This folder contains the code for evaluating the evidence selected by agents trained in convince/allennlp. We used ParlAI (March 12, 2019 commit), making changes specific to our evaluation setup in a few files.
This repo reads the evidence selections from evaluation/inference log files of trained evidence agents (available here). We then run launch HITs (human evaluation jobs) on Amazon Mechanical Turk, using ParlAI's MTurk code (convince/ParlAI/parlai/mturk - no GPU required). We made our own ParlAI task which contains all code specific to our evaluations (convince/ParlAI/parlai/mturk/tasks/context_evaluator) - we overview the files in this task-specific folder below:
| Python File | Functionality |
| run.py | Initialize, launch, and end HITs |
| task_configs.py | Human evaluation "hyperparameters" |
| worlds.py | Logic for evaluating evidence and saving results. |
| worlds_onboard.py | Logic for the Onboarding World. Filters out workers based on performance on a few example "easy" evaluations. |
We also added data reading/processing code for RACE (convince/ParlAI/parlai/tasks/race/) and DREAM (convince/ParlAI/parlai/tasks/dream/).
Conda can be used set up a virtual environment (Python 3):
-
Create a Conda environment with Python 3.6
conda create -n convince_human python=3.6
-
Activate the Conda environment
conda activate convince_human
Clone this repo and move to convince/ParlAI/ (where all commands should be run from):
git clone https://github.com/ethanjperez/convince.git
cd convince/ParlAIInstall dependencies using pip:
pip install -r requirements.txtYou may also need to install PyTorch 1.1 if you have dependency issues later on.
Link the cloned directory to your site-packages:
python setup.py developAny necessary data will be downloaded to ~/ParlAI/data.
Now that you've installed ParlAI, follow these instructions to setup and walk through ParlAI's MTurk functionality.
To run human evaluation:
python parlai/mturk/tasks/context_evaluator/run.py \
--dataset race \ # Use evidence found on this dataset ('race' or 'dream')
--prompt-type "quote and question" \ # Evidence evaluation setup: Evaluate single-sentence evidence
--live # Without this flag, you'll run a debugging HIT in MTurk Sandbox without feesWe support the following evidence evaluation setups (via arguments to --prompt-type):
| --prompt-type | Evaluation Setup |
| 'question' | Question-only baseline (no evidence shown) |
| 'passage and question' | Full passage baseline |
| 'quote and question' | Show one evidence sentence for one answer |
| 'quotes and question' | Show one evidence sentence for each answer (concatenated as a summary) |
Sometimes, you'll need to delete a set of HITs if launched evaluations are not cancelled properly (workers will email you that your HIT isn't working, though it was already cancelled). To do so, run:
python parlai/mturk/core/scripts/delete_hits.pyYou can bonus a worker if they were not paid for a HIT (requires that a worker has completed a previous HIT of yours):
python parlai/mturk/core/scripts/bonus_workers.py --hit-idYou'll need to provide the HIT ID that you're bonusing.
Omit --hit-id if you have the Assignment ID instead of the HIT ID.
Try both (with or without --hit-id) if you have some ID related to the HIT but don't know if it's an Assignment ID or HIT ID.
We reject HITs very sparingly, as rejected HITs have major consequences for workers. When we do reject HITs, it's usually because the worker was answering too quickly. If you do give a rejection unfairly, the worker will likely email you, and you can ask for their HIT ID or Assignment ID (or perhaps find it in their email). To reverse a rejected HIT that was given out unfairly, run the following code in Python:
from parlai.mturk.core.mturk_manager import MTurkManager
manager = MTurkManager.make_taskless_instance()
# Run one of the below, depending on what ID you have. Try both if you don't know.
manager.approve_work('[INSERT ASSIGNMENT ID]', override_rejection=True)
manager.approve_assignments_for_hit('[INSERT HIT ID]', override_rejection=True)Use the following steps to evaluate the evidence of your own trained agents:
- Run inference with your own agent 4 times total, with
--debate-modeas (Ⅰ, Ⅱ, Ⅲ, or Ⅳ - once each). - For each run, the code will save a log file of the form
debate_log.*jsonin the save directory (whatever you specified after--serialization-dirduring inference). - Rename each file to
$DM.json, where$DMspecifies the--debate-modeyou ran inference with to produce that file. - Place the files in a new directory together within some directory
$DIR - Change the 'evaluation_data_dir' field value to
$DIRin task_configs.py - Run human evaluation as described above