Human Evaluation

This folder contains the code for evaluating the evidence selected by agents trained in convince/allennlp. We used ParlAI (March 12, 2019 commit), making changes specific to our evaluation setup in a few files.

Overview

This repo reads the evidence selections from evaluation/inference log files of trained evidence agents (available here). We then run launch HITs (human evaluation jobs) on Amazon Mechanical Turk, using ParlAI's MTurk code (convince/ParlAI/parlai/mturk - no GPU required). We made our own ParlAI task which contains all code specific to our evaluations (convince/ParlAI/parlai/mturk/tasks/context_evaluator) - we overview the files in this task-specific folder below:

Python File	Functionality
run.py	Initialize, launch, and end HITs
task_configs.py	Human evaluation "hyperparameters"
worlds.py	Logic for evaluating evidence and saving results.
worlds_onboard.py	Logic for the Onboarding World. Filters out workers based on performance on a few example "easy" evaluations.

We also added data reading/processing code for RACE (convince/ParlAI/parlai/tasks/race/) and DREAM (convince/ParlAI/parlai/tasks/dream/).

Installation

Setting up a virtual environment

Conda can be used set up a virtual environment (Python 3):

Download and install Conda.
Create a Conda environment with Python 3.6
```
conda create -n convince_human python=3.6
```
Activate the Conda environment
```
conda activate convince_human
```

Installing the library and dependencies

Clone this repo and move to convince/ParlAI/ (where all commands should be run from):

git clone https://github.com/ethanjperez/convince.git
cd convince/ParlAI

Install dependencies using pip:

pip install -r requirements.txt

You may also need to install PyTorch 1.1 if you have dependency issues later on.

Link the cloned directory to your site-packages:

python setup.py develop

Any necessary data will be downloaded to ~/ParlAI/data.

Now that you've installed ParlAI, follow these instructions to setup and walk through ParlAI's MTurk functionality.

Running Human Evaluation

To run human evaluation:

python parlai/mturk/tasks/context_evaluator/run.py \
  --dataset race \                      # Use evidence found on this dataset ('race' or 'dream')
  --prompt-type "quote and question" \  # Evidence evaluation setup: Evaluate single-sentence evidence
  --live                                # Without this flag, you'll run a debugging HIT in MTurk Sandbox without fees

We support the following evidence evaluation setups (via arguments to --prompt-type):

--prompt-type	Evaluation Setup
'question'	Question-only baseline (no evidence shown)
'passage and question'	Full passage baseline
'quote and question'	Show one evidence sentence for one answer
'quotes and question'	Show one evidence sentence for each answer (concatenated as a summary)

Handling possible issues

Sometimes, you'll need to delete a set of HITs if launched evaluations are not cancelled properly (workers will email you that your HIT isn't working, though it was already cancelled). To do so, run:

python parlai/mturk/core/scripts/delete_hits.py

You can bonus a worker if they were not paid for a HIT (requires that a worker has completed a previous HIT of yours):

python parlai/mturk/core/scripts/bonus_workers.py --hit-id

You'll need to provide the HIT ID that you're bonusing. Omit --hit-id if you have the Assignment ID instead of the HIT ID. Try both (with or without --hit-id) if you have some ID related to the HIT but don't know if it's an Assignment ID or HIT ID.

We reject HITs very sparingly, as rejected HITs have major consequences for workers. When we do reject HITs, it's usually because the worker was answering too quickly. If you do give a rejection unfairly, the worker will likely email you, and you can ask for their HIT ID or Assignment ID (or perhaps find it in their email). To reverse a rejected HIT that was given out unfairly, run the following code in Python:

from parlai.mturk.core.mturk_manager import MTurkManager
manager = MTurkManager.make_taskless_instance()
# Run one of the below, depending on what ID you have. Try both if you don't know.
manager.approve_work('[INSERT ASSIGNMENT ID]', override_rejection=True)
manager.approve_assignments_for_hit('[INSERT HIT ID]', override_rejection=True)

Evaluating your own evidence agents

Use the following steps to evaluate the evidence of your own trained agents:

Run inference with your own agent 4 times total, with --debate-mode as (Ⅰ, Ⅱ, Ⅲ, or Ⅳ - once each).
For each run, the code will save a log file of the form debate_log.*json in the save directory (whatever you specified after --serialization-dir during inference).
Rename each file to $DM.json, where $DM specifies the --debate-mode you ran inference with to produce that file.
Place the files in a new directory together within some directory $DIR
Change the 'evaluation_data_dir' field value to $DIR in task_configs.py
Run human evaluation as described above

Name		Name	Last commit message	Last commit date
Latest commit History 2,603 Commits
.circleci		.circleci
docs		docs
examples		examples
parlai		parlai
projects		projects
tests		tests
.flake8		.flake8
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NEWS.md		NEWS.md
README.md		README.md
README_ParlAI.md		README_ParlAI.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Human Evaluation

Overview

Installation

Setting up a virtual environment

Installing the library and dependencies

Running Human Evaluation

Handling possible issues

Evaluating your own evidence agents

About

Uh oh!

Releases

Packages

Languages

License

ethanjperez/ParlAI

Folders and files

Latest commit

History

Repository files navigation

Human Evaluation

Overview

Installation

Setting up a virtual environment

Installing the library and dependencies

Running Human Evaluation

Handling possible issues

Evaluating your own evidence agents

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages