data

Moji

Download processed data

Please use the script provided at [download_data.sh](https://github.com/shauli-ravfogel/nullspace_projection/blob/master/download_data.sh)

From scratch

Download the data from Su Lin Blodgett dataset, described in "Demographic dialectal variation in social media: A case study of african-american english."
```
wget http://slanglab.cs.umass.edu/TwitterAAE/TwitterAAE-full-v1.zip

# or directly from the site: http://slanglab.cs.umass.edu/TwitterAAE/
```

Follow Elazar et al. in preprocessing the dataset to get race and sentiment labels.

python make_data.py /path/to/downloaded/twitteraae_all /path/to/project/data/processed/sentiment_race sentiment race

See this doc for details about the scripts and other details.

Noticing that the prerequisites python==2.7, and the original scripts directly maps tokens to ids inplace, i.e., original tokens will not be stored. In order to save texts, please hack the following function

https://github.com/yanaiela/demog-text-removal/blob/f11b243c3f2f24f2179348c468b2caf76e7a3b23/src/data/make_data.py#L59

def to_file(output_dir, voc2id, vocab, pos_pos, pos_neg, neg_pos, neg_neg):
    if output_dir[-1] != '/':
        output_dir += '/'

    if not os.path.isdir(output_dir):
        os.makedirs(output_dir)

    with open(output_dir + 'vocab', 'w') as f:
        f.writelines('\n'.join(vocab))

    for data, name in zip([pos_pos, pos_neg, neg_pos, neg_neg], ['pos_pos', 'pos_neg', 'neg_pos', 'neg_neg']):
        with open(output_dir + name, 'w') as f:
            for sen in data:
                ids = map(lambda x: str(voc2id[x]), sen)
                f.write(' '.join(ids) + '\n')

        with open(output_dir + name + "_text", 'w') as f:
            for sen in data:
                ids = map(lambda x: str(x), sen)
                f.write(' '.join(ids) + '\n')

Encode texts with torchMoji. We provide an example for extract text representations at src/Moji.

Bios

Download processed data without economy labels

Please use the script provided at [download_data.sh](https://github.com/shauli-ravfogel/nullspace_projection/blob/master/download_data.sh)

From scratch

Download the dataset as described in Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting

See https://github.com/microsoft/biosbias for instructions for downloading and processing all bio records as a single file.
Create splits and get BERT encoding. We follow Ravfogel et al. in creating data splits and extracting BERT encoding. Please see create_dataset_biasbios.py and encode_bert_states.py.

We provide an example for dataset splits.
Augmented economy labels. TODO

Name		Name	Last commit message	Last commit date
parent directory ..
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

Moji

Download processed data

From scratch

Bios

Download processed data without economy labels

From scratch

FilesExpand file tree

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Moji

Download processed data

From scratch

Bios

Download processed data without economy labels

From scratch