GM-RKB WikiText Error Correction Task (LREC-2020)

GM-RKBWikiText Error Correction Task goal is to benchmark systems that attempt to automatically detect and fix simple typographical errors in WikiText (Wiki Pages).

This repository contains three different datasets used for evaluation, two different models used to fix Wiki pages, along with different tools and test scripts.

Example:

Original WikiText: <B>Subject Headings:</B> [[Text Corpus]], [[Language Model]]
WikiText with Noise: <B>Subject Headings:<B/> [[Text Corpus]], [[Language Model]]

WikiText with noise is the input of the WikiFixer models. WikiFixer aims at converting the text with noise to its original (clean) form.

WikiFixer Usage

The repository contains the model files for the best evaluated WikiFixer NNet and it can be used using the sample code below.

Download Datasets and models

A zip file containing the datasets files and all the tested models of the WikiFixer can be found here (794 Megabytes): Download link
the file should be unzipped in the same directory The zipped file data.zip contains two directories Datasets and nnet_models. The first one has the files for the 3 datasets used for training and evaluation. The second directory contains the trained models for WikiFixer NNet.

WikiFixer MLE

from WikiFixerMLE import WikiFixer
text_noise = '[[Text Corpus]], [[Lnaguage Model]]'
fixer = WikiFixer()
fixer.load_models()
fixer.fix_text(text_noise)

'[[Text Corpus]], [[Language Model]]'

WikiFixer NNet

from WikiFixerNNet import WikiFixerNNet
text_noise = '<B>Subject Headings:<B/> [[Text Corpus]], [[Language Model]]'
fixer = WikiFixer()
fixer.load_models()
fixer.fix_text(text_noise)

'<B>Subject Headings:</B> [[Text Corpus]], [[Language Model]]'

Run tests

Models Evaluation

The repository contains code for evaluating any system used for Wiki Pages errors fixing.

First the model has to be defined in test_config.py. A class with fix_text function using this system has to defined. The following code is an example to add a normal spelling correction tool as a system to be evaluated on the Wiki data JamSpell

class jspell(object):
        def __init__(self):
                self.corrector = jamspell.TSpellCorrector()
                self.corrector.LoadLangModel('/mnt/efs/data/en.bin')

        def fix_text(self, text):
                out = []
                for line in text.splitlines():
                        out.append(self.corrector.FixFragment(line))
                return "\n".join(out)

Second, using the same file, the configuration of the test can be set as following

    def test_models(self):
        log_file = './output/log_file.csv' # output file

        models = ["mle"] # model to be evaluated
        datafiles = ["./Datasets/MWDump.20191001.Noisetest.parquet"] #dataset used for evaluation 
        for datafile in datafiles:
            for model in models:
                 config = test_config.get_config(datafile=datafile,Model=model) #load configuraiton
                config["sample_size"] = 100 #number of pages used in the evaluation porcess
  
                Emetric, score, types_stats = test_script.run_test(config)
                l = [datafile, model, Emetric[0], Emetric[1], Emetric[2],metric[3], score]
                with open(log_file, 'a') as csvfile:
                    spamwriter = csv.writer(csvfile, delimiter=' ',
                                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
                    spamwriter.writerow(l)

third to run the test using command python -m unittest tests.test_model

Unit tests

A number of unit testing scripts are available:

test_tools unit testing for WikiTools, which is a class containing different functions used in the process of running and evaluating different models.
test_mle unit testing for WikiFixer MLE
test_nnet unit testing for WikiFixer NNet

They can be run using the command python -m unittest tests.test_tools (this is example to run the first unit testing script).

Datasets

GM-RKB Dataset

Wikipedia Dataset

Directory Structure

├── WikiFixerMLE.py : WikiFixer MLE Model
├── WikiFixerNNet.py : WikiFixer NNet seq2seq Model
├── model_config.py :  WikiFixer NNet configuration file
├── requirements.txt
├── Datasets
│   ├── MWDump.20191001.Noisetest.parquet :  GM-RKB Wiki dataset
│   ├── Wikipedia.Noise.parquet : Wikipedia Training dataset
│   └── WikipediaTest.Noisetest.parquet : Wikipedia Test dataset
├── mle
│   ├── LanguageModel.py : character based language model implementation
│   ├── model6-0.json
│   ├── model7-1.json
├── nnet
│   ├── data
│   │   └── allowed_chars_sm.json
│   ├── data_processing.py
│   ├── data_vectorization.py
├── nnet_models
│   ├── gm_rkb_nnet_fixer_GMRKB&Wiki7_sm_e22.h5
│   ├── gm_rkb_nnet_fixer_GMRKB2019_sm_e22.h5
│   ├── gm_rkb_nnet_fixer_GMRKB_PREWiki_sm_e12.h5
│   └── gm_rkb_nnet_fixer_Wikipedia_sm_e10.h5
├── tests
│   ├── log_file.txt
│   ├── multiproc_evaluation.py
│   ├── path.py
│   ├── test_config.py
│   ├── test_integration.py
│   ├── test_mle.py
│   ├── test_model.py
│   ├── test_nnet.py
│   ├── test_script.py
│   └── test_tools.py
└── tools
    ├── WikiTextTools.py
    ├── clean_lm.py
    ├── diff_match_patch.py
    ├── enums.py
    ├── fixer_evaluation.py
    ├── path.py
    ├── review_logs_output.py
    └── test_WikiTextTools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GM-RKB WikiText Error Correction Task (LREC-2020)

Example:

WikiFixer Usage

Download Datasets and models

WikiFixer MLE

WikiFixer NNet

Run tests

Models Evaluation

Unit tests

Datasets

GM-RKB Dataset

Wikipedia Dataset

Directory Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
mle		mle
nnet		nnet
nnet_models		nnet_models
output		output
tests		tests
tools		tools
.gitignore		.gitignore
README.md		README.md
WikiFixerMLE.py		WikiFixerMLE.py
WikiFixerNNet.py		WikiFixerNNet.py
model_config.py		model_config.py
requirements.txt		requirements.txt

GM-RKB/LREC-2020

Folders and files

Latest commit

History

Repository files navigation

GM-RKB WikiText Error Correction Task (LREC-2020)

Example:

WikiFixer Usage

Download Datasets and models

WikiFixer MLE

WikiFixer NNet

Run tests

Models Evaluation

Unit tests

Datasets

GM-RKB Dataset

Wikipedia Dataset

Directory Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages