An end-to-end closed-domain question answering system with BERT and classic IR methods π
pip install cdqagit clone https://github.com/fmikaelian/cdQA.git
cd cdQA
pip install -e .Experiments have been done on an AWS EC2 p3.2xlarge Deep Learning AMI (Ubuntu) Version 22.0 + a single Tesla V100 16GB with 16-bits training enabled (to accelerate training and prediction). To enable this feature, you will need to install apex:
git clone https://github.com/NVIDIA/apex.git
cd apex/
python setup.py install --cuda_ext --cpp_extTo use cdqa you need a .csv corpus file with the following columns:
| date | title | category | link | abstract | paragraphs | content |
|---|---|---|---|---|---|---|
| DD/MM/YY | The Article Title | The Article Category | https://the-article-link.com | The Article Abstract | [Paragraph 1 of Article, Paragraph N of Article] | Paragraph 1 of Article Paragraph N of Article |
You can use the converters to create this file:
# create a corpus dataframe from a directory containing .pdf filesRead your corpus in .csv format:
import pandas as pd
from cdqa.pipeline.cdqa_sklearn import QAPipeline
df = pd.read_csv('your-custom-corpus-here.csv')Fit the pipeline on your corpus using the pre-trained reader:
cdqa_pipeline = QAPipeline(model='bert_qa_squad_v1.1_sklearn.joblib')
cdqa_pipeline.fit(X=df)If you want to fine-tune the reader on your custom data:
cdqa_pipeline = QAPipeline()
cdqa_pipeline.fit(X=df, fit_reader=True)To get the best prediction given an input query:
query = 'your custom question here'
cdqa_pipeline.predict(X=query)In order to evaluate models on your custom dataset you will need to annotate it. The annotation process can be done in 3 steps:
-
Convert your pandas DataFrame into a json file with SQuAD format:
from cdqa.utils.converter import df2squad json_data = df2squad(df=df, squad_version='v2.0', output_dir='../data', filename='bnpp_newsroom-v1.1')
-
Use an annotator to add ground truth question-answer pairs:
Please refer to
cdQA-annotator, a web-based annotator for closed-domain question answering datasets with SQuAD format. -
Evaluate your model:
from cdqa.utils.metrics import evaluate, evaluate_from_files evaluate(dataset, predictions) # as json objects evaluate_from_files(dataset_file='dev-v1.1.json', prediction_file='predictions.json') # as json files
A complete worfklow is described in our examples notebook.
You can deploy a cdQA REST API by executing:
FLASK_APP=api.py flask run -h 0.0.0.0To try it, execute:
http localhost:5000/api query=='your question here'If you wish to serve a user interface, follow the instructions of cdQA-ui, a web interface developed for cdQA.
You can use the Dockerfile to deploy the full cdQA app.
Read our Contributing Guidelines.
| Type | Title | Author | Year |
|---|---|---|---|
| πΉ Video | Stanford CS224N: NLP with Deep Learning Lecture 10 β Question Answering | Christopher Manning | 2019 |
| π° Paper | End-to-End Open-Domain Question Answering with BERTserini | Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin | 2019 |
| π° Paper | Contextual Word Representations: A Contextual Introduction | Noah A. Smith | 2019 |
| π° Paper | BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova | 2018 |
| π° Paper | Neural Reading Comprehension and Beyond | Danqi Chen | 2018 |
| π° Paper | Reading Wikipedia to Answer Open-Domain Questions | Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes | 2017 |