cdQA

An end-to-end closed-domain question answering system with BERT and classic IR methods 📚

Installation

With pip

pip install cdqa

From source

git clone https://github.com/fmikaelian/cdQA.git
cd cdQA
pip install -e .

Hardware Requirements

Experiments have been done on an AWS EC2 p3.2xlarge Deep Learning AMI (Ubuntu) Version 22.0 + a single Tesla V100 16GB with 16-bits training enabled (to accelerate training and prediction). To enable this feature, you will need to install apex:

git clone https://github.com/NVIDIA/apex.git
cd apex/
python setup.py install --cuda_ext --cpp_ext

Getting started

How it works

Preparing your data

To use cdqa you need a .csv corpus file with the following columns:

date	title	category	link	abstract	paragraphs	content
DD/MM/YY	The Article Title	The Article Category	https://the-article-link.com	The Article Abstract	[Paragraph 1 of Article, Paragraph N of Article]	Paragraph 1 of Article Paragraph N of Article

You can use the converters to create this file:

# create a corpus dataframe from a directory containing .pdf files

Training models

Read your corpus in .csv format:

import pandas as pd
from cdqa.pipeline.cdqa_sklearn import QAPipeline

df = pd.read_csv('your-custom-corpus-here.csv')

Fit the pipeline on your corpus using the pre-trained reader:

cdqa_pipeline = QAPipeline(model='bert_qa_squad_v1.1_sklearn.joblib')
cdqa_pipeline.fit(X=df)

If you want to fine-tune the reader on your custom data:

cdqa_pipeline = QAPipeline()
cdqa_pipeline.fit(X=df, fit_reader=True)

Making predictions

To get the best prediction given an input query:

query = 'your custom question here'

cdqa_pipeline.predict(X=query)

Evaluating models

In order to evaluate models on your custom dataset you will need to annotate it. The annotation process can be done in 3 steps:

Convert your pandas DataFrame into a json file with SQuAD format:

from cdqa.utils.converter import df2squad

json_data = df2squad(df=df, squad_version='v2.0', output_dir='../data', filename='bnpp_newsroom-v1.1')

Use an annotator to add ground truth question-answer pairs:

Please refer to cdQA-annotator, a web-based annotator for closed-domain question answering datasets with SQuAD format.

Evaluate your model:

from cdqa.utils.metrics import evaluate, evaluate_from_files

evaluate(dataset, predictions) # as json objects

evaluate_from_files(dataset_file='dev-v1.1.json', prediction_file='predictions.json') # as json files

Practical examples

A complete worfklow is described in our examples notebook.

Deployment

Manual

You can deploy a cdQA REST API by executing:

FLASK_APP=api.py flask run -h 0.0.0.0

To try it, execute:

http localhost:5000/api query=='your question here'

If you wish to serve a user interface, follow the instructions of cdQA-ui, a web interface developed for cdQA.

With docker

You can use the Dockerfile to deploy the full cdQA app.

Contributing

Read our Contributing Guidelines.

References

Type	Title	Author	Year
📹 Video	Stanford CS224N: NLP with Deep Learning Lecture 10 – Question Answering	Christopher Manning	2019
📰 Paper	End-to-End Open-Domain Question Answering with BERTserini	Wei Yang, Yuqing Xie, Aileen Lin, Xingyu Li, Luchen Tan, Kun Xiong, Ming Li, Jimmy Lin	2019
📰 Paper	Contextual Word Representations: A Contextual Introduction	Noah A. Smith	2019
📰 Paper	BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova	2018
📰 Paper	Neural Reading Comprehension and Beyond	Danqi Chen	2018
📰 Paper	Reading Wikipedia to Answer Open-Domain Questions	Danqi Chen, Adam Fisch, Jason Weston, Antoine Bordes	2017

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
.github		.github
cdqa		cdqa
data		data
docs		docs
examples		examples
logs		logs
models		models
tests		tests
.gitignore		.gitignore
.travis.yml		.travis.yml
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
api.py		api.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

cdQA

Installation

With pip

From source

Hardware Requirements

Getting started

How it works

Preparing your data

Training models

Making predictions

Evaluating models

Practical examples

Deployment

Manual

With docker

Contributing

References

About

Uh oh!

Releases 5

Packages

Uh oh!

Contributors 11

Uh oh!

Languages

License

cdqa-suite/cdQA

Folders and files

Latest commit

History

Repository files navigation

cdQA

Installation

With pip

From source

Hardware Requirements

Getting started

How it works

Preparing your data

Training models

Making predictions

Evaluating models

Practical examples

Deployment

Manual

With docker

Contributing

References

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 5

Packages 0

Uh oh!

Contributors 11

Uh oh!

Languages

Packages