Thanks to visit codestin.com
Credit goes to github.com

Skip to content

nehangit/extract_tool

Repository files navigation

PeopleWeave Intelligent Paper Extraction

Extracts the abstract or content from research paper PDFs using GROBID models first and then resorting to OCR. May retrain GROBID in future.

Setup

  1. Install Tesseract OCR with the instructions found here: https://tesseract-ocr.github.io/tessdoc/Installation.html

  2. Clone this repository in the working directory and install dependencies:

    git clone --recurse-submodules https://github.com/nehangit/extract_tool.git
    cd extract_tool
    pip install -r requirements.txt
    
  3. Modify grobid.yaml, then open a new terminal and run the Grobid Server Docker image. There are two options, you can use the full Grobid image with deep learning models (accuracy better, long installation and runtime, ideal for small # of pdfs, a good machine, and ideally a GPU) or the lightweight image without DL models (for efficiency, low resources, lots of pdfs). You must have Docker installed for both options, and make sure the engine is running.

    • Full image: For more accurate abstract extraction (header model) e.g using a BiLSTM fed into a ChainCRF (and can also use a SciBert transformer). grobid.yaml is configured for this by default.

      Ensure that the concurrency parameter in the grobid.yaml file is set according to your CPU/GPU capacity and matches passed "--threads" parameter (the default of 10 should be okay).

      Installation takes a while, but you only need to do it once.

      docker run --rm --gpus all --init --ulimit core=0 -p 8070:8070 -p 8071:8071 -v {Full path to local grobid.yaml}:/opt/grobid/grobid-home/config/grobid.yaml:ro grobid/grobid:0.8.0
      
    • Lightweight image: The lightweight image is much faster to install and run and doesn't use DL models (only linear chain CRFs).

      Important: In grobid.yaml, simply modify the engine parameter of the "header" model to be "wapiti" instead of "delft" (line 117 and 118).

      Ensure that the concurrency parameter in the grobid.yaml file is set according to your CPU capacity and matches passed "--threads" parameter (the default of 10 should be okay).

      docker run --rm --init --ulimit core=0 -p 8070:8070 -p 8071:8071 -v {Full path to local grobid.yaml}:/opt/grobid/grobid-home/config/grobid.yaml:ro lfoppiano/grobid:0.8.0
      

      In theory, runtime and params shouldn't be an issue since we generally have a small number of new papers to process.

Usage

Ensure that the GROBID container is running on a separate terminal, then use:

python extract.py

By default, this looks for a directory "Conference Papers", uses 10 threads, and does abstract extraction.

To change these options, see the help menu:

python extract.py -h

Notes

  • Runtime depends on models used, model parameters, number of PDFs, and number of threads ("concurrency") all are specified in the grobid.yaml file.
  • Configuration docs for grobid.yaml: https://grobid.readthedocs.io/en/latest/Configuration/
  • I've ran into issues with GROBID if full paths become too long, move the project to a previous/shorter directory.
  • Using DL engine recommended for certain models other than header, depends on your use cases see https://grobid.readthedocs.io/en/latest/Deep-Learning-models/
  • If using Linux/macOS and want to train the models, you can build GROBID from source, see grobid docs for details.
  • WORK IN PROGRESS

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages