LoA (Librarian of Alexandria) is a comprehensive tool designed for researchers, data scientists, and information professionals. It automates the process of searching, scraping, and analyzing scientific papers from various preprint servers and databases. The tool combines web scraping, natural language processing, and data extraction to efficiently gather and process large volumes of academic literature.
- Multi-Source Scraping: Supports scraping from multiple sources including PubMed Central, arXiv, ChemRxiv, ScienceOpen, and Unpaywall.
- Customizable Search: Allows users to define search terms and filters for targeted paper retrieval.
- Full-Text Download: Automatically downloads full-text PDFs and XMLs when available.
- Intelligent Data Extraction: Uses advanced NLP models to extract specific information from papers based on user-defined schemas.
- Flexible Schema Definition: Enables users to create custom data extraction schemas for various research needs.
- Concurrent Processing: Implements concurrent scraping and extraction for improved efficiency.
- Resume Capability: Ability to resume interrupted scraping or extraction processes.
- Automatic Model Management: Handles the download and management of required NLP models.
-
Clone the repository:
git clone https://github.com/MorganRO8/LoA.git cd LoA -
Install required dependencies listen in install_commands.txt (will be creating requirements.txt soon, as well as a dockerized version)
-
Install conda environment using DECIMER-GPU file if usage of DECIMER is desired
-
Simply run "python main.py"
LoA can be used in two modes: Interactive (UI) mode and Automatic mode.
Run the main script without any arguments:
python main.py
Follow the prompts to:
- Choose between scraping, defining a CSV structure, or extracting data.
- Enter search terms and select data sources.
- Define or select a schema for data extraction.
- Specify the NLP model to use.
Prepare a JSON configuration file with your desired settings, then run:
python main.py -auto ./job_scripts/example_small_molecule.json
Be sure to replace example_small_molecule.json with the actual file you want to use.
If the def_search_terms list contains only "local" and maybe_search_terms
is set to "none", LoA will skip the scraping step and instead process every
file found in the scraped_docs directory. This mode is useful when you have
already downloaded all the papers you want to analyze. Any directories inside
scraped_docs are ignored.
Example json files for small molecules, proteins, and peptides can be found in job_scripts.
The decimer_synthesis.json configuration focuses on synthesis papers rich in
chemical figures so DECIMER insertion can be tested easily.
Each configuration can optionally include a use_comments setting ("y" or "n") to control
whether a trailing comments column is added automatically. A similar use_solvent setting
("y" or "n") toggles a built-in solvent column that expects a SMILES string or common name.
If use_solvent is enabled, you can also set assume_water ("y" or "n"). When assume_water
is on and the model outputs null for the solvent column, the value will automatically be filled
with the SMILES string for water (O) without further validation.
Common solvents like water and ethanol are resolved through a built-in lookup
table before any online database queries.
The use_decimer option ("y" or "n") controls whether images are processed
with DECIMER to extract additional SMILES strings from figures. When enabled,
any predicted SMILES are inserted back into the text at the location of the
corresponding figure instead of being appended separately.
use_openai can be set to "y" if you want to send prompts to the OpenAI API
instead of using local Ollama models. Provide your API key with the api_key
setting. You can list available models for your key using:
curl https://api.openai.com/v1/models \
-H "Authorization: Bearer $OPENAI_API_KEY"
When use_openai is enabled, model files are not downloaded through Ollama.
- Scraping Module (
scrape.py): Handles the retrieval of papers from various sources. - Extraction Module (
extract.py): Manages the process of extracting information from downloaded papers. - Schema Creator (
meta_model.py): Allows users to define custom schemas for data extraction. - Document Reader (
document_reader.py): Converts various document formats into processable text. - Utilities (
utils.py): Contains helper functions used throughout the project. - Concurrent Mode ('single_paper.py')
- Adding New Sources: Extend the
scrape.pyfile to include new paper sources. - Custom Extraction Logic: Modify the extraction prompts in
extract.pyto suit specific research needs. - Schema Definitions: Use the interactive schema creator to define new data extraction templates.
- Respect the terms of service and rate limits of the sources you're scraping from.
- Ensure you have the necessary permissions to access and use the papers you're downloading.
- Regularly update your NLP models to benefit from the latest improvements.
- For large-scale scraping, consider using a distributed system to avoid overloading single sources.
LoA can optionally extract SMILES strings from PDFs and images using the
DECIMER project. Because DECIMER relies on a separate set of dependencies,
the recommended approach is to install it in its own conda environment. The
install_commands.txt file includes the commands to create an environment
named DECIMER and install the required packages. LoA will invoke this
environment via conda run whenever SMILES extraction is requested. SMILES
strings predicted from figures are inserted directly into the extracted text
where the corresponding images were located.
- If you encounter issues with model downloads, ensure you have a stable internet connection and sufficient disk space.
- For API-related errors, check your API keys and ensure they have the necessary permissions.
- If extraction results are unsatisfactory, try adjusting the schema or providing more specific user instructions.
Contributions to LoA are welcome! Please follow these steps:
- Fork the repository.
- Create a new branch for your feature.
- Commit your changes and push to your fork.
- Submit a pull request with a clear description of your changes.
This project uses several open-source libraries and APIs. We're grateful to the maintainers of these projects for their work, such as: Unstructured Ollama
For more detailed information on each module and function, please refer to the inline documentation in the source code.