Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
47372b4
files to be merged added
kellywujy Jun 6, 2023
e4a1653
1.0 notebook final
kellywujy Jun 6, 2023
672f0cd
2.0 rerun
kellywujy Jun 6, 2023
8ac7873
json format fix
kellywujy Jun 7, 2023
f024118
4.0 update model eval
kellywujy Jun 7, 2023
2bb2628
gdd query script added
kellywujy Jun 9, 2023
f760187
Usage Update
kellywujy Jun 9, 2023
5487b1d
gddid var name fix
kellywujy Jun 9, 2023
c1236be
col name fixed to match gdd API return
kellywujy Jun 9, 2023
c7e78ab
delete duplicate files
kellywujy Jun 9, 2023
cacf7ef
more log info added
kellywujy Jun 9, 2023
9081dde
test model file added
kellywujy Jun 9, 2023
c962d1e
model file update
kellywujy Jun 9, 2023
521a4c0
logger added to gdd api
kellywujy Jun 9, 2023
81abc0a
updated logging info
kellywujy Jun 9, 2023
4f00209
NaN value converted to Null
kellywujy Jun 9, 2023
df5aa62
Api fix in progress
kellywujy Jun 11, 2023
7c41967
Add functionality to search by term
kellywujy Jun 12, 2023
1ad53dd
lang detect test added
kellywujy Jun 12, 2023
54cb5cb
NaN bug fix
kellywujy Jun 13, 2023
ff8d40d
gdd parquet added
kellywujy Jun 13, 2023
21f6068
Version checkpoint - no parquet file set up
kellywujy Jun 16, 2023
85bc107
parquet file
kellywujy Jun 16, 2023
ff5090f
detect bug fixed
kellywujy Jun 16, 2023
4d979e9
parquet single run set up
kellywujy Jun 19, 2023
796aba7
pipeline fixed
kellywujy Jun 20, 2023
e0d8773
Dockerfile setup
kellywujy Jun 21, 2023
6a65aee
xdd placeholder added
kellywujy Jun 21, 2023
6eb378a
optional argument edits
kellywujy Jun 21, 2023
640afe4
Dockerization done
kellywujy Jun 21, 2023
f61ed6d
readme added
kellywujy Jun 21, 2023
85a9d7a
extra text removed
kellywujy Jun 21, 2023
7341c11
docker README updated
kellywujy Jun 21, 2023
7329086
Relevance Prediction Notebook Cleaned
kellywujy Jun 21, 2023
e554317
redundant notebook clean up
kellywujy Jun 21, 2023
91a40db
Remove redundant script
kellywujy Jun 21, 2023
469ff72
Remove local files from the commit
kellywujy Jun 21, 2023
39573c1
main README updated
kellywujy Jun 21, 2023
55bf95d
Merge branch 'dev' into kw-pipeline-notebook-merge
shaunhutch Jun 21, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 9 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,15 @@ There are 3 primary components to this project:

## **Article Relevance Prediction**

The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not. The predicted articles are then submitted to the Data Extraction Pipeline for processing.
The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.

The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.

Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.

![](assets/article_prediction_flow.png)

To run the Docker image for article relevance prediction pipeline, please refer to the instruction [here](https://github.com/NeotomaDB/MetaExtractor/tree/main/docker/article-relevance)

## **Data Extraction Pipeline**

Expand Down
26 changes: 26 additions & 0 deletions docker/article-relevance/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
# Use the official Python 3.10 image as the base image
FROM python:3.10

# Set the working directory inside the container
WORKDIR /app/

# Copy the requirements file to the container
COPY docker/article-relevance/requirements.txt .

# Install the required Python packages
RUN pip install --no-cache-dir -r requirements.txt

# Copy the entire repository folder into the container
COPY src ./src

# Copy the model folder into the container
COPY models/article-relevance ./models/article-relevance

# Copy the shell script to the container
COPY docker/article-relevance/run-prediction.sh .

# Make the shell script executable
RUN chmod +x run-prediction.sh

# Set the entry point for the Docker container
ENTRYPOINT ["./run-prediction.sh"]
91 changes: 91 additions & 0 deletions docker/article-relevance/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Meta Extractor Article Relevance Prediction Pipeline Docker Image

This docker image contains the models and code required to run article relevance prediction for research articles on the xDD system. It queries the xDD API article repository based on user-specified parameters, returns the list of articles' DOI as per specified, and predict if the article is relevant to paleoecology/paleoenvironment.

Running this docker image will:

1. Create a JSON file containing a list of xDD articles's gddid and DOI information.
2. Load the article id information from the JSON file, and query CrossRef API to retrieve article metadata.
3. Using metadata and the model to predict the relevance of each article.
4. A parquet file containing the article metadata and the prediction results is created.
5. (Placeholder) Articles that are deemed relevant will be passed to xDD API on the xDD server. The resulting parquet file contains if the API call was success or not for each article.

To run the article relevance prediction pipeline, the docker image needs to be built first

```bash
docker build -t metaextractor-article-relevance-prediction:v0.0.1 -f docker/article-relevance/Dockerfile .
```

## Additional Options Enabled by Environment Variables

Specifying the search criteria of article list and file paths is done through environment variables.

The following environment variables can be set to change the behavior of the pipeline:

Arguments for controlling the xDD API Query:
- `DOI_PATH`: Mandatory. The JSON file containing the queried article list will be saved here.
- `PARQUET_PATH`: Mandatory. This is the folder where processed parquet files are stored.
- `N_RECENT`: This variable can be set to a number to retrieve the n most recently added articles. When this variable is set, no MIN_DATE or MAX_DATE should be set.
- `MIN_DATE`: This variable can be set to establish a earliest date for the range of articles to be included. The date should follow format yyyy-mm-dd.
- `MAX_DATE`: This variable can be set to establish a latest date for the range of articles to be included. The date should follow format yyyy-mm-dd.
- `TERM`: This variable can be set to a word to search for in the article.
- `AUTO_MIN_DATE`: This variable can be set to True or False. If set to True, the pipeline will screen through the date of existing processed parquet files and use the latest date as the earliest date for this run.
- `AUTO_CHECK_DUP`: This variable can be set to True or False. If set to True, the pipeline will screen through the date of existing processed parquet files and exclude the already-processed articles from the list.

Arguments for controlling the relevance prediction:
- `DOI_FILE_PATH`: This is the path to the JSON ile containing the article list.
- `MODEL_PATH`: This is the path to the classification model.
- `OUTPUT_PATH`: This is the path to save the parquet files (contain article metadata and prediction results)
- `SEND_XDD`: This variable can be set to True or False. If set to True, the articles that predicted to be relevant will be sent to xDD API and go through the name entity extraction (NER) process.

## Sample Docker Compose Setup

Below is a sample docker compose configuration for running the image。

Sample 1: Query by number of most recently added articles
```yaml
version: "0.0.1"
services:
article-relevance-prediction:
image: metaextractor-article-relevance-prediction:v0.0.1
environment:
# Arguments for xDD API Query
- DOI_PATH=data/article-relevance/raw
- PARQUET_PATH=data/article-relevance/processed/prediction_parquet
- N_RECENT=10
- MIN_DATE=
- MAX_DATE=
- TERM=
- AUTO_MIN_DATE=False
- AUTO_CHECK_DUP=False

# Arguments for relevance prediction script
- DOI_FILE_PATH=data/article-relevance/raw/gdd_api_return.json
- MODEL_PATH=models/article-relevance/logistic_regression_model.joblib
- OUTPUT_PATH=data/article-relevance/processed
- SEND_XDD=False
```

Sample 2: Query by date range
```yaml
version: "0.0.1"
services:
article-relevance-prediction:
image: metaextractor-article-relevance-prediction:v0.0.1
environment:
# Arguments for xDD API Query
- DOI_PATH=data/article-relevance/raw
- PARQUET_PATH=data/article-relevance/processed/prediction_parquet
- N_RECENT=
- MIN_DATE=2023-06-04
- MAX_DATE=2023-06-05
- TERM=
- AUTO_MIN_DATE=False
- AUTO_CHECK_DUP=False

# Arguments for relevance prediction script
- DOI_FILE_PATH=data/article-relevance/raw/gdd_api_return.json
- MODEL_PATH=models/article-relevance/logistic_regression_model.joblib
- OUTPUT_PATH=data/article-relevance/processed
- SEND_XDD=False
```
20 changes: 20 additions & 0 deletions docker/article-relevance/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
version: "0.0.1"
services:
article-relevance-prediction:
image: metaextractor-article-relevance-prediction:v0.0.1
environment:
# Arguments for xDD API Query
- DOI_PATH=data/article-relevance/raw
- PARQUET_PATH=data/article-relevance/processed/prediction_parquet
- N_RECENT=10
- MIN_DATE=
- MAX_DATE=
- TERM=
- AUTO_MIN_DATE=False
- AUTO_CHECK_DUP=False

# Arguments for relevance prediction script
- DOI_FILE_PATH=data/article-relevance/raw/gdd_api_return.json
- MODEL_PATH=models/article-relevance/logistic_regression_model.joblib
- OUTPUT_PATH=data/article-relevance/processed
- SEND_XDD=False
10 changes: 10 additions & 0 deletions docker/article-relevance/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# python version 3.10
pandas==1.5.3
numpy~=1.23
requests~=2.28
docopt-ng~=0.8
pyarrow==11.0.0
langdetect==1.0.9
joblib==1.2.0
scikit-learn==1.2.2
sentence-transformers==2.2.2
17 changes: 17 additions & 0 deletions docker/article-relevance/run-prediction.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
#!/bin/bash

python src/article_relevance/gdd_api_query.py \
--doi_path="$DOI_PATH" \
--parquet_path="$PARQUET_PATH" \
--n_recent="$N_RECENT" \
--term="$TERM" \
--min_date="$MIN_DATE" \
--max_date="$MAX_DATE" \
--auto_min_date="$AUTO_MIN_DATE" \
--auto_check_dup="$AUTO_CHECK_DUP"

python src/article_relevance/relevance_prediction_parquet.py \
--doi_file_path="$DOI_FILE_PATH" \
--model_path="$MODEL_PATH" \
--output_path="$OUTPUT_PATH" \
--send_xdd="$SEND_XDD"
Loading