NeotomaDB · shaunhutch · Jun 21, 2023 · Jun 6, 2023 · Jun 6, 2023 · Jun 6, 2023
diff --git a/README.md b/README.md
@@ -17,7 +17,15 @@ There are 3 primary components to this project:
 
 ## **Article Relevance Prediction**
 
-The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not. The predicted articles are then submitted to the Data Extraction Pipeline for processing.
+The goal of this component is to monitor and identify new articles that are relevant to Neotoma. This is done by using the public [xDD API](https://geodeepdive.org/) to regularly get recently published articles. Article metadata is queried from the [CrossRef API](https://www.crossref.org/documentation/retrieve-metadata/rest-api/) to obtain data such as journal name, title, abstract and more. The article metadata is then used to predict whether the article is relevant to Neotoma or not.
+
+The model was trained on ~900 positive examples (a sample of articles currently contributing to Neotoma) and ~3500 negative examples (a sample of articles unrrelated or closely related to Neotoma). Logistic regression model was chosen for its outstanding performance and interpretability.
+
+Articles predicted to be relevant will then be submitted to the Data Extraction Pipeline for processing.
+
+![](assets/article_prediction_flow.png)
+
+To run the Docker image for article relevance prediction pipeline, please refer to the instruction [here](https://github.com/NeotomaDB/MetaExtractor/tree/main/docker/article-relevance)
 
 ## **Data Extraction Pipeline**
 

diff --git a/docker/article-relevance/Dockerfile b/docker/article-relevance/Dockerfile
@@ -0,0 +1,26 @@
+# Use the official Python 3.10 image as the base image
+FROM python:3.10
+
+# Set the working directory inside the container
+WORKDIR /app/
+
+# Copy the requirements file to the container
+COPY docker/article-relevance/requirements.txt .
+
+# Install the required Python packages
+RUN pip install --no-cache-dir -r requirements.txt
+
+# Copy the entire repository folder into the container
+COPY src ./src
+
+# Copy the model folder into the container
+COPY models/article-relevance ./models/article-relevance
+
+# Copy the shell script to the container
+COPY docker/article-relevance/run-prediction.sh .
+
+# Make the shell script executable
+RUN chmod +x run-prediction.sh
+
+# Set the entry point for the Docker container
+ENTRYPOINT ["./run-prediction.sh"]
diff --git a/docker/article-relevance/README.md b/docker/article-relevance/README.md
@@ -0,0 +1,91 @@
+# Meta Extractor Article Relevance Prediction Pipeline Docker Image
+
+This docker image contains the models and code required to run article relevance prediction for research articles on the xDD system. It queries the xDD API article repository based on user-specified parameters, returns the list of articles' DOI as per specified, and predict if the article is relevant to paleoecology/paleoenvironment.
+
+Running this docker image will:
+
+1. Create a JSON file containing a list of xDD articles's gddid and DOI information.
+2. Load the article id information from the JSON file, and query CrossRef API to retrieve article metadata.
+3. Using metadata and the model to predict the relevance of each article.
+4. A parquet file containing the article metadata and the prediction results is created.
+5. (Placeholder) Articles that are deemed relevant will be passed to xDD API on the xDD server. The resulting parquet file contains if the API call was success or not for each article.
+
+To run the article relevance prediction pipeline, the docker image needs to be built first
+
+```bash
+docker build -t metaextractor-article-relevance-prediction:v0.0.1 -f docker/article-relevance/Dockerfile .
+```
+
+## Additional Options Enabled by Environment Variables
+
+Specifying the search criteria of article list and file paths is done through environment variables. 
+
+The following environment variables can be set to change the behavior of the pipeline:
+
+Arguments for controlling the xDD API Query:
+- `DOI_PATH`: Mandatory. The JSON file containing the queried article list will be saved here.
+- `PARQUET_PATH`: Mandatory. This is the folder where processed parquet files are stored.
+- `N_RECENT`: This variable can be set to a number to retrieve the n most recently added articles. When this variable is set, no MIN_DATE or MAX_DATE should be set.
+- `MIN_DATE`: This variable can be set to establish a earliest date for the range of articles to be included. The date should follow format yyyy-mm-dd.
+- `MAX_DATE`: This variable can be set to establish a latest date for the range of articles to be included. The date should follow format yyyy-mm-dd.
+- `TERM`: This variable can be set to a word to search for in the article.
+- `AUTO_MIN_DATE`: This variable can be set to True or False. If set to True, the pipeline will screen through the date of existing processed parquet files and use the latest date as the earliest date for this run.
+- `AUTO_CHECK_DUP`:  This variable can be set to True or False. If set to True, the pipeline will screen through the date of existing processed parquet files and exclude the already-processed articles from the list.
+
+Arguments for controlling the relevance prediction:
+- `DOI_FILE_PATH`: This is the path to the JSON ile containing the article list.
+- `MODEL_PATH`: This is the path to the classification model.
+- `OUTPUT_PATH`: This is the path to save the parquet files (contain article metadata and prediction results)
+- `SEND_XDD`: This variable can be set to True or False. If set to True, the articles that predicted to be relevant will be sent to xDD API and go through the name entity extraction (NER) process. 
+
+## Sample Docker Compose Setup
+
+Below is a sample docker compose configuration for running the image。
+
+Sample 1: Query by number of most recently added articles
+```yaml
+version: "0.0.1"
+services:
+  article-relevance-prediction:
+    image: metaextractor-article-relevance-prediction:v0.0.1
+    environment:
+      # Arguments for xDD API Query
+      - DOI_PATH=data/article-relevance/raw
+      - PARQUET_PATH=data/article-relevance/processed/prediction_parquet
+      - N_RECENT=10
+      - MIN_DATE=
+      - MAX_DATE=
+      - TERM=
+      - AUTO_MIN_DATE=False
+      - AUTO_CHECK_DUP=False
+
+      # Arguments for relevance prediction script
+      - DOI_FILE_PATH=data/article-relevance/raw/gdd_api_return.json
+      - MODEL_PATH=models/article-relevance/logistic_regression_model.joblib
+      - OUTPUT_PATH=data/article-relevance/processed
+      - SEND_XDD=False
+```
+
+Sample 2: Query by date range
+```yaml
+version: "0.0.1"
+services:
+  article-relevance-prediction:
+    image: metaextractor-article-relevance-prediction:v0.0.1
+    environment:
+      # Arguments for xDD API Query
+      - DOI_PATH=data/article-relevance/raw
+      - PARQUET_PATH=data/article-relevance/processed/prediction_parquet
+      - N_RECENT=
+      - MIN_DATE=2023-06-04
+      - MAX_DATE=2023-06-05
+      - TERM=
+      - AUTO_MIN_DATE=False
+      - AUTO_CHECK_DUP=False
+
+      # Arguments for relevance prediction script
+      - DOI_FILE_PATH=data/article-relevance/raw/gdd_api_return.json
+      - MODEL_PATH=models/article-relevance/logistic_regression_model.joblib
+      - OUTPUT_PATH=data/article-relevance/processed
+      - SEND_XDD=False
+```
diff --git a/docker/article-relevance/docker-compose.yml b/docker/article-relevance/docker-compose.yml
@@ -0,0 +1,20 @@
+version: "0.0.1"
+services:
+  article-relevance-prediction:
+    image: metaextractor-article-relevance-prediction:v0.0.1
+    environment:
+      # Arguments for xDD API Query
+      - DOI_PATH=data/article-relevance/raw
+      - PARQUET_PATH=data/article-relevance/processed/prediction_parquet
+      - N_RECENT=10
+      - MIN_DATE=
+      - MAX_DATE=
+      - TERM=
+      - AUTO_MIN_DATE=False
+      - AUTO_CHECK_DUP=False
+
+      # Arguments for relevance prediction script
+      - DOI_FILE_PATH=data/article-relevance/raw/gdd_api_return.json
+      - MODEL_PATH=models/article-relevance/logistic_regression_model.joblib
+      - OUTPUT_PATH=data/article-relevance/processed
+      - SEND_XDD=False
diff --git a/docker/article-relevance/requirements.txt b/docker/article-relevance/requirements.txt
@@ -0,0 +1,10 @@
+# python version 3.10
+pandas==1.5.3
+numpy~=1.23
+requests~=2.28
+docopt-ng~=0.8
+pyarrow==11.0.0
+langdetect==1.0.9
+joblib==1.2.0
+scikit-learn==1.2.2
+sentence-transformers==2.2.2
diff --git a/docker/article-relevance/run-prediction.sh b/docker/article-relevance/run-prediction.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+
+python src/article_relevance/gdd_api_query.py \
+  --doi_path="$DOI_PATH" \
+  --parquet_path="$PARQUET_PATH" \
+  --n_recent="$N_RECENT" \
+  --term="$TERM" \
+  --min_date="$MIN_DATE" \
+  --max_date="$MAX_DATE" \
+  --auto_min_date="$AUTO_MIN_DATE" \
+  --auto_check_dup="$AUTO_CHECK_DUP"
+
+python src/article_relevance/relevance_prediction_parquet.py \
+  --doi_file_path="$DOI_FILE_PATH" \
+  --model_path="$MODEL_PATH" \
+  --output_path="$OUTPUT_PATH" \
+  --send_xdd="$SEND_XDD"