NeotomaDB · tieandrews · Jun 28, 2023 · Jun 26, 2023 · Jun 26, 2023 · Jun 26, 2023
diff --git a/reports/final/finding-fossils-final.pdf b/reports/final/finding-fossils-final.pdf
diff --git a/reports/final/assets/data_review.png → reports/final_partner/assets/data_review.png b/reports/final/assets/data_review.png → reports/final_partner/assets/data_review.png
diff --git a/...s/finding-fossils-logo-symbol_highres.png → ...s/finding-fossils-logo-symbol_highres.png b/...s/finding-fossils-logo-symbol_highres.png → ...s/finding-fossils-logo-symbol_highres.png
diff --git a/reports/final/assets/references.bib → reports/final_partner/assets/references.bib b/reports/final/assets/references.bib → reports/final_partner/assets/references.bib
@@ -16,7 +16,7 @@ @article{NeotomaDB
 pages={156-177}
 }
 
-@misc{geodeepdive,
+@misc{xDD,
   title = {xDD API},
   author = {{Peters, S.E., I.A. Ross, T. Rekatsinas, M. Livny}},
   year = {2021},
@@ -76,12 +76,16 @@ @misc{ontonotes
   url={https://catalog.ldc.upenn.edu/LDC2013T19}
 }
 
-@software{LabelStudio,
-    title = {{Label Studio}: Data labeling software},
-    url = {https://github.com/heartexlabs/label-studio},
-    version = {1.7.3},
-    note={Open source software available from https://github.com/heartexlabs/label-studio},
-    date = {2023-05-09}
+@misc{LabelStudio,
+  title={{Label Studio}: Data labeling software},
+  url={https://github.com/heartexlabs/label-studio},
+  note={Open source software available from https://github.com/heartexlabs/label-studio},
+  author={
+    Maxim Tkachenko and
+    Mikhail Malyuk and
+    Andrey Holmanyuk and
+    Nikolai Liubimov},
+  year={2020-2022},
 }
 
 @article{inproceedings,
@@ -146,4 +150,29 @@ @misc{borealisai2023tutorial
   author = {Borealis AI},
   howpublished = {\url{https://www.borealisai.com/research-blogs/tutorial-17-transformers-iii-training/}},
   year = {2023}
-}
+}
+@misc{crossref,
+  title = {Crossref REST API},
+  author = {{Crossref}},
+  year = {2023},
+  url = {https://www.crossref.org/services/metadata-delivery/rest-api/},
+  note = {[Accessed May 9, 2023]},
+}
+
+@article{roberta-ner-wang,
+    title={Application of Pre-training Models in Named Entity Recognition},
+    author={Yu Wang},
+    journal={2020 12th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)},
+    year={2020},
+    volume={},
+    doi={10.1109/IHMSC49165.2020.00013}
+}
+@software{docker,
+  publisher={Docker},
+  title = {Docker},
+  url = {https://www.docker.com/},
+  version = {23.0.5},
+  date = {2023-06-27}
+
+}
+
diff --git a/...nal/assets/relevance-model-comparison.png → ...ner/assets/relevance-model-comparison.png b/...nal/assets/relevance-model-comparison.png → ...ner/assets/relevance-model-comparison.png
diff --git a/reports/final_partner/finding-fossils-final.pdf b/reports/final_partner/finding-fossils-final.pdf
diff --git a/reports/final/finding-fossils-final.qmd → ...s/final_partner/finding-fossils-final.qmd b/reports/final/finding-fossils-final.qmd → ...s/final_partner/finding-fossils-final.qmd
diff --git a/reports/proposal/finding-fossils-proposal.qmd b/reports/proposal/finding-fossils-proposal.qmd
@@ -37,7 +37,6 @@ Finding Fossils in the Literature is sponsored by the Neotoma database (Neotoma)
 \setcounter{tocdepth}{4}
 \tableofcontents
 ```
-
 \newpage
 
 # Introduction
@@ -46,7 +45,6 @@ The Neotoma database (Neotoma) [@NeotomaDB] is used by researchers studying ecol
 
 \newpage
 
-
 # Article Relevance Prediction
 
 The first step is to build a document classification model to assess the relevance of the new articles to Neotoma.
@@ -68,7 +66,7 @@ The data to be used for developing the article relevance prediction model comes
 +------------------------+----------------------------------------+----------------------------------------------------------------+
 | language               | Article language                       | One-hot encoding                                               |
 +------------------------+----------------------------------------+----------------------------------------------------------------+
-| published              | Date article was published             | Year as numeric features                                 |
+| published              | Date article was published             | Year as numeric features \|                                    |
 +------------------------+----------------------------------------+----------------------------------------------------------------+
 | publisher              | Publisher name                         | One-hot encoding                                               |
 +------------------------+----------------------------------------+----------------------------------------------------------------+
@@ -89,13 +87,13 @@ It is proposed that supervised machine learning approaches are used to predict a
 
 Approaches will be evaluated primarily on F1-Score with recall being monitored to reduce false negatives (Table 2).
 
-+-------------+------------------+
-| **Metric**  | **Target**       |
-+:===========:+:================:+
-| F1-Score    | \> 83%           |
-|             |                  |
-|             | [@alex2022raft]  |
-+-------------+------------------+
++---------------+-----------------+
+| **Metric**    | **Target**      |
++:=============:+:===============:+
+| F1-Score      | \> 83%          |
+|               |                 |
+|               | [@alex2022raft] |
++---------------+-----------------+
 
 : Proposed evaluation metric and target value for article relevancy prediction.
 
@@ -121,7 +119,6 @@ Additionally, we will leverage pre-trained BERT based large language models for
 
 \newpage
 
-
 # Fossil Data Extraction Pipeline
 
 The fossil data extraction pipeline receives the list of articles which are predicted to be relevant and processes each article's full text to pull out data that fits within the Neotoma DB tables.
@@ -183,7 +180,6 @@ For each entity to be extracted the following approaches are proposed in Table 5
 
 : Proposed baseline approach for each entity.
 
-
 ### Approach 1: Fine Tuned SpaCy NER Model
 
 The spaCy Python package (@spacy) includes the en_core_web_lg NER model. This model utilizes convolutional neural networks to generate text embeddings, which are used to classify each token of text according to the custom entity labels specific to the dataset. It has been pre-trained on texts from the English language and achieves NER accuracy of 85.5% on the OntoNotes 5.0 corpus [@ontonotes].
@@ -200,7 +196,7 @@ The final step in the process is data stewards from Neotoma reviewing the extrac
 
 ## Success Criteria
 
-Requirements are summarized in Table 6. 
+Requirements are summarized in Table 6.
 
 +-------------------------------------------------------+-------------------------------------------------+
 | **Requirement**                                       | **Target**                                      |
@@ -224,11 +220,10 @@ Requirements are summarized in Table 6.
 
 A draft wireframe for how the tool may look is below in Figure 1.
 
-![Data review tool wireframe.](assets/data-review-tool-wireframe.png){width=350 fig.pos='h'}
+![Data review tool wireframe.](assets/data-review-tool-wireframe.png){width="350" fig.pos="h"}
 
 \newpage
 
-
 # Timeline
 
 Deadlines and proposed intermediate milestones are outlined below and in Figure 2. Tasks will be completed in parallel by the team where appropriate.
@@ -243,8 +238,6 @@ Deadlines and proposed intermediate milestones are outlined below and in Figure
 
 -   *Milestone 4 - June 9th*: Solution deployment & final presentation.
 
-
-
 ![Proposed project timeline](assets/timeline.png)
 
 \newpage
@@ -255,7 +248,6 @@ Data were obtained from the Neotoma Paleoecology Database (http://www.neotomadb.
 
 A huge thanks to Simon Goring & Socorro Dominguez from the Neotoma Database team for their support on this project thus far.
 
-
 \newpage
 
 # References