Thanks to visit codestin.com
Credit goes to github.com

Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file removed reports/final/finding-fossils-final.pdf
Binary file not shown.
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ @article{NeotomaDB
pages={156-177}
}

@misc{geodeepdive,
@misc{xDD,
title = {xDD API},
author = {{Peters, S.E., I.A. Ross, T. Rekatsinas, M. Livny}},
year = {2021},
Expand Down Expand Up @@ -76,12 +76,16 @@ @misc{ontonotes
url={https://catalog.ldc.upenn.edu/LDC2013T19}
}

@software{LabelStudio,
title = {{Label Studio}: Data labeling software},
url = {https://github.com/heartexlabs/label-studio},
version = {1.7.3},
note={Open source software available from https://github.com/heartexlabs/label-studio},
date = {2023-05-09}
@misc{LabelStudio,
title={{Label Studio}: Data labeling software},
url={https://github.com/heartexlabs/label-studio},
note={Open source software available from https://github.com/heartexlabs/label-studio},
author={
Maxim Tkachenko and
Mikhail Malyuk and
Andrey Holmanyuk and
Nikolai Liubimov},
year={2020-2022},
}

@article{inproceedings,
Expand Down Expand Up @@ -146,4 +150,29 @@ @misc{borealisai2023tutorial
author = {Borealis AI},
howpublished = {\url{https://www.borealisai.com/research-blogs/tutorial-17-transformers-iii-training/}},
year = {2023}
}
}
@misc{crossref,
title = {Crossref REST API},
author = {{Crossref}},
year = {2023},
url = {https://www.crossref.org/services/metadata-delivery/rest-api/},
note = {[Accessed May 9, 2023]},
}

@article{roberta-ner-wang,
title={Application of Pre-training Models in Named Entity Recognition},
author={Yu Wang},
journal={2020 12th International Conference on Intelligent Human-Machine Systems and Cybernetics (IHMSC)},
year={2020},
volume={},
doi={10.1109/IHMSC49165.2020.00013}
}
@software{docker,
publisher={Docker},
title = {Docker},
url = {https://www.docker.com/},
version = {23.0.5},
date = {2023-06-27}

}

Binary file added reports/final_partner/finding-fossils-final.pdf
Binary file not shown.

Large diffs are not rendered by default.

28 changes: 10 additions & 18 deletions reports/proposal/finding-fossils-proposal.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ Finding Fossils in the Literature is sponsored by the Neotoma database (Neotoma)
\setcounter{tocdepth}{4}
\tableofcontents
```

\newpage

# Introduction
Expand All @@ -46,7 +45,6 @@ The Neotoma database (Neotoma) [@NeotomaDB] is used by researchers studying ecol

\newpage


# Article Relevance Prediction

The first step is to build a document classification model to assess the relevance of the new articles to Neotoma.
Expand All @@ -68,7 +66,7 @@ The data to be used for developing the article relevance prediction model comes
+------------------------+----------------------------------------+----------------------------------------------------------------+
| language | Article language | One-hot encoding |
+------------------------+----------------------------------------+----------------------------------------------------------------+
| published | Date article was published | Year as numeric features |
| published | Date article was published | Year as numeric features \| |
+------------------------+----------------------------------------+----------------------------------------------------------------+
| publisher | Publisher name | One-hot encoding |
+------------------------+----------------------------------------+----------------------------------------------------------------+
Expand All @@ -89,13 +87,13 @@ It is proposed that supervised machine learning approaches are used to predict a

Approaches will be evaluated primarily on F1-Score with recall being monitored to reduce false negatives (Table 2).

+-------------+------------------+
| **Metric** | **Target** |
+:===========:+:================:+
| F1-Score | \> 83% |
| | |
| | [@alex2022raft] |
+-------------+------------------+
+---------------+-----------------+
| **Metric** | **Target** |
+:=============:+:===============:+
| F1-Score | \> 83% |
| | |
| | [@alex2022raft] |
+---------------+-----------------+

: Proposed evaluation metric and target value for article relevancy prediction.

Expand All @@ -121,7 +119,6 @@ Additionally, we will leverage pre-trained BERT based large language models for

\newpage


# Fossil Data Extraction Pipeline

The fossil data extraction pipeline receives the list of articles which are predicted to be relevant and processes each article's full text to pull out data that fits within the Neotoma DB tables.
Expand Down Expand Up @@ -183,7 +180,6 @@ For each entity to be extracted the following approaches are proposed in Table 5

: Proposed baseline approach for each entity.


### Approach 1: Fine Tuned SpaCy NER Model

The spaCy Python package (@spacy) includes the en_core_web_lg NER model. This model utilizes convolutional neural networks to generate text embeddings, which are used to classify each token of text according to the custom entity labels specific to the dataset. It has been pre-trained on texts from the English language and achieves NER accuracy of 85.5% on the OntoNotes 5.0 corpus [@ontonotes].
Expand All @@ -200,7 +196,7 @@ The final step in the process is data stewards from Neotoma reviewing the extrac

## Success Criteria

Requirements are summarized in Table 6.
Requirements are summarized in Table 6.

+-------------------------------------------------------+-------------------------------------------------+
| **Requirement** | **Target** |
Expand All @@ -224,11 +220,10 @@ Requirements are summarized in Table 6.

A draft wireframe for how the tool may look is below in Figure 1.

![Data review tool wireframe.](assets/data-review-tool-wireframe.png){width=350 fig.pos='h'}
![Data review tool wireframe.](assets/data-review-tool-wireframe.png){width="350" fig.pos="h"}

\newpage


# Timeline

Deadlines and proposed intermediate milestones are outlined below and in Figure 2. Tasks will be completed in parallel by the team where appropriate.
Expand All @@ -243,8 +238,6 @@ Deadlines and proposed intermediate milestones are outlined below and in Figure

- *Milestone 4 - June 9th*: Solution deployment & final presentation.



![Proposed project timeline](assets/timeline.png)

\newpage
Expand All @@ -255,7 +248,6 @@ Data were obtained from the Neotoma Paleoecology Database (http://www.neotomadb.

A huge thanks to Simon Goring & Socorro Dominguez from the Neotoma Database team for their support on this project thus far.


\newpage

# References
Loading