-
Notifications
You must be signed in to change notification settings - Fork 19
Angelez/bibentries #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Angelez/bibentries #113
Conversation
| # TODO: these start/ends are page token IDs? | ||
| start=, | ||
| end=, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmwilhelm This is question #1
| # TODO tokens .. ? | ||
| self.postprocess(model_outputs, document.tokens[image_index], image_index, image) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmwilhelm This is question #2 - how do I learn more about how tokens exist in the Document that will be given as input? I'm assuming I can make it a requirement that the Document / Instance input contains Tokens per page somehow, but I'm struggling to understand how to match those tokens to what my bounding boxes contain. In my other code I use layoutparser's Layout.filter_by: bib_tokens = page_tokens.filter_by(bib_block, center=True)
…tools to tighten box around tokens
cmwilhelm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIMO interface looks good to me. I'll defer to @kyleclo on the particulars of the mmda predictor since there's a fair amount of logic on using the vila info heuristically.
Just a couple requested changes:
- Let's give
original_boxesa more specific name since it's at the top-level document namespace. - We probably shouldn't depend on
vila_predictorsfor your model. From what I can tell that's just there to allow the integration test to process the PDF before handing to your new predictor. I think we want to do that offline and generate a fixture JSON file instead.
ai2_internal/config.yaml
Outdated
| python_version: "3.8" | ||
|
|
||
| # Whether this model supports CUDA GPU acceleration | ||
| cuda: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be true, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this doesn't commit us to using GPU but it builds the image such that it can be)
| # One or more bash commands to execute as part of a RUN step in a Dockerfile AFTER extras require. | ||
| # Leave this unset unless your model has special system requirements beyond | ||
| # those in your setup.py. | ||
| docker_run_commands: [ "apt-get update && apt-get install -y poppler-utils libgl1", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment here about why we're manually installing python deps instead of relying on the setup.py would be helpful. RE: the weirdness in detectron installation order.
ai2_internal/config.yaml
Outdated
|
|
||
| # Any additional sets of dependencies required by the model. | ||
| # These are the 'extras_require' keys in your setup.py. | ||
| extras_require: [ "bibentry_detection_predictor", "vila_predictors" ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't require vila_predictors right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes you're right - I'll change the test as you suggested!
| @@ -0,0 +1,53 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a stray file that should be removed, right? Pre- settling on a name for the predictor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, that's how it got here, my bad!
|
|
||
| doc.annotate_images(page_images) | ||
|
|
||
| ivilaA = IVILATokenClassificationPredictor.from_pretrained( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible I think it would maybe be better if we pre-processed the PDF with PDFPlumber and particularly IVILA, as doing it at runtime here requires us to make your model depend on ivila's pip deps etc, as well as incurring a longer runtime for the integration tests to download these artifacts from HF.
We could take the pdf and pass through those two models offline and save the output JSON into the fixtures dir here?
|
|
||
| predictions = container.predict_batch(instances) | ||
|
|
||
| for bib_entry in predictions[0].bib_entries: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other assertions we want to make beyond the types of the data? Maybe even number of bibentries found would be a good smoke test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about making it more specific and decided against it but I think at least checking that the number of bib_entries and raw_bib_entry_boxes is the same would be a good addition
| Describes the outcome of inference for one Instance | ||
| """ | ||
| bib_entries: List[api.SpanGroup] | ||
| original_boxes: List[api.BoxGroup] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these annotations live at the top-level of the document, we should probably give this field an explicit name. Maybe bib_entry_boxes?
| original_boxes: List[api.BoxGroup] | ||
|
|
||
|
|
||
| class PredictorConfig(BaseSettings): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice use of the predictor config 🤘
| bib_entries=[api.SpanGroup.from_mmda(sg) for sg in predicted_span_groups_with_boxes], | ||
| original_boxes=[api.BoxGroup.from_mmda(bg) for bg in original_box_groups]) | ||
|
|
||
| return prediction |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
| from copy import copy | ||
| from functools import reduce | ||
|
|
||
| import itertools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: itertools is a stdlib, should probably be up in one stanza with typing and copy, functools
ai2_internal/config.yaml
Outdated
| python_version: "3.8" | ||
|
|
||
| # Whether this model supports CUDA GPU acceleration | ||
| cuda: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
whoops
|
Going to do some changes to the output Prediction after the sync today - s2airs doesn't need the text stored here anymore, so just going to save boxes only. (output will be bib_entries: List[BoxGroup] and original_boxes: List[BoxGroup] |
cmwilhelm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the back and forth Angele. LGTM to merge after the merge conflict is resolved.
This is my PR for https://github.com/allenai/scholar/issues/33041 step 1: move code to mmda
In moving the code over I learned I had to figure out how the outputs will be used in SPP / stored in the annotation store, and talked w/Kyle and realized the importance of including the raw model output, which is why a
Predictionincludesbib_entrieswhich contain boxes tightened around the tokens found inside of them + text and spans, as well asoriginal_boxeswhich are raw model output detected boxes.There are optional config variables:
BIB_ENTRY_DETECTION_PREDICTOR_SCORE_THRESHOLD(Prediction accuracy score used to determine threshold of returned predictions) andBIB_ENTRY_DETECTION_MIN_VILA_BIB_ROWS(Minimum number of rows in a Bibliography VILA SpanGroup required to qualify as a Bibliography section) that can be adjusted if e2e eval shows signals we should change them.