Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

@geli-gel
Copy link
Contributor

@geli-gel geli-gel commented Jul 29, 2022

This is my PR for https://github.com/allenai/scholar/issues/33041 step 1: move code to mmda

In moving the code over I learned I had to figure out how the outputs will be used in SPP / stored in the annotation store, and talked w/Kyle and realized the importance of including the raw model output, which is why a Prediction includes bib_entries which contain boxes tightened around the tokens found inside of them + text and spans, as well as original_boxes which are raw model output detected boxes.

There are optional config variables: BIB_ENTRY_DETECTION_PREDICTOR_SCORE_THRESHOLD (Prediction accuracy score used to determine threshold of returned predictions) and BIB_ENTRY_DETECTION_MIN_VILA_BIB_ROWS (Minimum number of rows in a Bibliography VILA SpanGroup required to qualify as a Bibliography section) that can be adjusted if e2e eval shows signals we should change them.

@geli-gel geli-gel requested a review from cmwilhelm July 29, 2022 22:16
Comment on lines 64 to 66
# TODO: these start/ends are page token IDs?
start=,
end=,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmwilhelm This is question #1

Comment on lines 101 to 102
# TODO tokens .. ?
self.postprocess(model_outputs, document.tokens[image_index], image_index, image)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmwilhelm This is question #2 - how do I learn more about how tokens exist in the Document that will be given as input? I'm assuming I can make it a requirement that the Document / Instance input contains Tokens per page somehow, but I'm struggling to understand how to match those tokens to what my bounding boxes contain. In my other code I use layoutparser's Layout.filter_by: bib_tokens = page_tokens.filter_by(bib_block, center=True)

@geli-gel geli-gel marked this pull request as ready for review August 22, 2022 22:35
@geli-gel geli-gel requested a review from kyleclo August 22, 2022 22:44
Copy link
Contributor

@cmwilhelm cmwilhelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIMO interface looks good to me. I'll defer to @kyleclo on the particulars of the mmda predictor since there's a fair amount of logic on using the vila info heuristically.

Just a couple requested changes:

  1. Let's give original_boxes a more specific name since it's at the top-level document namespace.
  2. We probably shouldn't depend on vila_predictors for your model. From what I can tell that's just there to allow the integration test to process the PDF before handing to your new predictor. I think we want to do that offline and generate a fixture JSON file instead.

python_version: "3.8"

# Whether this model supports CUDA GPU acceleration
cuda: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be true, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this doesn't commit us to using GPU but it builds the image such that it can be)

# One or more bash commands to execute as part of a RUN step in a Dockerfile AFTER extras require.
# Leave this unset unless your model has special system requirements beyond
# those in your setup.py.
docker_run_commands: [ "apt-get update && apt-get install -y poppler-utils libgl1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment here about why we're manually installing python deps instead of relying on the setup.py would be helpful. RE: the weirdness in detectron installation order.


# Any additional sets of dependencies required by the model.
# These are the 'extras_require' keys in your setup.py.
extras_require: [ "bibentry_detection_predictor", "vila_predictors" ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't require vila_predictors right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're right - I'll change the test as you suggested!

@@ -0,0 +1,53 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a stray file that should be removed, right? Pre- settling on a name for the predictor?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, that's how it got here, my bad!


doc.annotate_images(page_images)

ivilaA = IVILATokenClassificationPredictor.from_pretrained(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible I think it would maybe be better if we pre-processed the PDF with PDFPlumber and particularly IVILA, as doing it at runtime here requires us to make your model depend on ivila's pip deps etc, as well as incurring a longer runtime for the integration tests to download these artifacts from HF.

We could take the pdf and pass through those two models offline and save the output JSON into the fixtures dir here?


predictions = container.predict_batch(instances)

for bib_entry in predictions[0].bib_entries:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other assertions we want to make beyond the types of the data? Maybe even number of bibentries found would be a good smoke test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about making it more specific and decided against it but I think at least checking that the number of bib_entries and raw_bib_entry_boxes is the same would be a good addition

Describes the outcome of inference for one Instance
"""
bib_entries: List[api.SpanGroup]
original_boxes: List[api.BoxGroup]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these annotations live at the top-level of the document, we should probably give this field an explicit name. Maybe bib_entry_boxes?

original_boxes: List[api.BoxGroup]


class PredictorConfig(BaseSettings):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice use of the predictor config 🤘

bib_entries=[api.SpanGroup.from_mmda(sg) for sg in predicted_span_groups_with_boxes],
original_boxes=[api.BoxGroup.from_mmda(bg) for bg in original_box_groups])

return prediction
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

from copy import copy
from functools import reduce

import itertools
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: itertools is a stdlib, should probably be up in one stanza with typing and copy, functools

python_version: "3.8"

# Whether this model supports CUDA GPU acceleration
cuda: false
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whoops

@geli-gel
Copy link
Contributor Author

Going to do some changes to the output Prediction after the sync today - s2airs doesn't need the text stored here anymore, so just going to save boxes only. (output will be bib_entries: List[BoxGroup] and original_boxes: List[BoxGroup]

Copy link
Contributor

@cmwilhelm cmwilhelm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the back and forth Angele. LGTM to merge after the merge conflict is resolved.

@geli-gel geli-gel merged commit ea03e45 into main Aug 24, 2022
@geli-gel geli-gel deleted the angelez/bibentries branch August 24, 2022 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants