Angelez/bibentries #113

geli-gel · 2022-07-29T22:16:49Z

This is my PR for https://github.com/allenai/scholar/issues/33041 step 1: move code to mmda

In moving the code over I learned I had to figure out how the outputs will be used in SPP / stored in the annotation store, and talked w/Kyle and realized the importance of including the raw model output, which is why a Prediction includes bib_entries which contain boxes tightened around the tokens found inside of them + text and spans, as well as original_boxes which are raw model output detected boxes.

There are optional config variables: BIB_ENTRY_DETECTION_PREDICTOR_SCORE_THRESHOLD (Prediction accuracy score used to determine threshold of returned predictions) and BIB_ENTRY_DETECTION_MIN_VILA_BIB_ROWS (Minimum number of rows in a Bibliography VILA SpanGroup required to qualify as a Bibliography section) that can be adjusted if e2e eval shows signals we should change them.

geli-gel · 2022-07-29T22:25:55Z

mmda/predictors/d2_predictors/bibentry_detection_predictor.py

+                        # TODO: these start/ends are page token IDs?
+                        start=,
+                        end=,


@cmwilhelm This is question #1

geli-gel · 2022-07-29T22:30:09Z

mmda/predictors/d2_predictors/bibentry_detection_predictor.py

+                # TODO tokens .. ?
+                self.postprocess(model_outputs, document.tokens[image_index], image_index, image)


@cmwilhelm This is question #2 - how do I learn more about how tokens exist in the Document that will be given as input? I'm assuming I can make it a requirement that the Document / Instance input contains Tokens per page somehow, but I'm struggling to understand how to match those tokens to what my bounding boxes contain. In my other code I use layoutparser's Layout.filter_by: bib_tokens = page_tokens.filter_by(bib_block, center=True)

…t work

…tools to tighten box around tokens

cmwilhelm

TIMO interface looks good to me. I'll defer to @kyleclo on the particulars of the mmda predictor since there's a fair amount of logic on using the vila info heuristically.

Just a couple requested changes:

Let's give original_boxes a more specific name since it's at the top-level document namespace.
We probably shouldn't depend on vila_predictors for your model. From what I can tell that's just there to allow the integration test to process the PDF before handing to your new predictor. I think we want to do that offline and generate a fixture JSON file instead.

cmwilhelm · 2022-08-23T17:59:29Z

ai2_internal/config.yaml

+    python_version: "3.8"
+
+    # Whether this model supports CUDA GPU acceleration
+    cuda: false


This should be true, right?

(this doesn't commit us to using GPU but it builds the image such that it can be)

cmwilhelm · 2022-08-23T18:00:31Z

ai2_internal/config.yaml

+    # One or more bash commands to execute as part of a RUN step in a Dockerfile AFTER extras require.
+    # Leave this unset unless your model has special system requirements beyond
+    # those in your setup.py.
+    docker_run_commands: [ "apt-get update && apt-get install -y poppler-utils libgl1",


A comment here about why we're manually installing python deps instead of relying on the setup.py would be helpful. RE: the weirdness in detectron installation order.

cmwilhelm · 2022-08-23T18:00:58Z

ai2_internal/config.yaml

+
+    # Any additional sets of dependencies required by the model.
+    # These are the 'extras_require' keys in your setup.py.
+    extras_require: [ "bibentry_detection_predictor", "vila_predictors" ]


This shouldn't require vila_predictors right?

yes you're right - I'll change the test as you suggested!

cmwilhelm · 2022-08-23T18:02:56Z

ai2_internal/bibentries_v1/integration_test.py

@@ -0,0 +1,53 @@
+"""


This is a stray file that should be removed, right? Pre- settling on a name for the predictor?

yep, that's how it got here, my bad!

cmwilhelm · 2022-08-23T18:18:27Z

ai2_internal/bibentry_detection_predictor/integration_test.py

+
+        doc.annotate_images(page_images)
+
+        ivilaA = IVILATokenClassificationPredictor.from_pretrained(


If possible I think it would maybe be better if we pre-processed the PDF with PDFPlumber and particularly IVILA, as doing it at runtime here requires us to make your model depend on ivila's pip deps etc, as well as incurring a longer runtime for the integration tests to download these artifacts from HF.

We could take the pdf and pass through those two models offline and save the output JSON into the fixtures dir here?

cmwilhelm · 2022-08-23T18:20:09Z

ai2_internal/bibentry_detection_predictor/integration_test.py

+
+        predictions = container.predict_batch(instances)
+
+        for bib_entry in predictions[0].bib_entries:


Are there any other assertions we want to make beyond the types of the data? Maybe even number of bibentries found would be a good smoke test.

I thought about making it more specific and decided against it but I think at least checking that the number of bib_entries and raw_bib_entry_boxes is the same would be a good addition

cmwilhelm · 2022-08-23T18:21:37Z

ai2_internal/bibentry_detection_predictor/interface.py

+    Describes the outcome of inference for one Instance
+    """
+    bib_entries: List[api.SpanGroup]
+    original_boxes: List[api.BoxGroup]


Since these annotations live at the top-level of the document, we should probably give this field an explicit name. Maybe bib_entry_boxes?

cmwilhelm · 2022-08-23T18:21:51Z

ai2_internal/bibentry_detection_predictor/interface.py

+    original_boxes: List[api.BoxGroup]
+
+
+class PredictorConfig(BaseSettings):


Nice use of the predictor config 🤘

cmwilhelm · 2022-08-23T18:23:11Z

ai2_internal/bibentry_detection_predictor/interface.py

+            bib_entries=[api.SpanGroup.from_mmda(sg) for sg in predicted_span_groups_with_boxes],
+            original_boxes=[api.BoxGroup.from_mmda(bg) for bg in original_box_groups])
+
+        return prediction


cmwilhelm · 2022-08-23T18:24:04Z

mmda/predictors/d2_predictors/bibentry_detection_predictor.py

+from copy import copy
+from functools import reduce
+
+import itertools


nit: itertools is a stdlib, should probably be up in one stanza with typing and copy, functools

geli-gel · 2022-08-23T21:41:18Z

ai2_internal/config.yaml

+    python_version: "3.8"
+
+    # Whether this model supports CUDA GPU acceleration
+    cuda: false


geli-gel · 2022-08-23T21:44:54Z

Going to do some changes to the output Prediction after the sync today - s2airs doesn't need the text stored here anymore, so just going to save boxes only. (output will be bib_entries: List[BoxGroup] and original_boxes: List[BoxGroup]

…oups

cmwilhelm

Thanks for the back and forth Angele. LGTM to merge after the merge conflict is resolved.

geli-gel added 4 commits July 27, 2022 16:23

add empty timo interface and int_test files

3ad1a1d

start predictor by copying layoutparser

60e5079

in progress bibentry_detection_predictor.py / interface.py

c191ba3

setup and config file additions

e36a264

geli-gel requested a review from cmwilhelm July 29, 2022 22:16

geli-gel commented Jul 29, 2022

View reviewed changes

geli-gel added 7 commits August 3, 2022 14:09

postprocess returns list of spangroups

e051ec0

update config.yaml to allow cuda version, and other changes to make i…

36185d1

…t work

Prediciton includes original box groups and uses lp instead of utils.…

968123e

…tools to tighten box around tokens

simplify union

cec7b92

Merge remote-tracking branch 'origin/main' into angelez/bibentries

e9c8004

whoops

9aadf19

version

9c9345c

geli-gel marked this pull request as ready for review August 22, 2022 22:35

geli-gel requested a review from kyleclo August 22, 2022 22:44

cmwilhelm requested changes Aug 23, 2022

View reviewed changes

geli-gel commented Aug 23, 2022

View reviewed changes

geli-gel added 8 commits August 23, 2022 15:01

delete old file

e4a17ef

rename original_boxes to raw_bib_entry_boxes

ddee2d7

fix config.yaml

cf0181c

rename bibentries to bib_entry_boxes

27000d9

organize imports

b62eab4

return predictions as BoxGroups, don't fill text or convert to SpanGr…

8a344a2

…oups

update interface

6bfdff5

update integration_test.py

7cc8ac5

cmwilhelm approved these changes Aug 24, 2022

View reviewed changes

geli-gel added 2 commits August 24, 2022 13:49

version

8d051e1

Merge remote-tracking branch 'origin/main' into angelez/bibentries

0dec5f2

geli-gel merged commit ea03e45 into main Aug 24, 2022

geli-gel deleted the angelez/bibentries branch August 24, 2022 21:02

		# TODO tokens .. ?
		self.postprocess(model_outputs, document.tokens[image_index], image_index, image)


		doc.annotate_images(page_images)

		ivilaA = IVILATokenClassificationPredictor.from_pretrained(


		predictions = container.predict_batch(instances)

		for bib_entry in predictions[0].bib_entries:

		original_boxes: List[api.BoxGroup]


		class PredictorConfig(BaseSettings):

Angelez/bibentries #113

Angelez/bibentries #113

Uh oh!

Conversation

geli-gel commented Jul 29, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cmwilhelm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

geli-gel commented Aug 23, 2022

Uh oh!

cmwilhelm left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

geli-gel commented Jul 29, 2022 •

edited

Loading