Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Milestones

List view

  • Define a policy around how MMDA will: - Represent Images in Document - Serialize/Load Images - Integrate with other vision libraries like LayoutParser. Particularly around this point, aim for 2 options: (1) Indirect integration where user is expected to run their vision models outside of MMDA, format their image data into a manner compatible with MMDA, then load them in to manipulate within MMDA. This is suitable for libraries like LayoutParser that depend on detectron2 and may have incompatible environments with the rest of MMDA. (2) Direct integration where a user can run vision models directly in same environment as other MMDA.Predictors. This is suitable for libraries like Huggingface which are adding vision models - Includes MMDA Image fields, such as Tables/Figures and their associated Captions

    No due date
  • MMDA is currently developed without too much consideration for efficiency. There are some major refactors that could boost performance: - Switch to a better serialization data structure than JSON - Switch to a better indexing data structure than Interval Trees

    No due date
  • MMDA currently segments Documents into SpanGroups (e.g. entities), but doesn't have a natively supported way of representing relations between those units. Currently, relational information is being stored explicitly as metadata within the Source and Target units, but this is unintuitive/costly.

    No due date
  • MMDA needs a pretty major refactor that will break some of its usage. They are: 1. A way of managing namespaces of different fields to allow for overloading (`bib.title` vs `doc.title`) 2. A way of `.annotate()` on a `span_group` rather than at a Document-level, for example, adding titles to bib entries 3. Making explicit annotation of `BoxGroup` from `SpanGroup` and defining explicit conversions from one to another

    No due date
    1/3 issues closed
  • MMDA includes HTML-ified tables, which are only found in CORD19

    No due date
  • MMDA adds additional functionality: 1. Section-Section hierarchies 2. Table of contents metadata that links to associated sections 3. Identification of inline and display formulas 4. LaTeX representation of identified formulas

    No due date
    0/1 issues closed
  • MMDA contains enough functionality to reproduce S2ORC PDF Parse JSONs. Contains: 1. Citation mentions in context linked to bibliography entries that are separated from body text and parsed 2. Identification of inline references to floating elements (e.g. tables/figures/sections/footnotes) which are pulled out of the main body text 3. Identification of captions which are pulled out of main body text and associated with corresponding table/figure 4. Identification of section headings with appropriate association of body text to the section

    Overdue by 3 year(s)
    Due by September 30, 2022
    0/5 issues closed