Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Entity Localization Bug: Missing Sentences in many papers. #188

@andrewhead

Description

@andrewhead

Description: This issue is being used to consolidate analysis of which types of sentences are currently not being detected in the papers we have processed.

24 papers were inspected (see the list at the bottom of this post).

Of these papers, the pipeline failed to find sentences in the following situations:

  • Title of the paper (1702.01287v1, 1906.01502v1, 1802.07740v2, 1901.10159v1, 1906.04604v1)
  • Part or all of the authors text (1802.07740v2, 1901.10159v1)
  • Section header (1701.07481v3, 1702.01287v1, 1908.00300v1, 1905.05475v2, 1706.08482v1, 1903.00621v1, 1705.06566v2 (not just abstract header here), 1802.07740v2, 1806.02371v1 (abstract and other headers), 1901.10159v1, 1709.07902v1 (not just abstract here), 1711.08028v4, 1711.08028v4, 1905.10887v2 (not just abstract))
  • First sentence in abstract (1805.08660v1, 1903.00621v1)
  • First sentence in section (1701.07481v3, 1702.01287v1, 1705.06566v2, 1901.10159v1, 1709.07902v1)
  • Figure caption (1701.07481v3, 1702.01287v1, 1701.02810v2, 1805.08660v1, 1906.01502v1, 1908.00300v1, 1905.05475v2, 1706.08482v1, 1903.00621v1, 1705.06566v2, 1802.07740v2, 1811.12359v4, 1709.07902v1, 1711.08028v4, 1905.10887v2)
  • Subfigure caption (1702.01287v1, 1908.00300v1, 1903.00621v1, 1707.00683v3, 1811.12359v4, 1709.07902v1, 1711.08028v4)
  • Display equation (1702.01287v1, 1901.10159v1, 1709.07902v1)
  • Table header (1701.02810v2, 1805.08660v1, 1906.01502v1, 1908.00300v1, 1905.05475v2, 1706.08482v1, 1903.00621v1, 1705.06566v2, 1806.02371v1, 1707.00683v3, 1811.12359v4, 1709.07902v1, 1905.10887v2)
  • Row of result table (1701.07481v3, 1702.01287v1, 1805.08660v1, 1906.01502v1, 1905.05475v2, 1903.00621v1, 1707.00683v3, 1811.12359v4, 1709.07902v1)
  • Footnote (1702.01287v1, 1906.01502v1, 1905.05475v2, 1705.06566v2, 1901.10159v1 (first sentence), 1811.12359v4, 1709.07902v1)
  • Body text sentence (1701.07481v3, 1805.08660v1, 1906.00414v2(many), 1906.01502v1, 1908.00300v1, 1806.02371v1, 1901.10159v1, 1811.12359v4 (many), 1709.07902v1 (a lot, in appendix), 1711.08028v4 (many), 1906.04604v1 (many), 1905.10887v2 (many))
  • Missing single word in sentence (1707.00683v3 "MODERN", glossary term?, also "raw ResNet" at end of caption Fig. 6a)
  • Line in algorithm (1805.08660v1, 1901.10159v1)
  • Start of theorem (1901.10159v1)
  • Everything (1806.09231v2)

To observe the issues, follow the trick in this comment #187 (comment) for highlighting all instances of detected sentences.

Here are additional notes about the situations of missing sentences described above:

  • Section header: The missing section header is often the abstract header (which I suspect is never written explicitly in the TeX files, but rather appears in the style files)
  • Row of result table: The missing row is most commonly the first row in the table
  • Table header / Row of result table: In some tables, part of a header cell will be detected, while another part of it is not. I belive this is because the sentece splitter splits on "\" (linebreaks) that are often used in LaTeX table cells to wrap the text
  • Figure caption: The missing sentence is most commonly the first sentence in the figure

_How to fix: I suspect that the sentence splitter is getting the boundaries between headers and the rest of the content wrong. We should start by collecting examples from these papers, passing them through the sentence detector, and seeing if the sentence boundaries match our expectations.

High priority papers to fix, due to the severity of the problem include: 1906.00414v2

The papers inspected include:

  • 1701.07481v3
  • 1702.01287v1
  • 1701.02810v2
  • 1805.08660v1
  • 1906.00414v2
  • 1906.01502v1
  • 1805.08092v1
  • 1908.00300v1
  • 1905.05475v2
  • 1706.08482v1
  • 1704.05838v1
  • 1804.08286v1
  • 1712.05773v2
  • 1903.00621v1
  • 1902.03680v3
  • 1706.03850v3
  • 1705.06566v2
  • 1802.07740v2
  • 1806.02371v1
  • 1901.10159v1
  • 1811.12359v4
  • 1707.00683v3
  • 1709.07902v1
  • 1711.08028v4
  • 1806.09231v2
  • 1906.04604v1
  • 1905.10887v2

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingentity-localizationAn issue or task related to entity localizationmissing-entity-detectionAn issue or task related to entities that weren't detectedsentencesAn issue or task related to sentences

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions