-
Notifications
You must be signed in to change notification settings - Fork 57
Description
Description: This issue is being used to consolidate analysis of which types of sentences are currently not being detected in the papers we have processed.
24 papers were inspected (see the list at the bottom of this post).
Of these papers, the pipeline failed to find sentences in the following situations:
- Title of the paper (1702.01287v1, 1906.01502v1, 1802.07740v2, 1901.10159v1, 1906.04604v1)
- Part or all of the authors text (1802.07740v2, 1901.10159v1)
- Section header (1701.07481v3, 1702.01287v1, 1908.00300v1, 1905.05475v2, 1706.08482v1, 1903.00621v1, 1705.06566v2 (not just abstract header here), 1802.07740v2, 1806.02371v1 (abstract and other headers), 1901.10159v1, 1709.07902v1 (not just abstract here), 1711.08028v4, 1711.08028v4, 1905.10887v2 (not just abstract))
- First sentence in abstract (1805.08660v1, 1903.00621v1)
- First sentence in section (1701.07481v3, 1702.01287v1, 1705.06566v2, 1901.10159v1, 1709.07902v1)
- Figure caption (1701.07481v3, 1702.01287v1, 1701.02810v2, 1805.08660v1, 1906.01502v1, 1908.00300v1, 1905.05475v2, 1706.08482v1, 1903.00621v1, 1705.06566v2, 1802.07740v2, 1811.12359v4, 1709.07902v1, 1711.08028v4, 1905.10887v2)
- Subfigure caption (1702.01287v1, 1908.00300v1, 1903.00621v1, 1707.00683v3, 1811.12359v4, 1709.07902v1, 1711.08028v4)
- Display equation (1702.01287v1, 1901.10159v1, 1709.07902v1)
- Table header (1701.02810v2, 1805.08660v1, 1906.01502v1, 1908.00300v1, 1905.05475v2, 1706.08482v1, 1903.00621v1, 1705.06566v2, 1806.02371v1, 1707.00683v3, 1811.12359v4, 1709.07902v1, 1905.10887v2)
- Row of result table (1701.07481v3, 1702.01287v1, 1805.08660v1, 1906.01502v1, 1905.05475v2, 1903.00621v1, 1707.00683v3, 1811.12359v4, 1709.07902v1)
- Footnote (1702.01287v1, 1906.01502v1, 1905.05475v2, 1705.06566v2, 1901.10159v1 (first sentence), 1811.12359v4, 1709.07902v1)
- Body text sentence (1701.07481v3, 1805.08660v1, 1906.00414v2(many), 1906.01502v1, 1908.00300v1, 1806.02371v1, 1901.10159v1, 1811.12359v4 (many), 1709.07902v1 (a lot, in appendix), 1711.08028v4 (many), 1906.04604v1 (many), 1905.10887v2 (many))
- Missing single word in sentence (1707.00683v3 "MODERN", glossary term?, also "raw ResNet" at end of caption Fig. 6a)
- Line in algorithm (1805.08660v1, 1901.10159v1)
- Start of theorem (1901.10159v1)
- Everything (1806.09231v2)
To observe the issues, follow the trick in this comment #187 (comment) for highlighting all instances of detected sentences.
Here are additional notes about the situations of missing sentences described above:
- Section header: The missing section header is often the abstract header (which I suspect is never written explicitly in the TeX files, but rather appears in the style files)
- Row of result table: The missing row is most commonly the first row in the table
- Table header / Row of result table: In some tables, part of a header cell will be detected, while another part of it is not. I belive this is because the sentece splitter splits on "\" (linebreaks) that are often used in LaTeX table cells to wrap the text
- Figure caption: The missing sentence is most commonly the first sentence in the figure
_How to fix: I suspect that the sentence splitter is getting the boundaries between headers and the rest of the content wrong. We should start by collecting examples from these papers, passing them through the sentence detector, and seeing if the sentence boundaries match our expectations.
High priority papers to fix, due to the severity of the problem include: 1906.00414v2
The papers inspected include:
- 1701.07481v3
- 1702.01287v1
- 1701.02810v2
- 1805.08660v1
- 1906.00414v2
- 1906.01502v1
- 1805.08092v1
- 1908.00300v1
- 1905.05475v2
- 1706.08482v1
- 1704.05838v1
- 1804.08286v1
- 1712.05773v2
- 1903.00621v1
- 1902.03680v3
- 1706.03850v3
- 1705.06566v2
- 1802.07740v2
- 1806.02371v1
- 1901.10159v1
- 1811.12359v4
- 1707.00683v3
- 1709.07902v1
- 1711.08028v4
- 1806.09231v2
- 1906.04604v1
- 1905.10887v2