Releases: ispras/dedoc
Releases · ispras/dedoc
v2.6
- improve table merge algorithm (added check on table layout)
MultiPageTableExtractor. - refactoring table merge
MultiPageTableExtractor. - improve header footer analysis
HeaderFooterDetector. - added header footer analysis support in Tabby.
- added header footer analysis info (parameter
need_header_footer_analysis) in documentation (readthedocs). - update to python3.10.
- update to ubuntu22.04.
- added
Contributing Information
v2.5
- Added simple multilingual textual layer correctness classification based on letter percentage calculation (
textual_layer_classifier=letter). - Added a new parameter
textual_layer_classifier = [simple, ml (default), letter]. - Remove parameter
fast_textual_layer_detection. Now it is atextual_layer_classifier=simple. - Fix bug with
table_type=table_wo_external_bounds(fixed cv2.BoundingRect). - Some refactoring
TableRecognition. - Added parameter
table_typeandTableRecognitioninfo into documentation.
v2.4
- Upgrade
PyPDF2topypdf>4and fix bugs in attachments extraction from PDF files. - Added
each_page_textual_layer_detectionparameter for textual layer detection on each page of PDF documents (forPdfAutoReader). - Added
ENABLE_CANCELLATIONenv variable for enabling/disabling parsing cancellation after client disconnection (enabled by default). - Fixed location coordinates of attached images extracted by
PdfTabbyReader. - Added new reader
PdfBrokenEncodingReaderfor PDF documents with textual layer but broken encoding (pdf_with_text_layer=bad_encoding).
v2.3.2
v2.3.1
Release note: v2.3.1
- Fix bug with bold lines in
DocxReader(see issue 479). - Upgraded requirements.txt (
beautifulsoup4to 4.12.3 version). - Added support for external grobid (added support parameter
Authorization). - Added GOST (Russian government standard) frame recognition in
PdfTabbyReader(need_gost_frame_analysisparameter). - Update documentation (added GOST frame recognition).
- Added multi-page table handling to
PdfTabbyReader.
v2.3
- Dedoc telegram chat created.
- Added
patternsparameter for configuring default structure type. - Added notebooks with Dedoc usage (see issue 484).
- Fix bug
OutOfMemoryError: Java heap spaceinPdfTabbyReader(see issue 489). - Fix bug with numeration in
DocxReader(see issue 494). - Added GOST (Russian government standard) frame recognition in
PdfImageReaderandPdfTxtlayerReader(need_gost_frame_analysisparameter).
v2.2.7
- Fix bugs with
start,endofBBoxAnnotationinPdfTabbyReader. - Improve columns classification and orientation detection for PDF and images (
is_one_column_documentanddocument_orientationparameters). - Upgrade
docker:docker-composeis no longer supported, usedocker composeinstead. - Fix bug of tables parsing in
DocxReader(see issue). - Added simple textual layer detection in
PdfAutoReader(fast_textual_layer_detectionparameter). - Improve paragraph extraction from PDF documents and images.
- Retrain a classifier for diplomas (document_type="diploma") on a new dataset.
v2.2.6
- Upgrade dependencies:
numpy<2.0anddedoc-utils==0.3.7.
v2.2.5
v2.2.4
- Show page division and page numbers in the HTML output representation (API usage, return_format="html").
- Make imports from dedoc library faster.
- Added tutorial how to add a new language to dedoc (not finished entirely).
- Added additional page_id metadata for multi-page nodes (structure_type="tree" in API,
TreeConstructorin the library). - Updated OCR and orientation/columns classification benchmarks.
- Minor edits of
README.md. - Fixed empty cells handling in
CSVReader. - Fixed bounding boxes extraction for text in tables for
PdfTabbyReader.