Highlight text in documents
txtmarker highlights text in documents. txtmarker takes a list of (name, text) pairs, scans an input document and creates a modified version with highlights embedded.
Current file formats supported:
The easiest way to install is via pip and PyPI
pip install txtmarker
Python 3.10+ is supported. Using a Python virtual environment is recommended.
txtmarker can also be installed directly from GitHub to access the latest, unreleased features.
pip install git+https://github.com/neuml/txtmarker
The examples directory has a series of examples and notebooks giving an overview of txtmarker. See the list of notebooks below.
| Notebook | Description | |
|---|---|---|
| Introducing txtmarker | Overview of the functionality provided by txtmarker | |
| Highlighting with Transformers | AI-driven highlighting with Transformers |
The following section gives an overview of highlighters and available methods/configuration. See the notebooks above for detailed examples.
Creates a new highlighter instance.
from txtmarker.factory import Factory
highlighter = Factory.create("pdf")extension: stringType of highlighter to create (i.e. pdf)
formatter: callableFormats queries and input text using this method. Helps with cleanup of files with lots of symbols and other content.
chunks: intSplits queries into multiple chunks. This is designed for very long text matches.
Extracts page text from infile and returns as a generator. This enables analysis on the text exactly as it will appear to the highlighter.
highlighter.pages("input.pdf")infile: stringFull path to input file
Highlights using provided annotations. Annotated file is stored as outfile.
highlighter.highlight("input.pdf", "output.pdf", [("name", "text to highlight")])infile: stringFull path to input file
outfile: stringFull path to output file, i.e. the highlighted file
highlights: list of (string, string|regex)List of highlight elements. Each pair has a name (can be None) and text value. The text can either be a string or a regular expression. When using string matching, make sure to escape regular expressions (i.e. call re.escape).