A natural language processing pipeline for analyzing works of fiction, including entity detection, quotation attribution, and character relationship analysis.
- Python 3.9 or higher
- pip (Python package installer)
- Virtual environment (recommended)
- Clone the repository:
git clone https://github.com/yourusername/booknlp.git
cd booknlp- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate # On Windows, use: venv\Scripts\activate- Install the required packages:
pip install --upgrade pip
pip install -r requirements.txt- Install the spaCy model:
python -m spacy download en_core_web_sm- Make sure your virtual environment is activated:
source venv/bin/activate # On Windows, use: venv\Scripts\activate- Run BookNLP on a text file:
./run_booknlp.py input_file.txt --output-dir output/directoryinput_file: The text file to process (required)--output-dir: Directory where output files will be saved (default: 'output')--model: Model size to use - 'big' or 'small' (default: 'small')
The pipeline generates several output files in the specified output directory:
-
{book_id}.tokens: Word-level information including:- Paragraph and sentence IDs
- Word forms and lemmas
- Part-of-speech tags
- Dependency relations
- Event annotations
-
{book_id}.entities: Named entity information including:- Entity types
- Coreference IDs
- Text spans
-
{book_id}.quotes: Quotation information including:- Quoted text
- Speaker attribution
- Coreference information
-
{book_id}.supersense: Semantic categories for words:- Verb categories
- Noun categories
-
{book_id}.event: Event annotations including:- Event types
- Participants
- Temporal information
-
{book_id}.book: JSON file containing:- Character information
- Relationships
- Actions
- Attributes
-
{book_id}.book.html: Interactive HTML visualization of the text with:- Entity annotations
- Character relationships
- Interactive features
# Process a text file named "emma.txt" with maximum accuracy
./run_booknlp.py emma.txt --output-dir output/emma --model big