English | δΈζ
A powerful, configurable text extraction system for building knowledge graphs
Built on top of langextract with advanced configuration management and visualization capabilities
π Quick Start β’ π Documentation β’ π― Examples β’ π€ Contributing
- Strategy-based extraction with YAML configuration files
- Multi-dimensional granularity control (breadth, depth, confidence, context scope)
- Dynamic prompt generation using Jinja2 templates
- Few-shot learning with customizable examples
- Real-time node visualization with pyvis integration
- Pre-import data quality checks before Neo4j insertion
- Multi-strategy comparison views for analysis
- Customizable styling for different node types
- Neo4j-ready data formatting with automatic Cypher generation
- Flexible schema support for various entity types
- Relationship mapping with configurable properties
- Batch import capabilities with MERGE statements
- Type-safe configuration with Pydantic models
- Comprehensive logging and debugging support
- Extensible architecture for custom strategies
- Rich examples and documentation
- Python 3.11+
- OpenAI-compatible API (OpenAI, Azure OpenAI, local models, etc.)
# Clone the repository
git clone https://github.com/Adoubf/ExtractGraph.git
cd ExtractGraph
# Install with uv (recommended)
uv sync
# Or use pip
pip install -e .- Configure your API credentials:
cp .env.example .env
# Edit .env with your API settings- Run a quick extraction:
from src.core.extractor import extractor
text = "Alice is a data scientist at TechCorp. She feels excited about the new AI project."
# Extract with default strategy
result = extractor.extract_for_neo4j(text)
print(f"Found {len(result['neo4j_data']['nodes'])} nodes and {len(result['neo4j_data']['relationships'])} relationships")- Visualize the results:
from src.core.visual_nodes import visual_nodes
# Generate interactive visualization
html_path = visual_nodes.visualize_text_extraction(
text=text,
strategy="literary",
save_path="output/demo.html"
)
# Open the HTML file in your browser!ExtractGraph/
βββ ποΈ Strategy Layer # YAML-based extraction strategies
βββ π§ Configuration Layer # Dynamic prompt generation with Jinja2
βββ π Granularity Layer # Multi-dimensional extraction control
βββ π¨ Visualization Layer # Interactive node preview and analysis
βββ ποΈ Database Layer # Neo4j integration with Cypher generation
| Component | Description | Key Features |
|---|---|---|
| ConfigurableExtractor | Main extraction engine | Strategy management, dynamic prompting |
| VisualNodes | Visualization engine | Interactive graphs, comparison views |
| StrategyManager | Configuration management | YAML loading, custom strategy creation |
| CypherGenerator | Database integration | CREATE/MERGE statement generation |
Create custom extraction strategies with YAML configuration:
# strategies/custom_strategy.yaml
name: "scientific_papers"
description: "Extract entities from scientific literature"
entities:
- "researcher"
- "institution"
- "concept"
- "method"
relations:
- "affiliated_with"
- "researches"
- "cites"
granularity:
breadth: "comprehensive"
depth: "inferential"
confidence: "high"
context_scope: "document"# Basic visualization
visual_nodes.visualize_text_extraction(text, strategy="scientific")
# Custom styling
custom_visual = VisualNodes(
width="1200px",
height="800px",
bgcolor="#f8f9fa"
)
# Comparison analysis
visual_nodes.create_comparison_view(
data_list=[result1, result2],
titles=["Strategy A", "Strategy B"]
)# Generate Cypher statements
result = extractor.extract_for_neo4j_merge(text, strategy="business")
# Get CREATE statements
nodes_cypher = result['merge_statements']['nodes']
relationships_cypher = result['merge_statements']['relationships']
# Execute in Neo4j
# driver.session().run(nodes_cypher)
# driver.session().run(relationships_cypher)from src.core.extractor import extractor
from src.core.visual_nodes import visual_nodes
# Multi-step analysis pipeline
text = """
Dr. Sarah Chen, a machine learning researcher at Stanford University,
published groundbreaking work on neural networks. She collaborates
with Dr. Mike Johnson from MIT on deep learning applications.
"""
# 1. Extract with academic strategy
result = extractor.extract_for_neo4j(text, strategy="academic")
# 2. Visualize for quality check
visual_nodes.visualize_neo4j_data(
result['neo4j_data'],
title="Academic Knowledge Graph"
)
# 3. Generate database import
cypher_statements = result['cypher_statements']
print("Ready for Neo4j import!")# Compare different extraction approaches
strategies = ["literary", "business", "academic"]
results = []
for strategy in strategies:
result = extractor.extract_for_neo4j(text, strategy=strategy)
results.append(result['neo4j_data'])
# Generate comparison visualization
visual_nodes.create_comparison_view(
data_list=results,
titles=[f"Strategy: {s}" for s in strategies],
save_dir="analysis/strategy_comparison"
)# Create custom extraction with granular control
result = extractor.extract(
text=text,
entities=["person", "organization", "project"],
relations=["works_at", "collaborates_with"],
breadth="comprehensive",
depth="inferential",
confidence="medium",
context_scope="document"
)Interactive visualization of entities and relationships extracted from text.
Specialized extraction for literary texts with character and emotion analysis.
Structured data ready for Neo4j import with organizational relationships.
Compare different extraction strategies side by side.
All visualizations support:
- π±οΈ Drag nodes - Adjust graph layout
- π Hover details - View detailed information
- π― Click highlighting - Highlight connected nodes
- π Zoom & pan - Explore large graphs
- βοΈ Physics layout - Auto-optimize positioning
Experience the full interactive features:
git clone https://github.com/Adoubf/extractGraph.git
cd extractGraph
uv sync
python -m code_examples.visual_nodes_demo
# Open the generated HTML files in your browser!- Create strategy file:
# strategies/my_domain.yaml
name: "my_domain"
entities: ["entity1", "entity2"]
relations: ["relation1"]
# ... configuration- Load and use:
strategy = strategy_manager.load_strategy("my_domain")
result = extractor.extract(text, strategy="my_domain")# Custom node styles
visual_nodes.node_styles.update({
'RESEARCHER': {
'color': '#e74c3c',
'shape': 'star',
'size': 30
}
})
# Custom relationship styles
visual_nodes.edge_styles.update({
'COLLABORATES_WITH': {
'color': '#3498db',
'width': 4
}
})- Efficient processing: Optimized for large documents with configurable chunking
- Memory management: Streaming processing for large datasets
- Parallel extraction: Multi-strategy concurrent processing
- Caching: Built-in result caching for repeated analyses
We welcome contributions! Here's how to get started:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature - Make your changes and add tests
- Run the test suite:
python -m pytest - Submit a pull request
# Clone and setup development environment
git clone https://github.com/Adoubf/ExtractGraph.git
cd ExtractGraph
uv sync --dev
# Run tests
python -m pytest tests/
# Run examples
python -m code_examples.configurable_extraction_demo
python -m code_examples.visual_nodes_demoThis project is licensed under the MIT License - see the LICENSE file for details.
- langextract - The powerful extraction engine that powers this project
- pyvis - Interactive network visualization
- Neo4j - Graph database platform
- Pydantic - Data validation and settings management
- π Issues: GitHub Issues
- π¬ Discussions: GitHub Discussions
- π§ Email: [email protected]
β Star this repository if you find it helpful!
Made with β€οΈ by Haoyue