Thanks to visit codestin.com
Credit goes to github.com

Skip to content

"ChatMT - Egghia, parliamo materano!" - LangGraph-based dialect preservation chatbot with multi-agent cultural intelligence

Notifications You must be signed in to change notification settings

andmon97/ChatMT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ›οΈ ChatMT - Chatbot Dialetto Materano

Un chatbot intelligente per tradurre e insegnare il dialetto materano, preservando il patrimonio linguistico e culturale di Matera.

ChatMT = Chat + MT (provincia di Matera) - Il tuo assistente personale per il dialetto materano!

πŸ“‹ Descrizione

Il progetto mira a creare un assistente conversazionale che:

  • Traduce tra italiano e dialetto materano
  • Insegna grammatica e pronuncia del dialetto
  • Fornisce contesto culturale e storico
  • Racconta tradizioni e aneddoti dei Sassi

πŸ—οΈ Architettura

Fase 1 - MVP (Dizionario Base)

User Input β†’ LangGraph Workflow β†’ Response
    ↓
[Preprocessing] β†’ [Dictionary Lookup] β†’ [Response Formatting]

Fase 2 - Sistema Multi-Agente (Futuro)

User Input β†’ Chat Manager β†’ Specialized Agents β†’ Coordinated Response
                   ↓
    β”Œβ”€ Traduttore Materano (LangGraph)
    β”œβ”€ Storyteller Culturale  
    β”œβ”€ Guida Turistica Matera
    └─ Insegnante Dialetto

πŸ“š Risorse

πŸ“– Risorse Primarie (Fase 1)

  • Dizionario Materano (Antonio D'Ercole - "Voci di Sassi")
    • 80 pagine in PDF
    • Formato: Materano β†’ Italiano
    • Esempi d'uso e contesto

🌐 Risorse WikiMatera (Fasi 2-3)

Risorse Lessicali

Risorse Grammaticali

Risorse Culturali

Pagina Principale

πŸ“š Risorse Aggiuntive (Fase 4)

  • SassiTour: articoli su espressioni tipiche
  • Wikipedia: analisi linguistica dettagliata
  • Angelo Sarra: "Dizionario 'Na chedd' di parole in disuso" (con CD audio)

πŸ› οΈ Stack Tecnologico

Core Technologies

  • Python 3.12+
  • LangGraph - Workflow orchestration
  • LangChain - LLM integration
  • Ollama - Local LLM models
  • OpenAI API - Cloud LLM models

Data Processing

  • pdfplumber - PDF text extraction
  • BeautifulSoup4 - Web scraping WikiMatera
  • pandas - Data manipulation
  • Redis - Vector database and caching
  • redis-py - Redis Python client

Web Framework (Futuro)

  • Streamlit or Gradio - Chat interface
  • FastAPI - Backend API

πŸ—„οΈ Data Storage Strategy

Redis Vector Database

# Struttura dati Redis per termini dialetto
KEY: "term:materano:{term_id}"
VALUE: {
    "materano_term": "abbinì",
    "italian_translation": "hai da venire, devi venire", 
    "category": "verbi",
    "examples": ["abbinì appess a miéì"],
    "cultural_notes": "formula di invito tipica",
    "source": "dizionario_dercole",
    "embedding": [0.1, 0.2, ...],  # Vector for semantic search
    "created_at": "2025-01-01T00:00:00Z"
}

# Indici per ricerca
SET: "terms:by_category:{category}" β†’ {term_id1, term_id2, ...}
SET: "terms:by_source:{source}" β†’ {term_id1, term_id2, ...}
ZSET: "terms:by_popularity" β†’ term_id (score = usage_count)

βš™οΈ Setup

1. Installazione

git clone [repository-url]
cd ChatMT

# Install UV (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Install dependencies
uv sync

2. Configurazione Modelli

# Copy and edit environment file
cp .env.example .env

# .env file
OPENAI_API_KEY=your_openai_key
OLLAMA_MODEL=llama3.1  # or preferred local model
USE_LOCAL_MODEL=true   # or false for OpenAI

# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_DB=0
REDIS_PASSWORD=  # if required

LOG_LEVEL=INFO

3. Preparazione Dati

# Install Redis (if not already installed)
# macOS: brew install redis
# Ubuntu: sudo apt install redis-server
# Start Redis: redis-server

# Data processing (development-driven approach)
uv run src/data/pdf_extractor.py
uv run src/data/web_scraper.py

# Note: Database setup and schema will be implemented during development

4. Avvio

# CLI version
uv run main.py

# Web interface (Fase 3+)
uv run streamlit run src/interface/streamlit_app.py

πŸ“ Struttura Repository

ChatMT/
β”œβ”€β”€ πŸ“„ pyproject.toml              # UV dependency management
β”œβ”€β”€ πŸ“„ uv.lock                     # UV lockfile
β”œβ”€β”€ πŸ“„ README.md
β”œβ”€β”€ πŸ“„ ROADMAP.md
β”œβ”€β”€ πŸ“„ .env.example
β”œβ”€β”€ πŸ“„ .gitignore
β”œβ”€β”€ πŸ“„ main.py                     # Entry point CLI
β”‚
β”œβ”€β”€ πŸ“ data/                       # All data resources
β”‚   β”œβ”€β”€ πŸ“ raw/                    # Original, immutable resources
β”‚   β”‚   β”œβ”€β”€ πŸ“„ dizionario_materano.pdf
β”‚   β”‚   └── πŸ“ images/             # Dialect-related images from WikiMatera
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ scraped/                # Web-scraped content
β”‚   β”‚   β”œβ”€β”€ πŸ“„ wikimatera_pages.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ proverbi.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ grammatica.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ numeri.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ poesie.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ preghiere.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ soprannomi.json
β”‚   β”‚   └── πŸ“„ parole_antiche.json
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ processed/              # Cleaned, structured data
β”‚   β”‚   β”œβ”€β”€ πŸ“„ dictionary_terms.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ phonetic_rules.json
β”‚   β”‚   β”œβ”€β”€ πŸ“„ cultural_content.json
β”‚   β”‚   └── πŸ“„ training_examples.json
β”‚   β”‚
β”‚   └── πŸ“ exports/                # Generated datasets for training
β”‚       β”œβ”€β”€ πŸ“„ translation_pairs.csv
β”‚       β”œβ”€β”€ πŸ“„ conversation_examples.json
β”‚       └── πŸ“„ validation_set.json
β”‚
β”œβ”€β”€ πŸ“ src/                        # Source code
β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ core/                   # Core application logic
β”‚   β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ config.py           # Configuration management
β”‚   β”‚   β”œβ”€β”€ πŸ“„ exceptions.py       # Custom exceptions
β”‚   β”‚   └── πŸ“„ constants.py        # Application constants
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ models/                 # LLM and data models
β”‚   β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ model_factory.py    # Ollama/OpenAI factory
β”‚   β”‚   β”œβ”€β”€ πŸ“„ schemas.py          # Pydantic models
β”‚   β”‚   └── πŸ“„ prompts.py          # Prompt templates
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ data/                   # Data processing and management
β”‚   β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ pdf_extractor.py    # PDF dictionary processing
β”‚   β”‚   β”œβ”€β”€ πŸ“„ web_scraper.py      # WikiMatera scraping
β”‚   β”‚   β”œβ”€β”€ πŸ“„ redis_manager.py    # Redis operations and vector search
β”‚   β”‚   β”œβ”€β”€ πŸ“„ text_processor.py   # Text cleaning and normalization
β”‚   β”‚   └── πŸ“„ knowledge_builder.py # Knowledge base construction
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ workflows/              # LangGraph workflows
β”‚   β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ translation_workflow.py  # Core translation logic
β”‚   β”‚   β”œβ”€β”€ πŸ“„ chat_workflow.py         # Conversational flow
β”‚   β”‚   β”œβ”€β”€ πŸ“„ teaching_workflow.py     # Educational interactions
β”‚   β”‚   └── πŸ“„ nodes/                   # Workflow nodes
β”‚   β”‚       β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚       β”œβ”€β”€ πŸ“„ language_detection.py
β”‚   β”‚       β”œβ”€β”€ πŸ“„ dictionary_lookup.py
β”‚   β”‚       β”œβ”€β”€ πŸ“„ phonetic_rules.py
β”‚   β”‚       β”œβ”€β”€ πŸ“„ cultural_context.py
β”‚   β”‚       └── πŸ“„ response_formatter.py
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ agents/                 # Multi-agent system (Fase 4)
β”‚   β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ base_agent.py       # Abstract base agent
β”‚   β”‚   β”œβ”€β”€ πŸ“„ chat_manager.py     # Main orchestrator
β”‚   β”‚   β”œβ”€β”€ πŸ“„ translator.py       # Translation specialist
β”‚   β”‚   β”œβ”€β”€ πŸ“„ storyteller.py      # Cultural narratives
β”‚   β”‚   β”œβ”€β”€ πŸ“„ teacher.py          # Grammar and lessons
β”‚   β”‚   └── πŸ“„ guide.py            # Matera tourism info
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ interface/              # User interfaces
β”‚   β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ cli.py              # Command line interface
β”‚   β”‚   β”œβ”€β”€ πŸ“„ streamlit_app.py    # Web chat interface
β”‚   β”‚   └── πŸ“„ gradio_app.py       # Alternative web interface
β”‚   β”‚
β”‚   β”œβ”€β”€ πŸ“ services/               # External service integrations
β”‚   β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ ollama_service.py   # Ollama integration
β”‚   β”‚   └── πŸ“„ openai_service.py   # OpenAI integration
β”‚   β”‚
β”‚   └── πŸ“ utils/                  # Utility functions
β”‚       β”œβ”€β”€ πŸ“„ __init__.py
β”‚       β”œβ”€β”€ πŸ“„ logging_config.py   # Logging setup
β”‚       β”œβ”€β”€ πŸ“„ validation.py       # Data validation
β”‚       β”œβ”€β”€ πŸ“„ text_utils.py       # Text processing utilities
β”‚       └── πŸ“„ performance.py      # Performance monitoring
β”‚
β”œβ”€β”€ πŸ“ scripts/                    # Development utilities (as needed)
β”‚   └── πŸ“„ __init__.py
β”‚
β”œβ”€β”€ πŸ“ tests/                      # Test suite
β”‚   β”œβ”€β”€ πŸ“„ __init__.py
β”‚   β”œβ”€β”€ πŸ“„ conftest.py            # Pytest configuration
β”‚   β”œβ”€β”€ πŸ“ unit/                  # Unit tests
β”‚   β”‚   β”œβ”€β”€ πŸ“„ test_data_processing.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ test_workflows.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ test_models.py
β”‚   β”‚   └── πŸ“„ test_utils.py
β”‚   β”œβ”€β”€ πŸ“ integration/           # Integration tests
β”‚   β”‚   β”œβ”€β”€ πŸ“„ test_translation_flow.py
β”‚   β”‚   β”œβ”€β”€ πŸ“„ test_redis_operations.py
β”‚   β”‚   └── πŸ“„ test_agents.py
β”‚   └── πŸ“ fixtures/              # Test data
β”‚       β”œβ”€β”€ πŸ“„ sample_dictionary.json
β”‚       β”œβ”€β”€ πŸ“„ test_conversations.json
β”‚       └── πŸ“„ validation_cases.json
β”‚
β”œβ”€β”€ πŸ“ config/                     # Configuration files
β”‚   β”œβ”€β”€ πŸ“„ logging.yaml           # Logging configuration
β”‚   β”œβ”€β”€ πŸ“„ redis.yaml             # Redis connection and schema config
β”‚   └── πŸ“„ agents.yaml            # Agent configurations
β”‚
β”œβ”€β”€ πŸ“ notebooks/                  # Jupyter notebooks for analysis
β”‚   β”œβ”€β”€ πŸ“„ 01_dictionary_analysis.ipynb
β”‚   β”œβ”€β”€ πŸ“„ 02_phonetic_patterns.ipynb
β”‚   β”œβ”€β”€ πŸ“„ 03_cultural_content_exploration.ipynb
β”‚   └── πŸ“„ 04_model_evaluation.ipynb
β”‚
β”œβ”€β”€ πŸ“ docs/                       # Documentation
β”‚   β”œβ”€β”€ πŸ“„ architecture.md        # System architecture
β”‚   β”œβ”€β”€ πŸ“„ data_sources.md        # Data documentation
β”‚   β”œβ”€β”€ πŸ“„ api_reference.md       # API documentation
β”‚   β”œβ”€β”€ πŸ“„ deployment.md          # Deployment guide
β”‚   └── πŸ“ examples/              # Usage examples
β”‚       β”œβ”€β”€ πŸ“„ basic_translation.py
β”‚       β”œβ”€β”€ πŸ“„ chat_examples.py
β”‚       └── πŸ“„ agent_workflows.py
β”‚
└── πŸ“ deployment/                 # Deployment configurations
    β”œβ”€β”€ πŸ“„ Dockerfile
    β”œβ”€β”€ πŸ“„ docker-compose.yml
    β”œβ”€β”€ πŸ“„ requirements-prod.txt   # Production dependencies
    └── πŸ“ k8s/                   # Kubernetes configs (future)
        β”œβ”€β”€ πŸ“„ deployment.yaml
        └── πŸ“„ service.yaml

🎯 Obiettivi del Progetto

Educativi

  • Preservare il dialetto materano per le future generazioni
  • Fornire uno strumento di apprendimento interattivo
  • Documentare varianti e sfumature linguistiche

Tecnologici

  • Sperimentare con LangGraph per workflow complessi
  • Implementare sistemi multi-agente conversazionali
  • Integrare risorse testuali eterogenee

Culturali

  • Promuovere il patrimonio culturale di Matera
  • Creare ponte tra tradizione e innovazione tecnologica
  • Sviluppare strumento turistico culturale

🀝 Contributi

Il progetto Γ¨ aperto a contributi di:

  • Madrelingua materani per validazione linguistica
  • Sviluppatori interessati a dialetti regionali
  • Esperti di NLP e sistemi conversazionali
  • Appassionati di cultura materana

πŸ“„ Licenza

[Da definire - considerare licenza open source per la parte tecnica]

πŸ™ Riconoscimenti

  • Antonio D'Ercole per il "Dizionario Materano"
  • WikiMatera.it per le risorse culturali
  • ComunitΓ  materana per la preservazione del dialetto

Releases

No releases published

Packages

No packages published