Thanks to visit codestin.com
Credit goes to github.com

Skip to content

BuiDoKhoiNguyen/PDFMiner

Repository files navigation

PDFMiner - Hệ Thα»‘ng Microservice Xα»­ LΓ½ TΓ i Liệu ThΓ΄ng Minh

Hệ thα»‘ng microservice để quαΊ£n lΓ½, xα»­ lΓ½, phΓ’n tΓ­ch tΓ i liệu PDF vα»›i khαΊ£ nΔƒng OCR, tΓ¬m kiαΊΏm nΓ’ng cao vΓ  AI-powered document processing.

πŸ—οΈ KiαΊΏn trΓΊc tα»•ng quan

                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚   Config Server     β”‚
                            β”‚   (Port 8888)       β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚ Discovery Service   β”‚
                            β”‚   (Eureka 8761)     β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                       β”‚
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                                      β”‚                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚                         Gateway (Port 8080)                         β”‚   β”‚
β”‚  β”‚                    Spring Cloud Gateway + JWT                       β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                      β”‚                                      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚  β”‚                   β”‚               β”‚               β”‚                  β”‚ β”‚
β”‚  β–Ό                   β–Ό               β–Ό               β–Ό                  β–Ό β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚   User     β”‚ β”‚  Document  β”‚ β”‚  Storage   β”‚ β”‚ Processing β”‚ β”‚   WebApp   β”‚
β”‚ β”‚  Service   β”‚ β”‚  Service   β”‚ β”‚  Service   β”‚ β”‚  Service   β”‚ β”‚  (React)   β”‚
β”‚ β”‚ Port 8081  β”‚ β”‚ Port 8082  β”‚ β”‚ Port 8084  β”‚ β”‚  (Python)  β”‚ β”‚  (Vite)    β”‚
β”‚ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚       β”‚              β”‚              β”‚              β”‚
β”‚       β”‚   MongoDB    β”‚ Elasticsearchβ”‚   AWS S3     β”‚   Kafka
β”‚       └──────────────┴──────────────┴──────────────┴────────────────────┐
β”‚                                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                     β”‚
                          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                          β”‚  Infrastructure     β”‚
                          β”‚  β€’ Kafka/Zookeeper  β”‚
                          β”‚  β€’ Elasticsearch    β”‚
                          β”‚  β€’ Kibana           β”‚
                          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“¦ CΓ‘c Service

Core Services (Java/Spring Boot 3.4.4)

Service Port Description Technology Stack
Config Server 8888 QuαΊ£n lΓ½ cαΊ₯u hΓ¬nh tαΊ­p trung Spring Cloud Config
Discovery Service 8761 Service Registry & Discovery Netflix Eureka Server
Gateway 8080 API Gateway, Routing, JWT Auth Spring Cloud Gateway + WebFlux
User Service 8081 Quản lý người dùng & xÑc thực Spring Boot + MongoDB + JWT
Document Service 8082 QuαΊ£n lΓ½ metadata tΓ i liệu & tΓ¬m kiαΊΏm Spring Boot + Elasticsearch + Kafka
Storage Service 8084 LΖ°u trα»― vΓ  quαΊ£n lΓ½ file PDF Spring Boot + AWS S3 + JWT

AI/ML Services (Python)

Service Description Technology Stack
Processing Service OCR, Table Extraction, Document Processing Python + PaddleOCR + VietOCR + FastAPI + Kafka

Frontend (React + TypeScript)

Service Port Description Technology Stack
WebApp 5173 Giao diện người dΓΉng React 19 + TypeScript + Vite + Ant Design + Material-UI

πŸš€ Quick Start

Prerequisites

  • Java 17+
  • Maven 3.9+
  • Node.js 18+ & npm/yarn
  • Python 3.9+
  • Docker & Docker Compose
  • MongoDB (cho User Service)
  • Elasticsearch 7.17+ (cho Document Service)
  • AWS S3 (hoαΊ·c S3-compatible storage cho Storage Service)

1. Clone Repository

git clone https://github.com/BuiDoKhoiNguyen/PDFMiner.git
cd PDFMiner

2. Start Infrastructure Services

# Start Kafka, Zookeeper, Elasticsearch, Kibana
cd infrastructure
docker-compose up -d

# Verify services are running
docker-compose ps

Services started:

  • Zookeeper: localhost:2181
  • Kafka: localhost:9092
  • Kafka UI: localhost:8386
  • Elasticsearch: localhost:9200
  • Kibana: localhost:5601

3. Start Core Services (Java)

3.1. Build All Services

# Build tα»« root project
mvn clean install -DskipTests

3.2. Start Services (theo thα»© tα»±)

BΖ°α»›c 1: Start Config Server (bαΊ―t buα»™c chαΊ‘y Δ‘αΊ§u tiΓͺn)

cd config-server
mvn spring-boot:run

BΖ°α»›c 2: Start Discovery Service

cd discovery-service
mvn spring-boot:run

BΖ°α»›c 3: Start API Gateway

cd gateway
mvn spring-boot:run

BΖ°α»›c 4: Start Business Services (cΓ³ thể chαΊ‘y song song)

# Terminal 1 - User Service
cd user-service
mvn spring-boot:run

# Terminal 2 - Document Service
cd document-service
mvn spring-boot:run

# Terminal 3 - Storage Service
cd storage-service
mvn spring-boot:run

4. Start AI/ML Processing Service (Python)

cd processing-service

# CΓ i Δ‘αΊ·t dependencies
pip install -r requirements.txt

# Start service
chmod +x start.sh
./start.sh

# HoαΊ·c chαΊ‘y trα»±c tiαΊΏp
python server.py

5. Start Frontend WebApp

cd webapp

# CΓ i Δ‘αΊ·t dependencies
npm install

# Start development server
npm run dev

6. Access Services

Service URL Description
WebApp http://localhost:5173 Giao diện người dΓΉng
API Gateway http://localhost:8080 API Gateway
Eureka Dashboard http://localhost:8761 Service Registry Dashboard
Kafka UI http://localhost:8386 Kafka Management UI
Kibana http://localhost:5601 Elasticsearch Dashboard
Elasticsearch http://localhost:9200 Elasticsearch API

πŸ”§ Configuration

CαΊ₯u hΓ¬nh tαΊ­p trung (Config Server)

CΓ‘c file cαΊ₯u hΓ¬nh được quαΊ£n lΓ½ tαΊ­p trung tαΊ‘i thΖ° mα»₯c config/:

config/
β”œβ”€β”€ application.properties        # CαΊ₯u hΓ¬nh chung
β”œβ”€β”€ document-service.yml         # Document Service config
β”œβ”€β”€ eureka-server.yml            # Discovery Service config
β”œβ”€β”€ gateway.yml                  # Gateway config
β”œβ”€β”€ storage-service.yml          # Storage Service config
└── user-service.yml             # User Service config

Environment Variables

TαΊ‘o file .env hoαΊ·c cαΊ₯u hΓ¬nh biαΊΏn mΓ΄i trường:

# MongoDB (User Service)
MONGODB_URI=mongodb://localhost:27017/pdfminer
MONGODB_DATABASE=pdfminer

# Elasticsearch (Document Service)
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200

# AWS S3 (Storage Service)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=ap-southeast-1
AWS_S3_BUCKET=pdfminer-storage

# Kafka
KAFKA_BOOTSTRAP_SERVERS=localhost:9092

# JWT Secret
JWT_SECRET=your_jwt_secret_key_here
JWT_EXPIRATION=86400000

# Google Gemini API (Document Service)
GEMINI_API_KEY=your_gemini_api_key_here

🌟 Features

Core Features

  • βœ… QuαΊ£n lΓ½ người dΓΉng: Đăng kΓ½, Δ‘Δƒng nhαΊ­p, phΓ’n quyền vα»›i JWT
  • βœ… Upload & Storage: Upload PDF files lΓͺn AWS S3
  • βœ… Document Processing: OCR tiαΊΏng Việt vα»›i PaddleOCR & VietOCR
  • βœ… Table Extraction: TrΓ­ch xuαΊ₯t bαΊ£ng tα»« PDF thΓ nh structured data
  • βœ… Full-text Search: TΓ¬m kiαΊΏm nα»™i dung tΓ i liệu vα»›i Elasticsearch
  • βœ… Metadata Management: QuαΊ£n lΓ½ metadata vΓ  indexing
  • βœ… Real-time Processing: Xα»­ lΓ½ bαΊ₯t Δ‘α»“ng bα»™ vα»›i Kafka

Advanced Features

  • πŸ”„ Microservice Architecture: Scalable vΓ  maintainable
  • πŸ” Security: JWT authentication & authorization
  • πŸ“Š Monitoring: Service health checks & monitoring
  • πŸš€ Service Discovery: Automatic service registration vα»›i Eureka
  • βš™οΈ Centralized Config: QuαΊ£n lΓ½ cαΊ₯u hΓ¬nh tαΊ­p trung
  • 🎯 API Gateway: Single entry point vα»›i routing thΓ΄ng minh

πŸ“š API Documentation

Authentication APIs (via Gateway)

Base URL: http://localhost:8080/api/users

# Register
POST /api/users/auth/register
Content-Type: application/json
{
  "username": "[email protected]",
  "password": "password123",
  "fullName": "John Doe"
}

# Login
POST /api/users/auth/login
Content-Type: application/json
{
  "username": "[email protected]",
  "password": "password123"
}

# Response
{
  "accessToken": "eyJhbGciOiJIUzI1NiIs...",
  "tokenType": "Bearer"
}

Document APIs (via Gateway)

Base URL: http://localhost:8080/api/documents

# Upload Document
POST /api/storage/upload
Authorization: Bearer {token}
Content-Type: multipart/form-data
file: [PDF file]

# Search Documents
GET /api/documents/search?query=keyword
Authorization: Bearer {token}

# Get Document Details
GET /api/documents/{id}
Authorization: Bearer {token}

πŸ› οΈ Technology Stack

Backend (Java)

  • Spring Boot 3.4.4
  • Spring Cloud 2024.0.1
    • Spring Cloud Config
    • Spring Cloud Gateway
    • Netflix Eureka
    • OpenFeign
  • Spring Security + JWT
  • Spring Data MongoDB
  • Spring Data Elasticsearch
  • Spring Kafka
  • AWS SDK for S3
  • Lombok
  • ModelMapper

AI/ML (Python)

  • FastAPI - Web framework
  • PaddleOCR - OCR engine
  • VietOCR - Vietnamese OCR
  • Kafka-Python - Kafka consumer/producer
  • PyTorch - Deep learning framework
  • PIL/OpenCV - Image processing
  • pandas - Data manipulation
  • Google Generative AI - AI-powered text processing

Frontend

  • React 19
  • TypeScript
  • Vite - Build tool
  • Ant Design Pro Components
  • Material-UI
  • React Router DOM
  • Axios - HTTP client
  • TanStack Query - Server state management
  • JWT Decode

Infrastructure

  • Kafka + Zookeeper - Message broker
  • Elasticsearch 7.17 - Search engine
  • Kibana 7.17 - Elasticsearch UI
  • MongoDB - User data storage
  • AWS S3 - File storage
  • Docker - Containerization

πŸ“‚ Project Structure

PDFMiner/
β”œβ”€β”€ config/                          # Centralized configuration files
β”‚   β”œβ”€β”€ application.properties
β”‚   β”œβ”€β”€ document-service.yml
β”‚   β”œβ”€β”€ eureka-server.yml
β”‚   β”œβ”€β”€ gateway.yml
β”‚   β”œβ”€β”€ storage-service.yml
β”‚   └── user-service.yml
β”œβ”€β”€ config-server/                   # Spring Cloud Config Server
β”œβ”€β”€ discovery-service/               # Eureka Server
β”œβ”€β”€ gateway/                         # API Gateway
β”œβ”€β”€ user-service/                    # User management & authentication
β”œβ”€β”€ document-service/                # Document metadata & search
β”œβ”€β”€ storage-service/                 # File storage with AWS S3
β”œβ”€β”€ processing-service/              # Python AI/ML service
β”‚   β”œβ”€β”€ PaddleOCR/                  # OCR engine
β”‚   β”œβ”€β”€ vietocr/                    # Vietnamese OCR
β”‚   β”œβ”€β”€ kafka_consumer.py           # Kafka consumer
β”‚   β”œβ”€β”€ server.py                   # FastAPI server
β”‚   β”œβ”€β”€ table_ocr.py                # Table extraction
β”‚   └── requirements.txt
β”œβ”€β”€ webapp/                          # React frontend
β”‚   β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ public/
β”‚   └── package.json
β”œβ”€β”€ infrastructure/                  # Docker compose files
β”‚   └── docker-compose.yml
└── pom.xml                         # Parent POM

πŸ”„ Data Flow

Upload & Process Document Flow

1. User uploads PDF via WebApp (React)
   ↓
2. Gateway routes to Storage Service
   ↓
3. Storage Service uploads to AWS S3
   ↓
4. Storage Service publishes event to Kafka
   ↓
5. Processing Service (Python) consumes event
   ↓
6. Processing Service performs OCR & Table Extraction
   ↓
7. Processing Service sends results to Document Service
   ↓
8. Document Service indexes to Elasticsearch
   ↓
9. WebApp displays processing status & results

Search Flow

1. User searches from WebApp
   ↓
2. Gateway routes to Document Service
   ↓
3. Document Service queries Elasticsearch
   ↓
4. Results returned with metadata
   ↓
5. WebApp displays search results

πŸ§ͺ Testing

# Test all Java services
mvn test

# Test specific service
cd user-service
mvn test

# Test Python service
cd processing-service
pytest

πŸ“Š Monitoring & Health Checks

Service Health Endpoints

# Check all registered services
curl http://localhost:8761/eureka/apps

# Individual service health
curl http://localhost:8081/actuator/health  # User Service
curl http://localhost:8082/actuator/health  # Document Service
curl http://localhost:8084/actuator/health  # Storage Service

Kafka Monitoring

Access Kafka UI: http://localhost:8386

Elasticsearch Monitoring

Access Kibana: http://localhost:5601

πŸ› Troubleshooting

Common Issues

1. Service khΓ΄ng register vα»›i Eureka

# Kiểm tra Eureka server Δ‘ang chαΊ‘y
curl http://localhost:8761

# Kiểm tra config trong application.yml
eureka:
  client:
    service-url:
      defaultZone: http://localhost:8761/eureka/

2. Kafka connection refused

# Kiểm tra Kafka Δ‘ang chαΊ‘y
docker ps | grep kafka

# Restart Kafka
cd infrastructure
docker-compose restart kafka

3. Elasticsearch connection timeout

# Kiểm tra Elasticsearch
curl http://localhost:9200

# Restart Elasticsearch
docker-compose restart elasticsearch

4. MongoDB connection error

# Kiểm tra MongoDB Δ‘ang chαΊ‘y
mongosh --eval "db.adminCommand('ping')"

# Kiểm tra connection string trong config

οΏ½ Development Guide

Adding a New Service

  1. Create new Maven module
  2. Add to parent pom.xml
  3. Configure bootstrap.yml with Config Server
  4. Register with Eureka
  5. Add routing in Gateway
  6. Update documentation

Code Style

  • Follow Google Java Style Guide
  • Use Lombok for boilerplate code
  • Write meaningful commit messages
  • Add Javadoc for public APIs

πŸš€ Deployment

Docker Deployment (Coming Soon)

# Build all services
./build-all.sh

# Deploy with Docker Compose
docker-compose up -d

Kubernetes Deployment (Coming Soon)

kubectl apply -f k8s/

🀝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

πŸ“ License

This project is licensed under the MIT License.

πŸ‘₯ Team

πŸ“ž Contact

For questions or support, please open an issue on GitHub.


Made with ❀️ by PDFMiner Team

# User Management
POST /api/users/register
POST /api/users/login
GET  /api/users/profile

# Document Management
POST /api/documents/upload
GET  /api/documents/{id}
POST /api/documents/search

# Vector Search
POST /api/embeddings/search
POST /api/embeddings/similar

# File Storage
POST /api/storage/upload
GET  /api/storage/download/{id}

Direct Service APIs

πŸ” Monitoring & Observability

Health Checks

# Overall system health
curl http://localhost:8080/actuator/health

# Individual services
curl http://localhost:8081/actuator/health  # User Service
curl http://localhost:8082/actuator/health  # Metadata Service
curl http://localhost:8083/health           # Embedding Service

Metrics & Logging

πŸ› οΈ Development

Adding New Service

  1. Create new Maven module:
mkdir new-service
cd new-service
# Copy from template service
  1. Update root pom.xml:
<modules>
    <!-- existing modules -->
    <module>new-service</module>
</modules>
  1. Configure service discovery and config

Testing

# Unit tests
mvn test

# Integration tests
mvn integration-test

# End-to-end tests
cd tests && python -m pytest

πŸ”’ Security

  • JWT Authentication vα»›i Spring Security
  • Rate Limiting trong Gateway
  • Input Validation vΓ  sanitization
  • HTTPS/TLS cho production
  • API Key management cho external services

πŸ“Š Performance

Benchmarks

  • Gateway Throughput: 10,000 RPS
  • Vector Search: <100ms response time
  • Document Processing: 50 documents/minute
  • OCR Processing: 2-5 pages/minute

Scaling

  • Horizontal Scaling: Multiple instances vα»›i load balancing
  • Database Sharding: Partitioned by tenant/user
  • Caching Strategy: Redis cho frequently accessed data
  • CDN: Static files vΓ  images

🚒 Deployment

Docker

# Build all services
docker-compose build

# Deploy to staging
docker-compose -f docker-compose.staging.yml up -d

# Deploy to production
docker-compose -f docker-compose.prod.yml up -d

Kubernetes

# Apply configurations
kubectl apply -f k8s/

# Check deployment status
kubectl get pods
kubectl get services

Cloud Deployment

  • AWS: EKS + RDS + ElastiCache + S3
  • GCP: GKE + Cloud SQL + Cloud Storage
  • Azure: AKS + Azure Database + Blob Storage

πŸ“š Documentation

🀝 Contributing

  1. Fork repository
  2. Create feature branch: git checkout -b feature/new-feature
  3. Commit changes: git commit -m 'Add new feature'
  4. Push branch: git push origin feature/new-feature
  5. Create Pull Request

πŸ“„ License

This project is licensed under the MIT License - see LICENSE file.

πŸ†˜ Support


Built with ❀️ using Spring Boot, FastAPI, and modern microservice patterns

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published