Hα» thα»ng microservice Δα» quαΊ£n lΓ½, xα» lΓ½, phΓ’n tΓch tΓ i liα»u PDF vα»i khαΊ£ nΔng OCR, tΓ¬m kiαΊΏm nΓ’ng cao vΓ AI-powered document processing.
βββββββββββββββββββββββ
β Config Server β
β (Port 8888) β
ββββββββββββ¬βββββββββββ
β
ββββββββββββΌβββββββββββ
β Discovery Service β
β (Eureka 8761) β
ββββββββββββ¬βββββββββββ
β
ββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββββββββ
β β β
β βββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββββ β
β β Gateway (Port 8080) β β
β β Spring Cloud Gateway + JWT β β
β βββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββ¬ββββββββββββββββΌββββββββββββββββ¬βββββββββββββββββββ β
β β β β β β β
β βΌ βΌ βΌ βΌ βΌ β
β ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββ
β β User β β Document β β Storage β β Processing β β WebApp β
β β Service β β Service β β Service β β Service β β (React) β
β β Port 8081 β β Port 8082 β β Port 8084 β β (Python) β β (Vite) β
β βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ βββββββ¬βββββββ ββββββββββββββ
β β β β β
β β MongoDB β Elasticsearchβ AWS S3 β Kafka
β ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ΄βββββββββββββββββββββ
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
ββββββββββββΌβββββββββββ
β Infrastructure β
β β’ Kafka/Zookeeper β
β β’ Elasticsearch β
β β’ Kibana β
βββββββββββββββββββββββ
Service | Port | Description | Technology Stack |
---|---|---|---|
Config Server | 8888 | QuαΊ£n lΓ½ cαΊ₯u hΓ¬nh tαΊp trung | Spring Cloud Config |
Discovery Service | 8761 | Service Registry & Discovery | Netflix Eureka Server |
Gateway | 8080 | API Gateway, Routing, JWT Auth | Spring Cloud Gateway + WebFlux |
User Service | 8081 | QuαΊ£n lΓ½ ngΖ°α»i dΓΉng & xΓ‘c thα»±c | Spring Boot + MongoDB + JWT |
Document Service | 8082 | QuαΊ£n lΓ½ metadata tΓ i liα»u & tΓ¬m kiαΊΏm | Spring Boot + Elasticsearch + Kafka |
Storage Service | 8084 | LΖ°u trα»― vΓ quαΊ£n lΓ½ file PDF | Spring Boot + AWS S3 + JWT |
Service | Description | Technology Stack |
---|---|---|
Processing Service | OCR, Table Extraction, Document Processing | Python + PaddleOCR + VietOCR + FastAPI + Kafka |
Service | Port | Description | Technology Stack |
---|---|---|---|
WebApp | 5173 | Giao diα»n ngΖ°α»i dΓΉng | React 19 + TypeScript + Vite + Ant Design + Material-UI |
- Java 17+
- Maven 3.9+
- Node.js 18+ & npm/yarn
- Python 3.9+
- Docker & Docker Compose
- MongoDB (cho User Service)
- Elasticsearch 7.17+ (cho Document Service)
- AWS S3 (hoαΊ·c S3-compatible storage cho Storage Service)
git clone https://github.com/BuiDoKhoiNguyen/PDFMiner.git
cd PDFMiner
# Start Kafka, Zookeeper, Elasticsearch, Kibana
cd infrastructure
docker-compose up -d
# Verify services are running
docker-compose ps
Services started:
- Zookeeper:
localhost:2181
- Kafka:
localhost:9092
- Kafka UI:
localhost:8386
- Elasticsearch:
localhost:9200
- Kibana:
localhost:5601
# Build tα»« root project
mvn clean install -DskipTests
BΖ°α»c 1: Start Config Server (bαΊ―t buα»c chαΊ‘y ΔαΊ§u tiΓͺn)
cd config-server
mvn spring-boot:run
BΖ°α»c 2: Start Discovery Service
cd discovery-service
mvn spring-boot:run
BΖ°α»c 3: Start API Gateway
cd gateway
mvn spring-boot:run
BΖ°α»c 4: Start Business Services (cΓ³ thα» chαΊ‘y song song)
# Terminal 1 - User Service
cd user-service
mvn spring-boot:run
# Terminal 2 - Document Service
cd document-service
mvn spring-boot:run
# Terminal 3 - Storage Service
cd storage-service
mvn spring-boot:run
cd processing-service
# CΓ i ΔαΊ·t dependencies
pip install -r requirements.txt
# Start service
chmod +x start.sh
./start.sh
# HoαΊ·c chαΊ‘y trα»±c tiαΊΏp
python server.py
cd webapp
# CΓ i ΔαΊ·t dependencies
npm install
# Start development server
npm run dev
Service | URL | Description |
---|---|---|
WebApp | http://localhost:5173 | Giao diα»n ngΖ°α»i dΓΉng |
API Gateway | http://localhost:8080 | API Gateway |
Eureka Dashboard | http://localhost:8761 | Service Registry Dashboard |
Kafka UI | http://localhost:8386 | Kafka Management UI |
Kibana | http://localhost:5601 | Elasticsearch Dashboard |
Elasticsearch | http://localhost:9200 | Elasticsearch API |
CΓ‘c file cαΊ₯u hΓ¬nh Δược quαΊ£n lΓ½ tαΊp trung tαΊ‘i thΖ° mα»₯c config/
:
config/
βββ application.properties # CαΊ₯u hΓ¬nh chung
βββ document-service.yml # Document Service config
βββ eureka-server.yml # Discovery Service config
βββ gateway.yml # Gateway config
βββ storage-service.yml # Storage Service config
βββ user-service.yml # User Service config
TαΊ‘o file .env
hoαΊ·c cαΊ₯u hΓ¬nh biαΊΏn mΓ΄i trΖ°α»ng:
# MongoDB (User Service)
MONGODB_URI=mongodb://localhost:27017/pdfminer
MONGODB_DATABASE=pdfminer
# Elasticsearch (Document Service)
ELASTICSEARCH_HOST=localhost
ELASTICSEARCH_PORT=9200
# AWS S3 (Storage Service)
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=ap-southeast-1
AWS_S3_BUCKET=pdfminer-storage
# Kafka
KAFKA_BOOTSTRAP_SERVERS=localhost:9092
# JWT Secret
JWT_SECRET=your_jwt_secret_key_here
JWT_EXPIRATION=86400000
# Google Gemini API (Document Service)
GEMINI_API_KEY=your_gemini_api_key_here
- β QuαΊ£n lΓ½ ngΖ°α»i dΓΉng: ΔΔng kΓ½, ΔΔng nhαΊp, phΓ’n quyα»n vα»i JWT
- β Upload & Storage: Upload PDF files lΓͺn AWS S3
- β Document Processing: OCR tiαΊΏng Viα»t vα»i PaddleOCR & VietOCR
- β Table Extraction: TrΓch xuαΊ₯t bαΊ£ng tα»« PDF thΓ nh structured data
- β Full-text Search: TΓ¬m kiαΊΏm nα»i dung tΓ i liα»u vα»i Elasticsearch
- β Metadata Management: QuαΊ£n lΓ½ metadata vΓ indexing
- β Real-time Processing: Xα» lΓ½ bαΊ₯t Δα»ng bα» vα»i Kafka
- π Microservice Architecture: Scalable vΓ maintainable
- π Security: JWT authentication & authorization
- π Monitoring: Service health checks & monitoring
- π Service Discovery: Automatic service registration vα»i Eureka
- βοΈ Centralized Config: QuαΊ£n lΓ½ cαΊ₯u hΓ¬nh tαΊp trung
- π― API Gateway: Single entry point vα»i routing thΓ΄ng minh
Base URL: http://localhost:8080/api/users
# Register
POST /api/users/auth/register
Content-Type: application/json
{
"username": "[email protected]",
"password": "password123",
"fullName": "John Doe"
}
# Login
POST /api/users/auth/login
Content-Type: application/json
{
"username": "[email protected]",
"password": "password123"
}
# Response
{
"accessToken": "eyJhbGciOiJIUzI1NiIs...",
"tokenType": "Bearer"
}
Base URL: http://localhost:8080/api/documents
# Upload Document
POST /api/storage/upload
Authorization: Bearer {token}
Content-Type: multipart/form-data
file: [PDF file]
# Search Documents
GET /api/documents/search?query=keyword
Authorization: Bearer {token}
# Get Document Details
GET /api/documents/{id}
Authorization: Bearer {token}
- Spring Boot 3.4.4
- Spring Cloud 2024.0.1
- Spring Cloud Config
- Spring Cloud Gateway
- Netflix Eureka
- OpenFeign
- Spring Security + JWT
- Spring Data MongoDB
- Spring Data Elasticsearch
- Spring Kafka
- AWS SDK for S3
- Lombok
- ModelMapper
- FastAPI - Web framework
- PaddleOCR - OCR engine
- VietOCR - Vietnamese OCR
- Kafka-Python - Kafka consumer/producer
- PyTorch - Deep learning framework
- PIL/OpenCV - Image processing
- pandas - Data manipulation
- Google Generative AI - AI-powered text processing
- React 19
- TypeScript
- Vite - Build tool
- Ant Design Pro Components
- Material-UI
- React Router DOM
- Axios - HTTP client
- TanStack Query - Server state management
- JWT Decode
- Kafka + Zookeeper - Message broker
- Elasticsearch 7.17 - Search engine
- Kibana 7.17 - Elasticsearch UI
- MongoDB - User data storage
- AWS S3 - File storage
- Docker - Containerization
PDFMiner/
βββ config/ # Centralized configuration files
β βββ application.properties
β βββ document-service.yml
β βββ eureka-server.yml
β βββ gateway.yml
β βββ storage-service.yml
β βββ user-service.yml
βββ config-server/ # Spring Cloud Config Server
βββ discovery-service/ # Eureka Server
βββ gateway/ # API Gateway
βββ user-service/ # User management & authentication
βββ document-service/ # Document metadata & search
βββ storage-service/ # File storage with AWS S3
βββ processing-service/ # Python AI/ML service
β βββ PaddleOCR/ # OCR engine
β βββ vietocr/ # Vietnamese OCR
β βββ kafka_consumer.py # Kafka consumer
β βββ server.py # FastAPI server
β βββ table_ocr.py # Table extraction
β βββ requirements.txt
βββ webapp/ # React frontend
β βββ src/
β βββ public/
β βββ package.json
βββ infrastructure/ # Docker compose files
β βββ docker-compose.yml
βββ pom.xml # Parent POM
1. User uploads PDF via WebApp (React)
β
2. Gateway routes to Storage Service
β
3. Storage Service uploads to AWS S3
β
4. Storage Service publishes event to Kafka
β
5. Processing Service (Python) consumes event
β
6. Processing Service performs OCR & Table Extraction
β
7. Processing Service sends results to Document Service
β
8. Document Service indexes to Elasticsearch
β
9. WebApp displays processing status & results
1. User searches from WebApp
β
2. Gateway routes to Document Service
β
3. Document Service queries Elasticsearch
β
4. Results returned with metadata
β
5. WebApp displays search results
# Test all Java services
mvn test
# Test specific service
cd user-service
mvn test
# Test Python service
cd processing-service
pytest
# Check all registered services
curl http://localhost:8761/eureka/apps
# Individual service health
curl http://localhost:8081/actuator/health # User Service
curl http://localhost:8082/actuator/health # Document Service
curl http://localhost:8084/actuator/health # Storage Service
Access Kafka UI: http://localhost:8386
Access Kibana: http://localhost:5601
1. Service khΓ΄ng register vα»i Eureka
# Kiα»m tra Eureka server Δang chαΊ‘y
curl http://localhost:8761
# Kiα»m tra config trong application.yml
eureka:
client:
service-url:
defaultZone: http://localhost:8761/eureka/
2. Kafka connection refused
# Kiα»m tra Kafka Δang chαΊ‘y
docker ps | grep kafka
# Restart Kafka
cd infrastructure
docker-compose restart kafka
3. Elasticsearch connection timeout
# Kiα»m tra Elasticsearch
curl http://localhost:9200
# Restart Elasticsearch
docker-compose restart elasticsearch
4. MongoDB connection error
# Kiα»m tra MongoDB Δang chαΊ‘y
mongosh --eval "db.adminCommand('ping')"
# Kiα»m tra connection string trong config
- Create new Maven module
- Add to parent
pom.xml
- Configure
bootstrap.yml
with Config Server - Register with Eureka
- Add routing in Gateway
- Update documentation
- Follow Google Java Style Guide
- Use Lombok for boilerplate code
- Write meaningful commit messages
- Add Javadoc for public APIs
# Build all services
./build-all.sh
# Deploy with Docker Compose
docker-compose up -d
kubectl apply -f k8s/
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature
) - Commit your changes (
git commit -m 'Add some AmazingFeature'
) - Push to the branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
This project is licensed under the MIT License.
- Bui Do Khoi Nguyen - @BuiDoKhoiNguyen
For questions or support, please open an issue on GitHub.
Made with β€οΈ by PDFMiner Team
# User Management
POST /api/users/register
POST /api/users/login
GET /api/users/profile
# Document Management
POST /api/documents/upload
GET /api/documents/{id}
POST /api/documents/search
# Vector Search
POST /api/embeddings/search
POST /api/embeddings/similar
# File Storage
POST /api/storage/upload
GET /api/storage/download/{id}
- Swagger UI: http://localhost:{port}/swagger-ui.html
- OpenAPI: http://localhost:{port}/v3/api-docs
# Overall system health
curl http://localhost:8080/actuator/health
# Individual services
curl http://localhost:8081/actuator/health # User Service
curl http://localhost:8082/actuator/health # Metadata Service
curl http://localhost:8083/health # Embedding Service
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000
- Centralized Logging: ELK Stack
- Distributed Tracing: Sleuth + Zipkin
- Create new Maven module:
mkdir new-service
cd new-service
# Copy from template service
- Update root
pom.xml
:
<modules>
<!-- existing modules -->
<module>new-service</module>
</modules>
- Configure service discovery and config
# Unit tests
mvn test
# Integration tests
mvn integration-test
# End-to-end tests
cd tests && python -m pytest
- JWT Authentication vα»i Spring Security
- Rate Limiting trong Gateway
- Input Validation vΓ sanitization
- HTTPS/TLS cho production
- API Key management cho external services
- Gateway Throughput: 10,000 RPS
- Vector Search: <100ms response time
- Document Processing: 50 documents/minute
- OCR Processing: 2-5 pages/minute
- Horizontal Scaling: Multiple instances vα»i load balancing
- Database Sharding: Partitioned by tenant/user
- Caching Strategy: Redis cho frequently accessed data
- CDN: Static files vΓ images
# Build all services
docker-compose build
# Deploy to staging
docker-compose -f docker-compose.staging.yml up -d
# Deploy to production
docker-compose -f docker-compose.prod.yml up -d
# Apply configurations
kubectl apply -f k8s/
# Check deployment status
kubectl get pods
kubectl get services
- AWS: EKS + RDS + ElastiCache + S3
- GCP: GKE + Cloud SQL + Cloud Storage
- Azure: AKS + Azure Database + Blob Storage
- Fork repository
- Create feature branch:
git checkout -b feature/new-feature
- Commit changes:
git commit -m 'Add new feature'
- Push branch:
git push origin feature/new-feature
- Create Pull Request
This project is licensed under the MIT License - see LICENSE file.
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Wiki: Project Wiki
- Email: [email protected]
Built with β€οΈ using Spring Boot, FastAPI, and modern microservice patterns