A full-stack application designed to fetch, verify, and manage company information efficiently using AI-powered data extraction and web scraping technologies.
- Overview
- Features
- Tech Stack
- Prerequisites
- Installation
- Configuration
- Usage
- API Documentation
- Contributing
- Security
- Future Enhancements
- License
- Support
The Company Data Extractor is a comprehensive solution for businesses and researchers who need to collect and verify company information at scale. The application combines the power of OpenAI's API with intelligent web scraping to provide high-quality, verified company data including operational status, location details, and official websites.
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β β β β
β React FrontendβββββΊβ Spring Boot API βββββΊβ MySQL Databaseβ
β β β β β β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β OpenAI API β
β Integration β
βββββββββββββββββββ
- π Intelligent Data Extraction: Leverage OpenAI API for accurate company information retrieval
- π Location-Based Search: Filter companies by city and industry
- β Verification System: Verify company operational status (Active, Acquired, Closed)
- π« Duplicate Prevention: Built-in exclusion lists to avoid duplicate entries
- π Excel Export: Export company data to Excel files for further analysis
- π File Upload: Drag-and-drop interface for batch processing
- β‘ Real-time Processing: Interactive table display with real-time updates
- π Secure: Environment-based configuration for sensitive data
| Component | Technology |
|---|---|
| Backend | Java 17, Spring Boot 3.5.3, Maven |
| Database | MySQL 8.0 |
| Frontend | React 18.x, JavaScript ES6+, CSS3 |
| API Integration | OpenAI API (GPT Models) |
| File Processing | Apache POI (Excel export/import) |
| Build Tools | Maven (Backend), npm (Frontend) |
| Security | Environment Variables, Spring Security |
Before you begin, ensure you have the following installed:
- Java Development Kit (JDK) 17+
- Node.js 16+ and npm
- MySQL 8.0+
- Maven 3.8+
- OpenAI API Key (obtain from OpenAI Platform)
-
Clone the repository
git clone https://github.com/yourusername/company-data-extractor-backend.git cd company-data-extractor-backend -
Set up the database
CREATE DATABASE company_extractor; CREATE USER 'your_username'@'localhost' IDENTIFIED BY 'your_password'; GRANT ALL PRIVILEGES ON company_extractor.* TO 'your_username'@'localhost'; FLUSH PRIVILEGES;
-
Configure environment variables
export DB_USERNAME=your_username export DB_PASSWORD=your_password export OPENAI_API_KEY=your_openai_api_key
-
Build and run the application
mvn clean install mvn spring-boot:run
The backend will be available at
http://localhost:8081
-
Clone the frontend repository
git clone https://github.com/yourusername/company-data-extractor-frontend.git cd company-data-extractor-frontend -
Install dependencies
npm install
-
Start the development server
npm start
The frontend will be available at
http://localhost:3000
# Application
spring.application.name=company-data-extractor
server.port=8081
# Database Configuration
spring.datasource.url=jdbc:mysql://localhost:3306/company_extractor?useSSL=false&allowPublicKeyRetrieval=true&serverTimezone=UTC
spring.datasource.username=${DB_USERNAME}
spring.datasource.password=${DB_PASSWORD}
spring.datasource.driver-class-name=com.mysql.cj.jdbc.Driver
# JPA Configuration
spring.jpa.database-platform=org.hibernate.dialect.MySQLDialect
spring.jpa.hibernate.ddl-auto=update
spring.jpa.show-sql=false
spring.jpa.open-in-view=false
# OpenAI API Configuration
openai.api.key=${OPENAI_API_KEY}
openai.api.timeout=30000
# File Upload Configuration
spring.servlet.multipart.max-file-size=10MB
spring.servlet.multipart.max-request-size=10MBCreate a .env file in your project root (for development):
DB_USERNAME=your_db_username
DB_PASSWORD=your_db_password
OPENAI_API_KEY=sk-your-openai-api-key-
Start the Application
- Ensure both backend and frontend are running
- Navigate to
http://localhost:3000
-
Search for Companies
- Enter the target city (e.g., "San Francisco")
- Specify the industry (e.g., "Software")
- Set the number of companies to fetch
- Click "Fetch Companies"
-
View Results
- Browse companies in the interactive table
- View operational status (Active, Acquired, Closed)
- Check company websites and locations
-
Export Data
- Click "Export to Excel" to download results
- Use the exported file for further analysis
- Drag and Drop: Simply drag Excel files onto the upload area
- Batch Processing: Upload multiple company lists for processing
- Format Support: Supports
.xlsxand.xlsformats
POST /api/companies/fetch
Content-Type: application/json
{
"city": "San Francisco",
"industry": "Software",
"count": 50
}GET /api/companiesGET /api/companies/export?city=San Francisco&industry=SoftwarePOST /api/companies/upload
Content-Type: multipart/form-data
{
"file": [Excel file]
}{
"success": true,
"data": [
{
"id": 1,
"name": "Example Corp",
"website": "https://example.com",
"status": "Active",
"city": "San Francisco",
"industry": "Software",
"createdAt": "2025-09-24T19:30:00Z"
}
],
"total": 1,
"message": "Companies fetched successfully"
}We welcome contributions! Please follow these steps:
- Fork the repository
- Create a feature branch
git checkout -b feature/amazing-feature
- Commit your changes
git commit -m 'Add some amazing feature' - Push to the branch
git push origin feature/amazing-feature
- Open a Pull Request
- Backend: Follow Java naming conventions and Spring Boot best practices
- Frontend: Use ESLint and Prettier for consistent formatting
- Database: Use meaningful table and column names
- Comments: Write clear, concise comments for complex logic
- Environment Variables: All sensitive data is stored in environment variables
- API Key Protection: OpenAI API keys are never exposed in client-side code
- Database Security: Use strong passwords and limit database access
- Input Validation: All user inputs are validated and sanitized
- CORS Configuration: Properly configured for production deployments
- API keys stored in environment variables
- Database credentials secured
- Input validation implemented
- SQL injection prevention
- XSS protection enabled
- HTTPS in production
- User Authentication: JWT-based authentication system
- Role-Based Access Control: Different permission levels
- Advanced Analytics: Company data insights and trends
- Bulk Operations: Process thousands of companies simultaneously
- API Rate Limiting: Prevent abuse and ensure fair usage
- Caching System: Redis integration for better performance
- Monitoring: Application performance monitoring
- Docker Support: Containerization for easy deployment
- Backend: AWS EC2, Heroku, Google Cloud Platform
- Database: AWS RDS, Google Cloud SQL
- Frontend: Vercel, Netlify, AWS S3 + CloudFront
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Company Data Extractor
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- Documentation: Check our Wiki for detailed guides
- Issues: Report bugs and request features via GitHub Issues
- Discussions: Join the community in GitHub Discussions
- Email: [email protected]
- Twitter: @CompanyDataExt
- LinkedIn: Company Data Extractor