A canonical LinkedIn extraction and profile-intelligence pipeline.
DuxSoup ETL is a production-grade LinkedIn extraction, ingestion, and identity resolution pipeline - designed to safely process real-time extraction webhooks while maintaining a canonical profile model. The system ingests scan and visit events (via the DuxSoup API) as immutable observations. It resolves identities deterministically, and maintains a continuously updated People Snapshot optimized for analytics, CRM enrichment, and graph-based intelligence workflows. The system has been designed for background processing and extensibility.
Significant controls have been included for safety, such as idempotent ingestion, automatic deduplication, and failure recovery.
Finally, the system normalizes profile data and is stored in MongoDB Atlas records (documents). A very useful app for lead enrichment, CRM models, and overarching people intelligence.
- Table of Contents
- Features
- Benefits
- Tech Stack
- API Reference
- Data Model
- Architecture & Data Flow
- Read Modes & Cutover Strategy
- Project Structure
- Operations & Debugging
- License
- Webhook Processing: Handles DuxSoup LinkedIn data via
POST /api/webhook. - Type-Based Routing: Automatically routes scan vs visit data to appropriate handlers based on the
typefield in the payload. - Data Validation: Comprehensive validation for required fields, including a custom validator for the
idfield to ensure it's a non-empty string. - Error Handling: Robust error handling with detailed logging using Winston.
- Production Ready: Designed for deployment on platforms like Render with health monitoring.
- Extensible: Easy to add MongoDB storage and data normalization.
- MongoDB Storage: Integrates with MongoDB using Mongoose to persist processed data.
- Real-time Processing: Processes LinkedIn data in real-time.
- Background Jobs: Built for background processing.
- Health Monitoring: Built-in health checks and monitoring.
- Custom Routing: Differentiates between scans and visits.
- Data Normalization: Normalizes profile data into structured records.
- Lead Enrichment: Automatically enriches CRM data with LinkedIn insights for sales and marketing teams
- Graph-based CRM: Builds comprehensive relationship graphs for advanced customer relationship management
- Time Efficiency: Eliminates manual LinkedIn data collection, saving hours of manual work
- Data Consistency: Ensures clean, normalized data across all records with standardized formatting
- Scalability: Horizontal scaling ready with stateless architecture
- Reliability: 99.9% uptime target with comprehensive error handling
- Low Maintenance: Minimal operational overhead with automated health checks
- Easy Integration: Simple REST API for third-party integration
- Comprehensive Monitoring: Built-in health checks and logging for production environments
- Node.js 18+
- Express.js 4.x
- MongoDB Atlas
- Mongoose 7.x
- Winston 3.x
- Dotenv 16.x
- cors 2.x
- Jest (unit + integration)
- Nodemon 3.x
- Render (primary)
- AWS / DigitalOcean / Heroku / Fly.io
- PM2 recommended
- Docker-compatible
Primary ingestion endpoint.
Example Response:
{
"stored": true,
"people_upsert": true,
"duplicate": false
}Fetch canonical person by ID.
Resolve person by any alias.
Read-path metrics.
Fetch canonical company by ID.
Resolve company by alias (CompanyID, profile URL, or name).
Fetch canonical location by ID.
Resolve location by alias (raw or normalized).
/health/api/health/ingestion/api/health/parity/api/health/coverage-breakdown/api/health/canonical-coverage/api/health/company-coverage/api/health/location-coverage/api/health/metrics
- Observations:
visits,scans,Immutable,Idempotent. - People: One document per person; alias-based identity; role timeline; provenance + metrics.
- Companies: Canonical LinkedIn company registry.
- Dead Letters: Failed upserts; replayable.
- Merges: Identity merge audit trail.
DuxSoup Webhook ➡️ Validation ➡️ Observations ➡️ Identity Resolution ➡️ People Snapshot
READ_SOURCE:
- legacy
- hybrid
- people
Instant rollback via env change.
src/, scripts/, tests/, render.yaml, .env.example
node scripts/replayDeadLetters.js
node scripts/rebuildPeople.js
node scripts/rebuildCompanies.js
node scripts/rebuildLocations.js
node scripts/linkIdentities.js
node scripts/backfillCanonicalId.js --dry-run
node scripts/backfillCompanyCanonicalId.js --dry-run
node scripts/backfillLocationCanonicalId.js --dry-run
node scripts/dedupeAliases.js --dry-runSet CANONICAL_ID_NAMESPACE in your environment before running the backfill to keep UUIDs stable across deployments.
The same value should be used across all environments to keep canonical IDs consistent.
MIT License © 2024 Mike Hare