Codestin Search App

A canonical LinkedIn extraction and profile-intelligence pipeline.

DuxSoup ETL is a production-grade LinkedIn extraction, ingestion, and identity resolution pipeline - designed to safely process real-time extraction webhooks while maintaining a canonical profile model. The system ingests scan and visit events (via the DuxSoup API) as immutable observations. It resolves identities deterministically, and maintains a continuously updated People Snapshot optimized for analytics, CRM enrichment, and graph-based intelligence workflows. The system has been designed for background processing and extensibility.

Significant controls have been included for safety, such as idempotent ingestion, automatic deduplication, and failure recovery.

Finally, the system normalizes profile data and is stored in MongoDB Atlas records (documents). A very useful app for lead enrichment, CRM models, and overarching people intelligence.

Features

Webhook Processing: Handles DuxSoup LinkedIn data via POST /api/webhook.
Type-Based Routing: Automatically routes scan vs visit data to appropriate handlers based on the type field in the payload.
Data Validation: Comprehensive validation for required fields, including a custom validator for the id field to ensure it's a non-empty string.
Error Handling: Robust error handling with detailed logging using Winston.
Production Ready: Designed for deployment on platforms like Render with health monitoring.
Extensible: Easy to add MongoDB storage and data normalization.
MongoDB Storage: Integrates with MongoDB using Mongoose to persist processed data.
Real-time Processing: Processes LinkedIn data in real-time.
Background Jobs: Built for background processing.
Health Monitoring: Built-in health checks and monitoring.
Custom Routing: Differentiates between scans and visits.
Data Normalization: Normalizes profile data into structured records.

Benefits

Business Value

Lead Enrichment: Automatically enriches CRM data with LinkedIn insights for sales and marketing teams
Graph-based CRM: Builds comprehensive relationship graphs for advanced customer relationship management
Time Efficiency: Eliminates manual LinkedIn data collection, saving hours of manual work
Data Consistency: Ensures clean, normalized data across all records with standardized formatting

Technical Advantages

Scalability: Horizontal scaling ready with stateless architecture
Reliability: 99.9% uptime target with comprehensive error handling
Low Maintenance: Minimal operational overhead with automated health checks
Easy Integration: Simple REST API for third-party integration
Comprehensive Monitoring: Built-in health checks and logging for production environments

Tech Stack

Core Technologies

Node.js 18+
Express.js 4.x
MongoDB Atlas
Mongoose 7.x
Winston 3.x
Dotenv 16.x
cors 2.x
Jest (unit + integration)
Nodemon 3.x

Deployment Platforms

Render (primary)
AWS / DigitalOcean / Heroku / Fly.io
PM2 recommended
Docker-compatible

API Reference

POST /api/webhook

Primary ingestion endpoint.

Example Response:

{
  "stored": true,
  "people_upsert": true,
  "duplicate": false
}

GET /api/people/:id

Fetch canonical person by ID.

GET /api/people/by-alias/:value

Resolve person by any alias.

GET /api/people/metrics

Read-path metrics.

GET /api/companies/:id

Fetch canonical company by ID.

GET /api/companies/by-alias/:value

Resolve company by alias (CompanyID, profile URL, or name).

GET /api/locations/:id

Fetch canonical location by ID.

GET /api/locations/by-alias/:value

Resolve location by alias (raw or normalized).

Health

/health
/api/health/ingestion
/api/health/parity
/api/health/coverage-breakdown
/api/health/canonical-coverage
/api/health/company-coverage
/api/health/location-coverage
/api/health/metrics

Data Model

Observations: visits, scans, Immutable, Idempotent.
People: One document per person; alias-based identity; role timeline; provenance + metrics.
Companies: Canonical LinkedIn company registry.
Dead Letters: Failed upserts; replayable.
Merges: Identity merge audit trail.

Architecture & Data Flow

DuxSoup Webhook ➡️ Validation ➡️ Observations ➡️ Identity Resolution ➡️ People Snapshot

Read Modes & Cutover Strategy

READ_SOURCE:

legacy
hybrid
people

Instant rollback via env change.

Project Structure

src/, scripts/, tests/, render.yaml, .env.example

Operations & Debugging

node scripts/replayDeadLetters.js
node scripts/rebuildPeople.js
node scripts/rebuildCompanies.js
node scripts/rebuildLocations.js
node scripts/linkIdentities.js
node scripts/backfillCanonicalId.js --dry-run
node scripts/backfillCompanyCanonicalId.js --dry-run
node scripts/backfillLocationCanonicalId.js --dry-run
node scripts/dedupeAliases.js --dry-run

Set CANONICAL_ID_NAMESPACE in your environment before running the backfill to keep UUIDs stable across deployments. The same value should be used across all environments to keep canonical IDs consistent.

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.claude		.claude
.github/workflows		.github/workflows
__tests__		__tests__
docs		docs
scripts		scripts
src		src
.env.example		.env.example
.gitignore		.gitignore
HEALTH_CHECK_REPORT.md		HEALTH_CHECK_REPORT.md
LICENSE		LICENSE
README.md		README.md
duxsoup-etl.png		duxsoup-etl.png
jest.config.integration.js		jest.config.integration.js
jest.config.unit.js		jest.config.unit.js
package-lock.json		package-lock.json
package.json		package.json
render.yaml		render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Table of Contents

Features

Benefits

Business Value

Technical Advantages

Tech Stack

Core Technologies

Deployment Platforms

API Reference

POST /api/webhook

GET /api/people/:id

GET /api/people/by-alias/:value

GET /api/people/metrics

GET /api/companies/:id

GET /api/companies/by-alias/:value

GET /api/locations/:id

GET /api/locations/by-alias/:value

Health

Data Model

Architecture & Data Flow

Read Modes & Cutover Strategy

Project Structure

Operations & Debugging

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

harehimself/duxsoup-etl

Folders and files

Latest commit

History

Repository files navigation

Table of Contents

Features

Benefits

Business Value

Technical Advantages

Tech Stack

Core Technologies

Deployment Platforms

API Reference

POST /api/webhook

GET /api/people/:id

GET /api/people/by-alias/:value

GET /api/people/metrics

GET /api/companies/:id

GET /api/companies/by-alias/:value

GET /api/locations/:id

GET /api/locations/by-alias/:value

Health

Data Model

Architecture & Data Flow

Read Modes & Cutover Strategy

Project Structure

Operations & Debugging

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages