Author: Jeremiah Kroesche | Halfservers LLC
Version: 1.0.0
License: MIT
Last Updated: December 2025
- Introduction
- The Problem This Solves
- Architecture Overview
- untrunc CLI Reference
- Prerequisites
- Building the untrunc Binary
- AWS Batch Pipeline Deployment
- Edge Service Deployment
- API Reference
- Webhook Notifications
- Dynamic Resource Scaling
- Configuration Reference
- Security Features
- Monitoring and Observability
- Cost Estimation
- Troubleshooting Guide
- FAQ
- Development and Testing
- Changelog
This is a production-ready, automated pipeline for repairing corrupted MP4, MOV, MKV, AVI, and M4V video files using the open-source untrunc tool.
The pipeline provides two deployment options:
- Edge Service - A Docker-based service that runs on a local Mac (Apple Silicon), watching an SMB share for new files and repairing them automatically
- AWS Batch Pipeline - A serverless, cloud-based solution using AWS Batch with Fargate Spot for cost-effective batch processing
Both components can work independently or together, with the edge service falling back to AWS when local repairs fail.
When video recording is interrupted unexpectedly (power loss, battery depletion, SD card removal, application crash, etc.), the resulting file is often unplayable. This happens because:
- MP4/MOV containers store their index structure (the "moov atom") at the end of the file
- If recording stops abruptly, this index is never written
- Without the index, video players cannot decode the file, even though the actual video/audio data is intact
untrunc analyzes a working reference video from the same camera/software to understand:
- Codec parameters (resolution, frame rate, encoding settings)
- Atom structure and ordering
- Sample-to-chunk mapping patterns
It then reconstructs the missing index for the corrupted file, making it playable again.
Before this pipeline, repairing videos required:
- Manually running untrunc on each file
- Running a Windows VM 24/7 (~$150-300/month on AWS)
- No automation, no notifications, no batch processing
This pipeline provides:
- Fully automated batch processing of multiple files
- Auto-selection of reference files (smallest or newest)
- Cost-effective processing with Fargate Spot (~$20-50/month)
- Webhook notifications for integration with other systems
- Local + cloud hybrid deployment options
┌─────────────────────────────────────────────────────────────────────────────────┐
│ EDGE DEPLOYMENT │
│ (MacBook Air M-series / Apple Silicon) │
│ │
│ ┌──────────────────┐ │
│ │ SMB Share │ The SMB share is mounted from your NAS or file server │
│ │ /Volumes/... │ and contains subdirectories for the workflow: │
│ │ │ │
│ │ ├── ready/ │◄── Drop corrupted video files here │
│ │ ├── export/ │◄── Repaired files are moved here │
│ │ └── quarantine/│◄── Failed files go here for manual review │
│ └────────┬─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────────────┐ │
│ │ Edge Untrunc Service (Docker) │ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────────┐ │ │
│ │ │ Scanner │ │ Reference │ │ Untrunc │ │ AWS Fallback │ │ │
│ │ │ (watches │─▶│ Selector │─▶│ Runner │─▶│ (on failure) │ │ │
│ │ │ ready/) │ │ (smallest) │ │ (-n -s) │ │ (API call) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ └────────┬────────┘ │ │
│ │ │ │ │
│ │ FastAPI Server: http://localhost:8080 │ │ │
│ │ GET /health - Health check │ │ │
│ │ POST /scan-now - Trigger immediate scan │ │ │
│ │ POST /repair - Manual single-file repair │ │ │
│ │ GET /stats - Current scanner status │ │ │
│ └───────────────────────────────────────────────────────────────┼───────────┘ │
│ │ │
└───────────────────────────────────────────────────────────────────┼───────────────┘
│
│ HTTPS API Call
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AWS CLOUD │
│ │
│ ┌─────────────────┐ ┌──────────────────┐ ┌────────────────────────┐ │
│ │ API Gateway │ │ Lambda │ │ AWS Batch │ │
│ │ (HTTP API) │─────▶│ (Job Submitter) │─────▶│ (Fargate Spot) │ │
│ │ │ │ │ │ │ │
│ │ • API Key Auth │ │ • Lists S3 files│ │ • Downloads from S3 │ │
│ │ • Rate limiting│ │ • Selects ref │ │ • Runs untrunc │ │
│ │ • HTTPS only │ │ • Scales resources│ │ • Uploads repaired │ │
│ │ │ │ • Submits job │ │ • Sends notification │ │
│ └─────────────────┘ └──────────────────┘ └───────────┬────────────┘ │
│ │ │
│ ┌────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ S3 Bucket │ │ Batch Job │ │ S3 Bucket │ │
│ │ (Raw) │────────▶│ Container │────────▶│ (Processed) │ │
│ │ │ │ │ │ │ │
│ │ Encrypted │ │ untrunc │ │ Encrypted │ │
│ │ Versioned │ │ -n -s -dst │ │ Versioned │ │
│ │ Lifecycle │ │ │ │ Lifecycle │ │
│ └─────────────┘ └──────┬──────┘ └─────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ ┌─────────────────────────┐ │
│ │ SNS Topic │────────▶│ Notifications │ │
│ │ │ │ • HTTPS Webhook │ │
│ │ │ │ • Email (optional) │ │
│ └─────────────┘ └─────────────────────────┘ │
│ │
│ ┌───────────────────────────────────────────────────────────────────────────┐ │
│ │ Monitoring & Alerting │ │
│ │ │ │
│ │ CloudWatch Logs ──── CloudWatch Alarms ──── SNS Alerts │ │
│ │ • Job output • Failed jobs • Email notifications │ │
│ │ • Lambda logs • Lambda errors • Webhook delivery │ │
│ │ • Structured JSON • Queue depth • PagerDuty integration │ │
│ └───────────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────────┘
- Upload: Corrupted video files are uploaded to S3 (or dropped in SMB share for edge)
- Trigger: API call (or automatic scan) initiates repair job
- Reference Selection: System automatically picks the smallest file as reference (assumption: smallest = complete/working)
- Repair: untrunc processes each corrupted file using the reference
- Output: Repaired files are uploaded to processed bucket (or export directory)
- Notification: Webhook/email sent with job results
This pipeline uses the anthwlock/untrunc fork, which is actively maintained and includes many bug fixes over the original.
untrunc [options] <reference.mp4> <corrupted.mp4>Important: The reference file comes FIRST, then the corrupted file.
| Option | Description |
|---|---|
-n |
Non-interactive mode. Don't prompt for confirmation. Required for automation. |
-s |
Step through unknown sequences. Improves recovery of partially corrupted files. |
-dst <path> |
Set output destination file or directory. Without this, output is <input>_fixed.<ext> |
# Basic repair (output: corrupted_fixed.mp4)
untrunc reference.mp4 corrupted.mp4
# Non-interactive with stepping (recommended)
untrunc -n -s reference.mp4 corrupted.mp4
# Specify output destination
untrunc -n -s -dst /output/repaired.mp4 reference.mp4 corrupted.mp4
# Quiet mode (errors only)
untrunc -n -s -q reference.mp4 corrupted.mp4Usage: untrunc [options] <ok.mp4> [corrupt.mp4]
general options:
-V - version
-n - no interactive
repair options:
-s - step through unknown sequences
-st <step_size> - used with '-s'
-sv - stretches video to match audio duration (beta)
-dw - don't write _fixed.mp4
-dr - dump repaired tracks, implies '-dw'
-k - keep unknown sequences
-sm - search mdat, even if no mp4-structure found
-dcc - dont check if chunks are inside mdat
-dyn - use dynamic stats
-range <A:B> - raw data range
-dst <dir|file> - set destination
-skip - skip existing
-noctts - dont restore ctts
-mp <bytes> - set max partsize
analyze options:
-a - analyze
-i[t|a|s] - info [tracks|atoms|stats]
-d - dump samples
-f - find all atoms and check their lengths
-lsm - find all mdat,moov atoms
-m <offset> - match/analyze file offset
other options:
-ms - make streamable
-sh - shorten
-u <mdat> <moov>- unite fragments
logging options:
-q - quiet, only errors
-w - show hidden
-v / -vv - verbose / very verbose
For best results, your reference file should:
- Be from the same camera/software as the corrupted files
- Use the same settings (resolution, frame rate, codec)
- Be complete and playable (verify it plays correctly first)
- Be at least a few seconds long (longer is sometimes better)
| Software | Minimum Version | Purpose |
|---|---|---|
| Docker | 20.10+ | Container runtime |
| Docker Compose | 2.0+ | Multi-container orchestration |
| AWS CLI | 2.0+ | AWS resource management |
| Terraform | 1.5.0+ | Infrastructure as code |
| jq | 1.6+ | JSON processing |
| Git | 2.0+ | Source code management |
The IAM user/role deploying this infrastructure needs:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:*",
"batch:*",
"ecs:*",
"ecr:*",
"iam:*",
"lambda:*",
"apigateway:*",
"logs:*",
"sns:*",
"cloudwatch:*",
"events:*"
],
"Resource": "*"
}
]
}For production, scope these permissions down to specific resources.
For Edge Service (Mac):
- Apple Silicon (M1/M2/M3) recommended
- 8GB RAM minimum
- Fast SSD storage
- Stable network connection to SMB share
For AWS Batch Jobs:
- Resources scale automatically based on file sizes
- Maximum: 8 vCPU, 30GB RAM, 200GB storage per job
The pipeline includes a helper script to build untrunc for both platforms:
# From the project root
./build-untrunc.shThis script:
- Clones the anthwlock/untrunc repository
- Builds for linux/amd64 (AWS Batch containers)
- Builds for linux/arm64 (Apple Silicon edge service)
- Copies binaries to the correct locations
# On a Linux machine or in Docker
git clone --depth 5 https://github.com/anthwlock/untrunc.git
cd untrunc
# Install dependencies (Debian/Ubuntu)
sudo apt-get install -y build-essential libavformat-dev libavcodec-dev libavutil-dev
# Build
make FF_VER=6.0
# Copy binary
cp untrunc /path/to/batch-pipeline/container/bin/untrunc-linux-amd64# Install Homebrew dependencies
brew install ffmpeg pkg-config
# Clone and build
git clone --depth 5 https://github.com/anthwlock/untrunc.git
cd untrunc
export PKG_CONFIG_PATH="/opt/homebrew/lib/pkgconfig"
export CPPFLAGS="-I/opt/homebrew/include"
export LDFLAGS="-L/opt/homebrew/lib"
make
# Copy binary
cp untrunc /path/to/edge-service/bin/untrunc-arm64# Build for AMD64 (AWS)
docker build --platform linux/amd64 -t untrunc-builder .
docker create --name temp-untrunc untrunc-builder
docker cp temp-untrunc:/untrunc ./untrunc-linux-amd64
docker rm temp-untrunc
# Build for ARM64 (Apple Silicon)
docker build --platform linux/arm64 -t untrunc-builder-arm .
docker create --name temp-untrunc-arm untrunc-builder-arm
docker cp temp-untrunc-arm:/untrunc ./untrunc-arm64
docker rm temp-untrunc-armcd batch-pipeline
cp terraform.tfvars.example terraform.tfvarsEdit terraform.tfvars:
# Required settings
aws_region = "us-east-1"
environment = "prod"
raw_bucket_name = "mycompany-untrunc-raw-prod"
processed_bucket_name = "mycompany-untrunc-processed-prod"
untrunc_image_uri = "123456789012.dkr.ecr.us-east-1.amazonaws.com/untrunc:latest"
# Generate API key hash: echo -n 'your-secret-key' | sha256sum | cut -d' ' -f1
api_key_hash = "5e884898da28047d9f4b6ab7b3b6e6..."
# Notifications (optional but recommended)
webhook_url = "https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXX"
alert_email = "[email protected]"
# Job resource defaults (auto-scaled per job, these are fallbacks)
job_vcpu = 4
job_memory_mb = 8192
job_ephemeral_storage_gb = 100
job_timeout_seconds = 7200 # 2 hours
job_retry_attempts = 2
# Log retention
log_retention_days = 30
raw_file_retention_days = 30terraform initExpected output:
Initializing provider plugins...
- Finding hashicorp/aws versions matching "~> 5.0"...
- Installing hashicorp/aws v5.x.x...
Terraform has been successfully initialized!
terraform plan -out=tfplanReview the resources that will be created:
- 2 S3 buckets (raw and processed)
- API Gateway HTTP API
- Lambda function
- AWS Batch compute environment, job queue, and job definition
- IAM roles and policies
- SNS topic for notifications
- CloudWatch log groups and alarms
terraform apply tfplanSave the outputs:
terraform output > outputs.txtKey outputs:
api_endpoint- The URL for submitting jobsraw_bucket_name- Where to upload corrupted filesprocessed_bucket_name- Where repaired files appearecr_repository_url- Where to push the container image
# Create ECR repository (if not using Terraform-managed one)
aws ecr create-repository --repository-name untrunc --region us-east-1
# Get login credentials
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin 123456789012.dkr.ecr.us-east-1.amazonaws.com
# Build the container
cd container
docker build --platform linux/amd64 -t untrunc:latest .
# Tag and push
docker tag untrunc:latest 123456789012.dkr.ecr.us-east-1.amazonaws.com/untrunc:latest
docker push 123456789012.dkr.ecr.us-east-1.amazonaws.com/untrunc:latest# Upload test files
aws s3 cp test-videos/ s3://mycompany-untrunc-raw-prod/test-batch/ --recursive
# Submit a job
curl -X POST "$(terraform output -raw api_endpoint)/submit-batch" \
-H "Content-Type: application/json" \
-H "X-Api-Key: your-secret-key" \
-d '{"input_prefix": "test-batch"}'
# Check job status in AWS Console or CLI
aws batch describe-jobs --jobs <job-id-from-response>cd edge-service
# Copy the untrunc binary for ARM64
mkdir -p bin
cp /path/to/untrunc-arm64 bin/untrunc-arm64
chmod +x bin/untrunc-arm64
# Create configuration
cp .env.example .envEdit .env:
# SMB share mount point on Mac
HOST_SMB_ROOT=/Volumes/mosaic_share
# Directory structure within the share
READY_DIR=ready
EXPORT_DIR=export
QUARANTINE_DIR=quarantine
# Scanner settings
SCAN_INTERVAL_SECONDS=30
MIN_FILE_AGE_SECONDS=60
MAX_CONCURRENT_JOBS=2
UNTRUNC_TIMEOUT_SECONDS=3600
# Reference file selection: "smallest" or "newest"
REFERENCE_STRATEGY=smallest
# AWS fallback (get these from Terraform outputs)
AWS_REPAIR_API_BASE_URL=https://abc123.execute-api.us-east-1.amazonaws.com/prod
AWS_API_KEY=your-secret-api-key
AWS_FALLBACK_RETRIES=3
# Logging: "json" for production, "text" for debugging
LOG_FORMAT=jsonUsing Finder:
- Press Cmd+K
- Enter:
smb://server/share - Authenticate and mount
Using Terminal:
mkdir -p /Volumes/mosaic_share
mount_smbfs //user:password@server/share /Volumes/mosaic_shareFor persistent mounts, add to /etc/auto_master or use a login script.
docker compose up -d --build# Check container is running
docker compose ps
# View logs
docker compose logs -f
# Check health endpoint
curl http://localhost:8080/healthExpected health response:
{
"status": "healthy",
"scanner_running": true,
"known_files": 0,
"current_reference": null
}- Drop a known-good video file in the
ready/directory (this will be the reference) - Drop a corrupted video file in the same directory
- Wait for the scan interval (30 seconds by default)
- Check the
export/directory for repaired files - Check the
quarantine/directory for any failures
https://<api-id>.execute-api.<region>.amazonaws.com/prod
Get this from Terraform output: terraform output api_endpoint
All endpoints except /health require an API key in the X-Api-Key header:
curl -H "X-Api-Key: your-secret-key" https://...Health check endpoint. No authentication required.
Response:
{
"status": "healthy",
"service": "untrunc-batch-api",
"default_input_bucket": "mycompany-untrunc-raw-prod",
"default_output_bucket": "mycompany-untrunc-processed-prod"
}Submit a batch repair job.
Request Body:
| Field | Type | Required | Description |
|---|---|---|---|
input_prefix |
string | Yes | S3 prefix containing video files to repair |
input_bucket |
string | No | Override default input bucket |
output_bucket |
string | No | Override default output bucket |
output_prefix |
string | No | Override output prefix (defaults to input_prefix) |
reference_strategy |
string | No | "smallest" (default) or "newest" |
reference_key |
string | No | Explicit S3 key for reference file |
vcpu |
integer | No | Override auto-scaled vCPU (1-16) |
memory_mb |
integer | No | Override auto-scaled memory (512-122880) |
storage_gb |
integer | No | Override auto-scaled storage (21-200) |
Example Request:
curl -X POST https://abc123.execute-api.us-east-1.amazonaws.com/prod/submit-batch \
-H "Content-Type: application/json" \
-H "X-Api-Key: your-secret-key" \
-d '{
"input_prefix": "session-2025-01-15/camera1",
"reference_strategy": "smallest"
}'Success Response (202 Accepted):
{
"message": "Batch repair job submitted",
"job_id": "untrunc-a1b2c3d4e5f6",
"batch_job_id": "12345678-1234-1234-1234-123456789012",
"input_bucket": "mycompany-untrunc-raw-prod",
"input_prefix": "session-2025-01-15/camera1",
"output_bucket": "mycompany-untrunc-processed-prod",
"output_prefix": "session-2025-01-15/camera1",
"reference_file": "session-2025-01-15/camera1/clip001.mp4",
"reference_size_mb": 45.67,
"reference_strategy": "smallest",
"files_to_repair": [
"session-2025-01-15/camera1/clip002.mp4",
"session-2025-01-15/camera1/clip003.mp4",
"session-2025-01-15/camera1/clip004.mp4"
],
"file_count": 3,
"total_size_mb": 1536.89,
"largest_file_mb": 678.45,
"resources": {
"vcpu": "4",
"memory_mb": 8192,
"storage_gb": 120,
"auto_scaled": true
}
}Error Responses:
| Status | Description |
|---|---|
| 400 | Invalid request (missing prefix, no files found, etc.) |
| 401 | Invalid or missing API key |
| 500 | Internal server error |
When a job completes (success, partial, or failure), a notification is sent to the configured webhook URL.
{
"job_id": "untrunc-a1b2c3d4e5f6",
"status": "COMPLETED",
"message": "All 3 files repaired successfully",
"input_bucket": "mycompany-untrunc-raw-prod",
"input_prefix": "session-2025-01-15/camera1",
"output_bucket": "mycompany-untrunc-processed-prod",
"output_prefix": "session-2025-01-15/camera1",
"reference_file": "session-2025-01-15/camera1/clip001.mp4",
"total_files": 3,
"success_count": 3,
"failure_count": 0,
"success_files": [
"session-2025-01-15/camera1/clip002.mp4",
"session-2025-01-15/camera1/clip003.mp4",
"session-2025-01-15/camera1/clip004.mp4"
],
"failed_files": [],
"timestamp": "2025-01-15T14:30:00Z"
}| Status | Description |
|---|---|
COMPLETED |
All files repaired successfully |
PARTIAL |
Some files repaired, some failed |
FAILED |
All files failed to repair |
Example Slack webhook format:
{
"text": "Untrunc Job Completed",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "*Job ID:* untrunc-a1b2c3d4e5f6\n*Status:* COMPLETED\n*Files:* 3 repaired, 0 failed"
}
}
]
}You may need a Lambda or middleware to transform the payload to Slack's format.
The Lambda function automatically calculates appropriate job resources based on the files being processed.
| Total Size | vCPU | Memory | Storage | Use Case |
|---|---|---|---|---|
| ≤ 5 GB | 1 | 2 GB | 30 GB | Small batches (a few short clips) |
| ≤ 20 GB | 2 | 4 GB | 60 GB | Medium batches |
| ≤ 50 GB | 4 | 8 GB | 120 GB | Large batches |
| ≤ 100 GB | 4 | 16 GB | 175 GB | XL batches |
| ≤ 200 GB | 8 | 30 GB | 200 GB | XXL batches |
- Largest file consideration: If the largest single file is very large, memory is bumped to 2x that file's size
- Storage buffer: Storage is calculated as 3x the largest file + 10GB buffer
- Fargate limits: Resources are capped at Fargate maximums (16 vCPU, 120GB RAM, 200GB storage)
For special cases, you can override the auto-scaling:
curl -X POST .../submit-batch \
-d '{
"input_prefix": "huge-files",
"vcpu": 8,
"memory_mb": 30720,
"storage_gb": 200
}'| Variable | Type | Default | Description |
|---|---|---|---|
aws_region |
string | us-east-1 |
AWS region for deployment |
environment |
string | prod |
Environment name (prod, staging, dev) |
raw_bucket_name |
string | required | S3 bucket for corrupted files |
processed_bucket_name |
string | required | S3 bucket for repaired files |
untrunc_image_uri |
string | required | ECR image URI for batch container |
api_key_hash |
string | required | SHA256 hash of API key |
webhook_url |
string | "" |
HTTPS webhook for notifications |
alert_email |
string | "" |
Email for CloudWatch alarms |
job_vcpu |
number | 4 |
Default vCPUs per job |
job_memory_mb |
number | 8192 |
Default memory per job (MB) |
job_ephemeral_storage_gb |
number | 100 |
Default ephemeral storage (GB) |
job_timeout_seconds |
number | 7200 |
Job timeout (2 hours) |
job_retry_attempts |
number | 2 |
Retry failed jobs |
batch_max_vcpus |
number | 16 |
Max vCPUs in compute environment |
log_retention_days |
number | 30 |
CloudWatch log retention |
raw_file_retention_days |
number | 30 |
S3 lifecycle for raw bucket |
| Variable | Default | Description |
|---|---|---|
HOST_SMB_ROOT |
/Volumes/mosaic_share |
SMB mount path on Mac |
READY_DIR |
ready |
Subdirectory for input files |
EXPORT_DIR |
export |
Subdirectory for output files |
QUARANTINE_DIR |
quarantine |
Subdirectory for failed files |
SCAN_INTERVAL_SECONDS |
30 |
Time between directory scans |
MIN_FILE_AGE_SECONDS |
60 |
File stability threshold |
MAX_CONCURRENT_JOBS |
2 |
Parallel repair limit |
UNTRUNC_TIMEOUT_SECONDS |
3600 |
Single file timeout |
REFERENCE_STRATEGY |
smallest |
Reference selection method |
AWS_REPAIR_API_BASE_URL |
"" |
AWS API endpoint for fallback |
AWS_API_KEY |
"" |
API key for AWS auth |
AWS_FALLBACK_RETRIES |
3 |
Retry count for AWS calls |
LOG_FORMAT |
json |
Log format (json or text) |
- API Gateway uses API key authentication
- Key is stored as SHA256 hash (never plaintext)
- Keys can be rotated by updating Terraform variable
- API Gateway: HTTPS only, TLS 1.2+
- Batch jobs run in VPC with no inbound access
- Outbound only for S3 and SNS
- S3 buckets: AES-256 server-side encryption
- S3 buckets: Public access blocked
- S3 buckets: Versioning enabled for recovery
- CloudWatch logs: Encrypted
- Edge service runs as non-root user
- No shell injection possible (subprocess exec, not shell)
- Minimal base image (Debian slim)
- No unnecessary packages installed
- Lambda: Only S3 list/read, Batch submit
- Batch jobs: Only S3 read/write to specific buckets, SNS publish
- No admin permissions
All components log to CloudWatch in structured JSON format:
{
"timestamp": "2025-01-15T14:30:00Z",
"level": "INFO",
"job_id": "untrunc-a1b2c3d4e5f6",
"message": "Repair successful",
"file": "clip002.mp4",
"output": "s3://bucket/path/clip002.mp4",
"size": 123456789
}Log groups:
/aws/lambda/untrunc-prod-submit- Lambda execution logs/aws/batch/untrunc-prod- Batch job logs
Pre-configured alarms:
| Alarm | Threshold | Action |
|---|---|---|
| Batch Job Failures | > 0 in 5 min | SNS notification |
| Lambda Errors | > 0 in 5 min | SNS notification |
AWS/Batch/JobsFailed- Failed batch jobsAWS/Batch/JobsSucceeded- Successful batch jobsAWS/Lambda/Errors- Lambda function errorsAWS/Lambda/Duration- Lambda execution timeAWS/S3/BucketSizeBytes- Storage usage
# Example: Create CloudWatch dashboard
aws cloudwatch put-dashboard \
--dashboard-name "UntruncPipeline" \
--dashboard-body file://dashboard.jsonFor a typical workload of 10-20 files (10-50GB each) processed weekly:
| Component | Unit Cost | Monthly Usage | Monthly Cost |
|---|---|---|---|
| Fargate Spot (4 vCPU, 8GB) | ~$0.05/hour | ~20 hours | $1-5 |
| S3 Storage (500GB) | $0.023/GB | 500GB | $11.50 |
| S3 Requests | $0.005/1000 | ~10,000 | $0.05 |
| API Gateway | $1.00/million | ~1,000 | <$0.01 |
| Lambda | $0.20/million + duration | ~1,000 | <$0.01 |
| CloudWatch Logs | $0.50/GB | ~1GB | $0.50 |
| SNS | $0.50/million | ~100 | <$0.01 |
| Total | ~$15-25/month |
| Component | Monthly Cost |
|---|---|
| EC2 t3.large (24/7) | $60-80 |
| EBS Storage (200GB) | $20 |
| Windows License | $15 |
| Total | $95-115/month |
Savings: 75-85% with this pipeline vs. always-on VM
The edge service runs on existing hardware (your Mac), so the only cost is electricity (~$5-10/month for a Mac mini running 24/7).
Cause: The auto-selected reference file might be corrupted.
Solutions:
- Specify a known-good reference file explicitly:
curl -X POST .../submit-batch \ -d '{"input_prefix": "...", "reference_key": "path/to/known-good.mp4"}' - Use "newest" strategy instead of "smallest":
-d '{"input_prefix": "...", "reference_strategy": "newest"}'
Cause: untrunc couldn't repair the file, possibly due to:
- Reference file from different camera/settings
- File is too corrupted to recover
- Wrong codec
Solutions:
- Try a different reference file
- Try the
-dccflag (add to container script) - Check if the file is actually recoverable with the GUI version
Cause: Large files taking too long to process.
Solutions:
- Increase timeout in Terraform:
job_timeout_seconds = 14400 # 4 hours
- Process fewer files per batch
- Use larger instance (more vCPU)
Cause: Ephemeral storage not large enough for files.
Solutions:
- Let auto-scaling handle it (it considers file sizes)
- Override storage manually:
-d '{"storage_gb": 200}' - Maximum is 200GB for Fargate
Causes:
- Share not mounted
- Docker doesn't have file sharing permissions
- Mac went to sleep and dropped mount
Solutions:
- Verify mount:
ls /Volumes/mosaic_share - Docker Desktop → Settings → Resources → File Sharing → Add
/Volumes - Use
caffeinateor disable sleep - Create a LaunchDaemon to remount on wake
Causes:
- Files in wrong directory
- Files still being written (not stable)
- Wrong file extensions
Solutions:
- Files must be in the
ready/subdirectory - Wait for
MIN_FILE_AGE_SECONDS(60s default) - Supported extensions: .mp4, .mov, .mkv, .avi, .m4v
Causes:
- API URL misconfigured
- API key incorrect
- Network issues
Solutions:
- Verify URL format:
https://xxx.execute-api.region.amazonaws.com/prod - Test API key with curl directly
- Check edge service logs:
docker compose logs -f
A: Generally no. untrunc needs a reference file from the same camera/software with the same settings. Different cameras use different codecs, frame rates, and atom structures.
A: MP4, MOV, MKV, AVI, and M4V containers. The actual codec support depends on FFmpeg (H.264, H.265/HEVC, ProRes, etc. are all supported).
A: Use any complete, playable video file from the same camera with the same settings. The pipeline defaults to using the smallest file, assuming it's most likely to be complete.
A: Not in a single batch job (Fargate ephemeral storage limit). For very large files:
- Process one file at a time
- Use EC2 instead of Fargate (requires code changes)
- Use the edge service on a machine with enough local storage
A: Yes. All data is encrypted at rest (S3) and in transit (HTTPS/TLS). Access is controlled via IAM and API keys. However, for highly sensitive content, review the security settings and consider using KMS encryption.
A: Yes. AWS Batch manages a queue and runs jobs based on compute capacity. The default batch_max_vcpus is 16, allowing multiple concurrent jobs.
A:
- Batch: Failed files are logged, webhook shows failures, original files remain in raw bucket
- Edge: Files are moved to
quarantine/directory, AWS fallback is attempted
A: Yes, but you'll need to modify the shell script (run_untrunc.sh) or Python runner. Current options: -n -s -dst
./validate.shThis checks:
- Required tools installed
- untrunc binaries present
- Terraform configuration valid
- Python syntax valid
- Shell script syntax valid
cd edge-service
# Build without Docker Compose
docker build -t untrunc-edge:test .
# Run with local directory instead of SMB
docker run -it --rm \
-v /tmp/test-videos:/data \
-e READY_DIR=ready \
-e EXPORT_DIR=export \
-e QUARANTINE_DIR=quarantine \
-p 8080:8080 \
untrunc-edge:testcd batch-pipeline/lambda
# Install dependencies
pip install boto3
# Test with mock event
python -c "
import lambda_function
event = {
'headers': {'x-api-key': 'test'},
'body': '{\"input_prefix\": \"test\"}'
}
print(lambda_function.lambda_handler(event, None))
"- Deploy to a staging environment
- Upload test files to S3
- Submit a job
- Verify output files
- Check webhook notification
- Initial production release
- AWS Batch pipeline with Fargate Spot
- Edge service for local Mac processing
- Auto-scaling based on file sizes
- Webhook notifications
- Security hardening (encryption, auth, least privilege)
- Verified untrunc CLI syntax
MIT License
Copyright (c) 2025 Jeremiah Kroesche | Halfservers LLC
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
For issues and feature requests, please create an issue in the repository.
For commercial support or custom development, contact: Halfservers LLC