A Scalable Wireshark-to-SecOps Pipeline on Google Cloud Platform
Author: Filippo Lucchesi
Course: Scalable and Reliable Services, University of Bologna
Evolution of: Wireshark-to-Chronicle-Pipeline (Cybersecurity Projects 2024)
This project implements a robust, scalable, and event-driven pipeline to capture network traffic using tshark
, process it, and transform it into the Unified Data Model (UDM) for security analytics, all orchestrated on Google Cloud Platform (GCP) using Terraform. It evolves from an initial local processing concept into a cloud-native solution designed for enhanced reliability and scalability.
- Hybrid Capture Model: A Dockerized
tshark
sniffer designed for on-premises or edge deployment handles initial packet capture, automatically rotating PCAP files (supporting.pcap
and.pcapng
), uploading them to Google Cloud Storage (GCS), and notifying a Pub/Sub topic. - Serverless, Scalable Processing: A GCP Cloud Run service acts as a serverless processor, triggered by Pub/Sub messages, to manage the demanding PCAP-to-UDM transformation.
- Optimized Core Transformation (
json2udm_cloud.py
): The central Python script, originally designed for local batch processing, has been significantly re-engineered. It now employs streaming JSON parsing (ijson
) to handle potentially massivetshark
outputs efficiently within Cloud Run's memory constraints, mapping raw packet data to UDM. This is the analytical heart of the project. - Resilient and Decoupled Architecture: Leverages a Pub/Sub-driven workflow for loose coupling between capture and processing. Includes dead-letter queue (DLQ) support for failed messages, Cloud Run health probes for service reliability, and robust error handling within the processing logic.
- Infrastructure as Code (IaC): The entire GCP infrastructure is managed by Terraform, promoting repeatability, version control, and automated provisioning.
- Minimal On-Premises Footprint: All heavy computation (JSON parsing, UDM mapping) is offloaded to the cloud, requiring minimal resources on the capture (sniffer) side.
- Secure by Design: Implements IAM least-privilege principles for service accounts, OIDC-authenticated Cloud Run invocations from Pub/Sub, and secure SA key management for the on-premises sniffer.
- Observable System: Integrates with Cloud Logging for structured, centralized application and service logs. Leverages Cloud Monitoring with a comprehensive, custom operational dashboard defined as code (IaC) via Terraform, providing deep insights into pipeline health, performance, and error rates. Key performance indicators (KPIs) are tracked through numerous Log-Based Metrics.
- Key Features & Enhancements
- Architecture Overview
- From Local Batch to Cloud-Native Streaming
- Repository Layout
- Implementation Details
- How to Use
- Educational Value & Cloud-Native Principles
- Security Considerations
- Maintenance & Troubleshooting
The system employs a distributed, event-driven architecture:
- Capture & Notify (On-Premises/Edge -
sniffer
container):- The
sniffer
container runstshark
on a designated network interface. - PCAP files are rotated based on size or duration.
- Upon rotation, the completed PCAP file is uploaded to a GCS "incoming-pcaps" bucket.
- A notification containing the filename is published to a GCP Pub/Sub topic.
- The
- Trigger & Process (GCP - Cloud Run
processor
service):- A Pub/Sub push subscription, secured with OIDC, invokes the
processor
Cloud Run service. - The Cloud Run service:
- Downloads the specified PCAP file from the "incoming-pcaps" GCS bucket.
- Converts the PCAP to a JSON representation using an embedded
tshark
instance (tshark -T json
). - Executes the
json2udm_cloud.py
script, which streams the large JSON output fromtshark
and maps each packet to the UDM format. - Uploads the resulting UDM JSON file to a "processed-udm" GCS bucket.
- A Pub/Sub push subscription, secured with OIDC, invokes the
- Error Handling & Observability (GCP):
- Pub/Sub push subscription is configured with a dead-letter topic to capture messages that fail processing after multiple retries.
- All application logs (sniffer and processor) are sent to Cloud Logging.
- Key service metrics (Cloud Run invocations, latency, errors; Pub/Sub message counts; GCS operations) and detailed application-level metrics (e.g., PCAP processing stages, UDM conversion details) are available in Cloud Monitoring, primarily through a dedicated operational dashboard.
This project originated from a Cybersecurity course project focused on local PCAP processing. The initial json2udm.py
script (included in the repository for reference) was designed to:
- Load an entire
tshark
-generated JSON file into memory. - Iterate through the parsed packets.
- Handle local file system operations for input and output, including splitting large UDM outputs.
Key improvements in json2udm_cloud.py
for this "Scalable and Reliable Services" project:
- Memory Efficiency: The most significant change is the adoption of
ijson
for streaming JSON. This allows the script to process massivetshark
JSON outputs packet by packet, drastically reducing memory footprint and making it suitable for resource-constrained environments like Cloud Run. The originaljson.loads()
on a multi-gigabyte JSON file would lead to OOM errors. - Robustness: Enhanced error handling for individual packets. Instead of potentially skipping packets or failing entirely on malformed data, the script now attempts to create a minimal UDM event even for problematic packets, often including error details. Timestamp conversion is also more robust, with fallbacks.
- Cloud Environment Focus: Removal of local file system concerns like multi-file output splitting. The script now produces a single UDM JSON output stream, which the
processor_app.py
then uploads to GCS. - UDM Alignment: The UDM structure produced has been refined to more closely align with common UDM schemas (e.g., Chronicle UDM), featuring distinct
metadata
,principal
,target
, andnetwork
sections.
These adaptations were crucial to transition the core logic from a local, batch-oriented tool to a scalable, cloud-native component.
Chronicle-Sniffer/
├── terraform/ # Terraform IaC modules and configurations
│ ├── modules/
│ │ ├── gcs_buckets/ # Manages GCS buckets
│ │ ├── pubsub_topic/ # Manages Pub/Sub topic and DLQ
│ │ ├── cloudrun_processor/ # Manages Cloud Run processor service
│ │ └── test_generator_vm/ # Optional VM for on-prem simulation
│ │ └── startup_script_vm.sh
│ ├── dashboards/
│ │ └── main_operational_dashboard.json # Dashboard definition
│ ├── provider.tf
│ ├── variables.tf
│ ├── main.tf # Main Terraform configuration
│ ├── outputs.tf
│ └── terraform.tfvars.example # Example variables for Terraform
├── sniffer/ # On-Premises/Edge Sniffer component
│ ├── Dockerfile # Dockerfile for the sniffer
│ ├── sniffer_entrypoint.sh # Entrypoint script for capture and upload
│ ├── compose.yml # Docker Compose for local sniffer testing
│ ├── .env.example # Environment variables for sniffer example
│ └── readme.md # Sniffer-specific README
├── processor/ # Cloud Run Processor component
│ ├── Dockerfile # Dockerfile for the processor
│ ├── processor_app.py # Flask app orchestrating the processing
│ ├── json2udm_cloud.py # Core PCAP JSON to UDM transformation script (streaming version)
│ └── requirements.txt # Python dependencies for the processor
├── LICENSE # MIT License
└── readme.md # This file (main project README)
gcs_buckets
: Provisions two GCS buckets: one for incoming raw PCAP files and another for processed UDM JSON files. Configured with uniform bucket-level access, optional versioning, CMEK, and lifecycle rules for object deletion.pubsub_topic
: Creates the main Pub/Sub topic for PCAP file notifications and a corresponding dead-letter topic (DLQ). Configures a push subscription to the Cloud Run processor, utilizing OIDC for authenticated invocations and a dead-letter policy.cloudrun_processor
: Deploys the PCAP processor as a Cloud Run v2 service. Defines resource limits (CPU, memory), concurrency settings, startup and liveness probes, and injects necessary environment variables (bucket names, project ID).test_generator_vm
: (Optional) Creates a GCE instance to simulate an on-premises environment. Its startup script installs network tools and prepares the environment to run the sniffer Docker container.
Dockerfile
: Based ongcr.io/google.com/cloudsdktool/google-cloud-cli:alpine
, it installstshark
,procps
(forlsof
), andiproute2
.sniffer_entrypoint.sh
:- Validates required environment variables (GCP Project ID, GCS Bucket, Pub/Sub Topic ID, SA Key Path, Sniffer ID).
- Activates the provided Service Account using
gcloud auth activate-service-account
. - Automatically detects the primary active network interface (excluding loopback, docker, etc.).
- Starts
tshark
in the background, configured to rotate capture files based on size or duration (env varsROTATE
,LIMITS
). - Includes a background heartbeat function that logs
TSHARK_STATUS
(running/stopped) for monitoring. - Continuously monitors the capture directory for newly closed (rotated) PCAP files (matching
*.pcap*
to include.pcapng
format). - For each completed PCAP: logs its size, uploads it to the specified GCS
INCOMING_BUCKET
, publishes the filename as a message toPUBSUB_TOPIC_ID
, and then removes the local PCAP file. - Handles
SIGTERM
andSIGINT
for graceful shutdown oftshark
and the heartbeat process.
processor_app.py
:- A Flask web application serving as the endpoint for Pub/Sub push notifications.
- Initializes the Google Cloud Storage client, with a "lazy" verification of bucket accessibility to improve resilience against IAM propagation delays.
- Upon receiving a Pub/Sub message (containing a PCAP filename):
- Downloads the specified PCAP file from the
INCOMING_BUCKET
to a temporary local directory. - Executes
tshark -T json
as a subprocess to convert the PCAP to a raw JSON representation. - Invokes the
json2udm_cloud.py
script (also as a subprocess) to transform the tshark JSON output into UDM JSON. - Uploads the resulting UDM JSON file to the
OUTPUT_BUCKET
.
- Downloads the specified PCAP file from the
- Logs key events for metrics (download complete, tshark conversion successful, UDM conversion script output, upload complete, processing duration).
- Returns HTTP
204 No Content
on successful processing. - Returns appropriate HTTP
4xx
or5xx
status codes for error conditions, facilitating Pub/Sub's retry and dead-lettering mechanisms.
json2udm_cloud.py
:- The core transformation logic, adapted for efficient cloud execution.
- Streaming Processing: Utilizes the
ijson
library to parse the (potentially very large) JSON output fromtshark
incrementally, packet by packet. This avoids loading the entire JSON into memory, preventing OOM errors in Cloud Run. - Robust Conversion: For each packet, extracts data from relevant layers and maps them to a standardized UDM structure. Performs robust timestamp conversion to ISO 8601 UTC, with a fallback to the current processing time if the original timestamp is missing or malformed.
- Error Handling per Packet: If an error occurs while processing an individual packet, it generates a minimal UDM event containing error details.
- Logs
UDM_PACKETS_PROCESSED
andUDM_PACKET_ERRORS
counts per input file for metrics. - Outputs a list of UDM event dictionaries.
requirements.txt
: Lists Python dependencies:Flask
,gunicorn
,google-cloud-storage
, andijson
.
The pipeline is designed for comprehensive observability:
-
Cloud Logging: Both the on-premises sniffer (
sniffer_entrypoint.sh
) and the Cloud Run processor (processor_app.py
,json2udm_cloud.py
) generate detailed logs. These logs are structured to include crucial information like sniffer IDs, filenames, processing stages, and error messages, facilitating debugging and operational monitoring. All logs are centralized in Google Cloud Logging. -
Log-Based Metrics (LBMs): The Terraform configuration in
terraform/main.tf
defines a rich set of Log-Based Metrics. These metrics convert specific log patterns into quantifiable time-series data in Cloud Monitoring. Examples include:- Sniffer Metrics: Heartbeat counts, PCAP files uploaded, PCAP file sizes (distribution), GCS upload errors, Pub/Sub publish errors. (Note:
sniffer_tshark_status_running_count
was also defined for TShark status). - Processor Metrics: PCAP download successes/failures, TShark conversion successes/errors, UDM packets processed (distribution, per file), UDM packet processing errors (distribution, per file), UDM file upload successes, and end-to-end processing latency (distribution).
- These LBMs form the backbone of the operational dashboard.
- Sniffer Metrics: Heartbeat counts, PCAP files uploaded, PCAP file sizes (distribution), GCS upload errors, Pub/Sub publish errors. (Note:
-
Operational Dashboard (
terraform/dashboards/main_operational_dashboard.json
): A key deliverable of this project is a comprehensive operational dashboard, defined as Infrastructure as Code and deployed by Terraform. This dashboard, configured using Monitoring Query Language (MQL), provides a centralized view of the entire pipeline's health and performance.Dashboard Structure and Key Sections:
The dashboard (as per the latest JSON version provided by the user) is organized into logical sections with a 4-column layout:
-
🛰️ Sniffer & Edge Overview: Focuses on the health and output of the on-premises/edge sniffer components.
- (No Scorecards in the user-provided final version)
- Time Series Charts: Detailed views of sniffer heartbeats (by ID and interface), PCAP file upload rates (by sniffer ID), average PCAP file sizes (by sniffer ID, calculated via MQL from distribution), and error counts for PCAP uploads.
-
📣 Cloud Pub/Sub: Monitors the health of the message queue.
- Time Series Charts: Tracks unacknowledged messages, DLQ messages, and Pub/Sub publish errors originating from the sniffers.
-
⚙️ Cloud Processor: Provides insights into the Cloud Run processing service.
- Time Series Charts: Metrics for PCAP download success/not-found, TShark conversion success/errors, UDM upload success rates.
- Standard Cloud Run metrics like successful request rates.
-
(Integrated with Processor Section) UDM Conversion & Latency:
- Time Series Charts: UDM packet processed rates and UDM packet-level error rates (grouped by filename, leveraging MQL on distribution metrics).
- Average PCAP processing latency (calculated via MQL from distribution) and 95th percentile latency.
Query Language: The dashboard exclusively uses MQL (Monitoring Query Language) for querying both standard GCP metrics and the custom Log-Based Metrics. This was adopted for its direct and robust integration with Cloud Monitoring metric types, especially for LBMs and for performing complex aggregations or calculations directly in the query.
Customization and Iteration: The dashboard's JSON definition allows for precise control over its appearance and a version-controlled approach to its evolution.
-
- A Google Cloud Platform (GCP) account with billing enabled.
- Required GCP APIs enabled in your project: Cloud Run, Pub/Sub, Cloud Storage, IAM, Artifact Registry, Compute Engine (if using the test VM), Cloud Monitoring API.
gcloud
CLI installed and authenticated.- Terraform (>=1.1.0) installed.
- Docker installed (for building images and optionally running the sniffer locally).
- An Artifact Registry Docker repository (e.g.,
chronicle-sniffer
) in your GCP project and region (if you intend to host your custom-built images there).
Before deploying, authenticate gcloud
and configure Docker for Artifact Registry (if using private images from AR):
# Log in to your Google account (this will open a browser window)
gcloud auth login
# Set your default GCP project
gcloud config set project YOUR_PROJECT_ID
# Authenticate Application Default Credentials (used by Terraform and other tools)
gcloud auth application-default login
# Configure Docker to authenticate with Artifact Registry (if needed)
# Replace REGION with your Artifact Registry region (e.g., europe-west8)
gcloud auth configure-docker REGION-docker.pkg.dev
-
Clone the Repository:
git clone https://github.com/fillol/Chronicle-Sniffer.git # Or your repo URL cd Chronicle-Sniffer
-
Build and Push the Processor Docker Image: (Skip if using a pre-built public image for the processor) Navigate to the
processor
directory and build the image, then push it to your Artifact Registry.cd processor # Replace REGION, YOUR_PROJECT_ID, YOUR_REPO_NAME, and TAG accordingly docker build -t REGION-docker.pkg.dev/YOUR_PROJECT_ID/YOUR_REPO_NAME/pcap-processor:latest . docker push REGION-docker.pkg.dev/YOUR_PROJECT_ID/YOUR_REPO_NAME/pcap-processor:latest cd ..
Example:
docker build -t europe-west8-docker.pkg.dev/my-project/my-repo/pcap-processor:latest .
-
Deploy Infrastructure with Terraform: Navigate to the
terraform
directory.cd terraform cp terraform.tfvars.example terraform.tfvars
Edit
terraform.tfvars
to set:gcp_project_id
gcp_region
incoming_pcap_bucket_name
andprocessed_udm_bucket_name
(must be globally unique)processor_cloud_run_image
(the full URI of the image for the processor, e.g., the one you just pushed or a public one)sniffer_image_uri
(e.g.,fillol/chronicle-sniffer:latest
or your own Artifact Registry sniffer image if you built one)ssh_source_ranges
for the test VM (e.g.,["YOUR_IP_ADDRESS/32"]
)
Then, initialize and apply Terraform:
terraform init -reconfigure terraform validate terraform plan -out=tfplan.out terraform apply tfplan.out
Confirm with
yes
. This will also deploy the operational dashboard. -
(Optional) Test VM & On-Premises Sniffer Setup: Terraform will output
test_vm_sniffer_setup_instructions
on how to set up and run the sniffer on the provisioned test GCE VM. This involves generating an SA key, copying it to the VM, and then runningdocker-compose
on the VM.To run the sniffer locally using Docker Compose (e.g., on your development machine, not the test VM): a. Ensure you are in the project's root directory (
Chronicle-Sniffer/
). b. Generate the sniffer Service Account key if you haven't already (from Terraform outputgenerate_sniffer_key_command
). This creates./sniffer-key.json
in the root. c. Navigate to the sniffer directory:cd sniffer
d. Create the key directory:mkdir -p gcp-key
e. Copy the generated key:cp ../sniffer-key.json ./gcp-key/key.json
(This places the key from the project root intosniffer/gcp-key/
) f. Create and configure your.env
file from.env.example
:cp .env.example .env
* Editsniffer/.env
with yourGCP_PROJECT_ID
,INCOMING_BUCKET
(from Terraform output), andPUBSUB_TOPIC_ID
(from Terraform output, e.g.,projects/YOUR_PROJECT_ID/topics/YOUR_TOPIC_NAME
). g. (Optional) Create a directory for local captures if you want them persisted on your host:mkdir captures
(thesniffer/compose.yml
maps this). h. Build (if needed) and run the sniffer:docker-compose up --build -d
(run this command from within thesniffer/
directory). i. To see logs:docker-compose logs -f
(from within thesniffer/
directory, or specify service name). j. To stop:docker-compose down
(from within thesniffer/
directory).
This project demonstrates several key concepts relevant to building scalable and reliable cloud services:
- Scalability & Decoupling: Offloading intensive UDM conversion to serverless Cloud Run, triggered by Pub/Sub, allows the on-premises sniffer to remain lightweight. This design supports horizontal scaling of the processing layer independently of the capture points.
- Infrastructure as Code (IaC): Using Terraform with modular design ensures consistent, repeatable, and version-controlled infrastructure deployments, including the monitoring dashboard.
- Managed Services: Leveraging GCP's managed services (GCS, Pub/Sub, Cloud Run, IAM, Cloud Monitoring) reduces operational overhead and enhances reliability.
- Event-Driven Architecture: The Pub/Sub message queue decouples the sniffer from the processor, improving resilience and allowing components to evolve independently.
- Security: OIDC for secure, token-based authentication between Pub/Sub and Cloud Run, and IAM least-privilege for service accounts.
- Observability: Deep integration with Cloud Logging and Cloud Monitoring, featuring custom metrics and a detailed operational dashboard for comprehensive system insight.
- Least-Privilege IAM: Service Accounts for the sniffer (on-prem/VM) and the Cloud Run processor are granted only the necessary permissions for their tasks.
- OIDC-Secured Cloud Run Invocation: The Pub/Sub push subscription uses OIDC tokens to securely invoke the Cloud Run processor, ensuring that only legitimate Pub/Sub messages from the configured topic can trigger the service.
- Service Account Key Management: For the on-premises sniffer, the SA key is intended to be mounted securely into the Docker container. Best practices for key rotation and restricted access should be followed.
- Firewall Rules: The Terraform configuration for the optional test VM includes firewall rules that restrict SSH access to specified source IP ranges.
- GCS Bucket Security: Buckets are configured with Uniform Bucket-Level Access (UBLA), and public access is prevented. Optional CMEK can be configured for an additional layer of encryption control.
- Updating the Processor:
- Modify
processor_app.py
orjson2udm_cloud.py
. - Rebuild the Docker image and push to Artifact Registry.
- Update
processor_cloud_run_image
interraform.tfvars
if using a new tag. - Run
terraform apply
. Alternatively, manually deploy a new revision in the Cloud Run console pointing to the new image tag.
- Modify
- Updating the Sniffer:
- Modify
sniffer_entrypoint.sh
or the snifferDockerfile
. - Rebuild and push the sniffer Docker image (e.g., to Docker Hub or your Artifact Registry).
- Update the image reference (
var.sniffer_image_uri
interraform.tfvars
if the test VM pulls it, and on any actual on-prem hosts) and restart the sniffer containers.
- Modify
- Scaling:
- Cloud Run Processor: Adjust
cloud_run_memory
,cloud_run_cpu
, andmax_instance_count
(via Terraform or Cloud Run console) for desired throughput. - Pub/Sub: Modify subscription retry policies if needed.
- Cloud Run Processor: Adjust
- Common Issues & Debugging:
- Sniffer not uploading/publishing: Check sniffer container logs. Verify SA key validity and permissions (especially Pub/Sub publisher role for the sniffer's SA).
- Pub/Sub messages in DLQ or high unacked count: Inspect Cloud Run processor logs. This usually points to issues in the processing scripts or GCS permissions for the Cloud Run SA.
- UDM Conversion Errors: Examine
json2udm_cloud.py stderr
messages in Cloud Run logs. Test locally with problematic JSON if possible. - Terraform Apply Failures: Read Terraform error messages. Validate
terraform.tfvars
. Ensuregcloud
user has permissions to create/modify all resources. - Dashboard Widgets Empty/Erroring:
- Verify Log-Based Metrics are correctly defined in
terraform/main.tf
and are active in Cloud Monitoring (Metrics Management). - Check if logs matching the LBM filters are being generated by the sniffer or processor.
- Use Metrics Explorer in Cloud Monitoring to test the MQL queries or inspect the raw metric data for your custom LBMs.
- Ensure variable names in the dashboard JSON (
${cloud_run_processor_service_name}
, etc.) match those passed by thetemplatefile
function interraform/main.tf
.
- Verify Log-Based Metrics are correctly defined in