Network Attack Data Pipeline

Overview

This repository implements a pipeline for streaming network attack data to Google Cloud for threat detection using Vertex AI. The pipeline includes:

Data ingestion to Google Cloud Datastore
Real-time threat detection using Vertex AI
Comprehensive monitoring and alerting
Structured logging
Automated cleanup and lifecycle management

Requirements

Python 3.9+
Google Cloud SDK
Google Cloud project with enabled APIs:
- Datastore API
- Vertex AI API
- Cloud Logging API
- Pub/Sub API

Installation

Install Python dependencies:

pip install -r requirements.txt

Configure Google Cloud credentials:

export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/service-account.json"
export PROJECT_ID="your-project-id"

Enable required APIs:

./deploy.sh enable-apis

Pipeline Components

1. Data Ingestion

data_pipeline.py: Core pipeline implementation
datastore_instance_checker.py: Verifies Datastore instance status
logging_utils.py: Provides structured logging

2. Threat Detection

vertex_ai_utils.py: Vertex AI integration
Handles real-time predictions
Supports batch processing

3. Monitoring & Alerting

Cloud Monitoring dashboard
Cloud Alerting policies
Cloud Logging integration

Configuration

Vertex AI Configuration

{
    "project_id": "your-project-id",
    "region": "us-central1",
    "model_id": "your-model-id"
}

Environment Variables

export GOOGLE_APPLICATION_CREDENTIALS="path/to/your/service-account.json"
export PROJECT_ID="your-project-id"

Usage

Verify Datastore Instance

python datastore_instance_checker.py \
    --project_id your-project-id \
    --instance_id your-instance-id

Process Data

python data_pipeline.py \
    --project_id your-project-id \
    --datastore_kind AttackData \
    --datastore_namespace attack_data \
    --vertex_config vertex_config.json \
    --input_file path/to/your/data.json

Run Tests

python test_pipeline.py

Monitoring

Cloud Dashboard

Data ingestion rate
Vertex AI prediction latency
Error rates
Datastore operations
Pipeline health

Alerting

Low data ingestion rate
High prediction latency
High error rate

Documentation

Security

IAM permissions are configured automatically
Data is encrypted at rest and in transit
Automated cleanup policies are in place
Security Command Center integration

License

MIT License - see LICENSE file for details

Contributing

Fork the repository
Create your feature branch
Commit your changes
Push to the branch
Create a new Pull Request

Support

For support, please open an issue in the repository. |---|---| | id | UUID of dataset | |name | name of author | | date | last modified date | | dataset | array of URLs where the hosted version of the dataset is located | | description | describes the dataset as detailed as possible | | environment | markdown filename of the environment description see below | | technique | array of MITRE ATT&CK techniques associated with dataset | | references | array of URLs that reference the dataset | | sourcetypes | array of sourcetypes that are contained in the dataset |

For example

id: 405d5889-16c7-42e3-8865-1485d7a5b2b6
author: Patrick Bareiss
date: '2020-10-08'
description: 'Atomic Test Results: Successful Execution of test T1003.001-1 Windows
  Credential Editor Successful Execution of test T1003.001-2 Dump LSASS.exe Memory
  using ProcDump Return value unclear for test T1003.001-3 Dump LSASS.exe Memory using
  comsvcs.dll Successful Execution of test T1003.001-4 Dump LSASS.exe Memory using
  direct system calls and API unhooking Return value unclear for test T1003.001-6
  Offline Credential Theft With Mimikatz Return value unclear for test T1003.001-7
  LSASS read with pypykatz '
environment: attack_range
technique:
- T1003.001
dataset:
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-powershell.log
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-security.log
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-sysmon.log
- https://media.githubusercontent.com/media/splunk/attack_data/master/datasets/attack_techniques/T1003.001/atomic_red_team/windows-system.log
references:
- https://attack.mitre.org/techniques/T1003/001/
- https://github.com/redcanaryco/atomic-red-team/blob/master/atomics/T1003.001/T1003.001.md
- https://github.com/splunk/security-content/blob/develop/tests/T1003_001.yml
sourcetypes:
- XmlWinEventLog:Microsoft-Windows-Sysmon/Operational
- WinEventLog:Microsoft-Windows-PowerShell/Operational
- WinEventLog:System
- WinEventLog:Security

Environments

Environments are a description of where the dataset was collected. At this moment there are no specific restrictions, although we do have a simple template a user can start with here. The most common environment for most datasets will be the attack_range since this is the tool that used to generate attack data sets automatically.

Replay Datasets 📼

Most datasets generated will be raw log files. There are two main simple ways to ingest it.

Into Splunk

using replay.py

pre-requisite, clone, create virtual env and install python deps:

git clone [email protected]:splunk/attack_data.git
cd attack_data
pip install virtualenv
virtualenv venv
source venv/bin/activate
pip install -r bin/requirements.txt

Download dataset
configure bin/replay.yml
run python bin/replay.py -c bin/replay.yml

using UI

Download dataset
In Splunk enterprise , add data -> Files & Directories -> select dataset
Set the sourcetype as specified in the YML file
Explore your data

See a quick demo 📺 of this process here.

Into DSP

To send datasets into DSP the simplest way is to use the scloud command-line-tool as a requirement.

Download the dataset
Ingest the dataset into DSP via scloud command `cat attack_data.json | scloud ingest post-events --format JSON
Build a pipeline that reads from the firehose and you should see the events.

Contribute Datasets 🥰

Generate a dataset
Under the corresponding MITRE Technique ID folder create a folder named after the tool the dataset comes from, for example: atomic_red_Team
Make PR with <tool_name_yaml>.yml file under the corresponding created folder, upload dataset into the same folder.

See T1003.002 for a complete example.

Note the simplest way to generate a dataset to contribute is to launch your simulations in the attack_range, or manually attack the machines and when done dump the data using the dump function.

See a quick demo 📺 of the process to dump a dataset here.

To contribute a dataset simply create a PR on this repository, for general instructions on creating a PR see this guide.

Automatically generated Datasets ⚙️

This project takes advantage of automation to generate datasets using the attack_range. You can see details about this service on this sub-project folder attack_data_service.

Author

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Name		Name	Last commit message	Last commit date
Latest commit History 3,052 Commits
.github		.github
alerting		alerting
attack_data_service		attack_data_service
bin		bin
cloud_function		cloud_function
datasets		datasets
docs		docs
environments		environments
test_data		test_data
tests		tests
.gitattributes		.gitattributes
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
all_parsed_data.json		all_parsed_data.json
attack_test_data.json		attack_test_data.json
create_model.py		create_model.py
dashboard.json		dashboard.json
data_ingestion.py		data_ingestion.py
data_pipeline.py		data_pipeline.py
data_pipeline_plan.md		data_pipeline_plan.md
datastore_instance_checker.py		datastore_instance_checker.py
datastore_utils.py		datastore_utils.py
deploy.sh		deploy.sh
generate_splunk_queries.py		generate_splunk_queries.py
honeypot_filter_helper.py		honeypot_filter_helper.py
log_parser.py		log_parser.py
logging_plan.md		logging_plan.md
logging_utils.py		logging_utils.py
model.joblib		model.joblib
new_model.joblib		new_model.joblib
parsed_attack_data.json		parsed_attack_data.json
plan.md		plan.md
pytest.ini		pytest.ini
query.json		query.json
requirements.txt		requirements.txt
setup_vertex_ai.py		setup_vertex_ai.py
splunk_analyzer_README.md		splunk_analyzer_README.md
splunk_live_security_analysis.md		splunk_live_security_analysis.md
splunk_security_analyzer.py		splunk_security_analyzer.py
splunk_security_queries.md		splunk_security_queries.md
splunk_security_queries_filtered.md		splunk_security_queries_filtered.md
syslog_security_analysis.md		syslog_security_analysis.md
test_data.json		test_data.json
test_pipeline.py		test_pipeline.py
test_splunk_queries.py		test_splunk_queries.py
test_vertex_ai_integration.py		test_vertex_ai_integration.py
train_new_model.py		train_new_model.py
vertex_ai_threat_detection_plan.md		vertex_ai_threat_detection_plan.md
vertex_ai_utils.py		vertex_ai_utils.py
vertex_config.json		vertex_config.json

License

dayat81/attack_data

Folders and files

Latest commit

History

Repository files navigation

Network Attack Data Pipeline

Overview

Requirements

Installation

Pipeline Components

1. Data Ingestion

2. Threat Detection

3. Monitoring & Alerting

Configuration

Vertex AI Configuration

Environment Variables

Usage

Verify Datastore Instance

Process Data

Run Tests

Monitoring

Cloud Dashboard

Alerting

Documentation

Security

License

Contributing

Support

Environments

Replay Datasets 📼

Into Splunk

using replay.py

using UI

Into DSP

Contribute Datasets 🥰

Automatically generated Datasets ⚙️

Author

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages