Machine learning-based DNS classifier for detecting Domain Generation Algorithms (DGAs), tunneling, and data exfiltration by malicious actors.
Explore the docs »
View Demo
·
Report Bug
·
Request Feature
Caution
The project is under active development right now. Everything might change, break, or move around quickly.
| Continuous Integration |
|
If you want to use heiDGAF, just use the provided Docker compose to quickly bootstrap your environment:
docker compose -f docker/docker-compose.yml up
The following table lists the most important configuration parameters with their default values. The configuration options can be set in the config.yaml in the root directory.
| Path | Description | Default Value |
|---|---|---|
| logging | Global and module-specific logging configurations. | |
logging.base.debug |
Default debug logging level for all modules if not overridden. | false |
logging.modules.<module_name>.debug |
Specific debug logging level for a given module (e.g., log_storage.logserver). |
false (for all listed modules) |
| pipeline | Configuration for the data processing pipeline stages. | |
pipeline.log_storage.logserver.input_file |
Path to the input file for the log server. | "/opt/file.txt" |
pipeline.log_collection.collector.logline_format |
Defines the format of incoming log lines, specifying field name, type, and parsing rules/values. | Array of field definitions (e.g., ["timestamp", Timestamp, "%Y-%m-%dT%H:%M:%S.%fZ"]) |
pipeline.log_collection.batch_handler.batch_size |
Number of log lines to collect before sending a batch. | 10000 |
pipeline.log_collection.batch_handler.batch_timeout |
Maximum time (in seconds) to wait before sending a partially filled batch. | 30.0 |
pipeline.log_collection.batch_handler.subnet_id.ipv4_prefix_length |
IPv4 prefix length for subnet identification. | 24 |
pipeline.log_collection.batch_handler.subnet_id.ipv6_prefix_length |
IPv6 prefix length for subnet identification. | 64 |
pipeline.data_inspection.inspector.mode |
Mode of operation for the data inspector. | univariate (options: multivariate, ensemble) |
pipeline.data_inspection.inspector.ensemble.model |
Model to use when inspector mode is ensemble. |
WeightEnsemble |
pipeline.data_inspection.inspector.ensemble.module |
Python module for the ensemble model. | streamad.process |
pipeline.data_inspection.inspector.ensemble.model_args |
Arguments for the ensemble model. | (empty by default) |
pipeline.data_inspection.inspector.models |
List of models to use for data inspection (e.g., anomaly detection). | Array of model definitions (e.g., {"model": "ZScoreDetector", "module": "streamad.model", "model_args": {"is_global": false}}) |
pipeline.data_inspection.inspector.anomaly_threshold |
Threshold for classifying an observation as an anomaly. | 0.01 |
pipeline.data_inspection.inspector.score_threshold |
Threshold for the anomaly score. | 0.5 |
pipeline.data_inspection.inspector.time_type |
Unit of time used in time range calculations. | ms |
pipeline.data_inspection.inspector.time_range |
Time range for inspection. | 20 |
pipeline.data_analysis.detector.model |
Model to use for data analysis (e.g., DGA detection). | rf (Random Forest) option: XGBoost |
pipeline.data_analysis.detector.checksum |
Checksum for the model file to ensure integrity. | ba1f718179191348fe2abd51644d76191d42a5d967c6844feb3371b6f798bf06 |
pipeline.data_analysis.detector.base_url |
Base URL for downloading the model if not present locally. | https://heibox.uni-heidelberg.de/d/0d5cbcbe16cd46a58021/ |
pipeline.data_analysis.detector.threshold |
Threshold for the detector's classification. | 0.5 |
pipeline.monitoring.clickhouse_connector.batch_size |
Batch size for sending data to ClickHouse. | 50 |
pipeline.monitoring.clickhouse_connector.batch_timeout |
Batch timeout (in seconds) for sending data to ClickHouse. | 2.0 |
| environment | Configuration for external services and infrastructure. | |
environment.kafka_brokers |
List of Kafka broker hostnames and ports. | [{"hostname": "kafka1", "port": 8097}, {"hostname": "kafka2", "port": 8098}, {"hostname": "kafka3", "port": 8099}] |
environment.kafka_topics.pipeline.<topic_name> |
Kafka topic names for various stages in the pipeline. | e.g., logserver_in: "pipeline-logserver_in" |
environment.monitoring.clickhouse_server.hostname |
Hostname of the ClickHouse server for monitoring data. | clickhouse-server |
Important
More information will be added soon! Go and watch the repository for updates.
Install all Python requirements:
python -m venv .venv
source .venv/bin/activate
sh install_requirements.shAlternatively, you can use pip install and enter all needed requirements individually with -r requirements.*.txt.
Now, you can start each stage, e.g. the inspector:
python src/inspector/main.pyImportant
More information will be added soon! Go and watch the repository for updates.
Currently, we enable two trained models, namely XGBoost and RandomForest.
python -m venv .venv
source .venv/bin/activate
pip install -r requirements/requirements.train.txtFor training our models, we rely on the following data sets:
- CICBellDNS2021
- DGTA Benchmark
- DNS Tunneling Queries for Binary Classification
- UMUDGA - University of Murcia Domain Generation Algorithm Dataset
- Real-CyberSecurity-Datasets
However, we compute all feature separately and only rely on the domain and class.
Currently, we are only interested in binary classification, thus, the class is either benign or malicious.
Important
We support custom schemes.
loglines:
fields:
- [ "timestamp", RegEx, '^\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z$' ]
- [ "status_code", ListItem, [ "NOERROR", "NXDOMAIN" ], [ "NXDOMAIN" ] ]
- [ "client_ip", IpAddress ]
- [ "dns_server_ip", IpAddress ]
- [ "domain_name", RegEx, '^(?=.{1,253}$)((?!-)[A-Za-z0-9-]{1,63}(?<!-)\.)+[A-Za-z]{2,63}$' ]
- [ "record_type", ListItem, [ "A", "AAAA" ] ]
- [ "response_ip", IpAddress ]
- [ "size", RegEx, '^\d+b$' ]Contributions are what make the open source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
If you have a suggestion that would make this better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!
Distributed under the EUPL License. See LICENSE.txt for more information.