A semantic-aware log parser generator powered by Large Language Models
Security analysts struggle to quickly and efficiently query and correlate log data due to the heterogeneity and lack of structure in real-world logs. Existing AI-based parsers focus on learning syntactic log templates but lack the semantic interpretation needed for querying. Directly querying large language models on raw logs is impractical at scale and vulnerable to prompt injection attacks.
Matryoshka is the first end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers. Matryoshka consists of three complementary stages:
- Template Generator: Learns log syntax to extract relevant fields
- Schema Creator: Associates templates with semantically meaningful schemas for querying
- OCSF Mapper: Maps custom schemas to standardized OCSF taxonomy fields
While Matryoshka uses LLMs for generating parsers, these parsers only rely on regular expressions for conveting log data to structured formats, making them suitable for high volume log sources. Matryoshka’s tempate generator outperforms prior works, and its schema creator layer achieves an 𝐹1 score of 0.95 on realistic security queries. OCSF mapping remains a difficult task. Matryoshka is able to map the most important fields in a log to their OCSF counterparts, but struggles with the long tail of custom attributes. We recommend at this time relying on created attributes from step 2.
Based on the paper "Semantic-Aware Parsing for Security Logs" by Julien Piet, Vivian Fang, Rishi Khare, Vern Paxson, Raluca Ada Popa, and David Wagner from UC Berkeley.
⚠️ Disclaimer: This is research code and may contain bugs. For production use cases, please contact Julien Piet.
- Python: 3.9 or higher
- API Access: At least one of Gemini, OpenAI, or Anthropic API keys
# Clone the repository
git clone https://github.com/julien-piet/matryoshka
cd matryoshka
# Create virtual environment
python3.9 -m venv env
source env/bin/activate
# Install Matryoshka
pip install .Choose your preferred LLM provider and follow the setup instructions:
The original paper uses Gemini models. In order to setup the Gemini API, you will need to follow instructions here to login to your GCP account. At the time of writing, the instructions for this were:
- Enable the VertexAI API in Google Cloud, and create OAuth credentials
- Install gcloud on your server
- Setup Application Default Credentials
gcloud auth application-default login- Export your GCP project name
export GEMINI_PROJECT=YOUR_GCP_PROJECT_NAME
- Usage: Specify your model using
--model MODELflag - Supported Models: Any Gemini generative model
- Embeddings:
text-embedding-005
export OPENAI_API_KEY=YOUR_API_KEY- Usage: Add
--backend openaiflag and specify your model using--model MODELflag - Supported Models: Any OpenAI generative model
- Embeddings:
text-embedding-3-large
export ANTHROPIC_API_KEY=YOUR_API_KEY
export OPENAI_API_KEY=YOUR_OPENAI_KEY # Required for embeddings- Usage: Add
--backend anthropicflag and specify your model using--model MODELflag - Supported Models: Any Anthropic generative model
- Note: Uses OpenAI embeddings as Anthropic doesn't provide embedding models
| Command | Description |
|---|---|
matryoshka-syntax |
Run template generation step |
matryoshka-schema |
Run schema generation step |
matryoshka-map |
Run OCSF mapping step |
matryoshka |
Execute full pipeline |
Create a JSON configuration file:
{
"data_path": "path/to/your/log/file.log",
"example_path": "path/to/example/templates.json", // Optional
"results_path": "path/to/output/folder",
"description_path": "path/to/log/description.txt" // Optional
}Configuration Fields:
data_path: Required - Path to your log file. An example sshd log file is provided inexamples/sshd.txt.example_path: Optional - Example templates (Matryoshka runs zero-shot by default). If you did want to use example templates, they are long to write, because they require writing templates by hand in json, as well as providing a description of the different parts of the template. This could be made simpler by having a language model fill in the gaps. You can find an example inexamples/template.json.results_path: Required - Output directory.description_path: Optional - Text description of log context. This is optional if you want to provide the LLM with more context about the log. You can find an example description file inexamples/descrpition.txt.
| Argument | Default | Description |
|---|---|---|
--config_file |
- | Required - Configuration file path |
--model |
gemini-2.5-flash |
LLM model to use |
--backend |
google |
Provider: google, openai, anthropic |
--thread_count |
16 |
Number of processing threads |
--file_percent |
1.0 |
Percentage of log file to process |
--buffer_size |
2500 |
Log processing buffer size |
--force_overlap |
- | Skip overlap confirmation (recommended) |
--use_fewshot |
- | Enable few-shot examples |
--cache_dir |
.cache/ |
Cache directory path |
--existing_parser |
- | Path to saved template generation dill file. Use if template generation was halted to recover progress. |
# 1. Generate templates
matryoshka-syntax --config_file config.json --model gemini-2.5-flash --force_overlap
# 2. Create semantic schema
matryoshka-schema --config_file config.json --model gemini-2.5-flash
# 3. Map to OCSF standards
matryoshka-map --config_file config.json --model gemini-2.5-flashLaunch the interactive web interface for parser editing and querying:
# With specific files
matryoshka-edit --log_file logs/sshd.log --parser_file parsers/sshd_parser.dill
# With configuration file
matryoshka-edit --config_file config.jsonWeb Interface Options:
--listen_addr: Bind address (default:127.0.0.1)--listen_port: Port number (default:8080)
Convert logs to structured JSON format:
# Basic ingestion
matryoshka-ingest logs/input.log parsers/parser.dill output.json
# With OCSF mapping
matryoshka-ingest logs/input.log parsers/parser.dill output.json --OCSFMatryoshka was evaluated on both LogHub 2.0 and a curated dataset from public RedHat bug reports.
📥 Download Bug Tracker Dataset
Included Log Types:
- Audit logs
- SSH Server logs
- DHCP Client logs
- CRON logs
- Puppet logs
OCSF attributes are only described in the context of their parent attribute. Matryoshka augments OCSF attributes with a description relative to the event they are part of. For instance, field authentication.src_endpoint.ip is described as "The IP address of the endpoint, in either IPv4 or IPv6 format.", while our generated description for this field is "The IP address (IPv4 or IPv6) of the source endpoint involved in the authentication event (e.g., a user's login or logout attempt).", taking into account the role of its parent.
Generating these descriptions is costly, but only needs to be done once every time OCSF releases a new version. These descriptions are saved in a cache directory. When running the mapping algorithm, this cache will be created automatically. To avoid doing this yourself, you can download cached descriptions 📥 here, extract them, and put both json files in an OCSF directory at the .cache/OCSF path.
This project is licensed under the GNU General Public License v3.0.
For questions, issues, or collaboration opportunities, please contact Julien Piet 🌐 Personal Website