Matryoshka

A semantic-aware log parser generator powered by Large Language Models

Overview

Security analysts struggle to quickly and efficiently query and correlate log data due to the heterogeneity and lack of structure in real-world logs. Existing AI-based parsers focus on learning syntactic log templates but lack the semantic interpretation needed for querying. Directly querying large language models on raw logs is impractical at scale and vulnerable to prompt injection attacks.

Matryoshka is the first end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers. Matryoshka consists of three complementary stages:

Template Generator: Learns log syntax to extract relevant fields
Schema Creator: Associates templates with semantically meaningful schemas for querying
OCSF Mapper: Maps custom schemas to standardized OCSF taxonomy fields

While Matryoshka uses LLMs for generating parsers, these parsers only rely on regular expressions for conveting log data to structured formats, making them suitable for high volume log sources. Matryoshka’s tempate generator outperforms prior works, and its schema creator layer achieves an 𝐹1 score of 0.95 on realistic security queries. OCSF mapping remains a difficult task. Matryoshka is able to map the most important fields in a log to their OCSF counterparts, but struggles with the long tail of custom attributes. We recommend at this time relying on created attributes from step 2.

Research Background

Based on the paper "Semantic-Aware Parsing for Security Logs" by Julien Piet, Vivian Fang, Rishi Khare, Vern Paxson, Raluca Ada Popa, and David Wagner from UC Berkeley.

📖 Read the Paper

⚠️ Disclaimer: This is research code and may contain bugs. For production use cases, please contact Julien Piet.

Getting Started

Prerequisites

Python: 3.9 or higher
API Access: At least one of Gemini, OpenAI, or Anthropic API keys

Installation

1. Clone and Setup

# Clone the repository
git clone https://github.com/julien-piet/matryoshka
cd matryoshka

# Create virtual environment
python3.9 -m venv env
source env/bin/activate

# Install Matryoshka
pip install .

LLM API Configuration

Choose your preferred LLM provider and follow the setup instructions:

Gemini

The original paper uses Gemini models. In order to setup the Gemini API, you will need to follow instructions here to login to your GCP account. At the time of writing, the instructions for this were:

Enable the VertexAI API in Google Cloud, and create OAuth credentials
Install gcloud on your server
Setup Application Default Credentials

gcloud auth application-default login

Export your GCP project name

export GEMINI_PROJECT=YOUR_GCP_PROJECT_NAME

Usage: Specify your model using --model MODEL flag
Supported Models: Any Gemini generative model
Embeddings: text-embedding-005

OpenAI

export OPENAI_API_KEY=YOUR_API_KEY

Usage: Add --backend openai flag and specify your model using --model MODEL flag
Supported Models: Any OpenAI generative model
Embeddings: text-embedding-3-large

Anthropic

export ANTHROPIC_API_KEY=YOUR_API_KEY
export OPENAI_API_KEY=YOUR_OPENAI_KEY  # Required for embeddings

Usage: Add --backend anthropic flag and specify your model using --model MODEL flag
Supported Models: Any Anthropic generative model
Note: Uses OpenAI embeddings as Anthropic doesn't provide embedding models

Usage

Parser Generation

Available Commands

Command	Description
`matryoshka-syntax`	Run template generation step
`matryoshka-schema`	Run schema generation step
`matryoshka-map`	Run OCSF mapping step
`matryoshka`	Execute full pipeline

Configuration File

Create a JSON configuration file:

{
  "data_path": "path/to/your/log/file.log",
  "example_path": "path/to/example/templates.json",  // Optional
  "results_path": "path/to/output/folder",
  "description_path": "path/to/log/description.txt"  // Optional
}

Configuration Fields:

data_path: Required - Path to your log file. An example sshd log file is provided in examples/sshd.txt.
example_path: Optional - Example templates (Matryoshka runs zero-shot by default). If you did want to use example templates, they are long to write, because they require writing templates by hand in json, as well as providing a description of the different parts of the template. This could be made simpler by having a language model fill in the gaps. You can find an example in examples/template.json.
results_path: Required - Output directory.
description_path: Optional - Text description of log context. This is optional if you want to provide the LLM with more context about the log. You can find an example description file in examples/descrpition.txt.

Command Arguments

Argument	Default	Description
`--config_file`	-	Required - Configuration file path
`--model`	`gemini-2.5-flash`	LLM model to use
`--backend`	`google`	Provider: `google`, `openai`, `anthropic`
`--thread_count`	`16`	Number of processing threads
`--file_percent`	`1.0`	Percentage of log file to process
`--buffer_size`	`2500`	Log processing buffer size
`--force_overlap`	-	Skip overlap confirmation (recommended)
`--use_fewshot`	-	Enable few-shot examples
`--cache_dir`	`.cache/`	Cache directory path
`--existing_parser`	-	Path to saved template generation dill file. Use if template generation was halted to recover progress.

Step-by-Step Execution (Recommended)

# 1. Generate templates
matryoshka-syntax --config_file config.json --model gemini-2.5-flash --force_overlap

# 2. Create semantic schema
matryoshka-schema --config_file config.json --model gemini-2.5-flash

# 3. Map to OCSF standards
matryoshka-map --config_file config.json --model gemini-2.5-flash

Parser Curation

Launch the interactive web interface for parser editing and querying:

# With specific files
matryoshka-edit --log_file logs/sshd.log --parser_file parsers/sshd_parser.dill

# With configuration file
matryoshka-edit --config_file config.json

Web Interface Options:

--listen_addr: Bind address (default: 127.0.0.1)
--listen_port: Port number (default: 8080)

Log Ingestion

Convert logs to structured JSON format:

# Basic ingestion
matryoshka-ingest logs/input.log parsers/parser.dill output.json

# With OCSF mapping
matryoshka-ingest logs/input.log parsers/parser.dill output.json --OCSF

Additional Data

Bug Tracker Dataset

Matryoshka was evaluated on both LogHub 2.0 and a curated dataset from public RedHat bug reports.

📥 Download Bug Tracker Dataset

Included Log Types:

Audit logs
SSH Server logs
DHCP Client logs
CRON logs
Puppet logs

OCSF Descriptions

OCSF attributes are only described in the context of their parent attribute. Matryoshka augments OCSF attributes with a description relative to the event they are part of. For instance, field authentication.src_endpoint.ip is described as "The IP address of the endpoint, in either IPv4 or IPv6 format.", while our generated description for this field is "The IP address (IPv4 or IPv6) of the source endpoint involved in the authentication event (e.g., a user's login or logout attempt).", taking into account the role of its parent.

Generating these descriptions is costly, but only needs to be done once every time OCSF releases a new version. These descriptions are saved in a cache directory. When running the mapping algorithm, this cache will be created automatically. To avoid doing this yourself, you can download cached descriptions 📥 here, extract them, and put both json files in an OCSF directory at the .cache/OCSF path.

License

This project is licensed under the GNU General Public License v3.0.

Contributing & Contact

For questions, issues, or collaboration opportunities, please contact Julien Piet 🌐 Personal Website

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
examples		examples
src/matryoshka		src/matryoshka
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
logo.png		logo.png
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Matryoshka

Table of Contents

Overview

Research Background

Getting Started

Prerequisites

Installation

1. Clone and Setup

LLM API Configuration

Gemini

OpenAI

Anthropic

Usage

Parser Generation

Available Commands

Configuration File

Command Arguments

Step-by-Step Execution (Recommended)

Parser Curation

Log Ingestion

Additional Data

Bug Tracker Dataset

OCSF Descriptions

License

Contributing & Contact

About

Uh oh!

Releases

Packages

Languages

License

zschmerber/matryoshka

Folders and files

Latest commit

History

Repository files navigation

Matryoshka

Table of Contents

Overview

Research Background

Getting Started

Prerequisites

Installation

1. Clone and Setup

LLM API Configuration

Gemini

OpenAI

Anthropic

Usage

Parser Generation

Available Commands

Configuration File

Command Arguments

Step-by-Step Execution (Recommended)

Parser Curation

Log Ingestion

Additional Data

Bug Tracker Dataset

OCSF Descriptions

License

Contributing & Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages