Thanks to visit codestin.com
Credit goes to github.com

Skip to content

zschmerber/matryoshka

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Matryoshka Logo

Matryoshka

A semantic-aware log parser generator powered by Large Language Models

License: GPL v3 Python 3.9+

Table of Contents


Overview

Security analysts struggle to quickly and efficiently query and correlate log data due to the heterogeneity and lack of structure in real-world logs. Existing AI-based parsers focus on learning syntactic log templates but lack the semantic interpretation needed for querying. Directly querying large language models on raw logs is impractical at scale and vulnerable to prompt injection attacks.

Matryoshka is the first end-to-end system leveraging LLMs to automatically generate semantically-aware structured log parsers. Matryoshka consists of three complementary stages:

  1. Template Generator: Learns log syntax to extract relevant fields
  2. Schema Creator: Associates templates with semantically meaningful schemas for querying
  3. OCSF Mapper: Maps custom schemas to standardized OCSF taxonomy fields

While Matryoshka uses LLMs for generating parsers, these parsers only rely on regular expressions for conveting log data to structured formats, making them suitable for high volume log sources. Matryoshka’s tempate generator outperforms prior works, and its schema creator layer achieves an 𝐹1 score of 0.95 on realistic security queries. OCSF mapping remains a difficult task. Matryoshka is able to map the most important fields in a log to their OCSF counterparts, but struggles with the long tail of custom attributes. We recommend at this time relying on created attributes from step 2.

Research Background

Based on the paper "Semantic-Aware Parsing for Security Logs" by Julien Piet, Vivian Fang, Rishi Khare, Vern Paxson, Raluca Ada Popa, and David Wagner from UC Berkeley.

📖 Read the Paper

⚠️ Disclaimer: This is research code and may contain bugs. For production use cases, please contact Julien Piet.


Getting Started

Prerequisites

  • Python: 3.9 or higher
  • API Access: At least one of Gemini, OpenAI, or Anthropic API keys

Installation

1. Clone and Setup

# Clone the repository
git clone https://github.com/julien-piet/matryoshka
cd matryoshka

# Create virtual environment
python3.9 -m venv env
source env/bin/activate

# Install Matryoshka
pip install .

LLM API Configuration

Choose your preferred LLM provider and follow the setup instructions:

Gemini

The original paper uses Gemini models. In order to setup the Gemini API, you will need to follow instructions here to login to your GCP account. At the time of writing, the instructions for this were:

  1. Enable the VertexAI API in Google Cloud, and create OAuth credentials
  2. Install gcloud on your server
  3. Setup Application Default Credentials
gcloud auth application-default login
  1. Export your GCP project name
export GEMINI_PROJECT=YOUR_GCP_PROJECT_NAME
  • Usage: Specify your model using --model MODEL flag
  • Supported Models: Any Gemini generative model
  • Embeddings: text-embedding-005

OpenAI

export OPENAI_API_KEY=YOUR_API_KEY
  • Usage: Add --backend openai flag and specify your model using --model MODEL flag
  • Supported Models: Any OpenAI generative model
  • Embeddings: text-embedding-3-large

Anthropic

export ANTHROPIC_API_KEY=YOUR_API_KEY
export OPENAI_API_KEY=YOUR_OPENAI_KEY  # Required for embeddings
  • Usage: Add --backend anthropic flag and specify your model using --model MODEL flag
  • Supported Models: Any Anthropic generative model
  • Note: Uses OpenAI embeddings as Anthropic doesn't provide embedding models

Usage

Parser Generation

Available Commands
Command Description
matryoshka-syntax Run template generation step
matryoshka-schema Run schema generation step
matryoshka-map Run OCSF mapping step
matryoshka Execute full pipeline
Configuration File

Create a JSON configuration file:

{
  "data_path": "path/to/your/log/file.log",
  "example_path": "path/to/example/templates.json",  // Optional
  "results_path": "path/to/output/folder",
  "description_path": "path/to/log/description.txt"  // Optional
}

Configuration Fields:

  • data_path: Required - Path to your log file. An example sshd log file is provided in examples/sshd.txt.
  • example_path: Optional - Example templates (Matryoshka runs zero-shot by default). If you did want to use example templates, they are long to write, because they require writing templates by hand in json, as well as providing a description of the different parts of the template. This could be made simpler by having a language model fill in the gaps. You can find an example in examples/template.json.
  • results_path: Required - Output directory.
  • description_path: Optional - Text description of log context. This is optional if you want to provide the LLM with more context about the log. You can find an example description file in examples/descrpition.txt.
Command Arguments
Argument Default Description
--config_file - Required - Configuration file path
--model gemini-2.5-flash LLM model to use
--backend google Provider: google, openai, anthropic
--thread_count 16 Number of processing threads
--file_percent 1.0 Percentage of log file to process
--buffer_size 2500 Log processing buffer size
--force_overlap - Skip overlap confirmation (recommended)
--use_fewshot - Enable few-shot examples
--cache_dir .cache/ Cache directory path
--existing_parser - Path to saved template generation dill file.
Use if template generation was halted to recover progress.
Step-by-Step Execution (Recommended)
# 1. Generate templates
matryoshka-syntax --config_file config.json --model gemini-2.5-flash --force_overlap

# 2. Create semantic schema
matryoshka-schema --config_file config.json --model gemini-2.5-flash

# 3. Map to OCSF standards
matryoshka-map --config_file config.json --model gemini-2.5-flash

Parser Curation

Launch the interactive web interface for parser editing and querying:

# With specific files
matryoshka-edit --log_file logs/sshd.log --parser_file parsers/sshd_parser.dill

# With configuration file
matryoshka-edit --config_file config.json

Web Interface Options:

  • --listen_addr: Bind address (default: 127.0.0.1)
  • --listen_port: Port number (default: 8080)

Log Ingestion

Convert logs to structured JSON format:

# Basic ingestion
matryoshka-ingest logs/input.log parsers/parser.dill output.json

# With OCSF mapping
matryoshka-ingest logs/input.log parsers/parser.dill output.json --OCSF

Additional Data

Bug Tracker Dataset

Matryoshka was evaluated on both LogHub 2.0 and a curated dataset from public RedHat bug reports.

📥 Download Bug Tracker Dataset

Included Log Types:

  • Audit logs
  • SSH Server logs
  • DHCP Client logs
  • CRON logs
  • Puppet logs

OCSF Descriptions

OCSF attributes are only described in the context of their parent attribute. Matryoshka augments OCSF attributes with a description relative to the event they are part of. For instance, field authentication.src_endpoint.ip is described as "The IP address of the endpoint, in either IPv4 or IPv6 format.", while our generated description for this field is "The IP address (IPv4 or IPv6) of the source endpoint involved in the authentication event (e.g., a user's login or logout attempt).", taking into account the role of its parent.

Generating these descriptions is costly, but only needs to be done once every time OCSF releases a new version. These descriptions are saved in a cache directory. When running the mapping algorithm, this cache will be created automatically. To avoid doing this yourself, you can download cached descriptions 📥 here, extract them, and put both json files in an OCSF directory at the .cache/OCSF path.


License

This project is licensed under the GNU General Public License v3.0.


Contributing & Contact

For questions, issues, or collaboration opportunities, please contact Julien Piet 🌐 Personal Website

About

A LLM-powered log ingestion tool.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 87.7%
  • HTML 12.3%