🧠 Instruction Tuning with Annotation Guidelines for Event Extraction (Findings of ACL 2025)

Efficient and extensible Event Extraction with Code Prompts and Annotation Guidelines — built on top of TextEE.

This repository includes code for:

PyCode-TextEE: Tools to obtain code prompts for 15 event extraction datasets supported by TextEE.
Instruction Tuning with Guidelines: Source code to reproduce our work on utlizing code prompts and annotation guidelines for Event Extraction. Please navigate to the directory instruction_tuning_with_guidelines_ACL_2025 for the source code.

If you find our work helpful, please cite our work:

@inproceedings{srivastava-etal-2025-instruction,
    title = "Instruction-Tuning {LLM}s for Event Extraction with Annotation Guidelines",
    author = "Srivastava, Saurabh  and
      Pati, Sweta  and
      Yao, Ziyu",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.677/",
    pages = "13055--13071",
    ISBN = "979-8-89176-256-5",
    abstract = "In this work, we study the effect of annotation guidelines{--}textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance."
}

• • • • •

Authors:
Saurabh Srivastava, Sweta Pati, Ziyu Yao

🧩 Introduction

PyCode-TextEE extends TextEE, bringing event extraction into the era of prompt-based large language models.

While TextEE standardizes 10+ event extraction datasets into a unified JSON format—making them reproducible and comparable—PyCode-TextEE takes the next leap:

✨ We transform TextEE-formatted data into code-style prompts—a format that is both readable and executable by LLMs and ideal for structured evaluation. In addition, we annotate the code-prompts with annotation guidelines. Below, we provide an example of code prompt and how we integrate annotation guidelines within them:

What are Code Prompts and Annotation Guidelines?

Code prompting is a technique that enhances reasoning abilities in text+code LLMs by transforming natural language (NL) tasks into code representations. Instead of executing the code, the model uses it as a structured input format to reason and generate answers. The labels such event classes and arguments are represented as Python classes, and the guidelines or instructions are introduced as docstrings. The model start generating after the result = line.
Annotation Guidelines involve defining how to identify and classify events and their arguments within a text or other data. These guidelines help ensure consistency and quality in the annotation process, which is crucial for training machine learning models for event extraction. The performance of current SoTA models heavily depends on the quantity of human-annotated data, as the model learns the guidelines from these examples.

⚠️ Note that not all datasets release their annotation guidelines. We provide code to generate these annotation guidelines automatically using a few training samples.

An example for a code prompt with annotation guidelines is shown below:

# This is an event extraction task where the goal is to extract structured events from the text. A structured event contains an event trigger word, an event type, the arguments participating in the event, and their roles in the event. For each different event type, please output the extracted information from the text into python-style dictionaries where the first key will be 'mention' with the value of the event trigger. Next, please output the arguments and their roles following the same format. The event type definitions and their argument roles are defined next.

# Here are the event definitions:

@dataclass
class Meet(ContactEvent):
    """A 'Meet(ContactEvent)' is triggered by interactions where individuals or groups come together for a specific purpose, either physically or virtually. This event involves direct interaction, distinguishing it from remote communication events like 'PhoneWrite'. It encompasses formal and informal gatherings such as diplomatic talks, business meetings, press conferences, and forums, but excludes casual or unplanned encounters."""
    mention: str  # The text span that triggers the event.
    entity: List  # Entities are individuals, groups, organizations, or countries participating in the meeting. They represent the participants involved in the event.
    place: List  # The place is the location where the meeting occurs, providing context for the event. It can be a city, building, specific venue, or virtual platform. 

# This is the text to analyze
text = "The meeting concluded with the delegates voting by show of hands to meet again in 10 days."

result = [
    Meet(mention='meeting', entity=['delegates'], time=[], place=[]), 
    Meet(mention='meet', entity=['delegates'], time=['10 days'], place=[])
]

PyCode-TextEE transforms EE datasets into the above format which have shown to perform well with LLMs. For more details, please refer to our paper Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines.

🚀 What’s New in PyCode-TextEE?

CodePrompt Format Conversion
We convert event structures (event triggers, arguments—if available) into Python-like prompts (e.g., Attack(mention="...", attacker=[...], target=[...])) to help LLMs handle structured outputs.
Annotation Guideline Generation While annotation guidelines have helped LLMs achieve SOTA results for EE, previous approaches assume that these guidelines are made available which is not always true. We take the next steps in generating these guidelines automatically from a few training samples.
Plug-and-Play with TextEE
Directly load standardized datasets from TextEE and transform them with one command into training-ready CodePrompts.
Evaluation Toolkit for Prompted LLMs
We provide exact-match evaluation utilities that compute precision, recall, and F1 scores over structured LLM outputs.
Code to Reproduce LLaMAEvents
Includes all data transformations and training scripts used for our paper on utilizing code prompts and annotation guidelines. Code for that will live in LLaMAEvents/.

🛠️ Updates

April 23, 2025 — We release PyCode-TextEE, a modular framework for converting standardized event extraction datasets (via TextEE) into code-style prompts, along with exact-match evaluation scripts.
Feel free to reach out if you’d like to contribute your models, datasets, or ideas!

📂 Supported Datasets

We support 15 datasets for Event Detection (ED), Event Argument Extraction (EAE), and End-to-End (E2E) Event Extraction. All are converted into code-style prompts and support evaluation using our exact-match metric suite.

The table below also shows whether annotation guidelines are included for each dataset.

Dataset	Task(s)	Paper Title	Source	Guidelines
`ACE05`	ED, EAE, E2E	The Automatic Content Extraction (ACE) Program	LDC	🔘
`ERE`	ED, EAE, E2E	From Light to Rich ERE	LDC	🔘
`MLEE`	ED, EAE, E2E	Biological Event Extraction	Bioinformatics	⚪️
`Genia2011`	ED, EAE, E2E	Genia Event Task (2011)	BioNLP 2011	⚪️
`Genia2013`	ED, EAE, E2E	Genia Event Task (2013)	BioNLP 2013	⚪️
`M2E2`	ED, EAE, E2E	Cross-media Structured Common Space	ACL 2020	⚪️
`CASIE`	ED, EAE, E2E	CASIE: Cybersecurity Event Extraction	AAAI 2020	⚪️
`PHEE`	ED, EAE, E2E	Pharmacovigilance Event Extraction	EMNLP 2022	⚪️
`MEE`	ED	Multilingual Event Extraction	EMNLP 2022	⚪️
`FewEvent`	ED	Few-Shot Event Detection	WSDM 2020	⚪️
`MAVEN`	ED	Massive General-Domain ED	EMNLP 2020	⚪️
`SPPED`	ED	ED from Social Media for Epidemic Prediction	NAACL 2024	⚪️
`MUC-4`	EAE	Fourth Message Understanding Conference	MUC 1992	⚪️
`RAMS`	EAE	Multi-Sentence Argument Linking	ACL 2020	🔘
`WikiEvents`	EAE	Conditional Generation for Doc-level EAE	NAACL 2021	🔘
`GENEVA`	EAE	Benchmarking Generalizability for EAE	ACL 2023	🔘

⚙️ Environment

Although there is no need of any additional package to run PyCode-TextEE, we recommend using Python 3.9+ with a clean virtual environment (e.g., via venv or conda).

🔹 Install Dependencies

# Clone the repo
git clone https://github.com/yourname/PyCode-TextEE.git
cd PyCode-TextEE

# Create a virtual environment (optional)
python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

# Install requirements (optional)
pip install -r requirements.txt

🔹 Core Dependencies

These are the minimal dependencies to run the code.

datasets
openai # (used for guideline generation)
wandb # (optional for experiment tracking)

⚠️ Note

Some datasets (e.g., ACE, ERE) require LDC license to access raw files. We provide code for preprocessing them, but not the data itself.

🚀 Running the Code

Below is a step-by-step guide to run PyCode-TextEE.
Our pipeline is divided into 4 main stages:

✣ Step 0 — Obtaining TextEE Format Dataset

Our code accepts data formatted after TextEE pre-processing. Please follow the instructions in data directory from the TextEE repo.

Make sure after running TextEE, you have data saved in the following structure:

📂 Expected dataset layout:

<your_dataset_dir>/
├── ace05-en/
│   ├── split1/
│   │   ├── train.json
│   │   ├── dev.json
│   │   └── test.json
│   └── ...
├── casie/
└── ...

🔹 Step 1 — [Optional] Generate Code Schema

If you're working with custom datasets (or want to regenerate schemas for the 15 supported ones), you'll first convert them into TextEE format and generate the corresponding Python-style event definitions.

📁 Directory structure:

PyCode-TextEE/
├── code_schema_generation/
│   ├── generate_schema.py
│   ├── init_prompts/            # Contains per-dataset class schemas (*.txt)
│   ├── python_event_defs/       # Python classes for eval (dataset-wise + all_ee_definitions.py)
│   ├── mapper.json              # Maps cleaned names → class names
│   └── schema.json              # All cleaned event/arg schemas

🛠 To generate schema:

cd code_schema_generation
python generate_schema.py --dataset_folder <your_dataset_dir>

👾 Example output schema (for ACE05 Attack event):

@dataclass
class Attack(ConflictEvent):
    mention: str
    target: List
    victim: List
    attacker: List
    instrument: List
    place: List
    agent: List

Note: We’ve already generated schema for all 15 supported datasets. This step is only required for new datasets.

🔹 Step 2: Generating Annotation Guidelines from a few Training Samples

While code prompts convert EE datasets into a structured format, annotating the schema with guidelines helps LLMs understand event and argument definitions. As shown in our paper, annotated schema with these guidelines help us achive SOTA results with LLaMA-3-8.1B. However, not all datasets release these annotation guidelines and we address this in our paper by proposing 5 different ways to generate this guidelines. Specifically, we generate guidelines using following 5 variants discussed below:

Guideline-P: Uses training samples from an event class e to generate guidelines. We denote such instances as positive samples in our approach.
Guideline-PN: In addition to positive training samples, we also utilize 15 negative samples from different event classes to generate guidelines.
Guideline-PS: We designate sibling event classes in event schema as negative samples and utilize them to generate guidelines.
Guideline-PN-Int and Guideline-PS-Int: We create two more variants that Integrate the 5 diverse guideline samples from GuidelinePN and Guideline-PS into a comprehensive one, respectively.

Note: We’ve already the synethesized guidelines and available human guidelines in directory guideline_generation/synthesize_guidelines/synthesized_guidelines

To generate the guidelines, please run the following command:

cd guideline_generation
python synthesize_guidelines/create_dictionaries.py --dataset_name <dataset_name>
python prompting/prompt_llms.py #generates guidelines P, PN, PS
python prompting/prompt_llm_adv_guidelines.py #generates Int- guidelines
cd .. # to navigate to home directory

where, <dataset_name> refers to the dataset for which the guidelines need to be genrated (e.g., ace05-en), <guideline_type> refers to one of the 5 variants discussed above, i.e., one from Guideline-P (P), Guideline-PN (PN), Guideline-PS (PS), Guideline-PN-Int (PNI) or Guideline-PS-Int (PSI).

📘 Guideline File Format

After above code execution, the guidelines will be stored in the file <output_file>. Please make sure that your guideline file looks like:

{
  "EventName1": {
    "description": [
      "One possible definition.",
      "Another variation of the same."
    ],
    "attributes": {
      "mention": "Trigger span of the event.",
      "arg_1": ["One definition for arg_1", "another definition for arg_1"]
    }
  }
}

This enables randomized sampling during conversion to avoid overfitting to one phrasing—an approach highlighted in our paper.

🔹 Step 3: Obtaining Code Prompts

We first need to make sure that python event definitions are in current environment to verify code prompts.

cd python_event_defs # this directory is already included in the code or can be generated using Step 1. You can find it in "PyCode-TextEE/code_schema_generation/python_event_defs"
export PYCODE_HOME=$(pwd)
export PYTHONPATH=$PYCODE_HOME:$PYCODE_HOME:$PYTHONPATH
cd ../../ # redirect the terminal to PyCode home directory

Run the following:

cd code_prompts
python prepare_dataset.py \
    --input_dir <your_dataset_directory> \
    --dataset_name <dataset_name> \
    --annotate_schema <True/False> \ #if unspecified, the schema will be left unannotated because the flag defaults to False.
    --guideline_file <guideline_file> \ #if unspecified, the guidelines will be generated automatically as specified in Step-2.
    --add_negative_samples <True/False> \ #used to reproduce our LLaMAEvents results.
    --output_dir ./processed_code_prompts/

⚙️ Argument Descriptions

Argument	Description
`--input_dir`	Path to TextEE-formatted JSONs (default: `../../TextEE/processed_data`)
`--dataset_name`	Name of the dataset to process (e.g., `ace05-en`)
`--annotate_schema`	Add class docstrings and inline comments using guidelines (default: `False`)
`--guideline_file`	Guideline JSON file for schema annotation (required if `annotate_schema=True`)
`--add_negative_samples`	Add negative examples to training set (default: `False`)
`--output_dir`	Where to save the converted code prompts (default: `./processed_code_prompts/`)

🧬 Annotated Prompt Example (with Guidelines)

When --annotate_schema=True, we generate prompts like:

@dataclass
class Event(ParentEvent):
    """the event definition"""
    mention: str  # Event trigger definition
    arg_1: List   # Definition of argument 1
    arg_2: List   # Definition of argument 2

This format supports LLM-compatible structure learning and improves interpretability.

💡 Tip

⓵ Skip --guideline_file and --annotate_schema if you're only interested in raw code prompts. If annotate_schema is True but the guideline_file is unspecified or not found, Step 2 will be executed automatically to produce guideline_file.

⓶ Use --add_negative_samples if you want to add negative sample per instance similar to DEGREE.

🔹 Step 4: Training Models

To train the model, you can use the following scripts with LLaMA models as default, simply run:

cd training_scripts
python train_completion.py # train a chat completion model with LLaMA-3.1-8B as backbone

You can also run following command to resume training from a checkpoint:

python resume_from_ckpt.py # please specify the checkpoint directory in the script. By default, it will download and run LLaMA-3.1-8B

🧪 Evaluation

Once you've trained your model to generate Python-style event prompts, you can use our evaluation suite in code_evaluation/ to compute standard precision, recall, and F1 scores via exact-match comparison of predicted and gold structured outputs.

📁 Directory Overview

code_evaluation/
├── all_ee_definitions.py     # Event classes copied from schema generation (Step 1)
├── event_scorer.py           # 🔥 Main evaluation logic
├── utils_typing.py           # (Attribution to GoLLIE — type helper module)

📊 `event_scorer.py`: Evaluation in a Nutshell

The core script compares model-generated code prompts with gold ones using Python object introspection.

✅ Key Features:

Extracts arguments from predicted and gold event objects
Computes micro/macro F1 across all examples
Identifies:
- Trigger-level mismatches
- Argument-level hallucinations
Logs detailed stats (TP / FP / FN per role)

🎯 Core Functions:

compute_f1(...): calculates precision, recall, and F1 from match counts
extract_objects(...): extracts fields except for mention to compare arguments
micro_ed_scores: calculate micro f1 score on Event Detection task
micro_eae_scores: calculate micro f1 score on Event Argument Extraction task
micro_e2e_scores: calculate micro f1 score on End-to-End Event Extraction task
log_hallucinations_and_mismatches(...): logs mismatches like hallucinated roles

🧪 Run the Demo Evaluation

We provide a ready-to-run example in:

demo/e2e_demo.json

This file contains three illustrative cases: - ✅ One fully correct prediction - 🟡 One partially correct - ❌ One incorrect

To run the evaluation:

cd code_evaluation
python event_scorer.py --input_file ./../demo/e2e_demo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🧠 Instruction Tuning with Annotation Guidelines for Event Extraction (Findings of ACL 2025)

🧩 Introduction

What are Code Prompts and Annotation Guidelines?

An example for a code prompt with annotation guidelines is shown below:

🚀 What’s New in PyCode-TextEE?

🛠️ Updates

📂 Supported Datasets

⚙️ Environment

🔹 Install Dependencies

🔹 Core Dependencies

⚠️ Note

🚀 Running the Code

✣ Step 0 — Obtaining TextEE Format Dataset

📂 Expected dataset layout:

🔹 Step 1 — [Optional] Generate Code Schema

🔹 Step 2: Generating Annotation Guidelines from a few Training Samples

📘 Guideline File Format

🔹 Step 3: Obtaining Code Prompts

⚙️ Argument Descriptions

🧬 Annotated Prompt Example (with Guidelines)

💡 Tip

🔹 Step 4: Training Models

🧪 Evaluation

📁 Directory Overview

📊 `event_scorer.py`: Evaluation in a Nutshell

✅ Key Features:

🎯 Core Functions:

🧪 Run the Demo Evaluation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
code_evaluation		code_evaluation
code_prompts		code_prompts
code_schema_generation		code_schema_generation
demo		demo
guideline_generation		guideline_generation
instruction_tuning_with_guidelines_ACL_2025		instruction_tuning_with_guidelines_ACL_2025
training_scripts		training_scripts
.gitignore		.gitignore
ReadMe.md		ReadMe.md
requirements.txt		requirements.txt

Ziyu-Yao-NLP-Lab/PyCode-TextEE

Folders and files

Latest commit

History

Repository files navigation

🧠 Instruction Tuning with Annotation Guidelines for Event Extraction (Findings of ACL 2025)

🧩 Introduction

What are Code Prompts and Annotation Guidelines?

An example for a code prompt with annotation guidelines is shown below:

🚀 What’s New in PyCode-TextEE?

🛠️ Updates

📂 Supported Datasets

⚙️ Environment

🔹 Install Dependencies

🔹 Core Dependencies

⚠️ Note

🚀 Running the Code

✣ Step 0 — Obtaining TextEE Format Dataset

📂 Expected dataset layout:

🔹 Step 1 — [Optional] Generate Code Schema

🔹 Step 2: Generating Annotation Guidelines from a few Training Samples

📘 Guideline File Format

🔹 Step 3: Obtaining Code Prompts

⚙️ Argument Descriptions

🧬 Annotated Prompt Example (with Guidelines)

💡 Tip

🔹 Step 4: Training Models

🧪 Evaluation

📁 Directory Overview

📊 event_scorer.py: Evaluation in a Nutshell

✅ Key Features:

🎯 Core Functions:

🧪 Run the Demo Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

📊 `event_scorer.py`: Evaluation in a Nutshell

Packages