Efficient and extensible Event Extraction with Code Prompts and Annotation Guidelines β built on top of TextEE.
This repository includes code for:
PyCode-TextEE: Tools to obtain code prompts for 15 event extraction datasets supported by TextEE.Instruction Tuning with Guidelines: Source code to reproduce our work on utlizing code prompts and annotation guidelines for Event Extraction. Please navigate to the directoryinstruction_tuning_with_guidelines_ACL_2025for the source code.
If you find our work helpful, please cite our work:
@inproceedings{srivastava-etal-2025-instruction,
title = "Instruction-Tuning {LLM}s for Event Extraction with Annotation Guidelines",
author = "Srivastava, Saurabh and
Pati, Sweta and
Yao, Ziyu",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-acl.677/",
pages = "13055--13071",
ISBN = "979-8-89176-256-5",
abstract = "In this work, we study the effect of annotation guidelines{--}textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance."
}
Authors:
Saurabh Srivastava, Sweta Pati, Ziyu Yao
PyCode-TextEE extends TextEE, bringing event extraction into the era of prompt-based large language models.
While TextEE standardizes 10+ event extraction datasets into a unified JSON formatβmaking them reproducible and comparableβPyCode-TextEE takes the next leap:
β¨ We transform TextEE-formatted data into code-style promptsβa format that is both readable and executable by LLMs and ideal for structured evaluation. In addition, we annotate the code-prompts with annotation guidelines. Below, we provide an example of code prompt and how we integrate annotation guidelines within them:
-
Code promptingis a technique that enhances reasoning abilities in text+code LLMs by transforming natural language (NL) tasks into code representations. Instead of executing the code, the model uses it as a structured input format to reason and generate answers. The labels such event classes and arguments are represented as Python classes, and the guidelines or instructions are introduced as docstrings. The model start generating after theresult =line. -
Annotation Guidelinesinvolve defining how to identify and classify events and their arguments within a text or other data. These guidelines help ensure consistency and quality in the annotation process, which is crucial for training machine learning models for event extraction. The performance of current SoTA models heavily depends on the quantity of human-annotated data, as the model learns the guidelines from these examples.
# This is an event extraction task where the goal is to extract structured events from the text. A structured event contains an event trigger word, an event type, the arguments participating in the event, and their roles in the event. For each different event type, please output the extracted information from the text into python-style dictionaries where the first key will be 'mention' with the value of the event trigger. Next, please output the arguments and their roles following the same format. The event type definitions and their argument roles are defined next.
# Here are the event definitions:
@dataclass
class Meet(ContactEvent):
"""A 'Meet(ContactEvent)' is triggered by interactions where individuals or groups come together for a specific purpose, either physically or virtually. This event involves direct interaction, distinguishing it from remote communication events like 'PhoneWrite'. It encompasses formal and informal gatherings such as diplomatic talks, business meetings, press conferences, and forums, but excludes casual or unplanned encounters."""
mention: str # The text span that triggers the event.
entity: List # Entities are individuals, groups, organizations, or countries participating in the meeting. They represent the participants involved in the event.
place: List # The place is the location where the meeting occurs, providing context for the event. It can be a city, building, specific venue, or virtual platform.
# This is the text to analyze
text = "The meeting concluded with the delegates voting by show of hands to meet again in 10 days."
result = [
Meet(mention='meeting', entity=['delegates'], time=[], place=[]),
Meet(mention='meet', entity=['delegates'], time=['10 days'], place=[])
]PyCode-TextEE transforms EE datasets into the above format which have shown to perform well with LLMs. For more details, please refer to our paper Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines.
-
CodePrompt Format Conversion
We convert event structures (event triggers, argumentsβif available) into Python-like prompts (e.g.,Attack(mention="...", attacker=[...], target=[...])) to help LLMs handle structured outputs. -
Annotation Guideline Generation While annotation guidelines have helped LLMs achieve SOTA results for EE, previous approaches assume that these guidelines are made available which is not always true. We take the next steps in generating these guidelines automatically from a few training samples.
-
Plug-and-Play with TextEE
Directly load standardized datasets from TextEE and transform them with one command into training-ready CodePrompts. -
Evaluation Toolkit for Prompted LLMs
We provide exact-match evaluation utilities that compute precision, recall, and F1 scores over structured LLM outputs. -
Code to Reproduce LLaMAEvents
Includes all data transformations and training scripts used for our paper on utilizing code prompts and annotation guidelines. Code for that will live inLLaMAEvents/.
- April 23, 2025 β We release PyCode-TextEE, a modular framework for converting standardized event extraction datasets (via TextEE) into code-style prompts, along with exact-match evaluation scripts.
Feel free to reach out if youβd like to contribute your models, datasets, or ideas!
We support 15 datasets for Event Detection (ED), Event Argument Extraction (EAE), and End-to-End (E2E) Event Extraction. All are converted into code-style prompts and support evaluation using our exact-match metric suite.
The table below also shows whether annotation guidelines are included for each dataset.
| Dataset | Task(s) | Paper Title | Source | Guidelines |
|---|---|---|---|---|
ACE05 |
ED, EAE, E2E | The Automatic Content Extraction (ACE) Program | LDC | π |
ERE |
ED, EAE, E2E | From Light to Rich ERE | LDC | π |
MLEE |
ED, EAE, E2E | Biological Event Extraction | Bioinformatics | βͺοΈ |
Genia2011 |
ED, EAE, E2E | Genia Event Task (2011) | BioNLP 2011 | βͺοΈ |
Genia2013 |
ED, EAE, E2E | Genia Event Task (2013) | BioNLP 2013 | βͺοΈ |
M2E2 |
ED, EAE, E2E | Cross-media Structured Common Space | ACL 2020 | βͺοΈ |
CASIE |
ED, EAE, E2E | CASIE: Cybersecurity Event Extraction | AAAI 2020 | βͺοΈ |
PHEE |
ED, EAE, E2E | Pharmacovigilance Event Extraction | EMNLP 2022 | βͺοΈ |
MEE |
ED | Multilingual Event Extraction | EMNLP 2022 | βͺοΈ |
FewEvent |
ED | Few-Shot Event Detection | WSDM 2020 | βͺοΈ |
MAVEN |
ED | Massive General-Domain ED | EMNLP 2020 | βͺοΈ |
SPPED |
ED | ED from Social Media for Epidemic Prediction | NAACL 2024 | βͺοΈ |
MUC-4 |
EAE | Fourth Message Understanding Conference | MUC 1992 | βͺοΈ |
RAMS |
EAE | Multi-Sentence Argument Linking | ACL 2020 | π |
WikiEvents |
EAE | Conditional Generation for Doc-level EAE | NAACL 2021 | π |
GENEVA |
EAE | Benchmarking Generalizability for EAE | ACL 2023 | π |
Although there is no need of any additional package to run PyCode-TextEE, we recommend using Python 3.9+ with a clean virtual environment (e.g., via venv or conda).
# Clone the repo
git clone https://github.com/yourname/PyCode-TextEE.git
cd PyCode-TextEE
# Create a virtual environment (optional)
python3 -m venv env
source env/bin/activate # On Windows: env\Scripts\activate
# Install requirements (optional)
pip install -r requirements.txtThese are the minimal dependencies to run the code.
datasetsopenai# (used for guideline generation)wandb# (optional for experiment tracking)
Some datasets (e.g., ACE, ERE) require LDC license to access raw files. We provide code for preprocessing them, but not the data itself.
Below is a step-by-step guide to run PyCode-TextEE.
Our pipeline is divided into 4 main stages:
Our code accepts data formatted after TextEE pre-processing. Please follow the instructions in data directory from the TextEE repo.
Make sure after running TextEE, you have data saved in the following structure:
<your_dataset_dir>/
βββ ace05-en/
β βββ split1/
β β βββ train.json
β β βββ dev.json
β β βββ test.json
β βββ ...
βββ casie/
βββ ...
If you're working with custom datasets (or want to regenerate schemas for the 15 supported ones), you'll first convert them into TextEE format and generate the corresponding Python-style event definitions.
π Directory structure:
PyCode-TextEE/
βββ code_schema_generation/
β βββ generate_schema.py
β βββ init_prompts/ # Contains per-dataset class schemas (*.txt)
β βββ python_event_defs/ # Python classes for eval (dataset-wise + all_ee_definitions.py)
β βββ mapper.json # Maps cleaned names β class names
β βββ schema.json # All cleaned event/arg schemas
π To generate schema:
cd code_schema_generation
python generate_schema.py --dataset_folder <your_dataset_dir>πΎ Example output schema (for ACE05 Attack event):
@dataclass
class Attack(ConflictEvent):
mention: str
target: List
victim: List
attacker: List
instrument: List
place: List
agent: ListNote: Weβve already generated schema for all 15 supported datasets. This step is only required for new datasets.
While code prompts convert EE datasets into a structured format, annotating the schema with guidelines helps LLMs understand event and argument definitions. As shown in our paper, annotated schema with these guidelines help us achive SOTA results with LLaMA-3-8.1B. However, not all datasets release these annotation guidelines and we address this in our paper by proposing 5 different ways to generate this guidelines. Specifically, we generate guidelines using following 5 variants discussed below:
- Guideline-P: Uses training samples from an event class e to generate guidelines. We denote such instances as positive samples in our approach.
- Guideline-PN: In addition to positive training samples, we also utilize 15 negative samples from different event classes to generate guidelines.
- Guideline-PS: We designate sibling event classes in event schema as negative samples and utilize them to generate guidelines.
- Guideline-PN-Int and Guideline-PS-Int: We create two more variants that Integrate the 5 diverse guideline samples from GuidelinePN and Guideline-PS into a comprehensive one, respectively.
Note: Weβve already the synethesized guidelines and available human guidelines in directory guideline_generation/synthesize_guidelines/synthesized_guidelines
To generate the guidelines, please run the following command:
cd guideline_generation
python synthesize_guidelines/create_dictionaries.py --dataset_name <dataset_name>
python prompting/prompt_llms.py #generates guidelines P, PN, PS
python prompting/prompt_llm_adv_guidelines.py #generates Int- guidelines
cd .. # to navigate to home directorywhere, <dataset_name> refers to the dataset for which the guidelines need to be genrated (e.g., ace05-en), <guideline_type> refers to one of the 5 variants discussed above, i.e., one from Guideline-P (P), Guideline-PN (PN), Guideline-PS (PS), Guideline-PN-Int (PNI) or Guideline-PS-Int (PSI).
After above code execution, the guidelines will be stored in the file <output_file>. Please make sure that your guideline file looks like:
{
"EventName1": {
"description": [
"One possible definition.",
"Another variation of the same."
],
"attributes": {
"mention": "Trigger span of the event.",
"arg_1": ["One definition for arg_1", "another definition for arg_1"]
}
}
}This enables randomized sampling during conversion to avoid overfitting to one phrasingβan approach highlighted in our paper.
We first need to make sure that python event definitions are in current environment to verify code prompts.
cd python_event_defs # this directory is already included in the code or can be generated using Step 1. You can find it in "PyCode-TextEE/code_schema_generation/python_event_defs"
export PYCODE_HOME=$(pwd)
export PYTHONPATH=$PYCODE_HOME:$PYCODE_HOME:$PYTHONPATH
cd ../../ # redirect the terminal to PyCode home directoryRun the following:
cd code_prompts
python prepare_dataset.py \
--input_dir <your_dataset_directory> \
--dataset_name <dataset_name> \
--annotate_schema <True/False> \ #if unspecified, the schema will be left unannotated because the flag defaults to False.
--guideline_file <guideline_file> \ #if unspecified, the guidelines will be generated automatically as specified in Step-2.
--add_negative_samples <True/False> \ #used to reproduce our LLaMAEvents results.
--output_dir ./processed_code_prompts/| Argument | Description |
|---|---|
--input_dir |
Path to TextEE-formatted JSONs (default: ../../TextEE/processed_data) |
--dataset_name |
Name of the dataset to process (e.g., ace05-en) |
--annotate_schema |
Add class docstrings and inline comments using guidelines (default: False) |
--guideline_file |
Guideline JSON file for schema annotation (required if annotate_schema=True) |
--add_negative_samples |
Add negative examples to training set (default: False) |
--output_dir |
Where to save the converted code prompts (default: ./processed_code_prompts/) |
When --annotate_schema=True, we generate prompts like:
@dataclass
class Event(ParentEvent):
"""the event definition"""
mention: str # Event trigger definition
arg_1: List # Definition of argument 1
arg_2: List # Definition of argument 2This format supports LLM-compatible structure learning and improves interpretability.
β΅ Skip --guideline_file and --annotate_schema if you're only interested in raw code prompts. If annotate_schema is True but the guideline_file is unspecified or not found, Step 2 will be executed automatically to produce guideline_file.
βΆ Use --add_negative_samples if you want to add negative sample per instance similar to DEGREE.
To train the model, you can use the following scripts with LLaMA models as default, simply run:
cd training_scripts
python train_completion.py # train a chat completion model with LLaMA-3.1-8B as backboneYou can also run following command to resume training from a checkpoint:
python resume_from_ckpt.py # please specify the checkpoint directory in the script. By default, it will download and run LLaMA-3.1-8BOnce you've trained your model to generate Python-style event prompts, you can use our evaluation suite in code_evaluation/ to compute standard precision, recall, and F1 scores via exact-match comparison of predicted and gold structured outputs.
code_evaluation/
βββ all_ee_definitions.py # Event classes copied from schema generation (Step 1)
βββ event_scorer.py # π₯ Main evaluation logic
βββ utils_typing.py # (Attribution to GoLLIE β type helper module)
The core script compares model-generated code prompts with gold ones using Python object introspection.
- Extracts arguments from predicted and gold event objects
- Computes micro/macro F1 across all examples
- Identifies:
- Trigger-level mismatches
- Argument-level hallucinations
- Logs detailed stats (TP / FP / FN per role)
compute_f1(...): calculates precision, recall, and F1 from match countsextract_objects(...): extracts fields except formentionto compare argumentsmicro_ed_scores: calculate micro f1 score on Event Detection taskmicro_eae_scores: calculate micro f1 score on Event Argument Extraction taskmicro_e2e_scores: calculate micro f1 score on End-to-End Event Extraction tasklog_hallucinations_and_mismatches(...): logs mismatches like hallucinated roles
We provide a ready-to-run example in:
demo/e2e_demo.json
This file contains three illustrative cases: - β One fully correct prediction - π‘ One partially correct - β One incorrect
To run the evaluation:
cd code_evaluation
python event_scorer.py --input_file ./../demo/e2e_demo.json