Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Ziyu-Yao-NLP-Lab/PyCode-TextEE

Repository files navigation

🧠 Instruction Tuning with Annotation Guidelines for Event Extraction (Findings of ACL 2025)

Efficient and extensible Event Extraction with Code Prompts and Annotation Guidelines β€” built on top of TextEE.

This repository includes code for:

If you find our work helpful, please cite our work:

@inproceedings{srivastava-etal-2025-instruction,
    title = "Instruction-Tuning {LLM}s for Event Extraction with Annotation Guidelines",
    author = "Srivastava, Saurabh  and
      Pati, Sweta  and
      Yao, Ziyu",
    editor = "Che, Wanxiang  and
      Nabende, Joyce  and
      Shutova, Ekaterina  and
      Pilehvar, Mohammad Taher",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2025",
    month = jul,
    year = "2025",
    address = "Vienna, Austria",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-acl.677/",
    pages = "13055--13071",
    ISBN = "979-8-89176-256-5",
    abstract = "In this work, we study the effect of annotation guidelines{--}textual descriptions of event types and arguments, when instruction-tuning large language models for event extraction. We conducted a series of experiments with both human-provided and machine-generated guidelines in both full- and low-data settings. Our results demonstrate the promise of annotation guidelines when there is a decent amount of training data and highlight its effectiveness in improving cross-schema generalization and low-frequency event-type performance."
}

πŸ› οΈ Updates β€’ πŸ“‚ Datasets β€’ βš™οΈ Environment β€’ πŸš€ Running β€’ 🌐 Website β€’ πŸ“„ Paper


Authors:
Saurabh Srivastava, Sweta Pati, Ziyu Yao


🧩 Introduction

PyCode-TextEE extends TextEE, bringing event extraction into the era of prompt-based large language models.

While TextEE standardizes 10+ event extraction datasets into a unified JSON formatβ€”making them reproducible and comparableβ€”PyCode-TextEE takes the next leap:

✨ We transform TextEE-formatted data into code-style promptsβ€”a format that is both readable and executable by LLMs and ideal for structured evaluation. In addition, we annotate the code-prompts with annotation guidelines. Below, we provide an example of code prompt and how we integrate annotation guidelines within them:

What are Code Prompts and Annotation Guidelines?

  • Code prompting is a technique that enhances reasoning abilities in text+code LLMs by transforming natural language (NL) tasks into code representations. Instead of executing the code, the model uses it as a structured input format to reason and generate answers. The labels such event classes and arguments are represented as Python classes, and the guidelines or instructions are introduced as docstrings. The model start generating after the result = line.

  • Annotation Guidelines involve defining how to identify and classify events and their arguments within a text or other data. These guidelines help ensure consistency and quality in the annotation process, which is crucial for training machine learning models for event extraction. The performance of current SoTA models heavily depends on the quantity of human-annotated data, as the model learns the guidelines from these examples.

⚠️ Note that not all datasets release their annotation guidelines. We provide code to generate these annotation guidelines automatically using a few training samples.

An example for a code prompt with annotation guidelines is shown below:

# This is an event extraction task where the goal is to extract structured events from the text. A structured event contains an event trigger word, an event type, the arguments participating in the event, and their roles in the event. For each different event type, please output the extracted information from the text into python-style dictionaries where the first key will be 'mention' with the value of the event trigger. Next, please output the arguments and their roles following the same format. The event type definitions and their argument roles are defined next.

# Here are the event definitions:

@dataclass
class Meet(ContactEvent):
    """A 'Meet(ContactEvent)' is triggered by interactions where individuals or groups come together for a specific purpose, either physically or virtually. This event involves direct interaction, distinguishing it from remote communication events like 'PhoneWrite'. It encompasses formal and informal gatherings such as diplomatic talks, business meetings, press conferences, and forums, but excludes casual or unplanned encounters."""
    mention: str  # The text span that triggers the event.
    entity: List  # Entities are individuals, groups, organizations, or countries participating in the meeting. They represent the participants involved in the event.
    place: List  # The place is the location where the meeting occurs, providing context for the event. It can be a city, building, specific venue, or virtual platform. 

# This is the text to analyze
text = "The meeting concluded with the delegates voting by show of hands to meet again in 10 days."

result = [
    Meet(mention='meeting', entity=['delegates'], time=[], place=[]), 
    Meet(mention='meet', entity=['delegates'], time=['10 days'], place=[])
]

PyCode-TextEE transforms EE datasets into the above format which have shown to perform well with LLMs. For more details, please refer to our paper Instruction-Tuning LLMs for Event Extraction with Annotation Guidelines.


πŸš€ What’s New in PyCode-TextEE?

  • CodePrompt Format Conversion
    We convert event structures (event triggers, argumentsβ€”if available) into Python-like prompts (e.g., Attack(mention="...", attacker=[...], target=[...])) to help LLMs handle structured outputs.

  • Annotation Guideline Generation While annotation guidelines have helped LLMs achieve SOTA results for EE, previous approaches assume that these guidelines are made available which is not always true. We take the next steps in generating these guidelines automatically from a few training samples.

  • Plug-and-Play with TextEE
    Directly load standardized datasets from TextEE and transform them with one command into training-ready CodePrompts.

  • Evaluation Toolkit for Prompted LLMs
    We provide exact-match evaluation utilities that compute precision, recall, and F1 scores over structured LLM outputs.

  • Code to Reproduce LLaMAEvents
    Includes all data transformations and training scripts used for our paper on utilizing code prompts and annotation guidelines. Code for that will live in LLaMAEvents/.

πŸ› οΈ Updates

  • April 23, 2025 β€” We release PyCode-TextEE, a modular framework for converting standardized event extraction datasets (via TextEE) into code-style prompts, along with exact-match evaluation scripts.
    Feel free to reach out if you’d like to contribute your models, datasets, or ideas!

πŸ“‚ Supported Datasets

We support 15 datasets for Event Detection (ED), Event Argument Extraction (EAE), and End-to-End (E2E) Event Extraction. All are converted into code-style prompts and support evaluation using our exact-match metric suite.

The table below also shows whether annotation guidelines are included for each dataset.

Dataset Task(s) Paper Title Source Guidelines
ACE05 ED, EAE, E2E The Automatic Content Extraction (ACE) Program LDC πŸ”˜
ERE ED, EAE, E2E From Light to Rich ERE LDC πŸ”˜
MLEE ED, EAE, E2E Biological Event Extraction Bioinformatics βšͺ️
Genia2011 ED, EAE, E2E Genia Event Task (2011) BioNLP 2011 βšͺ️
Genia2013 ED, EAE, E2E Genia Event Task (2013) BioNLP 2013 βšͺ️
M2E2 ED, EAE, E2E Cross-media Structured Common Space ACL 2020 βšͺ️
CASIE ED, EAE, E2E CASIE: Cybersecurity Event Extraction AAAI 2020 βšͺ️
PHEE ED, EAE, E2E Pharmacovigilance Event Extraction EMNLP 2022 βšͺ️
MEE ED Multilingual Event Extraction EMNLP 2022 βšͺ️
FewEvent ED Few-Shot Event Detection WSDM 2020 βšͺ️
MAVEN ED Massive General-Domain ED EMNLP 2020 βšͺ️
SPPED ED ED from Social Media for Epidemic Prediction NAACL 2024 βšͺ️
MUC-4 EAE Fourth Message Understanding Conference MUC 1992 βšͺ️
RAMS EAE Multi-Sentence Argument Linking ACL 2020 πŸ”˜
WikiEvents EAE Conditional Generation for Doc-level EAE NAACL 2021 πŸ”˜
GENEVA EAE Benchmarking Generalizability for EAE ACL 2023 πŸ”˜

βš™οΈ Environment

Although there is no need of any additional package to run PyCode-TextEE, we recommend using Python 3.9+ with a clean virtual environment (e.g., via venv or conda).

πŸ”Ή Install Dependencies

# Clone the repo
git clone https://github.com/yourname/PyCode-TextEE.git
cd PyCode-TextEE

# Create a virtual environment (optional)
python3 -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

# Install requirements (optional)
pip install -r requirements.txt

πŸ”Ή Core Dependencies

These are the minimal dependencies to run the code.

  • datasets
  • openai # (used for guideline generation)
  • wandb # (optional for experiment tracking)

⚠️ Note

Some datasets (e.g., ACE, ERE) require LDC license to access raw files. We provide code for preprocessing them, but not the data itself.

πŸš€ Running the Code

Below is a step-by-step guide to run PyCode-TextEE.
Our pipeline is divided into 4 main stages:


✣ Step 0 β€” Obtaining TextEE Format Dataset

Our code accepts data formatted after TextEE pre-processing. Please follow the instructions in data directory from the TextEE repo.

Make sure after running TextEE, you have data saved in the following structure:

πŸ“‚ Expected dataset layout:

<your_dataset_dir>/
β”œβ”€β”€ ace05-en/
β”‚   β”œβ”€β”€ split1/
β”‚   β”‚   β”œβ”€β”€ train.json
β”‚   β”‚   β”œβ”€β”€ dev.json
β”‚   β”‚   └── test.json
β”‚   └── ...
β”œβ”€β”€ casie/
└── ...

πŸ”Ή Step 1 β€” [Optional] Generate Code Schema

If you're working with custom datasets (or want to regenerate schemas for the 15 supported ones), you'll first convert them into TextEE format and generate the corresponding Python-style event definitions.

πŸ“ Directory structure:

PyCode-TextEE/
β”œβ”€β”€ code_schema_generation/
β”‚   β”œβ”€β”€ generate_schema.py
β”‚   β”œβ”€β”€ init_prompts/            # Contains per-dataset class schemas (*.txt)
β”‚   β”œβ”€β”€ python_event_defs/       # Python classes for eval (dataset-wise + all_ee_definitions.py)
β”‚   β”œβ”€β”€ mapper.json              # Maps cleaned names β†’ class names
β”‚   └── schema.json              # All cleaned event/arg schemas

πŸ›  To generate schema:

cd code_schema_generation
python generate_schema.py --dataset_folder <your_dataset_dir>

πŸ‘Ύ Example output schema (for ACE05 Attack event):

@dataclass
class Attack(ConflictEvent):
    mention: str
    target: List
    victim: List
    attacker: List
    instrument: List
    place: List
    agent: List

Note: We’ve already generated schema for all 15 supported datasets. This step is only required for new datasets.


πŸ”Ή Step 2: Generating Annotation Guidelines from a few Training Samples

While code prompts convert EE datasets into a structured format, annotating the schema with guidelines helps LLMs understand event and argument definitions. As shown in our paper, annotated schema with these guidelines help us achive SOTA results with LLaMA-3-8.1B. However, not all datasets release these annotation guidelines and we address this in our paper by proposing 5 different ways to generate this guidelines. Specifically, we generate guidelines using following 5 variants discussed below:

  • Guideline-P: Uses training samples from an event class e to generate guidelines. We denote such instances as positive samples in our approach.
  • Guideline-PN: In addition to positive training samples, we also utilize 15 negative samples from different event classes to generate guidelines.
  • Guideline-PS: We designate sibling event classes in event schema as negative samples and utilize them to generate guidelines.
  • Guideline-PN-Int and Guideline-PS-Int: We create two more variants that Integrate the 5 diverse guideline samples from GuidelinePN and Guideline-PS into a comprehensive one, respectively.

Note: We’ve already the synethesized guidelines and available human guidelines in directory guideline_generation/synthesize_guidelines/synthesized_guidelines

To generate the guidelines, please run the following command:

cd guideline_generation
python synthesize_guidelines/create_dictionaries.py --dataset_name <dataset_name>
python prompting/prompt_llms.py #generates guidelines P, PN, PS
python prompting/prompt_llm_adv_guidelines.py #generates Int- guidelines
cd .. # to navigate to home directory

where, <dataset_name> refers to the dataset for which the guidelines need to be genrated (e.g., ace05-en), <guideline_type> refers to one of the 5 variants discussed above, i.e., one from Guideline-P (P), Guideline-PN (PN), Guideline-PS (PS), Guideline-PN-Int (PNI) or Guideline-PS-Int (PSI).

πŸ“˜ Guideline File Format

After above code execution, the guidelines will be stored in the file <output_file>. Please make sure that your guideline file looks like:

{
  "EventName1": {
    "description": [
      "One possible definition.",
      "Another variation of the same."
    ],
    "attributes": {
      "mention": "Trigger span of the event.",
      "arg_1": ["One definition for arg_1", "another definition for arg_1"]
    }
  }
}

This enables randomized sampling during conversion to avoid overfitting to one phrasingβ€”an approach highlighted in our paper.


πŸ”Ή Step 3: Obtaining Code Prompts

We first need to make sure that python event definitions are in current environment to verify code prompts.

cd python_event_defs # this directory is already included in the code or can be generated using Step 1. You can find it in "PyCode-TextEE/code_schema_generation/python_event_defs"
export PYCODE_HOME=$(pwd)
export PYTHONPATH=$PYCODE_HOME:$PYCODE_HOME:$PYTHONPATH
cd ../../ # redirect the terminal to PyCode home directory

Run the following:

cd code_prompts
python prepare_dataset.py \
    --input_dir <your_dataset_directory> \
    --dataset_name <dataset_name> \
    --annotate_schema <True/False> \ #if unspecified, the schema will be left unannotated because the flag defaults to False.
    --guideline_file <guideline_file> \ #if unspecified, the guidelines will be generated automatically as specified in Step-2.
    --add_negative_samples <True/False> \ #used to reproduce our LLaMAEvents results.
    --output_dir ./processed_code_prompts/

βš™οΈ Argument Descriptions

Argument Description
--input_dir Path to TextEE-formatted JSONs (default: ../../TextEE/processed_data)
--dataset_name Name of the dataset to process (e.g., ace05-en)
--annotate_schema Add class docstrings and inline comments using guidelines (default: False)
--guideline_file Guideline JSON file for schema annotation (required if annotate_schema=True)
--add_negative_samples Add negative examples to training set (default: False)
--output_dir Where to save the converted code prompts (default: ./processed_code_prompts/)

🧬 Annotated Prompt Example (with Guidelines)

When --annotate_schema=True, we generate prompts like:

@dataclass
class Event(ParentEvent):
    """the event definition"""
    mention: str  # Event trigger definition
    arg_1: List   # Definition of argument 1
    arg_2: List   # Definition of argument 2

This format supports LLM-compatible structure learning and improves interpretability.


πŸ’‘ Tip

β“΅ Skip --guideline_file and --annotate_schema if you're only interested in raw code prompts. If annotate_schema is True but the guideline_file is unspecified or not found, Step 2 will be executed automatically to produce guideline_file.

β“Ά Use --add_negative_samples if you want to add negative sample per instance similar to DEGREE.


πŸ”Ή Step 4: Training Models

To train the model, you can use the following scripts with LLaMA models as default, simply run:

cd training_scripts
python train_completion.py # train a chat completion model with LLaMA-3.1-8B as backbone

You can also run following command to resume training from a checkpoint:

python resume_from_ckpt.py # please specify the checkpoint directory in the script. By default, it will download and run LLaMA-3.1-8B

πŸ§ͺ Evaluation

Once you've trained your model to generate Python-style event prompts, you can use our evaluation suite in code_evaluation/ to compute standard precision, recall, and F1 scores via exact-match comparison of predicted and gold structured outputs.

πŸ“ Directory Overview

code_evaluation/
β”œβ”€β”€ all_ee_definitions.py     # Event classes copied from schema generation (Step 1)
β”œβ”€β”€ event_scorer.py           # πŸ”₯ Main evaluation logic
β”œβ”€β”€ utils_typing.py           # (Attribution to GoLLIE β€” type helper module)

πŸ“Š event_scorer.py: Evaluation in a Nutshell

The core script compares model-generated code prompts with gold ones using Python object introspection.

βœ… Key Features:

  • Extracts arguments from predicted and gold event objects
  • Computes micro/macro F1 across all examples
  • Identifies:
    • Trigger-level mismatches
    • Argument-level hallucinations
  • Logs detailed stats (TP / FP / FN per role)

🎯 Core Functions:

  • compute_f1(...): calculates precision, recall, and F1 from match counts
  • extract_objects(...): extracts fields except for mention to compare arguments
  • micro_ed_scores: calculate micro f1 score on Event Detection task
  • micro_eae_scores: calculate micro f1 score on Event Argument Extraction task
  • micro_e2e_scores: calculate micro f1 score on End-to-End Event Extraction task
  • log_hallucinations_and_mismatches(...): logs mismatches like hallucinated roles

πŸ§ͺ Run the Demo Evaluation

We provide a ready-to-run example in:

demo/e2e_demo.json

This file contains three illustrative cases: - βœ… One fully correct prediction - 🟑 One partially correct - ❌ One incorrect

To run the evaluation:

cd code_evaluation
python event_scorer.py --input_file ./../demo/e2e_demo.json

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •