Paper2Code: Automatic Paper-to-Code Generation System

A sophisticated system that automatically converts research papers into executable code using Large Language Models (LLMs). This system implements a three-stage pipeline (Plan → Analyze → Code) to reproduce research methodologies described in academic papers.

中文

🚀 Features

Three-Stage Pipeline: Plan → Analyze → Code methodology for comprehensive code generation
Multi-Format Support: Handles both JSON and LaTeX (.tex) paper formats
LLM Integration: Supports various LLM providers including Azure OpenAI and Zhipu AI
Modular Architecture: Clean separation of concerns with extensible design
Configuration-Driven: YAML-based configuration for easy customization
Output Organization: Structured output with planning documents, analysis files, and generated source code

📁 Project Structure

paper-code/
├── src/
│   └── paper2code/
│       ├── __init__.py
│       ├── config.py              # System prompts and configuration templates
│       ├── paper_planning.py      # Planning stage implementation
│       ├── paper_analyzing.py     # Analysis stage implementation  
│       ├── paper_coding.py        # Code generation stage implementation
│       ├── register.py            # NAT function registration
│       ├── tex2json.py           # LaTeX to JSON conversion utility
│       └── configs/
│           └── config.yml         # Default configuration file
├── outputs/                       # Generated code and analysis output directory
└── README.md

🏗️ Architecture Overview

Three-Stage Process

Planning Stage (paper_planning.py)
- Analyzes the research paper methodology
- Creates comprehensive implementation plan
- Generates architecture design with data structures and interfaces
- Produces task list with dependency analysis
Analysis Stage (paper_analyzing.py)
- Conducts detailed logic analysis for each component
- Creates detailed implementation guidance
- Ensures alignment with paper methodology
- Generates analysis documents for each code file
Code Generation Stage (paper_coding.py)
- Writes modular, maintainable Python code
- Follows Google-style coding guidelines
- Implements complete, reusable code snippets
- Generates executable source code with proper imports and type hints

Key Components

Configuration System (config.py): Contains system prompts and templates for consistent output formatting
LaTeX Processing (tex2json.py): Converts LaTeX papers to JSON format for easier processing
NAT Integration (register.py): Registers with NVIDIA AI Agent Toolkit framework
LLM Abstraction: Supports multiple LLM providers through unified interface

🔧 Setup and Installation

Prerequisites

Python 3.12+
NVIDIA AI Agent Toolkit
uv package manager

Installation Steps

Clone the project

git clone <repository-url>
cd paper-code

Set up virtual environment

uv venv .venv -p python3.12
source .venv/bin/activate

Install NVIDIA AI Agent Toolkit

git clone https://github.com/NVIDIA/NeMo-Agent-Toolkit aiqtoolkit --recursive
cd aiqtoolkit 
uv pip install ".[langchain]"

Install Paper2Code package

cd <this_git_repo_root_path>
uv pip install -e .

🎯 Usage

Basic Usage

nat run --config_file src/paper2code/configs/config.yml --input "path/to/paper.tex"

Configuration

The system uses YAML configuration files to define:

LLM Settings: Model selection and parameters
Output Directory: Location for generated files
Workflow Parameters: Customizable processing options

Example configuration (src/paper2code/configs/config.yml):

general:
  use_uvloop: true
  logging:
    console:
      level: WARN

llms:
  azure_openai_llm:
    _type: azure_openai
    model_name: gpt-4o
    azure_deployment: gpt-4o
  zhipu_llm:
    _type: openai
    base_url: https://open.bigmodel.cn/api/paas/v4
    model_name: glm-4.5

workflow:
  _type: paper2code
  llm_name: azure_openai_llm
  output_directory: outputs
  # file_list_msg:  prompt for file list
  # task_list_msg:  prompt for task list
  # config_msg:  config prompt

Input Formats

The system supports two input formats:

JSON Format: Direct JSON representation of paper content
LaTeX Format: Academic papers in LaTeX format (automatically converted to JSON)

📊 Output Structure

output log:

> nat run --config_file src/paper2code/configs/config.yml --input paper.tex
2025-09-06 17:00:49,968 - nat.cli.commands.start - INFO - Starting NAT from config file: 'src/paper2code/configs/config.yml'

Configuration Summary:
--------------------
Workflow Type: paper2code
Number of Functions: 0
Number of LLMs: 2
Number of Embedders: 0
Number of Memory: 0
Number of Object Stores: 0
Number of Retrievers: 0
Number of TTC Strategies: 0
Number of Authentication Providers: 0

2025-09-06 17:00:50,443 - paper2code.paper_planning - INFO - Start planning.
2025-09-06 17:00:50,443 - paper2code.paper_planning - INFO - [PLANNING] (1/4): Overall plan.
2025-09-06 17:01:16,203 - paper2code.paper_planning - INFO - [PLANNING] (2/4): Architecture design.
2025-09-06 17:01:25,494 - paper2code.paper_planning - INFO - [PLANNING] (3/4): Logic design.
2025-09-06 17:01:33,035 - paper2code.paper_planning - INFO - [PLANNING] (4/4): Configuration file generation.
2025-09-06 17:01:38,397 - paper2code.paper_planning - INFO - ✅ Paper planning finished.
2025-09-06 17:01:38,397 - paper2code.paper_analyzing - INFO - Start analyzing.
2025-09-06 17:01:38,397 - paper2code.paper_analyzing - INFO - [ANALYSIS] (1/6): utils.py.
2025-09-06 17:01:59,252 - paper2code.paper_analyzing - INFO - [ANALYSIS] (2/6): dataset_loader.py.
2025-09-06 17:02:21,041 - paper2code.paper_analyzing - INFO - [ANALYSIS] (3/6): model.py.
2025-09-06 17:02:46,550 - paper2code.paper_analyzing - INFO - [ANALYSIS] (4/6): trainer.py.
2025-09-06 17:03:09,380 - paper2code.paper_analyzing - INFO - [ANALYSIS] (5/6): evaluation.py.
2025-09-06 17:03:32,340 - paper2code.paper_analyzing - INFO - [ANALYSIS] (6/6): main.py.
2025-09-06 17:03:52,493 - paper2code.paper_analyzing - INFO - ✅ Paper analyzing finished.
2025-09-06 17:03:52,493 - paper2code.paper_coding - INFO - Start coding.
2025-09-06 17:03:52,494 - paper2code.paper_coding - INFO - [CODING](1/6): outputs/code/utils.py
2025-09-06 17:04:11,233 - paper2code.paper_coding - INFO - [CODING](2/6): outputs/code/dataset_loader.py
2025-09-06 17:04:33,644 - paper2code.paper_coding - INFO - [CODING](3/6): outputs/code/model.py
2025-09-06 17:04:54,248 - paper2code.paper_coding - INFO - [CODING](4/6): outputs/code/trainer.py
2025-09-06 17:05:09,493 - paper2code.paper_coding - INFO - [CODING](5/6): outputs/code/evaluation.py
2025-09-06 17:05:25,770 - paper2code.paper_coding - INFO - [CODING](6/6): outputs/code/main.py
2025-09-06 17:05:40,277 - paper2code.paper_coding - INFO - ✅ Paper coding finished.
2025-09-06 17:05:40,278 - nat.front_ends.console.console_front_end_plugin - INFO - 
--------------------------------------------------
Workflow Result:
['Source code generated Successfully in path: outputs/code.']
--------------------------------------------------
2025-09-06 17:05:40,278 - paper2code.register - INFO - Cleaning up paper2code workflow.

The system generates organized output in the specified directory:

outputs/
├── planning.md                    # Overall implementation plan
├── file_list.txt                 # System architecture design
├── task_list.json                # Detailed task breakdown with logic analysis
├── planning_config.yaml          # Configuration parameters
├── [filename]_simple_analysis.txt  # Logic analysis for each component
└── code/                         # Generated source code directory
    ├── main.py
    ├── dataset_loader.py
    ├── model.py
    ├── trainer.py
    └── evaluation.py

🌟 Advanced Features

Multi-LLM Support

The system supports multiple LLM providers:

Azure OpenAI: GPT-4, GPT-4o models
Zhipu AI: GLM-4.5 model
Extensible to other providers through configuration

Error Handling and Logging

Comprehensive logging for each processing stage
Error handling for invalid input formats
Progress tracking throughout the pipeline

Code Quality Assurance

Follows Google Python Style Guide
Strong type hints and explicit variable declarations
Complete import statements and error handling
Modular, maintainable code architecture

🔍 Environment Variables

For Azure OpenAI integration, set the following environment variables:

export AZURE_OPENAI_ENDPOINT="your_endpoint"
export AZURE_OPENAI_API_KEY="your_api_key"

For Zhipu OpenAI api integration, set the following environment variables:

export OPENAI_API_KEY="zhipu_api_key"

🛠️ Customization

Adding New LLM Providers

Extend the configuration in config.yml
Add provider-specific parameters
Test with sample papers

Custom Prompts

Modify system prompts in config.py to:

Adjust output style and format
Include domain-specific requirements
Customize planning and analysis depth

Output Customization

Modify the output directory structure and file naming patterns in the configuration to match your project requirements.

📈 Performance Considerations

Parallel Processing: Uses async/await for efficient LLM API calls
Memory Management: Processes papers in chunks for large documents
Caching: Optimized for repeated processing with similar papers
Resource Utilization: Configurable timeout and retry mechanisms

🔗 Dependencies

NVIDIA AI Agent Toolkit: Core framework integration
LangChain: LLM orchestration and prompt management
PyYAML: Configuration file parsing
Python Standard Library: Core functionality

🤝 Contributing

The system is designed to be extensible. Key areas for enhancement include:

Additional input format support
New LLM provider integrations
Enhanced code quality metrics
Multi-language support
Automated testing framework

📝 License

This project is part of the NVIDIA AI Agent Toolkit ecosystem and follows the same licensing terms.

Built with ❤️ using NVIDIA AI Agent Toolkit and Large Language Models

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
src/paper2code		src/paper2code
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_cn.md		README_cn.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper2Code: Automatic Paper-to-Code Generation System

🚀 Features

📁 Project Structure

🏗️ Architecture Overview

Three-Stage Process

Key Components

🔧 Setup and Installation

Prerequisites

Installation Steps

🎯 Usage

Basic Usage

Configuration

Input Formats

📊 Output Structure

🌟 Advanced Features

Multi-LLM Support

Error Handling and Logging

Code Quality Assurance

🔍 Environment Variables

🛠️ Customization

Adding New LLM Providers

Custom Prompts

Output Customization

📈 Performance Considerations

🔗 Dependencies

🤝 Contributing

📝 License

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

kabirz/Paper2code

Folders and files

Latest commit

History

Repository files navigation

Paper2Code: Automatic Paper-to-Code Generation System

🚀 Features

📁 Project Structure

🏗️ Architecture Overview

Three-Stage Process

Key Components

🔧 Setup and Installation

Prerequisites

Installation Steps

🎯 Usage

Basic Usage

Configuration

Input Formats

📊 Output Structure

🌟 Advanced Features

Multi-LLM Support

Error Handling and Logging

Code Quality Assurance

🔍 Environment Variables

🛠️ Customization

Adding New LLM Providers

Custom Prompts

Output Customization

📈 Performance Considerations

🔗 Dependencies

🤝 Contributing

📝 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages