codetlingua

Code Lingua Leaderboard

🚨 WIP: Artifacts for the leaderboard is expected to finish soon 🚨

Code Lingua leaderboard evaluates LLMs in Programming Language Translation. While other leaderboards assess abilities of LLMs to understand Natural Language (NL) for code synthesis, the ultimate way of assessing whether LLMs understand code syntax and semantics is code translation. Code Lingua serves as such leaderboard, and compares the ability of LLMs to understand what the code implements in source language and translate the same semantics in target language.

Requirements

Execute the following to install all requirements:

pip3 install -r requirements.txt

Docker

To create a docker image, execute the following:

docker build -t codetlingua .

Dataset

The dataset used in this study is available on HuggingFace. The current version of the leaderboard consists of the following datasets:

CodeNet:

PLs: C, C++, Go, Java, Python
# Samples / Language: 200
# Tests / Sample: 1

AVATAR:

PLs: Java, Python
# Samples / Language: 250
# Tests / Sample: ~50

Closed models API calls

In order to use GPT, Claude and Gemini, the following environment variables must be set before running the code.

GPT: OPENAI_API_KEY
Claude: ANTHROPIC_KEY
Gemini: GEMINI_KEY

The current version has been tested with gpt3.5-turbo, gpt4, gpt-4-0125, gemini-pro (1.0), claude-3-opus-20240229 .

Evaluating a new model?

The artifacts of Code Lingua has multiple modules which can be used for evaluating new LLMs on our benchmarks. You can either use our artifacts to evaluate your model or file a request so we could evaluate your model and add it to our leaderboard.

Translation

The first step is to use the model for generating raw translations. Please see the translate.sh script on how to generate translations. A sample translation command is provided below:

bash scripts/translate.sh deepseek-coder-1.3b-instruct codenet Java Python deepseek 0.2 10 16 1024 3 0

Sanitization

The raw translations generated by LLMs contain extra template-related tokens and natural language. Please see the sanitize.sh script on how to sanitize the generated translations. A sample sanitization command is provided below:

bash scripts/sanitize.sh translations deepseek-coder-1.3b-instruct codenet Java Python 0.2 remove_prompt

Evaluation

The final step is to evaluate the correctness of sanitized translations. Please check the evaluate.sh script on how to run the test suites against the translations. A sample evaluation command is given below:

bash scripts/evaluate.sh translations deepseek-coder-1.3b-instruct codenet Java Python 0.2 8

Contact Us

The artifacts of Code Lingua leaderboard is consistently being improved. If you see any inconsistencies, please feel free to open a PR or contact Ali ([email protected]).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly