This repository contains the implementation of Epicoder, a pipeline designed for code instruction synthesis, encompassing diverse and complex code generation. This work is associated with the paper:
EpiCoder: Encompassing Diversity and Complexity in Code Generation. The data can be found in
EpiCoder-func-380k on Hugging Face
The Epicoder pipeline consists of multiple stages, starting from raw code feature extraction and evolving through code generation, debugging, and testing. Below is a step-by-step breakdown of the pipeline:
Before running the pipeline, set up the PYTHONPATH:
export PYTHONPATH=$(pwd):$PYTHONPATHpython extract/extract_features.py
Extracts features from raw code.python extract/extract_frequency_count.py
Counts the frequency of features and identifies the top frequent features, saving the results as a JSON file.python extract/extract_separate_frequency.py
Separates and saves feature frequencies and features individually.
python evol/feature_evol.py
Evolves the features based on the extracted data.python evol/merge_expanded_features.py
Merges the evolved features with the original ones.
python gen/gen_question.py
Generates questions or tasks based on the feature set.python gen/gen_code.py
Generates code according to the tasks created.
python debug/run_test_iter0.py
Runs in a Docker environment to identify correct code outputs.python debug/run_test_with_debug_multi_turn.py
Tests the code with multiple debugging iterations.
python utils/get_train_data.py
Saves correct codes as a JSON file for further analysis or training.
Make sure to follow the steps in the pipeline to ensure smooth code generation and testing. The Docker environment is used to validate the generated code.
For more details, refer to the Epicoder paper.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.