Thanks to visit codestin.com
Credit goes to github.com

Skip to content

apcl-research/llm-reason

Repository files navigation

Do Code LLMs Do Static Analysis?

Proposed by:

Quick link

To-do list

To set up your local environment, run the following command. We recommend the use of a virtual environment for running the experiments.

pip install -r requirements.txt

For finetuning or running codellama related experiments, use:

pip install -r requirements_codellama.txt

For finetuning or running jam related experiments, use:

pip install -r requirements_jam.txt

Note

you need to have the key from OpenAI to run the GPT model and the key from Google to run the Gemini model.

You also need to download the CodeLlama 13B model in Meta Huggingace. You need to obtain permission for using the model as well.

For GPT and Gemini, *_example.py means with in-context learning where * is either GPT or Gemini e.g. gpt_example.py means gpt models with in-context learning.

We realse all of the data and models with results in our hugginface website. See the link for models and the link for datastet and results.

AST generation

Generation

  • For AST generation experiements, visit related directory for desired models. For example, if you want to run the codellama model, you would visit srcml/java/codellama.
  • Each directory has its related yaml file for configuration.
  • The paramters for generating srcml are as follows:
function_file_dir: file location of the functions
q90_fid_file: file location of the function id for testset
OUT_FILENAME: filename for the output file
model_id: file location of your codellama model

Metrics

Run the following command in srcml/java/ directory:

python3 srcml_metrics.py

The parameters are as follows:

srcml_tools_filename: filename for srcml generated by the tool
srcml_gpt_filename: filename for srcml generated by LLMs

Callgraph generation

Generation

  • For callgraph generation experiements, visit related directory for desired models. For example, if you want to run the codellama model, you would visit callgraph/java/codellama.
  • Each directory has its related yaml file for configuration.
  • The paramters for generating srcml are as follows:
model_id: file location of your codellama model
OUT_FILENAME: filename for the output file
callgraph_data_file: file location of function and call methods

Metrics

For models other than codellama, run the following command in callgraph/java/ directory:

python3 callgraph_metric.py

The parameters are as follows:

dat: filename for callgraph generated by LLMs

For codellam, run the follwing command in callgraph/java/codellama/ directory:

python3 callgraph_metric.py

The parameter includes:

llm_out_file: filename of callgraph generated by codellam for running metrics script in codellama diretectory

Dataflow graph generation

Generation

  • For dataflow graph generation experiements, visit related directory for desired models. For example, if you want to run the codellama model, you would visit dataflow/java/codellama.
  • Each directory has its related yaml file for configuration.
  • The paramters for generating srcml are as follows:
model_id: file location of your codellama model
OUT_FILENAME: filename for the output file
datafile: file location of function and call methods

Metrics

For models other than codellama, run the following command in dataflow/java/ directory:

python3 metric.py

The parameters are as follows:

tools_filename: filename for dataflow graph generated by tools
llm_filename: filename for dataflow graph generated by llms

For codellam, run the follwing command in dataflow/java/codellama/ directory:

python3 metrics.py

The parameter includes:

tools_filename: filename for dataflow graph generated by tools
llm_filename: filename for dataflow graph generated by llms

Codellama

Compile data

To compile data for train/val/test codellama, visit related directory for the desired task. For example, if you want to compile data for srcml, you would visit srcml_data directory.

The command is as follows:

python3 srcml_data_java.py

Finetune

To finetune codellama, visit related directory for the desired task. For example, if you want to finetune codellama for srcml, you would visit srcml_finetune directory.

The command is as follows:

CUDA_DEVICE_ORDER='PCI_BUS_ID' CUDA_VISIBLE_DEVICES='2,3' python3 qlora_with_val_srcml_java.py --model_name_or_path CodeLlama-13b-Instruct-hf/ --ddp_find_unused_parameters False --bf16 --dataset data/srcml_java_train.json --max_steps 240 --output_dir srcml_java_pretrained --per_device_train_batch_size 4

Generate

To generate the results after finetunining codellama, visit related directory for the desired task. For example, if you want to generate srcml after finetuning codellama, you would visit srcml_generate directory.

The command is as follows:

CUDA_DEVICE_ORDER='PCI_BUS_ID' CUDA_VISIBLE_DEVICES='2,3' python3 generate_srcml_java.py

Metrics

For metrics on code intelligence tasks, please refer to Jam section

For metrics other than code intelligence tasks, visit related directory for the desired task. For example, if you want to generate srcml after finetuning codellama, you would visit srcml_metrics directory.

the example command is as follows:

python3 srcml_metrics_java.py

Jam

Compile data

For compiling data, look at the data directory for the desired task. For example, look at jam_codegen for the codegen task

The command for compiling data is as follows:

python3 prepare_fc_raw.py

Finetuning models

For finetuning models, look at the config directory for the desired task. For example, look at codegen_base.py for the codegen base task.

The example command for finetuning the model is as follows:

CUDA_DEVICE_ORDER='PCI_BUS_ID' CUDA_VISIBLE_DEVICES='1' OMP_NUM_THREADS=2 time torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:4001 --nnodes=1 --nproc_per_node=1 train.py config/codegen_base.py --freeze=False --outfilename=ckpt.pt --always_save_checkpoint=True

Generate

For generation, look at sample_*.py for the desired task. For example, look at sample_codegen.py for the codegen task.

The example command is as follows:

CUDA_DEVICE_ORDER='PCI_BUS_ID' CUDA_VISIBLE_DEVICES='1' OMP_NUM_THREADS=2 time torchrun --rdzv-backend=c10d --rdzv-endpoint=localhost:4000 --nnodes=1 --nproc_per_node=1 sample_codegen.py config/codegen_base.py --outfilename=ckpt.pt --prediction_filename=predict_codegen_base.txt

Metrics

For metrics, look at metrics_*.py for the desired metrics. For example, look at metrics_bleu.py for bleu score.

For statistical test, look at metrics_*_diff.py for the desired metrics. For example, metrics_bleu_diff.py for statistical test on bleu score.

The example command is as follows:

python3 bleu.py jam_cgpt_predictions/predict_codegen_base.txt

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages