Revisiting code search in a two-stage paradigm

[WSDM 2023] This is the official PyTorch implementation for the paper: "Revisiting Code Search in a Two-Stage Paradigm" (arXiv url). This project implements two-stage paradigm on CodeSearchNet.

Environment

We used Anaconda to setup a deep learning workspace that supports PyTorch. Run the following script to install all the required packages.

conda create -n TwoStage python==3.8
conda activate TwoStage

download this_project

cd TwoStage
pip install -r requirements.txt
pip install git+https://github.com/casics/spiral.git

Data prepare

Dataset

We follow GraphCodeBERT pre-process progress for CodeSearchNet. The answer of each query is retrieved from the whole development and testing code corpus instead of 1,000 candidate codes. We put the dataset files on ~/VisualSearch/GraphCodeBERT_dataset.

mkdir ~/VisualSearch
mkdir ~/VisualSearch/GraphCodeBERT_dataset

Please refer to GraphCodeBERT for data-preprocess: https://github.com/microsoft/CodeBERT/tree/master/GraphCodeBERT/codesearch#data-preprocess

Model checkpoints

Except for text matching methods, the other baselines need checkpoints for model weight initialization. Readers could train the baselines referring to the URLs provided in the paper and put the checkpoint files to ~/VisualSearch/TwoStageModels/. We provide the checkpoints of GraphCodeBERT and CodeBERT on python subset.

GoogleDrive: https://drive.google.com/file/d/1Tfhibit35WzfLEyl0Uf5hGsi5FSIMwna/view?usp=sharing

unzip -q "TwoStageModels.zip" -d "~/VisualSearch/"

Train model from scratch

If you want to Train GraphCodeBERT, CodeBERT and CODEnn from scratch, you could refer to the following instrctions.

For GraphCodeBERT, we could refer to the official implement: https://github.com/microsoft/CodeBERT/tree/master/GraphCodeBERT/codesearch#data-preprocess.

However, for CodeBERT and CODEnn, due to the differences in dataset preprocessing and partitioning methods used here compared to the official implementation, you can use our provided training scripts:

CodeBERT: ./shell/TOSS/train_CodeBert_csn.sh
CODEnn: ./shell/TOSS/train_CODEnn_csn.sh

Two-stage evaluation on python subset

The evaluation results of $TOSS_{[BM25]+CodeBERT}$, $TOSS_{[GraphCodeBERT]+CodeBERT}$, and $TOSS_{[GraphCodeBERT+BM25]+CodeBERT}$.

language="python"

python run.py --lang ${language} --num_workers 16 --device cuda --code_length 256 --batch_size 256

On one V100 GPU, we get following results:

MRR:

Model	Top 5	Top 10	Top 100
$TOSS_{[BM25]+CodeBERT}$	0.5428	0.5852	0.6884
$TOSS_{[GraphCodeBERT]+CodeBERT}$	0.7392	0.7512	0.7589
$TOSS_{[GraphCodeBert+BM25]+CodeBERT}$	0.7553	0.7595	0.7607

Per Query Time (seconds):

Model	Top 5	Top 10	Top 100
$TOSS_{[BM25]+CodeBERT}$	7.6	12.5	114.3
$TOSS_{[GraphCodeBERT]+CodeBERT}$	54.0	59.2	161.1
$TOSS_{[GraphCodeBert+BM25]+CodeBERT}$	61.2801	72.3974	277.8994

Two-stage evaluation on six language of CSN

After preparing the Model checkpoints of GraphCodeBERT and CodeBERT on Ruby, JavaScript, Go, Python, Java, and PHP, we get the MRR and Recall result.

MRR

Model / Method	Ruby	Javascript	Go	Python	Java	Php	Overall
BOW	0.230	0.184	0.350	0.222	0.245	0.193	0.237
TF	0.239	0.204	0.363	0.240	0.262	0.215	0.254
Jaccard	0.220	0.191	0.345	0.243	0.235	0.182	0.236
BM25	0.505	0.393	0.572	0.452	0.426	0.335	0.447
CODEnn	0.342	0.355	0.495	0.178	0.108	0.141	0.270
CodeBERT-bi	0.679	0.620	0.882	0.667	0.676	0.628	0.692
GraphCodeBERT	0.703	0.644	0.897	0.692	0.691	0.649	0.713
$TOSS$	0.765	0.696	0.918	0.760	0.750	0.692	0.763

R@1

R@1
Model / Method	Ruby	Javascript	Go	Python	Java	Php	Overall
BOW	15.2	12.3	26.1	17.7	17.0	13.2	16.9
TF	15.2	13.7	26.7	16.1	18.1	14.9	17.4
Jaccard	14.2	12.9	25.6	16.9	15.9	11.7	16.2
BM25	39.7	29.2	45.7	35.6	32.0	23.9	34.4
CodeBERT-bi	55.7	49.1	83.2	57.4	56.0	50.4	58.6
GraphCodeBERT	60.7	54.7	85.0	59.3	59.0	54.6	62.2
TOSS	67.9	60.4	87.4	67.0	66.1	58.6	67.9

R@5

Model / Method	Ruby	Javascript	Go	Python	Java	Php	Overall
BOW	30.5	24.2	44.2	30.7	32.0	25.3	31.1
TF	32.5	26.6	46.5	28.1	34.5	27.8	32.7
Jaccard	30.3	24.9	44.1	30.8	31.1	24.1	30.9
BM25	63.2	51.0	71.0	56.4	54.8	44.2	56.8
CodeBERT-bi	79.3	72.3	94.7	77.9	78.4	73.8	79.4
GraphCodeBERT	83.1	76.5	95.7	82.1	81.0	77.5	82.7
TOSS	86.0	79.6	96.3	85.6	83.9	79.4	85.1

R@10

Model / Method	Ruby	Javascript	Go	Python	Java	Php	Overall
BOW	37.5	30.2	51.5	36.7	38.3	31.1	37.6
TF	40.5	33.3	54.3	33.6	41.6	34.2	39.6
Jaccard	38.9	31.8	51.3	37.1	38.1	30.6	38.0
BM25	70.9	58.5	78.5	63.4	62.4	51.8	64.3
CodeBERT-bi	85.6	78.6	96.8	83.3	83.8	80.8	84.8
GraphCodeBERT	87.5	82.8	97.4	87.3	86.2	83.9	87.5
TOSS	87.5	81.4	96.7	87.4	85.0	81.3	86.5

Implement tips

Code structure

StageClass.py includes the implements of two-stage framework and each model.
evaluation.py includes evaluation function (MRR, Recall@K and Time).

Add your first-stage model

You just need to add a class in StageClass.py. The templet is as following:

class YourFirstStageModel(BaseStageCodeSearch):
    def __init__(self, dataset="GraphCodeBERT_python", code_pre_process_config=[],
                 query_pre_process_config=[], args=None):
        ...
        
    def get_time_score(self, topK):
        """
        return score_matrix, index, calculate_time
        """
        ...

Add your second-stage model

You just need to add a class in StageClass.py. The templet is as following:

class YourSecondStageModel(BaseStageCodeSearch):
    def __init__(self, dataset="GraphCodeBERT_python", code_pre_process_config=[],
                 query_pre_process_config=[], args=None):
        ...
        
    def evaluation_two_stage(self, topK, score_stage1, index_stage1, stage1_time):
        ...

Please feel free to contact us after the project is open.

@inproceedings{hu2023revisiting,
  title={Revisiting Code Search in a Two-Stage Paradigm},
  author={Hu, Fan and Wang, Yanlin and Du, Lun and Li, Xirong and Zhang, Hongyu and Han, Shi and Zhang, Dongmei},
  booktitle={WSDM},
  pages={994--1002},
  year={2023}
}

Contact

If you enounter any issue when running the code, please feel free to reach us either by creating a new issue in the github or by emailing

Fan Hu ([email protected])
Yanlin Wang ([email protected])

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
CoCLR_train		CoCLR_train
image		image
parser_graphCodeBert		parser_graphCodeBert
shell/TOSS		shell/TOSS
GraphCodeBERT.py		GraphCodeBERT.py
README.md		README.md
StageClass_use_ids.py		StageClass_use_ids.py
evaluation.py		evaluation.py
requirements.txt		requirements.txt
run.py		run.py
train_Graphcodebert.py		train_Graphcodebert.py
two_stage_utils.py		two_stage_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Revisiting code search in a two-stage paradigm

Environment

Data prepare

Dataset

Model checkpoints

Train model from scratch

Two-stage evaluation on python subset

Two-stage evaluation on six language of CSN

Implement tips

Code structure

Add your first-stage model

Add your second-stage model

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Languages

fly-dragon211/TOSS

Folders and files

Latest commit

History

Repository files navigation

Revisiting code search in a two-stage paradigm

Environment

Data prepare

Dataset

Model checkpoints

Train model from scratch

Two-stage evaluation on python subset

Two-stage evaluation on six language of CSN

Implement tips

Code structure

Add your first-stage model

Add your second-stage model

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages