This guide explains how to test the duplicate logic detection functionality both locally and as part of your development workflow.
# Install main dependencies
pip install -r requirements.txt
# Install test dependencies
pip install -r test_requirements.txt
# Or use the test runner to install everything
python run_tests.py install# Run complete test suite
python run_tests.py all
# Or run specific test types
python run_tests.py unit # Unit tests only
python run_tests.py sample # Sample analysisLocated in tests/test_duplicate_detector.py, these tests verify individual components:
- CodeFunction class: Tests function data structure
- DuplicateMatch class: Tests match result structure
- DuplicateLogicDetector: Tests core detection logic
- Integration tests: Tests complete detection workflow
# Run unit tests
python run_tests.py unit
# With verbose output
python run_tests.py unit --verbose
# With coverage report
python run_tests.py unit --coverageTest with pre-built sample files that contain known duplicates:
# Analyze sample files
python run_tests.py sampleSample files:
test_samples/original_code.py- Original functionstest_samples/duplicate_code.py- Near-identical duplicatestest_samples/similar_but_different.py- Similar but different logic
Test the detector on real repository scenarios:
# Test on current repository
python run_tests.py integration
# Test specific files
python run_tests.py integration --changed-files file1.py file2.py
# Test different repository
python run_tests.py integration --repository-path /path/to/repoYou can also run the detection script directly:
cd scripts
python duplicate_logic_detector.py \
--repository-path ../test_samples \
--changed-files duplicate_code.py \
--output-format json- Create test files with intentional duplicates:
# file1.py
def validate_email(email):
import re
pattern = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(pattern, email) is not None
# file2.py
def check_email_format(email_addr):
import re
regex = r'^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$'
return re.match(regex, email_addr) is not None- Run detection:
python run_tests.py integration --changed-files file2.pyCreate a test-specific config file:
# test-config.yml
thresholds:
high_similarity: 0.70 # Lower threshold for testing
moderate_similarity: 0.50
similarity_weights:
semantic: 0.50 # Emphasize semantic similarity
signature: 0.30
ast_structure: 0.15
function_calls: 0.05
min_function_lines: 3 # Test smaller functionsUse with:
python scripts/duplicate_logic_detector.py \
--config-file test-config.yml \
--changed-files test_file.pyFunctions that should be detected as high-confidence duplicates:
validate_email()vscheck_email_format()- Same regex, similar logiccalculate_discount()vscompute_discount_amount()- Identical calculationprocess_user_data()vsvalidate_and_process_users()- Same workflow
Functions with similar patterns but different purposes:
validate_email()vsvalidate_phone_number()- Similar validation patterncalculate_discount()vscalculate_tax()- Similar math operations
{
"summary": {
"total_matches": 8,
"high_confidence": 5,
"medium_confidence": 2,
"low_confidence": 1
},
"matches": [
{
"original_function": {
"name": "validate_email",
"file": "test_samples/original_code.py",
"line_start": 8
},
"duplicate_function": {
"name": "check_email_format",
"file": "test_samples/duplicate_code.py",
"line_start": 8
},
"similarity_score": 0.92,
"confidence_level": "very_high",
"suggestions": [
"Extract common email validation logic into a shared utility function",
"Consider using the existing validate_email function instead"
]
}
]
}-
ImportError: No module named 'duplicate_logic_detector'
# Make sure the script exists ls scripts/duplicate_logic_detector.py # Install dependencies python run_tests.py install
-
NLTK Data Missing
python -c " import nltk nltk.download('punkt') nltk.download('stopwords') "
-
No Matches Found
- Lower similarity thresholds in config
- Check file patterns (include/exclude)
- Verify functions meet minimum complexity/length requirements
-
Too Many False Positives
- Increase similarity thresholds
- Add exclusion patterns for utility functions
- Adjust similarity weights
Enable detailed logging:
export DEBUG=1
python run_tests.py sampleGenerate detailed coverage reports:
python run_tests.py unit --coverage
# Open htmlcov/index.html in browserAdd to .github/workflows/test-duplicate-detection.yml:
name: Test Duplicate Detection
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Install dependencies
run: python run_tests.py install
- name: Run tests
run: python run_tests.py all --verboseAdd to .git/hooks/pre-commit:
#!/bin/bash
python run_tests.py unit
if [ $? -ne 0 ]; then
echo "Tests failed. Commit aborted."
exit 1
fiTest with larger codebases:
# Clone a Python project
git clone https://github.com/python/cpython.git test_large_repo
cd test_large_repo
# Test duplicate detection
python ../run_tests.py integration \
--repository-path . \
--changed-files Lib/email/utils.pyTime the detection process:
time python run_tests.py sampleExpected performance:
- Small files (< 100 functions): < 5 seconds
- Medium files (100-500 functions): < 30 seconds
- Large files (500+ functions): < 2 minutes
When adding new features, include tests:
- Add unit tests in
tests/test_duplicate_detector.py - Add sample cases in
test_samples/ - Update documentation in this file
- Run full test suite before submitting
# Verify all tests pass
python run_tests.py all --coverage