Automatically detect duplicate logic in Python code changes using advanced AST analysis and semantic similarity.
Prevent code duplication, improve code quality, and maintain cleaner codebases with intelligent duplicate detection that goes beyond simple text matching.
- π§ Multi-Strategy Detection: AST analysis, semantic similarity, and function signature matching
- π― Smart Pattern Recognition: Detects business logic patterns and common code structures
- π¬ Actionable PR Comments: Provides suggestions and refactoring recommendations
- βοΈ Highly Configurable: Adjustable similarity thresholds and file patterns
- π Comprehensive Reports: JSON and Markdown reports with detailed analysis
- π Fast & Efficient: Uses uv package manager for lightning-fast dependency installation
Add this workflow to .github/workflows/duplicate-detection.yml:
name: Duplicate Logic Detection
on:
pull_request:
paths: ['**/*.py']
jobs:
detect-duplicates:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}| Parameter | Description | Required | Default |
|---|---|---|---|
github-token |
GitHub token for API access | β | ${{ github.token }} |
pr-number |
Pull request number | β | ${{ github.event.number }} |
repository |
Repository name (owner/repo) | β | ${{ github.repository }} |
base-ref |
Base reference for comparison | β | ${{ github.base_ref }} |
head-ref |
Head reference for comparison | β | ${{ github.head_ref }} |
post-comment |
Post findings as PR comment | β | true |
fail-on-duplicates |
Fail if high-confidence duplicates found | β | false |
similarity-method |
Similarity method to use (jaccard_tokens, sequence_matcher, levenshtein_norm) |
β | jaccard_tokens |
global-threshold |
Global similarity threshold (0.0-1.0) for all methods | β | 0.7 |
folder-thresholds |
Per-folder thresholds as JSON (e.g., {"src/shared": 0.1, "src/tests": 0.9}) |
β | {} |
| Output | Description |
|---|---|
duplicates-found |
Whether any duplicates were detected |
match-count |
Total number of matches found |
report-path |
Path to the generated report file |
- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
fail-on-duplicates: true- name: Detect Duplicate Logic
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
post-comment: false- name: Detect Duplicate Logic (High Precision)
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
similarity-method: levenshtein_norm # More thorough analysis
fail-on-duplicates: true- name: Detect Duplicate Logic (Custom Thresholds)
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
global-threshold: 0.8 # Higher threshold for stricter detection
folder-thresholds: '{"src/shared": 0.1, "src/tests": 0.9}'- name: Detect Duplicate Logic (Folder-Specific Thresholds)
uses: ArthurMor4is/duplicate-logic-detector-action@v1
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
similarity-method: jaccard_tokens
folder-thresholds: '{"src/shared": 0.1, "src/core": 0.8, "tests": 0.9}'The action uses configurable similarity analysis to detect duplicate logic patterns:
- Parses Python files to extract function definitions
- Analyzes function signatures and structure
- Identifies code patterns and complexity
Choose from three different similarity algorithms:
- Best for: General purpose, fast analysis
- Method: Token-based Jaccard similarity coefficient
- Strengths: Fast, good balance of precision/recall
- Use when: You want reliable results with good performance
- Best for: Balanced approach between speed and accuracy
- Method: Python's difflib.SequenceMatcher
- Strengths: Good at detecting structural similarities
- Use when: You need more nuanced similarity detection
- Best for: High precision, strict duplicate detection
- Method: Normalized Levenshtein distance
- Strengths: Most thorough analysis, best precision
- Use when: You want to catch even subtle duplicates
- Excludes very small functions (< 5 lines)
- Filters out test files and common patterns
- Prioritizes business logic and complex functions
Control when functions are considered duplicates with flexible threshold settings:
- Default:
0.7(70% similarity) - Usage: Applies to all files when no folder-specific threshold is set
- Range:
0.0to1.0(0% to 100% similarity)
- Format: JSON object with folder paths as keys
- Example:
{"src/shared": 0.1, "src/tests": 0.9} - Priority: Folder-specific thresholds override global threshold
- Matching: Uses most specific (longest) matching folder path
- Fallback: If no folder threshold matches, uses global threshold
# Strict detection globally
global-threshold: 0.85
# Lenient detection globally
global-threshold: 0.5
# Mixed approach (lenient for shared code, strict for tests)
folder-thresholds: '{"src/shared": 0.1, "tests": 0.9, "src/core": 0.8}'- Shared Libraries: Low threshold (0.1-0.3) to catch even minor duplications
- Test Files: High threshold (0.8-0.9) to avoid false positives on similar test patterns
- Core Business Logic: Medium-high threshold (0.6-0.8) for important code quality
- Utilities: Medium threshold (0.5-0.7) for general utility functions
## π Duplicate Logic Detection Results
Found 2 potential duplicates with high confidence:
### Match 1: Email Validation
- **New Function**: `check_email_format` (src/utils.py:15)
- **Existing Function**: `validate_email` (src/validators.py:8)
- **Similarity**: 92%
- **Suggestion**: Consider using the existing `validate_email` function instead
### Match 2: Data Processing
- **New Function**: `process_user_data` (src/handlers.py:25)
- **Existing Function**: `handle_user_info` (src/services.py:45)
- **Similarity**: 87%
- **Suggestion**: Extract common logic into a shared utility functionThe action has minimal runtime dependencies for fast execution:
- rich v14.1.0 - Console output and progress bars
For development, testing, and research, additional dependencies are available:
- Testing: pytest, pytest-mock, pytest-cov, pytest-xdist
- Code Quality: black, isort, flake8, mypy, pre-commit
- Research: GitPython, PyGithub, scikit-learn, nltk, numpy, pandas, pyyaml
The action uses modern Python packaging with pyproject.toml and uv for fast dependency management:
# Clean core dependencies
dependencies = []
# Runtime dependencies (action execution)
[project.optional-dependencies]
runtime = ["rich==14.1.0"]
# Research dependencies (experiments)
research = ["GitPython", "PyGithub", "scikit-learn", ...]
# Development dependencies
dev = ["black>=23.0.0", "isort>=5.12.0", ...]
test = ["pytest>=7.0.0", "pytest-mock>=3.10.0", ...]# Clone the repository
git clone https://github.com/ArthurMor4is/duplicate-logic-detector-action.git
# Install dependencies using uv (recommended)
uv sync --all-extras
# Or using traditional pip
pip install -e ".[dev,test]"
# Run tests
make test
# or
uv run pytest
# Run sample analysis
make test-sampleNote: The config/default-config.yml file is used for development and testing purposes only. The GitHub Action uses built-in configuration optimized for CI/CD workflows.
- Usage Guide - Detailed usage instructions
- Testing Guide - How to test the action
- Examples - Complete workflow examples
Contributions are welcome! Please read our contributing guidelines and submit pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
- π Documentation
- π Report Issues
- π¬ Discussions