Cregit tokenizes source code files in a Git repository and creates a token-level blame view, showing which commit introduced each token.
See readme.org for the original documentation.
This fork includes the following modifications for improved compatibility:
| Component | Original | Updated |
|---|---|---|
| sqlite-jdbc | 3.8.0-SNAPSHOT | 3.8.11.2 |
| sbt | 0.13.7 | 0.13.18 |
- Changed shebang in
tokenBySha.plfrom#!/usr/bin/perlto#!/usr/bin/env perlfor better cross-platform compatibility.
| Tool | URL | Notes |
|---|---|---|
| srcml | https://www.srcml.org/ | Must be in PATH |
| ctags | https://github.com/universal-ctags | Universal Ctags recommended |
| bfg | https://github.com/Mont9165/bfg-repo-cleaner/tree/blobexec | Use the blobexec branch |
| Perl | - | With modules: DBI, DBD::SQLite, Set::Scalar, HTML::FromText |
# Install dependencies with Homebrew
brew install srcml universal-ctags
# Install Perl modules
cpan DBI DBD::SQLite Set::Scalar HTML::FromTextapt-get install cmake libarchive-dev libxml++2.6-dev libxml2-dev \
libcurl4-openssl-dev libxslt1-dev libboost-all-dev libantlr-dev \
libssl-dev libxerces-c-dev exuberant-ctags libdbi-perl \
libjgit-java libhtml-fromtext-perl libset-scalar-perl libdbd-sqlite3-perlgit clone https://github.com/Mont9165/bfg-repo-cleaner.git --branch blobexec
cd bfg-repo-cleaner
sbt "bfg/assembly"git clone https://github.com/Mont9165/cregit.git
cd cregitcd tokenize/srcMLtoken
make
cd ../..# Build slickGitLog
cd slickGitLog && sbt one-jar && cd ..
# Build persons
cd persons && sbt one-jar && cd ..
# Build remapCommits
cd remapCommits && sbt one-jar && cd ..export CREGIT_HOME=/path/to/cregit
export BFG_JAR=/path/to/bfg-repo-cleaner/bfg/target/bfg-*.jar
export BFG_MEMO_DIR=/tmp/memo
export BFG_TOKENIZE_CMD="$CREGIT_HOME/tokenize/tokenizeSrcMl.pl \
--srcml2token=$CREGIT_HOME/tokenize/srcMLtoken/srcml2token \
--srcml=$(which srcml) \
--ctags=$(which ctags)"
mkdir -p "$BFG_MEMO_DIR"# Clone the repository you want to tokenize
git clone --mirror https://github.com/example/repo.git /tmp/repo.gitFor Java files:
java -jar "$BFG_JAR" \
--blob-exec '$CREGIT_HOME/tokenizeByBlobId/tokenBySha.pl=\.java$' \
--no-blob-protection \
/tmp/repo.gitFor C/C++ files:
java -jar "$BFG_JAR" \
--blob-exec '$CREGIT_HOME/tokenizeByBlobId/tokenBySha.pl=\.[ch]$' \
--no-blob-protection \
/tmp/repo.gitNote: Use single quotes around the
--blob-execargument to prevent shell expansion of$in the regex pattern.
Here's a complete example for processing the Apache Camel repository:
#!/bin/bash
set -e
# Configuration
REPO_NAME="camel"
REPO_URL="https://github.com/apache/camel.git"
WORK_DIR="/tmp/cregit-work"
CREGIT_HOME=/path/to/cregit
BFG_JAR=/path/to/bfg-*.jar
# Setup
mkdir -p "$WORK_DIR"
export BFG_MEMO_DIR="$WORK_DIR/memo"
export BFG_TOKENIZE_CMD="$CREGIT_HOME/tokenize/tokenizeSrcMl.pl \
--srcml2token=$CREGIT_HOME/tokenize/srcMLtoken/srcml2token \
--srcml=$(which srcml) \
--ctags=$(which ctags)"
mkdir -p "$BFG_MEMO_DIR"
# 1. Clone repository
git clone --mirror "$REPO_URL" "$WORK_DIR/$REPO_NAME.git"
# 2. Keep a copy of the original
cp -r "$WORK_DIR/$REPO_NAME.git" "$WORK_DIR/$REPO_NAME-original.git"
# 3. Tokenize with BFG + cregit
java -jar "$BFG_JAR" \
--blob-exec '$CREGIT_HOME/tokenizeByBlobId/tokenBySha.pl=\.java$' \
--no-blob-protection \
"$WORK_DIR/$REPO_NAME.git"
# 4. Create history database for original repo
java -jar "$CREGIT_HOME/slickGitLog/target/scala-2.11/slickgitlog_2.11-*-one-jar.jar" \
"$WORK_DIR/$REPO_NAME-original.db" \
"$WORK_DIR/$REPO_NAME-original.git"
# 5. Create history database for tokenized repo
java -jar "$CREGIT_HOME/slickGitLog/target/scala-2.11/slickgitlog_2.11-*-one-jar.jar" \
"$WORK_DIR/$REPO_NAME-cregit.db" \
"$WORK_DIR/$REPO_NAME.git"
echo "Done! Tokenized repository is at: $WORK_DIR/$REPO_NAME.git"Make sure to export the environment variable before running:
export BFG_TOKENIZE_CMD="..."Use the patched BFG from this fork: https://github.com/Mont9165/bfg-repo-cleaner/tree/blobexec
Install srcml and ensure it's in your PATH:
# macOS
brew install srcml
# Verify installation
which srcmlInstall Universal Ctags (not Exuberant Ctags for best results):
# macOS
brew install universal-ctags
# Verify installation
which ctagsGPL-3.0+