- π BibGuard is now available with a online gradio demo https://huggingface.co/spaces/thinkwee/BibGuard! Try it out! π
BibGuard is your comprehensive quality assurance tool for academic papers. It validates bibliography entries against real-world databases and checks LaTeX submission quality to catch errors before you submit.
AI coding assistants and writing tools often hallucinate plausible-sounding but non-existent references. BibGuard verifies the existence of every entry against multiple databases (arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, Google Scholar) and uses advanced LLMs to ensure cited papers actually support your claims.
- π« Stop Hallucinations: Instantly flag citations that don't exist or have mismatched metadata
- π LaTeX Quality Checks: Detect formatting issues, weak writing patterns, and submission compliance problems
- π Safe & Non-Destructive: Your original files are never modified - only detailed reports are generated
- π§ Contextual Relevance: Ensure cited papers actually discuss what you claim (with LLM)
- β‘ Efficiency Boost: Drastically reduce time needed to manually verify hundreds of citations
- π Multi-Source Verification: Validates metadata against arXiv, CrossRef, DBLP, Semantic Scholar, OpenAlex, and Google Scholar
- π€ AI Relevance Check: Uses LLMs to verify citations match their context (optional)
- π Preprint Detection: Warns if >50% of references are preprints (arXiv, bioRxiv, etc.)
- π Usage Analysis: Highlights missing citations and unused bib entries
- π― Duplicate Detector: Identifies duplicate entries with fuzzy matching
- π Format Validation: Caption placement, cross-references, citation spacing, equation punctuation
- βοΈ Writing Quality: Weak sentence starters, hedging language, redundant phrases
- π€ Consistency: Spelling variants (US/UK English), hyphenation, terminology
- π€ AI Artifact Detection: Conversational AI responses, placeholder text, Markdown remnants
- π Acronym Validation: Ensures acronyms are defined before use (smart matching)
- π Anonymization: Checks for identity leaks in double-blind submissions
- π Citation Age: Flags references older than 30 years
git clone [email protected]:thinkwee/BibGuard.git
cd BibGuard
pip install -r requirements.txtpython main.py --initThis creates config.yaml. Edit it to set your file paths. You have two modes:
Best for individual papers.
files:
bib: "paper.bib"
tex: "paper.tex"
output_dir: "bibguard_output"Best for large projects or a collection of papers. BibGuard will recursively search for all .tex and .bib files.
files:
input_dir: "./my_project_dir"
output_dir: "bibguard_output"python main.pyOutput (in bibguard_output/):
bibliography_report.md- Bibliography validation resultslatex_quality_report.md- Writing and formatting issuesline_by_line_report.md- All issues sorted by line number*_only_used.bib- Clean bibliography (used entries only)
Edit config.yaml to customize checks:
bibliography:
check_metadata: true # Validate against online databases (takes time)
check_usage: true # Find unused/missing entries
check_duplicates: true # Detect duplicate entries
check_preprint_ratio: true # Warn if >50% are preprints
check_relevance: false # LLM-based relevance check (requires API key)
submission:
# Format checks
caption: true # Table/figure caption placement
reference: true # Cross-reference integrity
formatting: true # Citation spacing, blank lines
equation: true # Equation punctuation, numbering
# Writing quality
sentence: true # Weak starters, hedging language
consistency: true # Spelling, hyphenation, terminology
acronym: true # Acronym definitions (3+ letters)
# Submission compliance
ai_artifacts: true # AI-generated text detection
anonymization: true # Double-blind compliance
citation_quality: true # Old citations (>30 years)
number: true # Percentage formattingTo verify citations match their context using AI:
bibliography:
check_relevance: true
llm:
backend: "gemini" # Options: gemini, openai, anthropic, deepseek, ollama, vllm
api_key: "" # Or use environment variable (e.g., GEMINI_API_KEY)Supported Backends:
- Gemini (Google):
GEMINI_API_KEY - OpenAI:
OPENAI_API_KEY - Anthropic:
ANTHROPIC_API_KEY - DeepSeek:
DEEPSEEK_API_KEY(recommended for cost/performance) - Ollama: Local models (no API key needed)
- vLLM: Custom endpoint
Then run:
python main.pyShows for each entry:
- β Verified: Metadata matches online databases
β οΈ Issues: Mismatches, missing entries, duplicates- π Statistics: Usage, duplicates, preprint ratio
Organized by severity:
- π΄ Errors: Critical issues (e.g., undefined references)
- π‘ Warnings: Important issues (e.g., inconsistent spelling)
- π΅ Suggestions: Style improvements (e.g., weak sentence starters)
All LaTeX issues sorted by line number for easy fixing.
BibGuard is strict, but false positives happen:
-
Year Discrepancy (Β±1 Year):
- Reason: Delay between preprint (arXiv) and official publication
- Action: Verify which version you intend to cite
-
Author List Variations:
- Reason: Different databases handle large author lists differently
- Action: Check if primary authors match
-
Venue Name Differences:
- Reason: Abbreviations vs. full names (e.g., "NeurIPS" vs. "Neural Information Processing Systems")
- Action: Both are usually correct
-
Non-Academic Sources:
- Reason: Blogs, documentation not indexed by academic databases
- Action: Manually verify URL and title
python main.py --help # Show all options
python main.py --list-templates # List conference templates
python main.py --config my.yaml # Use custom config fileContributions welcome! Please open an issue or pull request.
BibGuard uses multiple data sources:
- arXiv API
- CrossRef API
- Semantic Scholar API
- DBLP API
- OpenAlex API
- Google Scholar (via scholarly)
Made with β€οΈ for researchers who care about their submission
