index

A tool for tracking and visualizing Google Scholar citation data over time.

Overview

This project has three components:

Ingest (ingest/) - Scrapes citation data from Google Scholar
Model (model/) - Smooths citation rates using a Kalman filter
Visualization (root) - D3.js interactive charts, deployed via GitHub Pages

Installation

pip install -r requirements.txt

Usage

1. Scrape citation data from Google Scholar

Configure your Google Scholar user ID in config.yaml:

user_id: RIi-1pAAAAAJ
request_delay: 5

Run the scraper:

python -m ingest

Output: results/citations.json with raw citation counts per paper per year.

Note: The scraper uses browser cookies for authentication. You must be logged into Google in Chrome, Firefox, or Safari.

2. Tune Kalman filter hyperparameters (optional)

Find optimal smoothing parameters for your data:

python -m model.tune \
  --input results/citations.json \
  --output results/tuned_hyperparams.json \
  --n-grid 50

This performs a grid search over process variance and overdispersion, maximizing marginal likelihood across all papers.

3. Produce smoothed citation rates

python -m model.rates \
  --input results/citations.json \
  --output results/citation_rates.json \
  --forecast-years 0

Output: results/citation_rates.json with observed counts, empirical rates, and Kalman-smoothed rates with uncertainty.

Default parameters (--process-var 0.25 --obs-overdispersion 0.56) were tuned from empirical data.

To update the GitHub Pages visualization, copy the output to the versioned data directory:

cp results/citation_rates.json data/citation_rates.json

4. View the visualization

The visualization is deployed via GitHub Pages at: https://trvrb.github.io/index

For local development, run a server from the repository root:

python -m http.server 8000

Then open http://localhost:8000 in your browser.

The visualization includes:

H-index over time: Tracks how the h-index evolves across years
All paper citations over time: Stacked area chart showing all papers' smoothed citation rates, colored by publication year
Specific paper citations over time: Per-paper view with empirical citations, smoothed rate, and uncertainty bands

Click on any paper in the streamplot to view its details in the line plot.

Project Structure

index/
├── index.html          # Visualization entry point
├── app.js              # D3.js visualization code
├── styles.css          # Styling
├── data/
│   └── citation_rates.json  # Versioned data for GitHub Pages
├── results/            # Local working outputs (gitignored)
├── ingest/             # Google Scholar scraper
└── model/              # Kalman filter smoothing

Data Formats

citations.json

Raw scraped data:

{
  "user_id": "RIi-1pAAAAAJ",
  "scraped_at": "2025-12-25T12:00:00+00:00",
  "papers": [
    {
      "title": "Paper Title",
      "total_citations": 100,
      "citations_by_year": { "2020": 10, "2021": 25, "2022": 42 }
    }
  ]
}

citation_rates.json

Smoothed rates:

{
  "user_id": "RIi-1pAAAAAJ",
  "model": { "type": "kalman", "process_var": 0.25, "obs_overdispersion": 0.56 },
  "papers": [
    {
      "title": "Paper Title",
      "years": [2020, 2021, 2022],
      "observed_citations": [10, 25, 42],
      "empirical_rate": [10.0, 25.0, 42.0],
      "smoothed_rate": [12.3, 23.1, 38.5],
      "smoothed_rate_std": [3.2, 4.1, 5.0]
    }
  ]
}

Kalman Filter Model

The smoothing model treats each paper's citation history as a noisy observation of an underlying citation rate that evolves over time.

State-space formulation

We model the log citation rate as a random walk:

x_t = x_{t-1} + ε_t,  where ε_t ~ N(0, q)

The observed log-counts are noisy measurements of this latent rate:

z_t = x_t + η_t,  where η_t ~ N(0, R_t)

Time-varying observation variance

Rather than assuming constant observation noise, we use a Poisson-inspired model where high-count years are measured more precisely:

R_t = φ / (rate_t + 0.5) + σ²_min

Here φ is an overdispersion factor (φ > 1 indicates more variance than pure Poisson).

Smoothing

We apply the standard Kalman filter (forward pass) followed by the Rauch-Tung-Striebel smoother (backward pass) to obtain smoothed estimates of the log-rate at each year. Exponentiating gives the smoothed citation rate.

Hyperparameter tuning

The two key parameters are:

Process variance (q): Controls how much the underlying rate can change year-to-year. Higher values allow more volatility.
Overdispersion (φ): Scales the observation noise. Higher values down-weight the observations relative to the prior.

These are tuned by maximizing the marginal likelihood p(observations | q, φ) summed across all papers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

index

Overview

Installation

Usage

1. Scrape citation data from Google Scholar

2. Tune Kalman filter hyperparameters (optional)

3. Produce smoothed citation rates

4. View the visualization

Project Structure

Data Formats

citations.json

citation_rates.json

Kalman Filter Model

State-space formulation

Time-varying observation variance

Smoothing

Hyperparameter tuning

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
data		data
ingest		ingest
model		model
notes		notes
.gitignore		.gitignore
README.md		README.md
app.js		app.js
config.yaml		config.yaml
index.html		index.html
requirements.txt		requirements.txt
styles.css		styles.css

trvrb/index

Folders and files

Latest commit

History

Repository files navigation

index

Overview

Installation

Usage

1. Scrape citation data from Google Scholar

2. Tune Kalman filter hyperparameters (optional)

3. Produce smoothed citation rates

4. View the visualization

Project Structure

Data Formats

citations.json

citation_rates.json

Kalman Filter Model

State-space formulation

Time-varying observation variance

Smoothing

Hyperparameter tuning

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages