The Devil Wears Data

Because that "pile of stuff" was never just stuff — it was data waiting to be modeled.

TL;DR

I built an open, longitudinal dataset of 14 luxury fashion houses (2018‑2024) — creative directors, runway calendars, sentiment, revenue, and more — then showed how causal inference (synthetic control) can quantify the impact of Pharrell Williams at Louis Vuitton. Clone the repo, open the R notebook, and run make all to reproduce every figure.

Why This Matters

Fashion is a $1.7 T industry that still relies on anecdotes for strategic decisions. Data exists — scattered across runway schedules, glossy editorials, and SEC filings — but no public, research‑grade panel ties it all together.

This project delivers that missing link so that:

Researchers can run time‑series and causal inference without months of scraping.
Students can learn econometrics on a fun, culturally rich topic.
Brands & analysts can benchmark creative decisions against measurable outcomes.

Dataset Schema

Category	Variable	Type	Notes
Identity	`house`	factor	14 maisons in v1.0
Time	`year`, `season`	int / factor	`season ∈ {ss, fw}`
Governance	`creative_director`, `director_years`, `director_houses`	chr / int	Tenure + career breadth
Geography	`home_base`, `parent_group`	factor	Links to HQ & conglomerate
Runway Presence	`paris`, `milan`, `new_york`, `london`	0/1	Major fashion weeks
Culture	`met_gala`	0/1	Red‑carpet signal
Outcome	`fashion_magazine_sentiment`	num	VADER compound score
Finance †	`seasonal_revenue`, `employees`	num	Publicly reported, where available

† Financial coverage currently limited to houses with public filings. See data/codebook.csv for detailed metadata.

Raw CSV lives in data/Capstone Final Draft Data.csv. A rendered codebook is in /docs.

Methodology

1 · Sentiment Pipeline

Crawl Vogue, BOF, W, Harper’s Bazaar via Google CSE + News API.
Extract article text with newspaper3k.
Score with VADER; aggregate by house‑season.
Store as tidy panel (house × season).

Python implementation lives in scripts/social_media_sentiment.py. A sample notebook shows exploratory plots.

2 · Causal Demo

In analysis/synth_louis_vuitton.R I replicate the synthetic control example from the report:

source("analysis/synth_louis_vuitton.R")

The script:

Converts year + season to half‑year time.
Builds a donor pool (13 houses).
Matches on fashion‑week presence + director tenure pre‑2023.
Estimates ATT of Pharrell’s appointment on magazine sentiment.

Reproduced figure:

Impact of Pharrell Williams at Louis Vuitton (2023‑2024)
┌───────── actual (LV) ──────┐
│                             │
└── synthetic control (LV‧) ──┘

Repository Structure

├── data/                  # Raw & processed CSVs + codebook
├── scripts/               # Python crawlers & sentiment pipeline
├── analysis/              # R notebooks & .R scripts for causal work
├── docs/                  # PDF report, figures, and slides
├── assets/                # Logos & banner images
└── README.md              # You are here

Quick Start

# 1 · Clone
$ git clone https://github.com/yourusername/devil-wears-data.git
$ cd devil-wears-data

# 2 · Set up R (≥ 4.3) & install deps
$ R -e "install.packages(c('Synth','dplyr','ggplot2','tidyr','readr'))"

# 3 · (Optional) Python sentiment pipeline
$ pip install -r requirements.txt
$ python scripts/social_media_sentiment.py  # writes data/sentiment.csv

A Makefile is included for one‑command reproduction: make all.

Reproducing the Analysis

Run the sentiment pipeline (or use the pre‑computed CSV).
Open analysis/synth_louis_vuitton.Rmd and knit to HTML.
All figures and tables from the capstone paper will be generated under /docs.

Limitations & Roadmap

Coverage: v1.0 tracks 14 houses; goal is 30+ by 2026.
Sentiment Model: VADER is fashion‑agnostic; exploring FinBERT‑style fine‑tuning.
Financials: Scraping PDF annual reports to improve revenue granularity.
CI/CD: Automate weekly sentiment refresh via GitHub Actions.

Citation

@misc{nguyen2024devil,
  title  = {The Devil Wears Data},
  author = {Nguyen, An},
  year   = {2024},
  note   = {GitHub repository},
  url    = {https://github.com/yourusername/devil-wears-data}
}

License

Released under the MIT License — see LICENSE.

Contact

Have questions or want to collaborate?

👗 An Nguyen · [email protected] · @causalnotcasual

Data is the new cerulean.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.gitignore		.gitignore
CP193 Deliverable Codebook.pdf		CP193 Deliverable Codebook.pdf
Capstone Final Draft Data - Dior.csv		Capstone Final Draft Data - Dior.csv
Capstone Final Draft Data - Overall data (1).csv		Capstone Final Draft Data - Overall data (1).csv
Capstone Final Draft Data - _Burberry.csv		Capstone Final Draft Data - _Burberry.csv
README.md		README.md
The Devil Wears Data (Final).pdf		The Devil Wears Data (Final).pdf
google_search_api.py		google_search_api.py
social media sentiment analysis-2.py		social media sentiment analysis-2.py
social media sentiment analysis.py		social media sentiment analysis.py
synthetic control LV.R		synthetic control LV.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

The Devil Wears Data

TL;DR

Table of Contents

Why This Matters

Dataset Schema

Methodology

1 · Sentiment Pipeline

2 · Causal Demo

Repository Structure

Quick Start

Reproducing the Analysis

Limitations & Roadmap

Citation

License

Contact

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

antrgngn/capstone

Folders and files

Latest commit

History

Repository files navigation

The Devil Wears Data

TL;DR

Table of Contents

Why This Matters

Dataset Schema

Methodology

1 · Sentiment Pipeline

2 · Causal Demo

Repository Structure

Quick Start

Reproducing the Analysis

Limitations & Roadmap

Citation

License

Contact

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages