DSLD: Data Science Looks at Discrimination

Statistical and graphical tools for detecting and measuring discrimination and bias in data

This is an R package with Python interfaces available.

Overview

Discrimination is a key social issue in the United States and in a number of other countries. There is lots of available data with which one might investigate possible discrimination. But how might such investigations be conducted?

Our DSLD package provides statistical and graphical tools for detecting and measuring discrimination and bias; be it racial, gender, age or other. It is widely applicable, here are just a few possible use cases:

Quantitative analysis in instruction and research in the social sciences.
Corporate HR analysis and research.
Litigation involving discrimination and related issues.
Concerned citizenry.

This package is broadly aimed at users ranging from instructors of statistics classes to legal professionals, as it offers a powerful yet intuitive approach to discrimination analysis. It also includes an 80 page Quarto book to serve as a guide of the key statistical principles and their applications.

Quarto Book: Paper - Important statistical principles and applications.
Research Paper: Paper - Package implementation details.

Installation:

From CRAN :

install.packages("dsld")

Analysis categories:

DSLD addresses two main types of bias analysis:

Estimation Analysis: Investigates possible discrimination by estimating effects while accounting for confounders. Confounders are variables that may affect the outcome variable other than through the sensitive variable. DSLD provides both analytical and graphical functions for this purpose.

Prediction Analysis: Addresses algorithmic bias in machine learning by excluding sensitive variables while controlling proxy effects. Proxies are variables strongly related to the sensitive variable that could indirectly introduce bias.

The first case examines societal or institutional bias. The second case focuses on algorithmic bias.

Statistical Analysis	Fair Machine Learning
Estimate an effect	Predict an outcome
Harm comes from society	Harm comes from algorithm
Include sensitive variables	Exclude sensitive variables
Adjust for covariates	Limit proxy impact

We will tour of a small subset of dsld's features using the svcensus data included in the package.

The data

The svcensus dataset consists of recorded income across 6 different engineering occupations. It consists of the features: 'age', 'education level', 'occupation', 'wage income', 'number of weeks worked', 'gender'.

> data(svcensus)
> head(svcensus)
       age     educ occ wageinc wkswrkd gender
1 50.30082 zzzOther 102   75000      52 female
2 41.10139 zzzOther 101   12300      20   male
3 24.67374 zzzOther 102   15400      52 female
4 50.19951 zzzOther 100       0      52   male
5 51.18112 zzzOther 100     160       1 female
6 57.70413 zzzOther 100       0       0   male

We will use only a few features to keep things simple. The Quarto Book provides a more extensive analysis of examples shown below.

Part One: Adjustment for Confounders

We want to estimate the impact of a sensitive variable [S] on an outcome variable [Y], while accounting for confounders [C]. Let's call such analysis "confounder adjustment."

Estimation Example

We are investigating a possible gender pay gap between men and women. Here, [Y] is wage and [S] is gender. We will treat age as a confounder [C], using a linear model. For simplicity, no other confounders (such as occupation) or any other predictors [X] are included in this example.

No interactions

> data(svcensus)
> svcensus <- svcensus[,c(1,4,6)]      # subset columns: age, wage, gender
> z <- dsldLinear(svcensus,'wageinc','gender', interactions = FALSE)
> summary(z)                              # show coefficients of linear model

$`Summary Coefficients`
    Covariate   Estimate StandardError PValue
1 (Intercept) 31079.9174    1378.08158      0
2         age   489.5728      30.26461      0
3  gendermale 13098.2091     790.44515      0

$`Sensitive Factor Level Comparisons`
         Factors Compared Estimates Standard Errors P-Value
Estimate    male - female  13098.21        790.4451       0

Our linear model can be written as:

E(W) = $\beta_0$ + $\beta_1$ A + $\beta_2$ M

Consider the case without any interaction: Here W indicates wage income, A is age and M denotes an indicator variable, with M = 1 for men and M = 0 for women.

Where W is wage income, A is age, and M is male indicator (M = 1 for men, M = 0 for women).

$\beta_2$ represents the gender wage gap at any age. The linear model shows men earn $13,000 more than women across all ages. However, the wage gap might also vary by age. We test for such interactions by fitting separate models for men and women, for example comparing ages 36 and 43:

Interactions

newData <- data.frame(age=c(36,43))
z <- dsldLinear(svcensus,'wageinc','gender',interactions=TRUE, newData)
summary(z)

$female
    Covariate   Estimate StandardError PValue
1 (Intercept) 30551.4302    2123.44361      0
2         age   502.9624      52.07742      0

$male
    Covariate  Estimate StandardError PValue
1 (Intercept) 44313.159    1484.82216      0
2         age   486.161      36.02116      0

$`Sensitive Factor Level Comparisons`
  Factors Compared New Data Row Estimates Standard Errors
1    female - male            1 -13156.88        710.9696
2    female - male            2 -13039.27        710.7782

The gender pay gap is -$13,157 at age 36 and -$13,039 at age 43, differing by only $118. This suggests minimal age-gender interaction. We only focused on age as the confounder, but other variables like occupation could be included depending on the analysis goals.

Part Two: Discovering/Mitigating Bias in Machine Learning

We are predicting [Y] from a feature set [X] and a sensitive variable [S]. We want to minimize the effect of [S], along with any proxies [O] in [X] that may be correlated with [S]. The inherent trade-off of increasing fairness (minimizing [S] and [O]) is reduced utility. The package provides wrappers for several functions.

Prediction Example

Goal: Predict wage income while minimizing gender bias by limiting the impact of occupation as a proxy variable.

Setup:

Outcome [Y]: Wage income
Sensitive Variable [S]: Gender
Proxy Variable [O]: Occupation (deweighted to 0.2)
Method: Fair K-Nearest Neighbors using dsldQeFairKNN()

Fairness/Utility Tradeoff	Fairness	Accuracy
K-Nearest Neighbors	0.1943313	25452.08
Fair K-NN (via EDFFair)	0.0814919	26291.38

The base K-NN model shows 0.194 correlation between predicted wage and gender, with $25,452 prediction error. Using dsldQeFairKNN, the correlation drops to 0.081, but test error increases by $839. This shows the fairness-utility trade-off. Users can test parameter combinations to find their optimal balance. The dsldFairUtils function facilitates this search.

Function List

Graphical Functions

Function	Description	Use Case
`dsldFreqPCoord`	Frequency-based parallel coordinates	Visualizing multivariate relationships
`dsldScatterPlot3D`	3D scatter plots with color coding	Exploring 3D data relationships
`dsldConditDisparity`	Conditional disparity plots	Detecting Simpson's Paradox
`dsldDensityByS`	Density plots by sensitive variable	Comparing distributions across groups
`dsldConfounders`	Confounder analysis	Identifying confounding variables
`dsldIamb`	Constraint-based structure learning algorithms	Fits a causal model to data

Analytical Functions

Function	Description	Use Case
`dsldLinear`	Linear regression with sensitive group comparisons	Regression outcome analysis
`dsldLogit`	Logistic regression with sensitive group comparisons	Binary outcome analysis
`dsldML`	Machine learning with sensitive group comparisons	Analysis via non-parametric models (KNN, RF)
`dsldTakeALookAround`	Feature set evaluation	Assessing prediction fairness
`dsldCHunting`	Confounder hunting	Finding variables that predict both Y and S
`dsldOHunting`	Proxy hunting	Finding variables that predict S
`dsldMatchedAte`	Causal inference via matching	Estimating treatment effects

Fair Machine Learning Functions

Function	Description	Package
`dsldFairML`	FairML algorithm wrappers	FairML
`dsldQeFairML`	EDFFair algorithm wrappers	EDFFair
`dsldFairUtils`	Grid search and parameter optimization for fair ML

Available Algorithms:

FairML: dsldFrrm, dsldFgrrm, dsldZlm, dsldNclm, dsldZlrm
EDFFair: dsldQeFairKNN, dsldQeFairRf, dsldQeFairRidgeLin, dsldQeFairRidgeLog

Authors

Norm Matloff
Aditya Mittal
Taha Abdullah
Arjun Ashok
Shubhada Martha
Billy Ouattara
Jonathan Tran
Brandon Zarate

For issues, contact Aditya Mittal at [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
R		R
build		build
data		data
inst		inst
man		man
vignettes		vignettes
DESCRIPTION		DESCRIPTION
MD5		MD5
NAMESPACE		NAMESPACE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DSLD: Data Science Looks at Discrimination

Overview

Installation:

Analysis categories:

The data

Part One: Adjustment for Confounders

Estimation Example

Part Two: Discovering/Mitigating Bias in Machine Learning

Prediction Example

Function List

Authors

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

cran/dsld

Folders and files

Latest commit

History

Repository files navigation

DSLD: Data Science Looks at Discrimination

Overview

Installation:

Analysis categories:

The data

Part One: Adjustment for Confounders

Estimation Example

Part Two: Discovering/Mitigating Bias in Machine Learning

Prediction Example

Function List

Authors

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages