Thanks to visit codestin.com
Credit goes to github.com

Skip to content
/ dsld Public

❗ This is a read-only mirror of the CRAN R package repository. dsld — Data Science Looks at Discrimination. Homepage: https://github.com/matloff/dsld Report bugs for this package: https://github.com/matloff/dsld/issues

Notifications You must be signed in to change notification settings

cran/dsld

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DSLD: Data Science Looks at Discrimination

CRAN Status

Statistical and graphical tools for detecting and measuring discrimination and bias in data

This is an R package with Python interfaces available.

Overview

Discrimination is a key social issue in the United States and in a number of other countries. There is lots of available data with which one might investigate possible discrimination. But how might such investigations be conducted?

Our DSLD package provides statistical and graphical tools for detecting and measuring discrimination and bias; be it racial, gender, age or other. It is widely applicable, here are just a few possible use cases:

  • Quantitative analysis in instruction and research in the social sciences.
  • Corporate HR analysis and research.
  • Litigation involving discrimination and related issues.
  • Concerned citizenry.

This package is broadly aimed at users ranging from instructors of statistics classes to legal professionals, as it offers a powerful yet intuitive approach to discrimination analysis. It also includes an 80 page Quarto book to serve as a guide of the key statistical principles and their applications.

  • Quarto Book: Paper - Important statistical principles and applications.
  • Research Paper: Paper - Package implementation details.

Installation:

From CRAN :

install.packages("dsld")

Analysis categories:

DSLD addresses two main types of bias analysis:

Estimation Analysis: Investigates possible discrimination by estimating effects while accounting for confounders. Confounders are variables that may affect the outcome variable other than through the sensitive variable. DSLD provides both analytical and graphical functions for this purpose.

Prediction Analysis: Addresses algorithmic bias in machine learning by excluding sensitive variables while controlling proxy effects. Proxies are variables strongly related to the sensitive variable that could indirectly introduce bias.

The first case examines societal or institutional bias. The second case focuses on algorithmic bias.

Statistical Analysis Fair Machine Learning
Estimate an effect Predict an outcome
Harm comes from society Harm comes from algorithm
Include sensitive variables Exclude sensitive variables
Adjust for covariates Limit proxy impact

We will tour of a small subset of dsld's features using the svcensus data included in the package.

The data

The svcensus dataset consists of recorded income across 6 different engineering occupations. It consists of the features: 'age', 'education level', 'occupation', 'wage income', 'number of weeks worked', 'gender'.

> data(svcensus)
> head(svcensus)
       age     educ occ wageinc wkswrkd gender
1 50.30082 zzzOther 102   75000      52 female
2 41.10139 zzzOther 101   12300      20   male
3 24.67374 zzzOther 102   15400      52 female
4 50.19951 zzzOther 100       0      52   male
5 51.18112 zzzOther 100     160       1 female
6 57.70413 zzzOther 100       0       0   male

We will use only a few features to keep things simple. The Quarto Book provides a more extensive analysis of examples shown below.

Part One: Adjustment for Confounders

We want to estimate the impact of a sensitive variable [S] on an outcome variable [Y], while accounting for confounders [C]. Let's call such analysis "confounder adjustment."

Estimation Example

We are investigating a possible gender pay gap between men and women. Here, [Y] is wage and [S] is gender. We will treat age as a confounder [C], using a linear model. For simplicity, no other confounders (such as occupation) or any other predictors [X] are included in this example.

No interactions

> data(svcensus)
> svcensus <- svcensus[,c(1,4,6)]      # subset columns: age, wage, gender
> z <- dsldLinear(svcensus,'wageinc','gender', interactions = FALSE)
> summary(z)                              # show coefficients of linear model

$`Summary Coefficients`
    Covariate   Estimate StandardError PValue
1 (Intercept) 31079.9174    1378.08158      0
2         age   489.5728      30.26461      0
3  gendermale 13098.2091     790.44515      0

$`Sensitive Factor Level Comparisons`
         Factors Compared Estimates Standard Errors P-Value
Estimate    male - female  13098.21        790.4451       0

Our linear model can be written as:

E(W) = $\beta_0$ + $\beta_1$ A + $\beta_2$ M

Consider the case without any interaction: Here W indicates wage income, A is age and M denotes an indicator variable, with M = 1 for men and M = 0 for women.

Where W is wage income, A is age, and M is male indicator (M = 1 for men, M = 0 for women).

$\beta_2$ represents the gender wage gap at any age. The linear model shows men earn $13,000 more than women across all ages. However, the wage gap might also vary by age. We test for such interactions by fitting separate models for men and women, for example comparing ages 36 and 43:

Interactions

newData <- data.frame(age=c(36,43))
z <- dsldLinear(svcensus,'wageinc','gender',interactions=TRUE, newData)
summary(z)

$female
    Covariate   Estimate StandardError PValue
1 (Intercept) 30551.4302    2123.44361      0
2         age   502.9624      52.07742      0

$male
    Covariate  Estimate StandardError PValue
1 (Intercept) 44313.159    1484.82216      0
2         age   486.161      36.02116      0

$`Sensitive Factor Level Comparisons`
  Factors Compared New Data Row Estimates Standard Errors
1    female - male            1 -13156.88        710.9696
2    female - male            2 -13039.27        710.7782

The gender pay gap is -$13,157 at age 36 and -$13,039 at age 43, differing by only $118. This suggests minimal age-gender interaction. We only focused on age as the confounder, but other variables like occupation could be included depending on the analysis goals.

Part Two: Discovering/Mitigating Bias in Machine Learning

We are predicting [Y] from a feature set [X] and a sensitive variable [S]. We want to minimize the effect of [S], along with any proxies [O] in [X] that may be correlated with [S]. The inherent trade-off of increasing fairness (minimizing [S] and [O]) is reduced utility. The package provides wrappers for several functions.

Prediction Example

Goal: Predict wage income while minimizing gender bias by limiting the impact of occupation as a proxy variable.

Setup:

  • Outcome [Y]: Wage income
  • Sensitive Variable [S]: Gender
  • Proxy Variable [O]: Occupation (deweighted to 0.2)
  • Method: Fair K-Nearest Neighbors using dsldQeFairKNN()
Fairness/Utility Tradeoff Fairness Accuracy
K-Nearest Neighbors 0.1943313 25452.08
Fair K-NN (via EDFFair) 0.0814919 26291.38

The base K-NN model shows 0.194 correlation between predicted wage and gender, with $25,452 prediction error. Using dsldQeFairKNN, the correlation drops to 0.081, but test error increases by $839. This shows the fairness-utility trade-off. Users can test parameter combinations to find their optimal balance. The dsldFairUtils function facilitates this search.

Function List

  1. Graphical Functions
Function Description Use Case
dsldFreqPCoord Frequency-based parallel coordinates Visualizing multivariate relationships
dsldScatterPlot3D 3D scatter plots with color coding Exploring 3D data relationships
dsldConditDisparity Conditional disparity plots Detecting Simpson's Paradox
dsldDensityByS Density plots by sensitive variable Comparing distributions across groups
dsldConfounders Confounder analysis Identifying confounding variables
dsldIamb Constraint-based structure learning algorithms Fits a causal model to data
  1. Analytical Functions
Function Description Use Case
dsldLinear Linear regression with sensitive group comparisons Regression outcome analysis
dsldLogit Logistic regression with sensitive group comparisons Binary outcome analysis
dsldML Machine learning with sensitive group comparisons Analysis via non-parametric models (KNN, RF)
dsldTakeALookAround Feature set evaluation Assessing prediction fairness
dsldCHunting Confounder hunting Finding variables that predict both Y and S
dsldOHunting Proxy hunting Finding variables that predict S
dsldMatchedAte Causal inference via matching Estimating treatment effects
  1. Fair Machine Learning Functions
Function Description Package
dsldFairML FairML algorithm wrappers FairML
dsldQeFairML EDFFair algorithm wrappers EDFFair
dsldFairUtils Grid search and parameter optimization for fair ML

Available Algorithms:

  • FairML: dsldFrrm, dsldFgrrm, dsldZlm, dsldNclm, dsldZlrm
  • EDFFair: dsldQeFairKNN, dsldQeFairRf, dsldQeFairRidgeLin, dsldQeFairRidgeLog

Authors

  • Norm Matloff
  • Aditya Mittal
  • Taha Abdullah
  • Arjun Ashok
  • Shubhada Martha
  • Billy Ouattara
  • Jonathan Tran
  • Brandon Zarate

For issues, contact Aditya Mittal at [email protected]

About

❗ This is a read-only mirror of the CRAN R package repository. dsld — Data Science Looks at Discrimination. Homepage: https://github.com/matloff/dsld Report bugs for this package: https://github.com/matloff/dsld/issues

Resources

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •