DivExplorer

Machine learning models may perform differently on different data subgroups. We propose the notion of divergence over itemsets (i.e., conjunctions of simple predicates) as a measure of different classification behavior on data subgroups, and the use of frequent pattern mining techniques for their identification. We quantify the contribution of different attribute values to divergence with the notion of Shapley values to identify both critical and peculiar behaviors of attributes. See our paper and our project page for all the details.

Installation

Install using pip with:

pip install divexplorer

or, download a wheel or source archive from PyPI.

Example Notebooks

This notebook gives an example of how to use DivExplorer to find divergent subgroups in datasets and in the predictions of a classifier.

Documentation

For the code details, see the documentation.

The original paper is:

Looking for Trouble: Analyzing Classifier Behavior via Pattern Divergence. Eliana Pastor, Luca de Alfaro, Elena Baralis. In Proceedings of the 2021 ACM SIGMOD Conference, 2021.

You can find more papers and information in the DivExplorer project page.

Quick Start

DivExplorer works on Pandas datasets. Here we load an example one, and discretize in coarser ranges one of its attributes.

import pandas as pd

df_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')
df_census["AGE_RANGE"] = df_census.apply(lambda row : 10 * (row["A_AGE"] // 10), axis=1)

We can then find the data subgroups that have highest income divergence, using the DivergenceExplorer class as follows:

from divexplorer import DivergenceExplorer

fp_diver = DivergenceExplorer(df_census)
subgroups = fp_diver.get_pattern_divergence(
    min_support=0.001,
    attributes=["STATE", "SEX", "EDUCATION", "AGE_RANGE"], 
    quantitative_outcomes=["PTOTVAL"])
subgroups.sort_values(by="PTOTVAL_div", ascending=False).head(10)

You can also prune redundant subgroups by specifying:

a threshold, so that attributes that don't increase the divergence by at least the threshold value are not included in subgroups,
a minimum t-value, to select only significant subgroups.

from divexplorer import DivergencePatternProcessor

processor = DivergencePatternProcessor(subgroups, "PTOTVAL")
pruned_subgroups = pd.DataFrame(processor.redundancy_pruning(th_redundancy=10000))
pruned_subgroups = pruned_subgroups[pruned_subgroups["PTOTVAL_t"] > 2]
pruned_subgroups.sort_values(by="PTOTVAL_div", ascending=False, ignore_index=True)

Finding subgroups with divergent performance in classifiers

For classifiers, it may be of interest to find the subgroups with the highest (or lowest) divergence in characteristics such as false positive rates, etc. Here is how to do it for the false-positive rate in a COMPAS-derived classifier.

compas_df = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/compas_discretized.csv')

We generate an fp column whose average will give the false-positive rate, like so:

from divexplorer.outcomes import get_false_positive_rate_outcome

y_trues = compas_df["class"]
y_preds = compas_df["predicted"]

compas_df['fp'] =  get_false_positive_rate_outcome(y_trues, y_preds)

The fp column has values:

1, if the data is a false positive (class is 0 and predicted is 1)
0, if the data is a true negative (class is 0 and predicted is 0).
NaN, if the class is positive (class is 1).

We use Nan for class 1 data, to exclude those data from the average, so that the column average is the false-positive rate. We can then find the most divergent groups as in the previous example, noting that here we use boolean_outcomes rather than quantitative_outcomes because fp is boolean:

fp_diver = DivergenceExplorer(compas_df)

attributes = ['race', '#prior', 'sex', 'age']
FP_fm = fp_diver.get_pattern_divergence(min_support=0.1, attributes=attributes, 
                                        boolean_outcomes=['fp'])
FP_fm.sort_values(by="fp_div", ascending=False).head(10)

Note how we specify the attributes that can be used to define subgroups. In the above code, we use boolean_outcomes because fp is boolean. The following example, from the example notebook, shows how to use quantitative_outcomes for a quantitative outcome.

df_census = pd.read_csv('https://raw.githubusercontent.com/divexplorer/divexplorer/main/datasets/census_income.csv')
explorer = DivergenceExplorer(df_census)
value_subgroups = explorer.get_pattern_divergence(
    min_support=0.001, quantitative_outcomes=["PTOTVAL"])

Analyzing subgroups via Shapley values

Returning to our COMPAS example, if we want to analyze what factors contribute to the divergence of a particular subgroup, we can do so via Shapley values:

fp_details = DivergencePatternProcessor(FP_fm, 'fp')

pattern = fp_details.patterns['itemset'].iloc[37]
fp_details.shapley_value(pattern)

Pruning redundant subgroups

If you get too many subgroups, you can prune redundant ones via redundancy pruning. This prunes a pattern $\beta$ if there is a pattern $\alpha$, subset of $\beta$, with a divergence difference below a threshold.

df_pruned = fp_details.redundancy_pruning(th_redundancy=0.01)
df_pruned.sort_values("fp_div", ascending=False).head(5)

Algorithms

DivExplorer can multiple frequent-pattern mining algorithms: fpgrowth, apriori, and alt_apriori. The default is fpgrowth, which is faster; you can switch to apriori by specifying algorithm='apriori' in the get_pattern_divergence method. The advantage of the apriori algorithm is that it can be more memory-efficient. The alt_apriori algorithm is memory-efficient but slower.

The fpgrowth and apriori algorithms are adapted from the mlxtend library. The alt_apriori algorithm is derived from the paper:

Sergey Brin, Rajeev Motwani, Jeffrey D. Ullman, and Shalom Tsur. “Dynamic Itemset Counting and Implication Rules for Market Basket Data.” In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data - SIGMOD ’97, 255–64. Tucson, Arizona, United States: ACM Press, 1997. https://doi.org/10.1145/253260.253325.

We added to the above algorithm options and optimizations that speed up the computation and allow limiting the memory used.

`alt_apriori` options

You can specify the maximum number of items in an itemset, with max_items (default: None, meaning all items). For instance, if finding itemsets with at most 4 items suffices, you can specify max_items=4.
The dataset is shuffled with a random seed, and for each itemset, only the first max_instances instances are considered (default: None, meaning all instances). For example, if you want to use a support limit of 0.01, choosing max_instances=10000 guarantees that every frequent itemset will be supported by at least 100 instances, allowing for the computation of somewhat reliable divergence values. If the dataset is very large (in number of instances), this can lead to considerable speedup.

Code Contributors

Project Lead:

Eliana Pastor

Contributors:

Luca de Alfaro

Other contributors:

Harsh Dadhich

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
.github/workflows		.github/workflows
datasets		datasets
divexplorer		divexplorer
notebooks		notebooks
.gitignore		.gitignore
DEVELOP.md		DEVELOP.md
Documentation.md		Documentation.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DivExplorer

Installation

Example Notebooks

Documentation

Quick Start

Finding subgroups with divergent performance in classifiers

Analyzing subgroups via Shapley values

Pruning redundant subgroups

Algorithms

`alt_apriori` options

Code Contributors

About

Uh oh!

Releases 2

Uh oh!

Contributors 4

Uh oh!

Languages

License

divexplorer/divexplorer

Folders and files

Latest commit

History

Repository files navigation

DivExplorer

Installation

Example Notebooks

Documentation

Quick Start

Finding subgroups with divergent performance in classifiers

Analyzing subgroups via Shapley values

Pruning redundant subgroups

Algorithms

alt_apriori options

Code Contributors

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Uh oh!

Contributors 4

Uh oh!

Languages

`alt_apriori` options