Thanks to visit codestin.com
Credit goes to github.com

Skip to content

giobbu/covariate-shift

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python Tests

Multivariate Shift Detectors

Covariate Shift

We have a Source distribution, $P_{\text{source}}$, and a Target distribution, $P_{\text{target}}$:

  • Source distribution: $(X^{\text{source}}, Y^{\text{source}}) ∼ P_{\text{source}}$

  • Target distribution: $(X^{\text{target}}, Y^{\text{target}}) ∼ P_{\text{target}}$

Covariate drifts happen when:

$$P_{\text{target}}(Y \mid X) = P_{\text{source}}(Y \mid X) \quad \text{but} \quad P_{\text{target}}(X) \ne P_{\text{source}}(X)$$

Multivariate drift detectors

  • Maximum Mean Discrepancy Two-Sample Test - MMD Test

from source.mmd import MMD_test
"""
x_before (np.ndarray): First sample of shape (n, d) from source distribution.
x_after (np.ndarray): Second sample of shape (m, d) from reference distribution.
n_permutations (int): Number of permutations for the permutation test.
sigma (float): Bandwidth parameter for the Gaussian kernel.
"""
sigma = 1.0
n_permutations=1000
mmd_statistic, mmd_perms, pval = MMD_test(x_before, x_after, sigma, n_permutations=n_permutations)
print(f"MMD Statistic: {mmd}, p-value: {pval}")
  • Log-Likelihood Ratio Test - LLR Test

from source.ratio import LLR_test
"""
x_before (np.ndarray): First sample of shape (n, d) from source distribution.
x_after (np.ndarray): Second sample of shape (m, d) from reference distribution.
bandwidth (float): Bandwidth parameter for KDE.
n_permutations (int): Number of permutations for the permutation test. Default is 1000.
"""
bandwidth = 0.5
n_permutations=1000
llr_statistic, llr_perms, p_value = LLR_test(x_before, x_after, bandwidth=bandwidth, n_permutations=n_permutations)
print(f'LLR Statistic: {llr_statistic}, p-value: {p_value}')

Streaming batch data simulator

  • Data stream with simulated mean drifts

Batch Streaming Animation

  • Drifts detected with MOVING reference window with LLR-test

Useful to build adaptive learning models in streaming environments. The learning model is updated or rebuilt as soon as a drift-event is detected. Moving Window P-Value Moving Window

  • Drifts detected with FIXED reference window with LLR-test

Useful to monitor automated systems and identify the full duration of the concept drift. The reference period should be representative. Fixed Window P-Value Fixed Window

Lambda framework for near real-time covariate monitoring

  1. Offline layer: define the Reference Component

Using data collected offline, perform the following steps:

    1. Define the reference distribution: select a fixed portion of the offline data to construct a stable covariate distribution representing normal condition.
    1. Simulate streaming data via batch sampling: from the remaining offline data, draw multiple batches to simulate streaming behavior. For each batch, compute the statistic of interest.
    1. Model null distribution: Aggregate the statistics to form a distribution (i.e. Null hypothesis) that captures the natural variability of the statistic under normal conditions. This distribution serves as a reference and is passed to the streaming layer for real-time monitoring. Offline-Layer
  1. Streaming layer: define Monitoring component

For each incoming batch of data in streaming:

    1. Compare the current batch against the reference distribution by computing the statistic of interest.
    1. Verify where the computed statistic fall within the null distribution derived in the offline layer. If the statistic exceeds a predifined threshold (e.g, quantile(1-$\alpha$)), flag the batch as a potential drift event and trigger an alert. Streaming-Layer

About

Non-parametric statistical tests for multivariate distribution shift detection.

Topics

Resources

License

Stars

Watchers

Forks