|
| 1 | +# Distribution-Aware Encoder |
| 2 | + |
| 3 | +## Overview |
| 4 | +The Distribution-Aware Encoder is an advanced preprocessing layer that automatically detects and handles various types of data distributions. It uses TensorFlow Probability (tfp) for accurate modeling and applies specialized transformations while preserving the statistical properties of the data. |
| 5 | + |
| 6 | +## Features |
| 7 | + |
| 8 | +### Distribution Types Supported |
| 9 | +1. **Normal Distribution** |
| 10 | + - For standard normally distributed data |
| 11 | + - Handled via z-score normalization |
| 12 | + - Detection: Kurtosis ≈ 3.0, Skewness ≈ 0 |
| 13 | + |
| 14 | +2. **Heavy-Tailed Distribution** |
| 15 | + - For data with heavier tails than normal |
| 16 | + - Handled via Student's t-distribution |
| 17 | + - Detection: Kurtosis > 3.5 |
| 18 | + |
| 19 | +3. **Multimodal Distribution** |
| 20 | + - For data with multiple peaks |
| 21 | + - Handled via Gaussian Mixture Models |
| 22 | + - Detection: KDE-based peak detection |
| 23 | + |
| 24 | +4. **Uniform Distribution** |
| 25 | + - For evenly distributed data |
| 26 | + - Handled via min-max scaling |
| 27 | + - Detection: Kurtosis ≈ 1.8 |
| 28 | + |
| 29 | +5. **Exponential Distribution** |
| 30 | + - For data with exponential decay |
| 31 | + - Handled via rate-based transformation |
| 32 | + - Detection: Skewness ≈ 2.0 |
| 33 | + |
| 34 | +6. **Log-Normal Distribution** |
| 35 | + - For data that is normal after log transform |
| 36 | + - Handled via logarithmic transformation |
| 37 | + - Detection: Log-transformed kurtosis ≈ 3.0 |
| 38 | + |
| 39 | +7. **Discrete Distribution** |
| 40 | + - For data with finite distinct values |
| 41 | + - Handled via empirical CDF-based encoding |
| 42 | + - Detection: Unique values analysis |
| 43 | + |
| 44 | +8. **Periodic Distribution** |
| 45 | + - For data with cyclic patterns |
| 46 | + - Handled via Fourier features (sin/cos) |
| 47 | + - Detection: Autocorrelation analysis |
| 48 | + |
| 49 | +9. **Sparse Distribution** |
| 50 | + - For data with many zeros |
| 51 | + - Handled via separate zero/non-zero transformations |
| 52 | + - Detection: Zero ratio analysis |
| 53 | + |
| 54 | +10. **Beta Distribution** |
| 55 | + - For bounded data between 0 and 1 |
| 56 | + - Handled via beta CDF transformation |
| 57 | + - Detection: Value range and shape analysis |
| 58 | + |
| 59 | +11. **Gamma Distribution** |
| 60 | + - For positive, right-skewed data |
| 61 | + - Handled via gamma CDF transformation |
| 62 | + - Detection: Positive support and skewness |
| 63 | + |
| 64 | +12. **Poisson Distribution** |
| 65 | + - For count data |
| 66 | + - Handled via rate parameter estimation |
| 67 | + - Detection: Integer values and variance≈mean |
| 68 | + |
| 69 | +13. **Weibull Distribution** |
| 70 | + - For lifetime/failure data |
| 71 | + - Handled via Weibull CDF |
| 72 | + - Detection: Shape and scale analysis |
| 73 | + |
| 74 | +14. **Cauchy Distribution** |
| 75 | + - For extremely heavy-tailed data |
| 76 | + - Handled via robust location-scale estimation |
| 77 | + - Detection: Undefined moments |
| 78 | + |
| 79 | +15. **Zero-Inflated Distribution** |
| 80 | + - For data with excess zeros |
| 81 | + - Handled via mixture model approach |
| 82 | + - Detection: Zero proportion analysis |
| 83 | + |
| 84 | +16. **Bounded Distribution** |
| 85 | + - For data with known bounds |
| 86 | + - Handled via scaled beta transformation |
| 87 | + - Detection: Value range analysis |
| 88 | + |
| 89 | +17. **Ordinal Distribution** |
| 90 | + - For ordered categorical data |
| 91 | + - Handled via learned mapping |
| 92 | + - Detection: Discrete ordered values |
| 93 | + |
| 94 | +## Usage |
| 95 | + |
| 96 | +### Basic Usage |
| 97 | +```python |
| 98 | +from kdp.processor import PreprocessingModel |
| 99 | + |
| 100 | +preprocessor = PreprocessingModel( |
| 101 | + features_stats=stats, |
| 102 | + features_specs=specs, |
| 103 | + use_distribution_aware=True |
| 104 | +) |
| 105 | +``` |
| 106 | + |
| 107 | +### Advanced Configuration |
| 108 | +```python |
| 109 | +encoder = DistributionAwareEncoder( |
| 110 | + num_bins=1000, |
| 111 | + epsilon=1e-6, |
| 112 | + detect_periodicity=True, |
| 113 | + handle_sparsity=True, |
| 114 | + adaptive_binning=True, |
| 115 | + mixture_components=3, |
| 116 | + trainable=True |
| 117 | +) |
| 118 | +``` |
| 119 | + |
| 120 | +## Parameters |
| 121 | + |
| 122 | +| Parameter | Type | Default | Description | |
| 123 | +|-----------|------|---------|-------------| |
| 124 | +| num_bins | int | 1000 | Number of bins for quantile encoding | |
| 125 | +| epsilon | float | 1e-6 | Small value for numerical stability | |
| 126 | +| detect_periodicity | bool | True | Enable periodic pattern detection | |
| 127 | +| handle_sparsity | bool | True | Enable special handling for sparse data | |
| 128 | +| adaptive_binning | bool | True | Enable adaptive bin boundaries | |
| 129 | +| mixture_components | int | 3 | Number of components for mixture models | |
| 130 | +| trainable | bool | True | Whether parameters are trainable | |
| 131 | + |
| 132 | +## Key Features |
| 133 | + |
| 134 | +### 1. Automatic Distribution Detection |
| 135 | +- Uses statistical moments and tests |
| 136 | +- Employs KDE for multimodality detection |
| 137 | +- Handles mixed distributions via ensemble approach |
| 138 | + |
| 139 | +### 2. Adaptive Transformations |
| 140 | +- Learns optimal parameters during training |
| 141 | +- Adjusts to data distribution changes |
| 142 | +- Handles complex periodic patterns |
| 143 | + |
| 144 | +### 3. Fourier Feature Generation |
| 145 | +- Automatic frequency detection |
| 146 | +- Multiple harmonic components |
| 147 | +- Phase-aware transformations |
| 148 | + |
| 149 | +### 4. Robust Handling |
| 150 | +- Special treatment for zeros |
| 151 | +- Outlier-resistant transformations |
| 152 | +- Numerical stability safeguards |
| 153 | + |
| 154 | +## Implementation Details |
| 155 | + |
| 156 | +### 1. Periodic Data Handling |
| 157 | +```python |
| 158 | +# Normalize to [-π, π] range |
| 159 | +normalized = inputs * π / scale |
| 160 | +# Generate Fourier features |
| 161 | +features = [ |
| 162 | + sin(freq * normalized + phase), |
| 163 | + cos(freq * normalized + phase) |
| 164 | +] |
| 165 | +# Add harmonics if multimodal |
| 166 | +if is_multimodal: |
| 167 | + for h in [2, 3, 4]: |
| 168 | + features.extend([ |
| 169 | + sin(h * freq * normalized + phase), |
| 170 | + cos(h * freq * normalized + phase) |
| 171 | + ]) |
| 172 | +``` |
| 173 | + |
| 174 | +### 2. Distribution Detection |
| 175 | +```python |
| 176 | +# Statistical moments |
| 177 | +mean = tf.reduce_mean(inputs) |
| 178 | +variance = tf.math.reduce_variance(inputs) |
| 179 | +skewness = compute_skewness(inputs) |
| 180 | +kurtosis = compute_kurtosis(inputs) |
| 181 | + |
| 182 | +# Distribution tests |
| 183 | +is_normal = test_normality(inputs) |
| 184 | +is_multimodal = detect_multimodality(inputs) |
| 185 | +is_periodic = check_periodicity(inputs) |
| 186 | +``` |
| 187 | + |
| 188 | +### 3. Adaptive Parameters |
| 189 | +```python |
| 190 | +self.boundaries = self.add_weight( |
| 191 | + name="boundaries", |
| 192 | + shape=(num_bins - 1,), |
| 193 | + initializer="zeros", |
| 194 | + trainable=adaptive_binning |
| 195 | +) |
| 196 | + |
| 197 | +self.mixture_weights = self.add_weight( |
| 198 | + name="mixture_weights", |
| 199 | + shape=(mixture_components,), |
| 200 | + initializer="ones", |
| 201 | + trainable=True |
| 202 | +) |
| 203 | +``` |
| 204 | + |
| 205 | +## Best Practices |
| 206 | + |
| 207 | +1. **Data Preparation** |
| 208 | + - Clean obvious outliers |
| 209 | + - Handle missing values |
| 210 | + - Ensure numeric data types |
| 211 | + |
| 212 | +2. **Configuration** |
| 213 | + - Enable periodicity detection for time-related features |
| 214 | + - Use adaptive binning for changing distributions |
| 215 | + - Adjust mixture components based on complexity |
| 216 | + |
| 217 | +3. **Performance** |
| 218 | + - Use appropriate batch sizes |
| 219 | + - Enable caching when possible |
| 220 | + - Monitor transformation times |
| 221 | + |
| 222 | +4. **Monitoring** |
| 223 | + - Check distribution detection accuracy |
| 224 | + - Validate transformation quality |
| 225 | + - Watch for numerical instabilities |
| 226 | + |
| 227 | +## Integration with Preprocessing Pipeline |
| 228 | + |
| 229 | +The DistributionAwareEncoder is integrated into the numeric feature processing pipeline: |
| 230 | + |
| 231 | +1. **Feature Statistics Collection** |
| 232 | + - Basic statistics (mean, variance) |
| 233 | + - Distribution characteristics |
| 234 | + - Sparsity patterns |
| 235 | + |
| 236 | +2. **Automatic Distribution Detection** |
| 237 | + - Statistical tests |
| 238 | + - Pattern recognition |
| 239 | + - Threshold-based decisions |
| 240 | + |
| 241 | +3. **Dynamic Transformation** |
| 242 | + - Distribution-specific handling |
| 243 | + - Adaptive parameter adjustment |
| 244 | + - Quality monitoring |
| 245 | + |
| 246 | +## Performance Considerations |
| 247 | + |
| 248 | +### Memory Usage |
| 249 | +- Adaptive binning weights: O(num_bins) |
| 250 | +- GMM parameters: O(mixture_components) |
| 251 | +- Periodic components: O(1) |
| 252 | + |
| 253 | +### Computational Complexity |
| 254 | +- Distribution detection: O(n) |
| 255 | +- Transformation: O(n) |
| 256 | +- GMM fitting: O(n * mixture_components) |
| 257 | + |
| 258 | +## Best Practices |
| 259 | + |
| 260 | +1. **Data Preparation** |
| 261 | + - Clean outliers if not meaningful |
| 262 | + - Handle missing values before encoding |
| 263 | + - Ensure numeric data type |
| 264 | + |
| 265 | +2. **Configuration** |
| 266 | + - Start with default parameters |
| 267 | + - Adjust based on data characteristics |
| 268 | + - Monitor distribution detection results |
| 269 | + |
| 270 | +3. **Performance Optimization** |
| 271 | + - Use appropriate batch sizes |
| 272 | + - Enable caching for repeated processing |
| 273 | + - Adjust mixture components based on data |
| 274 | + |
| 275 | +## Example Use Cases |
| 276 | + |
| 277 | +### 1. Financial Data |
| 278 | +```python |
| 279 | +# Handle heavy-tailed return distributions |
| 280 | +preprocessor = PreprocessingModel( |
| 281 | + use_distribution_aware=True, |
| 282 | + handle_sparsity=False, |
| 283 | + mixture_components=2 |
| 284 | +) |
| 285 | +``` |
| 286 | + |
| 287 | +### 2. Temporal Data |
| 288 | +```python |
| 289 | +# Handle periodic patterns |
| 290 | +preprocessor = PreprocessingModel( |
| 291 | + use_distribution_aware=True, |
| 292 | + detect_periodicity=True, |
| 293 | + adaptive_binning=True |
| 294 | +) |
| 295 | +``` |
| 296 | + |
| 297 | +### 3. Sparse Features |
| 298 | +```python |
| 299 | +# Handle sparse categorical data |
| 300 | +preprocessor = PreprocessingModel( |
| 301 | + use_distribution_aware=True, |
| 302 | + handle_sparsity=True, |
| 303 | + mixture_components=1 |
| 304 | +) |
| 305 | +``` |
| 306 | + |
| 307 | +## Monitoring and Debugging |
| 308 | + |
| 309 | +### Distribution Detection |
| 310 | +```python |
| 311 | +# Access distribution information |
| 312 | +dist_info = encoder._estimate_distribution(inputs) |
| 313 | +print(f"Detected distribution: {dist_info['type']}") |
| 314 | +print(f"Statistics: {dist_info['stats']}") |
| 315 | +``` |
| 316 | + |
| 317 | +### Transformation Quality |
| 318 | +```python |
| 319 | +# Monitor transformed output statistics |
| 320 | +transformed = encoder(inputs) |
| 321 | +print(f"Output mean: {tf.reduce_mean(transformed)}") |
| 322 | +print(f"Output variance: {tf.math.reduce_variance(transformed)}") |
| 323 | +``` |
0 commit comments