Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 9bfe276

Browse files
feat(KDP): addin DistributionAwareEncored layer and numeric preprocessing
1 parent a67c634 commit 9bfe276

File tree

8 files changed

+1655
-286
lines changed

8 files changed

+1655
-286
lines changed

docs/distribution_aware_encoder.md

Lines changed: 323 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,323 @@
1+
# Distribution-Aware Encoder
2+
3+
## Overview
4+
The Distribution-Aware Encoder is an advanced preprocessing layer that automatically detects and handles various types of data distributions. It uses TensorFlow Probability (tfp) for accurate modeling and applies specialized transformations while preserving the statistical properties of the data.
5+
6+
## Features
7+
8+
### Distribution Types Supported
9+
1. **Normal Distribution**
10+
- For standard normally distributed data
11+
- Handled via z-score normalization
12+
- Detection: Kurtosis ≈ 3.0, Skewness ≈ 0
13+
14+
2. **Heavy-Tailed Distribution**
15+
- For data with heavier tails than normal
16+
- Handled via Student's t-distribution
17+
- Detection: Kurtosis > 3.5
18+
19+
3. **Multimodal Distribution**
20+
- For data with multiple peaks
21+
- Handled via Gaussian Mixture Models
22+
- Detection: KDE-based peak detection
23+
24+
4. **Uniform Distribution**
25+
- For evenly distributed data
26+
- Handled via min-max scaling
27+
- Detection: Kurtosis ≈ 1.8
28+
29+
5. **Exponential Distribution**
30+
- For data with exponential decay
31+
- Handled via rate-based transformation
32+
- Detection: Skewness ≈ 2.0
33+
34+
6. **Log-Normal Distribution**
35+
- For data that is normal after log transform
36+
- Handled via logarithmic transformation
37+
- Detection: Log-transformed kurtosis ≈ 3.0
38+
39+
7. **Discrete Distribution**
40+
- For data with finite distinct values
41+
- Handled via empirical CDF-based encoding
42+
- Detection: Unique values analysis
43+
44+
8. **Periodic Distribution**
45+
- For data with cyclic patterns
46+
- Handled via Fourier features (sin/cos)
47+
- Detection: Autocorrelation analysis
48+
49+
9. **Sparse Distribution**
50+
- For data with many zeros
51+
- Handled via separate zero/non-zero transformations
52+
- Detection: Zero ratio analysis
53+
54+
10. **Beta Distribution**
55+
- For bounded data between 0 and 1
56+
- Handled via beta CDF transformation
57+
- Detection: Value range and shape analysis
58+
59+
11. **Gamma Distribution**
60+
- For positive, right-skewed data
61+
- Handled via gamma CDF transformation
62+
- Detection: Positive support and skewness
63+
64+
12. **Poisson Distribution**
65+
- For count data
66+
- Handled via rate parameter estimation
67+
- Detection: Integer values and variance≈mean
68+
69+
13. **Weibull Distribution**
70+
- For lifetime/failure data
71+
- Handled via Weibull CDF
72+
- Detection: Shape and scale analysis
73+
74+
14. **Cauchy Distribution**
75+
- For extremely heavy-tailed data
76+
- Handled via robust location-scale estimation
77+
- Detection: Undefined moments
78+
79+
15. **Zero-Inflated Distribution**
80+
- For data with excess zeros
81+
- Handled via mixture model approach
82+
- Detection: Zero proportion analysis
83+
84+
16. **Bounded Distribution**
85+
- For data with known bounds
86+
- Handled via scaled beta transformation
87+
- Detection: Value range analysis
88+
89+
17. **Ordinal Distribution**
90+
- For ordered categorical data
91+
- Handled via learned mapping
92+
- Detection: Discrete ordered values
93+
94+
## Usage
95+
96+
### Basic Usage
97+
```python
98+
from kdp.processor import PreprocessingModel
99+
100+
preprocessor = PreprocessingModel(
101+
features_stats=stats,
102+
features_specs=specs,
103+
use_distribution_aware=True
104+
)
105+
```
106+
107+
### Advanced Configuration
108+
```python
109+
encoder = DistributionAwareEncoder(
110+
num_bins=1000,
111+
epsilon=1e-6,
112+
detect_periodicity=True,
113+
handle_sparsity=True,
114+
adaptive_binning=True,
115+
mixture_components=3,
116+
trainable=True
117+
)
118+
```
119+
120+
## Parameters
121+
122+
| Parameter | Type | Default | Description |
123+
|-----------|------|---------|-------------|
124+
| num_bins | int | 1000 | Number of bins for quantile encoding |
125+
| epsilon | float | 1e-6 | Small value for numerical stability |
126+
| detect_periodicity | bool | True | Enable periodic pattern detection |
127+
| handle_sparsity | bool | True | Enable special handling for sparse data |
128+
| adaptive_binning | bool | True | Enable adaptive bin boundaries |
129+
| mixture_components | int | 3 | Number of components for mixture models |
130+
| trainable | bool | True | Whether parameters are trainable |
131+
132+
## Key Features
133+
134+
### 1. Automatic Distribution Detection
135+
- Uses statistical moments and tests
136+
- Employs KDE for multimodality detection
137+
- Handles mixed distributions via ensemble approach
138+
139+
### 2. Adaptive Transformations
140+
- Learns optimal parameters during training
141+
- Adjusts to data distribution changes
142+
- Handles complex periodic patterns
143+
144+
### 3. Fourier Feature Generation
145+
- Automatic frequency detection
146+
- Multiple harmonic components
147+
- Phase-aware transformations
148+
149+
### 4. Robust Handling
150+
- Special treatment for zeros
151+
- Outlier-resistant transformations
152+
- Numerical stability safeguards
153+
154+
## Implementation Details
155+
156+
### 1. Periodic Data Handling
157+
```python
158+
# Normalize to [-π, π] range
159+
normalized = inputs * π / scale
160+
# Generate Fourier features
161+
features = [
162+
sin(freq * normalized + phase),
163+
cos(freq * normalized + phase)
164+
]
165+
# Add harmonics if multimodal
166+
if is_multimodal:
167+
for h in [2, 3, 4]:
168+
features.extend([
169+
sin(h * freq * normalized + phase),
170+
cos(h * freq * normalized + phase)
171+
])
172+
```
173+
174+
### 2. Distribution Detection
175+
```python
176+
# Statistical moments
177+
mean = tf.reduce_mean(inputs)
178+
variance = tf.math.reduce_variance(inputs)
179+
skewness = compute_skewness(inputs)
180+
kurtosis = compute_kurtosis(inputs)
181+
182+
# Distribution tests
183+
is_normal = test_normality(inputs)
184+
is_multimodal = detect_multimodality(inputs)
185+
is_periodic = check_periodicity(inputs)
186+
```
187+
188+
### 3. Adaptive Parameters
189+
```python
190+
self.boundaries = self.add_weight(
191+
name="boundaries",
192+
shape=(num_bins - 1,),
193+
initializer="zeros",
194+
trainable=adaptive_binning
195+
)
196+
197+
self.mixture_weights = self.add_weight(
198+
name="mixture_weights",
199+
shape=(mixture_components,),
200+
initializer="ones",
201+
trainable=True
202+
)
203+
```
204+
205+
## Best Practices
206+
207+
1. **Data Preparation**
208+
- Clean obvious outliers
209+
- Handle missing values
210+
- Ensure numeric data types
211+
212+
2. **Configuration**
213+
- Enable periodicity detection for time-related features
214+
- Use adaptive binning for changing distributions
215+
- Adjust mixture components based on complexity
216+
217+
3. **Performance**
218+
- Use appropriate batch sizes
219+
- Enable caching when possible
220+
- Monitor transformation times
221+
222+
4. **Monitoring**
223+
- Check distribution detection accuracy
224+
- Validate transformation quality
225+
- Watch for numerical instabilities
226+
227+
## Integration with Preprocessing Pipeline
228+
229+
The DistributionAwareEncoder is integrated into the numeric feature processing pipeline:
230+
231+
1. **Feature Statistics Collection**
232+
- Basic statistics (mean, variance)
233+
- Distribution characteristics
234+
- Sparsity patterns
235+
236+
2. **Automatic Distribution Detection**
237+
- Statistical tests
238+
- Pattern recognition
239+
- Threshold-based decisions
240+
241+
3. **Dynamic Transformation**
242+
- Distribution-specific handling
243+
- Adaptive parameter adjustment
244+
- Quality monitoring
245+
246+
## Performance Considerations
247+
248+
### Memory Usage
249+
- Adaptive binning weights: O(num_bins)
250+
- GMM parameters: O(mixture_components)
251+
- Periodic components: O(1)
252+
253+
### Computational Complexity
254+
- Distribution detection: O(n)
255+
- Transformation: O(n)
256+
- GMM fitting: O(n * mixture_components)
257+
258+
## Best Practices
259+
260+
1. **Data Preparation**
261+
- Clean outliers if not meaningful
262+
- Handle missing values before encoding
263+
- Ensure numeric data type
264+
265+
2. **Configuration**
266+
- Start with default parameters
267+
- Adjust based on data characteristics
268+
- Monitor distribution detection results
269+
270+
3. **Performance Optimization**
271+
- Use appropriate batch sizes
272+
- Enable caching for repeated processing
273+
- Adjust mixture components based on data
274+
275+
## Example Use Cases
276+
277+
### 1. Financial Data
278+
```python
279+
# Handle heavy-tailed return distributions
280+
preprocessor = PreprocessingModel(
281+
use_distribution_aware=True,
282+
handle_sparsity=False,
283+
mixture_components=2
284+
)
285+
```
286+
287+
### 2. Temporal Data
288+
```python
289+
# Handle periodic patterns
290+
preprocessor = PreprocessingModel(
291+
use_distribution_aware=True,
292+
detect_periodicity=True,
293+
adaptive_binning=True
294+
)
295+
```
296+
297+
### 3. Sparse Features
298+
```python
299+
# Handle sparse categorical data
300+
preprocessor = PreprocessingModel(
301+
use_distribution_aware=True,
302+
handle_sparsity=True,
303+
mixture_components=1
304+
)
305+
```
306+
307+
## Monitoring and Debugging
308+
309+
### Distribution Detection
310+
```python
311+
# Access distribution information
312+
dist_info = encoder._estimate_distribution(inputs)
313+
print(f"Detected distribution: {dist_info['type']}")
314+
print(f"Statistics: {dist_info['stats']}")
315+
```
316+
317+
### Transformation Quality
318+
```python
319+
# Monitor transformed output statistics
320+
transformed = encoder(inputs)
321+
print(f"Output mean: {tf.reduce_mean(transformed)}")
322+
print(f"Output variance: {tf.math.reduce_variance(transformed)}")
323+
```

kdp/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
CategoryEncodingOptions,
77
OutputModeOptions,
88
PreprocessingModel,
9+
TabularAttentionPlacementOptions,
910
TransformerBlockPlacementOptions,
1011
)
1112
from kdp.stats import DatasetStatistics
@@ -25,4 +26,5 @@
2526
"CategoryEncodingOptions",
2627
"TransformerBlockPlacementOptions",
2728
"OutputModeOptions",
29+
"TabularAttentionPlacementOptions",
2830
]

0 commit comments

Comments
 (0)