Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit c988087

Browse files
feat(KDP): adding DistributionAwareEncored layer for numeric features preprocessing. (#20)
2 parents ea419cf + 2f01e67 commit c988087

24 files changed

+3189
-694
lines changed

.isort.cfg

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
[settings]
2+
profile=black
3+
multi_line_output=3
4+
include_trailing_comma=True
5+
force_grid_wrap=0
6+
use_parentheses=True
7+
ensure_newline_before_comments=True
8+
line_length=88

.pre-commit-config.yaml

Lines changed: 0 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -15,23 +15,6 @@ repos:
1515
# Run the formatter.
1616
- id: ruff-format
1717

18-
- repo: https://github.com/timothycrosley/isort
19-
rev: 5.12.0
20-
hooks:
21-
- id: isort
22-
args:
23-
[
24-
"--profile=black",
25-
"--py=311",
26-
"--line-length=120",
27-
"--multi-line=3",
28-
"--trailing-comma",
29-
"--force-grid-wrap=0",
30-
"--use-parentheses",
31-
"--ensure-newline-before-comments",
32-
"--project=CORE,src,config,preprocess,train,transform,main,model",
33-
]
34-
3518
- repo: https://github.com/pre-commit/pre-commit-hooks
3619
rev: v4.1.0
3720
hooks:

.ruff.toml

Whitespace-only changes.

docs/complex_example.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ features = {
2424
"quantity": NumericalFeature(
2525
name="quantity",
2626
feature_type=FeatureType.FLOAT_RESCALED
27+
prefered_distribution="poisson" # here we could specify a prefered distribution (normal, periodic, etc)
2728
),
2829

2930
# Categorical features
@@ -118,6 +119,10 @@ ppr = PreprocessingModel(
118119
feature_selection_placement="all_features", # Choose between (all_features|numeric|categorical)
119120
feature_selection_units=32,
120121
feature_selection_dropout=0.15,
122+
123+
# Distribution aware configuration
124+
use_distribution_aware=True, # here we activate the distribution aware encoder
125+
distribution_aware_bins=1000, # thats the default value, but you can change it for finer data
121126
)
122127

123128
# Build the preprocessor

docs/distribution_aware_encoder.md

Lines changed: 316 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,316 @@
1+
# Distribution-Aware Encoder
2+
3+
## Overview
4+
The Distribution-Aware Encoder is an advanced preprocessing layer that automatically detects and handles various types of data distributions. It uses TensorFlow Probability (tfp) for accurate modeling and applies specialized transformations while preserving the statistical properties of the data.
5+
6+
## Features
7+
8+
### Distribution Types Supported
9+
1. **Normal Distribution**
10+
- For standard normally distributed data
11+
- Handled via z-score normalization
12+
- Detection: Kurtosis ≈ 3.0, Skewness ≈ 0
13+
14+
2. **Heavy-Tailed Distribution**
15+
- For data with heavier tails than normal
16+
- Handled via Student's t-distribution
17+
- Detection: Kurtosis > 3.5
18+
19+
3. **Multimodal Distribution**
20+
- For data with multiple peaks
21+
- Handled via Gaussian Mixture Models
22+
- Detection: KDE-based peak detection
23+
24+
4. **Uniform Distribution**
25+
- For evenly distributed data
26+
- Handled via min-max scaling
27+
- Detection: Kurtosis ≈ 1.8
28+
29+
5. **Exponential Distribution**
30+
- For data with exponential decay
31+
- Handled via rate-based transformation
32+
- Detection: Skewness ≈ 2.0
33+
34+
6. **Log-Normal Distribution**
35+
- For data that is normal after log transform
36+
- Handled via logarithmic transformation
37+
- Detection: Log-transformed kurtosis ≈ 3.0
38+
39+
7. **Discrete Distribution**
40+
- For data with finite distinct values
41+
- Handled via empirical CDF-based encoding
42+
- Detection: Unique values analysis
43+
44+
8. **Periodic Distribution**
45+
- For data with cyclic patterns
46+
- Handled via Fourier features (sin/cos)
47+
- Detection: Autocorrelation analysis
48+
49+
9. **Sparse Distribution**
50+
- For data with many zeros
51+
- Handled via separate zero/non-zero transformations
52+
- Detection: Zero ratio analysis
53+
54+
10. **Beta Distribution**
55+
- For bounded data between 0 and 1
56+
- Handled via beta CDF transformation
57+
- Detection: Value range and shape analysis
58+
59+
11. **Gamma Distribution**
60+
- For positive, right-skewed data
61+
- Handled via gamma CDF transformation
62+
- Detection: Positive support and skewness
63+
64+
12. **Poisson Distribution**
65+
- For count data
66+
- Handled via rate parameter estimation
67+
- Detection: Integer values and variance≈mean
68+
69+
14. **Cauchy Distribution**
70+
- For extremely heavy-tailed data
71+
- Handled via robust location-scale estimation
72+
- Detection: Undefined moments
73+
74+
15. **Zero-Inflated Distribution**
75+
- For data with excess zeros
76+
- Handled via mixture model approach
77+
- Detection: Zero proportion analysis
78+
79+
## Usage
80+
81+
### Basic Usage
82+
83+
The capability only works with numerical features!
84+
85+
```python
86+
from kdp.processor import PreprocessingModel
87+
from kdp.features import NumericalFeature
88+
89+
# Define features
90+
features = {
91+
# Numerical features
92+
"feature1": NumericalFeature(),
93+
"feature2": NumericalFeature(),
94+
# etc ..
95+
}
96+
97+
# Initialize the model
98+
model = PreprocessingModel( # here
99+
features=features,
100+
use_distribution_aware=True
101+
)
102+
```
103+
104+
### Manual Usage
105+
106+
```python
107+
from kdp.processor import PreprocessingModel
108+
from kdp.features import NumericalFeature
109+
110+
# Define features
111+
features = {
112+
# Numerical features
113+
# Numerical features
114+
"feature1": NumericalFeature(
115+
name="feature1",
116+
feature_type=FeatureType.FLOAT_NORMALIZED
117+
),
118+
"feature2": NumericalFeature(
119+
name="feature2",
120+
feature_type=FeatureType.FLOAT_RESCALED
121+
prefered_distribution="log_normal" # here we could specify a prefered distribution (normal, periodic, etc)
122+
)
123+
# etc ..
124+
}
125+
126+
# Initialize the model
127+
model = PreprocessingModel( # here
128+
features=features,
129+
use_distribution_aware=True,
130+
distribution_aware_bins=1000, # 1000 is the default value, but you can change it for finer data
131+
)
132+
```
133+
134+
### Advanced Configuration
135+
```python
136+
encoder = DistributionAwareEncoder(
137+
num_bins=1000,
138+
epsilon=1e-6,
139+
detect_periodicity=True,
140+
handle_sparsity=True,
141+
adaptive_binning=True,
142+
mixture_components=3,
143+
trainable=True
144+
)
145+
```
146+
147+
## Parameters
148+
149+
| Parameter | Type | Default | Description |
150+
|-----------|------|---------|-------------|
151+
| num_bins | int | 1000 | Number of bins for quantile encoding |
152+
| epsilon | float | 1e-6 | Small value for numerical stability |
153+
| detect_periodicity | bool | True | Enable periodic pattern detection | Remove this parameter when having multimodal functions/distributions
154+
| handle_sparsity | bool | True | Enable special handling for sparse data |
155+
| adaptive_binning | bool | True | Enable adaptive bin boundaries |
156+
| mixture_components | int | 3 | Number of components for mixture models |
157+
| trainable | bool | True | Whether parameters are trainable |
158+
159+
## Key Features
160+
161+
### 1. Automatic Distribution Detection
162+
- Uses statistical moments and tests
163+
- Employs KDE for multimodality detection
164+
- Handles mixed distributions via ensemble approach
165+
166+
### 2. Adaptive Transformations
167+
- Learns optimal parameters during training
168+
- Adjusts to data distribution changes
169+
- Handles complex periodic patterns
170+
171+
### 3. Fourier Feature Generation
172+
- Automatic frequency detection
173+
- Multiple harmonic components
174+
- Phase-aware transformations
175+
176+
### 4. Robust Handling
177+
- Special treatment for zeros
178+
- Outlier-resistant transformations
179+
- Numerical stability safeguards
180+
181+
## Implementation Details
182+
183+
### 1. Periodic Data Handling
184+
```python
185+
# Normalize to [-π, π] range
186+
normalized = inputs * π / scale
187+
# Generate Fourier features
188+
features = [
189+
sin(freq * normalized + phase),
190+
cos(freq * normalized + phase)
191+
]
192+
# Add harmonics if multimodal
193+
if is_multimodal:
194+
for h in [2, 3, 4]:
195+
features.extend([
196+
sin(h * freq * normalized + phase),
197+
cos(h * freq * normalized + phase)
198+
])
199+
```
200+
201+
### 2. Distribution Detection
202+
```python
203+
# Statistical moments
204+
mean = tf.reduce_mean(inputs)
205+
variance = tf.math.reduce_variance(inputs)
206+
skewness = compute_skewness(inputs)
207+
kurtosis = compute_kurtosis(inputs)
208+
209+
# Distribution tests
210+
is_normal = test_normality(inputs)
211+
is_multimodal = detect_multimodality(inputs)
212+
is_periodic = check_periodicity(inputs)
213+
```
214+
215+
### 3. Adaptive Parameters
216+
```python
217+
self.boundaries = self.add_weight(
218+
name="boundaries",
219+
shape=(num_bins - 1,),
220+
initializer="zeros",
221+
trainable=adaptive_binning
222+
)
223+
224+
self.mixture_weights = self.add_weight(
225+
name="mixture_weights",
226+
shape=(mixture_components,),
227+
initializer="ones",
228+
trainable=True
229+
)
230+
```
231+
232+
## Best Practices
233+
234+
1. **Data Preparation**
235+
- Clean obvious outliers
236+
- Handle missing values
237+
- Ensure numeric data types
238+
239+
2. **Configuration**
240+
- Enable periodicity detection for time-related features
241+
- Use adaptive binning for changing distributions
242+
- Adjust mixture components based on complexity
243+
244+
3. **Performance**
245+
- Use appropriate batch sizes
246+
- Enable caching when possible
247+
- Monitor transformation times
248+
249+
4. **Monitoring**
250+
- Check distribution detection accuracy
251+
- Validate transformation quality
252+
- Watch for numerical instabilities
253+
254+
## Integration with Preprocessing Pipeline
255+
256+
The DistributionAwareEncoder is integrated into the numeric feature processing pipeline:
257+
258+
1. **Feature Statistics Collection**
259+
- Basic statistics (mean, variance)
260+
- Distribution characteristics
261+
- Sparsity patterns
262+
263+
2. **Automatic Distribution Detection**
264+
- Statistical tests
265+
- Pattern recognition
266+
- Threshold-based decisions
267+
268+
3. **Dynamic Transformation**
269+
- Distribution-specific handling
270+
- Adaptive parameter adjustment
271+
- Quality monitoring
272+
273+
## Performance Considerations
274+
275+
### Memory Usage
276+
- Adaptive binning weights: O(num_bins)
277+
- GMM parameters: O(mixture_components)
278+
- Periodic components: O(1)
279+
280+
### Computational Complexity
281+
- Distribution detection: O(n)
282+
- Transformation: O(n)
283+
- GMM fitting: O(n * mixture_components)
284+
285+
## Best Practices
286+
287+
1. **Data Preparation**
288+
- Clean outliers if not meaningful
289+
- Handle missing values before encoding
290+
- Ensure numeric data type
291+
292+
2. **Configuration**
293+
- Start with default parameters
294+
- Adjust based on data characteristics
295+
- Monitor distribution detection results
296+
297+
3. **Performance Optimization**
298+
- Use appropriate batch sizes
299+
- Enable caching for repeated processing
300+
- Adjust mixture components based on data
301+
302+
### Distribution Detection
303+
```python
304+
# Access distribution information
305+
dist_info = encoder._estimate_distribution(inputs)
306+
print(f"Detected distribution: {dist_info['type']}")
307+
print(f"Statistics: {dist_info['stats']}")
308+
```
309+
310+
### Transformation Quality
311+
```python
312+
# Monitor transformed output statistics
313+
transformed = encoder(inputs)
314+
print(f"Output mean: {tf.reduce_mean(transformed)}")
315+
print(f"Output variance: {tf.math.reduce_variance(transformed)}")
316+
```

0 commit comments

Comments
 (0)