Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit ffe8d89

Browse files
fix(KDP): fixing layers functionality
1 parent e1f453f commit ffe8d89

File tree

6 files changed

+777
-66
lines changed

6 files changed

+777
-66
lines changed

docs/distribution_aware_encoder.md

Lines changed: 41 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,13 @@ The **Distribution-Aware Encoder** is an advanced preprocessing layer that autom
3838

3939
7. **Discrete Distribution**
4040
- For data with finite distinct values
41-
- Handled via empirical CDF-based encoding
41+
- Handled via rank-based normalization
4242
- Detection: Unique values analysis
4343

4444
8. **Periodic Distribution**
4545
- For data with cyclic patterns
4646
- Handled via Fourier features (sin/cos)
47-
- Detection: Autocorrelation analysis
47+
- Detection: Peak spacing analysis
4848

4949
9. **Sparse Distribution**
5050
- For data with many zeros
@@ -105,19 +105,18 @@ model = PreprocessingModel( # here
105105

106106
```python
107107
from kdp.processor import PreprocessingModel
108-
from kdp.features import NumericalFeature
108+
from kdp.features import NumericalFeature, FeatureType
109109

110110
# Define features
111111
features = {
112-
# Numerical features
113112
# Numerical features
114113
"feature1": NumericalFeature(
115114
name="feature1",
116115
feature_type=FeatureType.FLOAT_NORMALIZED
117116
),
118117
"feature2": NumericalFeature(
119118
name="feature2",
120-
feature_type=FeatureType.FLOAT_RESCALED
119+
feature_type=FeatureType.FLOAT_RESCALED,
121120
prefered_distribution="log_normal" # here we could specify a prefered distribution (normal, periodic, etc)
122121
)
123122
# etc ..
@@ -150,11 +149,12 @@ encoder = DistributionAwareEncoder(
150149
|-----------|------|---------|-------------|
151150
| num_bins | int | 1000 | Number of bins for quantile encoding |
152151
| epsilon | float | 1e-6 | Small value for numerical stability |
153-
| detect_periodicity | bool | True | Enable periodic pattern detection | Remove this parameter when having multimodal functions/distributions
152+
| detect_periodicity | bool | True | Enable periodic pattern detection |
154153
| handle_sparsity | bool | True | Enable special handling for sparse data |
155154
| adaptive_binning | bool | True | Enable adaptive bin boundaries |
156155
| mixture_components | int | 3 | Number of components for mixture models |
157156
| trainable | bool | True | Whether parameters are trainable |
157+
| prefered_distribution | DistributionType | None | Manually specify distribution type |
158158

159159
## Key Features
160160

@@ -282,35 +282,44 @@ The DistributionAwareEncoder is integrated into the numeric feature processing p
282282
- Transformation: O(n)
283283
- GMM fitting: O(n * mixture_components)
284284

285-
## Best Practices
286-
287-
1. **Data Preparation**
288-
- Clean outliers if not meaningful
289-
- Handle missing values before encoding
290-
- Ensure numeric data type
285+
## Testing and Validation
291286

292-
2. **Configuration**
293-
- Start with default parameters
294-
- Adjust based on data characteristics
295-
- Monitor distribution detection results
287+
For information on how we test and validate the Distribution-Aware Encoder, see the [Distribution-Aware Encoder Testing](distribution_aware_encoder_testing.md) documentation.
296288

297-
3. **Performance Optimization**
298-
- Use appropriate batch sizes
299-
- Enable caching for repeated processing
300-
- Adjust mixture components based on data
289+
## Example Usage in Preprocessing Pipeline
301290

302-
### Distribution Detection
303291
```python
304-
# Access distribution information
305-
dist_info = encoder._estimate_distribution(inputs)
306-
print(f"Detected distribution: {dist_info['type']}")
307-
print(f"Statistics: {dist_info['stats']}")
308-
```
292+
# Example with automatic distribution detection
293+
from kdp.processor import PreprocessingModel
294+
from kdp.features import NumericalFeature, FeatureType
309295

310-
### Transformation Quality
311-
```python
312-
# Monitor transformed output statistics
313-
transformed = encoder(inputs)
314-
print(f"Output mean: {tf.reduce_mean(transformed)}")
315-
print(f"Output variance: {tf.math.reduce_variance(transformed)}")
296+
# Define features
297+
features = {
298+
# Default automatic distribution detection
299+
"basic_float": NumericalFeature(
300+
name="basic_float",
301+
feature_type=FeatureType.FLOAT,
302+
),
303+
304+
# Manually setting a gamma distribution
305+
"rescaled_float": NumericalFeature(
306+
name="rescaled_float",
307+
feature_type=FeatureType.FLOAT_RESCALED,
308+
scale=2.0,
309+
prefered_distribution="gamma"
310+
),
311+
}
312+
313+
# Create preprocessing model with distribution-aware encoding
314+
ppr = PreprocessingModel(
315+
path_data="sample_data.csv",
316+
features_specs=features,
317+
features_stats_path="features_stats.json",
318+
overwrite_stats=True,
319+
output_mode="concat",
320+
use_distribution_aware=True
321+
)
322+
323+
# Build the preprocessor
324+
result = ppr.build_preprocessor()
316325
```
Lines changed: 132 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,132 @@
1+
# Testing the Distribution-Aware Encoder
2+
3+
## Overview
4+
5+
The `DistributionAwareEncoder` is a sophisticated layer that automatically detects and handles various data distributions. To ensure its reliability, we've implemented comprehensive testing that verifies its functionality across different distribution types.
6+
7+
## Key Improvements
8+
9+
We've made several improvements to the `DistributionAwareEncoder` class:
10+
11+
1. **Fixed Multimodality Detection**: Corrected the implementation of the `_detect_multimodality` method to properly handle peak detection and periodicity checking.
12+
13+
2. **Enhanced Discrete Distribution Handling**: Improved the `_handle_discrete` method to work reliably in both eager and graph execution modes, replacing the `StaticHashTable` approach with a more compatible implementation.
14+
15+
3. **Graph Mode Compatibility**: Ensured all methods work correctly in TensorFlow's graph execution mode, which is essential for production deployment.
16+
17+
## Testing Strategy
18+
19+
Our testing approach for the `DistributionAwareEncoder` includes:
20+
21+
### 1. Distribution-Specific Tests
22+
23+
We test each supported distribution type individually:
24+
25+
- **Normal Distribution**: Verifies correct handling of normally distributed data
26+
- **Heavy-Tailed Distribution**: Tests Student's t-distribution handling
27+
- **Multimodal Distribution**: Checks detection and transformation of bimodal data
28+
- **Uniform Distribution**: Validates uniform distribution handling
29+
- **Discrete Distribution**: Tests handling of data with finite distinct values
30+
- **Sparse Distribution**: Verifies special handling for data with many zeros
31+
- **Periodic Distribution**: Tests detection and transformation of cyclic patterns
32+
33+
### 2. Graph Mode Compatibility Test
34+
35+
We verify that the encoder works correctly in TensorFlow's graph execution mode by:
36+
37+
1. Creating a simple model with the encoder
38+
2. Compiling the model
39+
3. Training it for one epoch
40+
4. Verifying no errors occur during graph compilation and execution
41+
42+
## Sample Test Code
43+
44+
Here's an example of how we test the `DistributionAwareEncoder`:
45+
46+
```python
47+
import numpy as np
48+
import pytest
49+
import tensorflow as tf
50+
51+
from kdp.custom_layers import DistributionAwareEncoder, DistributionType
52+
53+
@pytest.fixture
54+
def encoder():
55+
"""Create a DistributionAwareEncoder instance for testing."""
56+
return DistributionAwareEncoder(num_bins=10, detect_periodicity=True, handle_sparsity=True)
57+
58+
def test_normal_distribution(encoder):
59+
"""Test that normal distribution is correctly identified and transformed."""
60+
# Generate normal distribution data
61+
np.random.seed(42)
62+
data = np.random.normal(0, 1, (100, 1))
63+
64+
# Transform the data
65+
transformed = encoder(data)
66+
67+
# Check that the output is finite and in a reasonable range
68+
assert np.all(np.isfinite(transformed))
69+
assert -2.0 <= np.min(transformed) <= 2.0
70+
assert -2.0 <= np.max(transformed) <= 2.0
71+
```
72+
73+
## Running the Tests
74+
75+
To run the tests, use the following command:
76+
77+
```bash
78+
poetry run pytest tests/test_distribution_encoder.py -v
79+
```
80+
81+
## Best Practices for Using Distribution-Aware Encoder
82+
83+
1. **Data Preparation**:
84+
- Clean obvious outliers if they're not meaningful
85+
- Handle missing values before encoding
86+
- Ensure numeric data type
87+
88+
2. **Configuration**:
89+
- Start with default parameters
90+
- Adjust based on your data characteristics
91+
- Monitor distribution detection results
92+
93+
3. **Performance Optimization**:
94+
- Use appropriate batch sizes
95+
- Enable caching for repeated processing
96+
- Adjust mixture components based on data complexity
97+
98+
4. **Distribution Monitoring**:
99+
- For debugging, you can access the detected distribution:
100+
```python
101+
# Access distribution information
102+
dist_info = encoder._estimate_distribution(inputs)
103+
print(f"Detected distribution: {dist_info['type']}")
104+
```
105+
106+
## Integration with Preprocessing Pipeline
107+
108+
The `DistributionAwareEncoder` is fully integrated into the KDP preprocessing pipeline. To use it, simply enable it in your `PreprocessingModel`:
109+
110+
```python
111+
from kdp.processor import PreprocessingModel
112+
from kdp.features import NumericalFeature, FeatureType
113+
114+
# Define features
115+
features = {
116+
"feature1": NumericalFeature(
117+
name="feature1",
118+
feature_type=FeatureType.FLOAT_NORMALIZED
119+
),
120+
"feature2": NumericalFeature(
121+
name="feature2",
122+
feature_type=FeatureType.FLOAT_RESCALED,
123+
prefered_distribution="log_normal" # Manually specify distribution if needed
124+
)
125+
}
126+
127+
# Initialize the model with distribution-aware encoding
128+
model = PreprocessingModel(
129+
features=features,
130+
use_distribution_aware=True,
131+
distribution_aware_bins=1000 # Adjust bin count for finer data resolution
132+
)

0 commit comments

Comments
 (0)