UnicoLab
diff --git a/‎docs/distribution_aware_encoder.md‎
Lines changed: 41 additions & 32 deletions b/‎docs/distribution_aware_encoder.md‎
Lines changed: 41 additions & 32 deletions
diff --git a/‎docs/distribution_aware_encoder_testing.md‎
Lines changed: 132 additions & 0 deletions b/‎docs/distribution_aware_encoder_testing.md‎
Lines changed: 132 additions & 0 deletions
@@ -38,13 +38,13 @@ The **Distribution-Aware Encoder** is an advanced preprocessing layer that autom
 
 7. **Discrete Distribution**
    - For data with finite distinct values
-   - Handled via empirical CDF-based encoding
+   - Handled via rank-based normalization
    - Detection: Unique values analysis
 
 8. **Periodic Distribution**
    - For data with cyclic patterns
    - Handled via Fourier features (sin/cos)
-   - Detection: Autocorrelation analysis
+   - Detection: Peak spacing analysis
 
 9. **Sparse Distribution**
    - For data with many zeros
@@ -105,19 +105,18 @@ model = PreprocessingModel( # here
 
 ```python
 from kdp.processor import PreprocessingModel
-from kdp.features import NumericalFeature
+from kdp.features import NumericalFeature, FeatureType
 
 # Define features
 features = {
-    # Numerical features
     # Numerical features
     "feature1": NumericalFeature(
         name="feature1",
         feature_type=FeatureType.FLOAT_NORMALIZED
     ),
     "feature2": NumericalFeature(
         name="feature2",
-        feature_type=FeatureType.FLOAT_RESCALED
+        feature_type=FeatureType.FLOAT_RESCALED,
         prefered_distribution="log_normal" # here we could specify a prefered distribution (normal, periodic, etc)
     )
     # etc ..
@@ -150,11 +149,12 @@ encoder = DistributionAwareEncoder(
 |-----------|------|---------|-------------|
 | num_bins | int | 1000 | Number of bins for quantile encoding |
 | epsilon | float | 1e-6 | Small value for numerical stability |
-| detect_periodicity | bool | True | Enable periodic pattern detection | Remove this parameter when having multimodal functions/distributions
+| detect_periodicity | bool | True | Enable periodic pattern detection |
 | handle_sparsity | bool | True | Enable special handling for sparse data |
 | adaptive_binning | bool | True | Enable adaptive bin boundaries |
 | mixture_components | int | 3 | Number of components for mixture models |
 | trainable | bool | True | Whether parameters are trainable |
+| prefered_distribution | DistributionType | None | Manually specify distribution type |
 
 ## Key Features
 
@@ -282,35 +282,44 @@ The DistributionAwareEncoder is integrated into the numeric feature processing p
 - Transformation: O(n)
 - GMM fitting: O(n * mixture_components)
 
-## Best Practices
-
-1. **Data Preparation**
-   - Clean outliers if not meaningful
-   - Handle missing values before encoding
-   - Ensure numeric data type
+## Testing and Validation
 
-2. **Configuration**
-   - Start with default parameters
-   - Adjust based on data characteristics
-   - Monitor distribution detection results
+For information on how we test and validate the Distribution-Aware Encoder, see the [Distribution-Aware Encoder Testing](distribution_aware_encoder_testing.md) documentation.
 
-3. **Performance Optimization**
-   - Use appropriate batch sizes
-   - Enable caching for repeated processing
-   - Adjust mixture components based on data
+## Example Usage in Preprocessing Pipeline
 
-### Distribution Detection
 ```python
-# Access distribution information
-dist_info = encoder._estimate_distribution(inputs)
-print(f"Detected distribution: {dist_info['type']}")
-print(f"Statistics: {dist_info['stats']}")
-```
+# Example with automatic distribution detection
+from kdp.processor import PreprocessingModel
+from kdp.features import NumericalFeature, FeatureType
 
-### Transformation Quality
-```python
-# Monitor transformed output statistics
-transformed = encoder(inputs)
-print(f"Output mean: {tf.reduce_mean(transformed)}")
-print(f"Output variance: {tf.math.reduce_variance(transformed)}")
+# Define features
+features = {
+    # Default automatic distribution detection
+    "basic_float": NumericalFeature(
+        name="basic_float",
+        feature_type=FeatureType.FLOAT,
+    ),
+
+    # Manually setting a gamma distribution
+    "rescaled_float": NumericalFeature(
+        name="rescaled_float",
+        feature_type=FeatureType.FLOAT_RESCALED,
+        scale=2.0,
+        prefered_distribution="gamma"
+    ),
+}
+
+# Create preprocessing model with distribution-aware encoding
+ppr = PreprocessingModel(
+    path_data="sample_data.csv",
+    features_specs=features,
+    features_stats_path="features_stats.json",
+    overwrite_stats=True,
+    output_mode="concat",
+    use_distribution_aware=True
+)
+
+# Build the preprocessor
+result = ppr.build_preprocessor()
 ```
@@ -0,0 +1,132 @@
+# Testing the Distribution-Aware Encoder
+
+## Overview
+
+The `DistributionAwareEncoder` is a sophisticated layer that automatically detects and handles various data distributions. To ensure its reliability, we've implemented comprehensive testing that verifies its functionality across different distribution types.
+
+## Key Improvements
+
+We've made several improvements to the `DistributionAwareEncoder` class:
+
+1. **Fixed Multimodality Detection**: Corrected the implementation of the `_detect_multimodality` method to properly handle peak detection and periodicity checking.
+
+2. **Enhanced Discrete Distribution Handling**: Improved the `_handle_discrete` method to work reliably in both eager and graph execution modes, replacing the `StaticHashTable` approach with a more compatible implementation.
+
+3. **Graph Mode Compatibility**: Ensured all methods work correctly in TensorFlow's graph execution mode, which is essential for production deployment.
+
+## Testing Strategy
+
+Our testing approach for the `DistributionAwareEncoder` includes:
+
+### 1. Distribution-Specific Tests
+
+We test each supported distribution type individually:
+
+- **Normal Distribution**: Verifies correct handling of normally distributed data
+- **Heavy-Tailed Distribution**: Tests Student's t-distribution handling
+- **Multimodal Distribution**: Checks detection and transformation of bimodal data
+- **Uniform Distribution**: Validates uniform distribution handling
+- **Discrete Distribution**: Tests handling of data with finite distinct values
+- **Sparse Distribution**: Verifies special handling for data with many zeros
+- **Periodic Distribution**: Tests detection and transformation of cyclic patterns
+
+### 2. Graph Mode Compatibility Test
+
+We verify that the encoder works correctly in TensorFlow's graph execution mode by:
+
+1. Creating a simple model with the encoder
+2. Compiling the model
+3. Training it for one epoch
+4. Verifying no errors occur during graph compilation and execution
+
+## Sample Test Code
+
+Here's an example of how we test the `DistributionAwareEncoder`:
+
+```python
+import numpy as np
+import pytest
+import tensorflow as tf
+
+from kdp.custom_layers import DistributionAwareEncoder, DistributionType
+
+@pytest.fixture
+def encoder():
+    """Create a DistributionAwareEncoder instance for testing."""
+    return DistributionAwareEncoder(num_bins=10, detect_periodicity=True, handle_sparsity=True)
+
+def test_normal_distribution(encoder):
+    """Test that normal distribution is correctly identified and transformed."""
+    # Generate normal distribution data
+    np.random.seed(42)
+    data = np.random.normal(0, 1, (100, 1))
+
+    # Transform the data
+    transformed = encoder(data)
+
+    # Check that the output is finite and in a reasonable range
+    assert np.all(np.isfinite(transformed))
+    assert -2.0 <= np.min(transformed) <= 2.0
+    assert -2.0 <= np.max(transformed) <= 2.0
+```
+
+## Running the Tests
+
+To run the tests, use the following command:
+
+```bash
+poetry run pytest tests/test_distribution_encoder.py -v
+```
+
+## Best Practices for Using Distribution-Aware Encoder
+
+1. **Data Preparation**:
+   - Clean obvious outliers if they're not meaningful
+   - Handle missing values before encoding
+   - Ensure numeric data type
+
+2. **Configuration**:
+   - Start with default parameters
+   - Adjust based on your data characteristics
+   - Monitor distribution detection results
+
+3. **Performance Optimization**:
+   - Use appropriate batch sizes
+   - Enable caching for repeated processing
+   - Adjust mixture components based on data complexity
+
+4. **Distribution Monitoring**:
+   - For debugging, you can access the detected distribution:
+     ```python
+     # Access distribution information
+     dist_info = encoder._estimate_distribution(inputs)
+     print(f"Detected distribution: {dist_info['type']}")
+     ```
+
+## Integration with Preprocessing Pipeline
+
+The `DistributionAwareEncoder` is fully integrated into the KDP preprocessing pipeline. To use it, simply enable it in your `PreprocessingModel`:
+
+```python
+from kdp.processor import PreprocessingModel
+from kdp.features import NumericalFeature, FeatureType
+
+# Define features
+features = {
+    "feature1": NumericalFeature(
+        name="feature1",
+        feature_type=FeatureType.FLOAT_NORMALIZED
+    ),
+    "feature2": NumericalFeature(
+        name="feature2",
+        feature_type=FeatureType.FLOAT_RESCALED,
+        prefered_distribution="log_normal"  # Manually specify distribution if needed
+    )
+}
+
+# Initialize the model with distribution-aware encoding
+model = PreprocessingModel(
+    features=features,
+    use_distribution_aware=True,
+    distribution_aware_bins=1000  # Adjust bin count for finer data resolution
+)