UnicoLab
diff --git a/‎.gitignore‎
Lines changed: 1 addition & 1 deletion b/‎.gitignore‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/complex_example.md‎
Lines changed: 19 additions & 16 deletions b/‎docs/complex_example.md‎
Lines changed: 19 additions & 16 deletions
diff --git a/‎docs/feature_selection.md‎
Lines changed: 170 additions & 0 deletions b/‎docs/feature_selection.md‎
Lines changed: 170 additions & 0 deletions
diff --git a/‎docs/imgs/complex_model.png‎
35 KB b/‎docs/imgs/complex_model.png‎
35 KB
@@ -166,4 +166,4 @@ kdp/data/fake_data.csv
 my_tests/*
 
 # derivative files
-data.csv
+*.csv
@@ -87,12 +87,15 @@ df = pd.DataFrame({
     ] * 20
 })
 
-# Save to CSV
+# Format data
 df.to_csv("sample_data.csv", index=False)
+test_batch = tf.data.Dataset.from_tensor_slices(dict(df.head(3))).batch(3)
 
 # Create preprocessor with both transformer blocks and attention
 ppr = PreprocessingModel(
     path_data="sample_data.csv",
+    features_stats_path="features_stats.json",
+    overwrite_stats=True,             # Force stats generation, recommended to be set to True
     features_specs=features,
     output_mode=OutputModeOptions.CONCAT,
 
@@ -111,32 +114,32 @@ ppr = PreprocessingModel(
     tabular_attention_dropout=0.1,               # Attention dropout rate
     tabular_attention_embedding_dim=16,          # Embedding dimension
 
-    # Other parameters
-    overwrite_stats=True,             # Force stats generation, recommended to be set to True
+    # Feature selection configuration
+    feature_selection_placement="all_features", # Choose between (all_features|numeric|categorical)
+    feature_selection_units=32,
+    feature_selection_dropout=0.15,
 )
 
 # Build the preprocessor
 result = ppr.build_preprocessor()
 ```
 
-Now if one wants to plot, use the Neural Network for predictions or just get the statistics, use the following:
+Now if one wants to plot the a block diagram of the model or get the outout of the NN or get the importance weights of the features, use the following:
 
 ```python
 # Plot the model architecture
 ppr.plot_model("complex_model.png")
 
-# Get predictions with an example test batch from the example data
-test_batch = tf.data.Dataset.from_tensor_slices(dict(df.head(3))).batch(3)
-predictions = result["model"].predict(test_batch)
-print("Output shape:", predictions.shape)
-
-# Print feature statistics
-print("\nFeature Statistics:")
-for feature_type, features in ppr.get_feature_statistics().items():
-    if isinstance(features, dict):
-        print(f"\n{feature_type}:")
-        for feature_name, stats in features.items():
-            print(f"  {feature_name}: {list(stats.keys())}")
+# Transform data using direct model prediction
+transformed_data = ppr.model.predict(test_batch)
+
+# Transform data using batch_predict
+transformed_data = ppr.batch_predict(test_batch)
+transformed_batches = list(transformed_data)  # For better visualization
+
+# Get feature importances
+feature_importances = ppr.get_feature_importances()
+print("Feature importances:", feature_importances)
 ```
 
 
 
@@ -0,0 +1,170 @@
+# Feature Selection in Keras Data Processor
+
+The Keras Data Processor includes a sophisticated feature selection mechanism based on the Gated Residual Variable Selection Network (GRVSN) architecture. This document explains the components, usage, and benefits of this feature.
+
+## Overview
+
+The feature selection mechanism uses a combination of gated units and residual networks to automatically learn the importance of different features in your data. It can be applied to both numeric and categorical features, either independently or together.
+
+## Components
+
+### 1. GatedLinearUnit
+
+The `GatedLinearUnit` is the basic building block that implements a gated activation function:
+
+```python
+gl = GatedLinearUnit(units=64)
+x = tf.random.normal((32, 100))
+y = gl(x)
+```
+
+Key features:
+- Applies a linear transformation followed by a sigmoid gate
+- Selectively filters input data based on learned weights
+- Helps control information flow through the network
+
+### 2. GatedResidualNetwork
+
+The `GatedResidualNetwork` combines gated linear units with residual connections:
+
+```python
+grn = GatedResidualNetwork(units=64, dropout_rate=0.2)
+x = tf.random.normal((32, 100))
+y = grn(x)
+```
+
+Key features:
+- Uses ELU activation for non-linearity
+- Includes dropout for regularization
+- Adds residual connections to help with gradient flow
+- Applies layer normalization for stability
+
+### 3. VariableSelection
+
+The `VariableSelection` layer is the main feature selection component:
+
+```python
+vs = VariableSelection(nr_features=3, units=64, dropout_rate=0.2)
+x1 = tf.random.normal((32, 100))
+x2 = tf.random.normal((32, 200))
+x3 = tf.random.normal((32, 300))
+selected_features, weights = vs([x1, x2, x3])
+```
+
+Key features:
+- Processes each feature independently using GRNs
+- Calculates feature importance weights using softmax
+- Returns both selected features and their weights
+- Supports different input dimensions for each feature
+
+## Usage in Preprocessing Model
+
+### Configuration
+
+Configure feature selection in your preprocessing model:
+
+```python
+model = PreprocessingModel(
+    # ... other parameters ...
+    feature_selection_placement="all_features",  # or "numeric" or "categorical"
+    feature_selection_units=64,
+    feature_selection_dropout=0.2
+)
+```
+
+### Placement Options
+
+The `FeatureSelectionPlacementOptions` enum provides several options for where to apply feature selection:
+
+1. `NONE`: Disable feature selection
+2. `NUMERIC`: Apply only to numeric features
+3. `CATEGORICAL`: Apply only to categorical features
+4. `ALL_FEATURES`: Apply to all features
+
+### Accessing Feature Weights
+
+After processing, feature weights are available in the `processed_features` dictionary:
+
+```python
+# Process your data
+processed = model.transform(data)
+
+# Access feature weights
+numeric_weights = processed["numeric_feature_weights"]
+categorical_weights = processed["categorical_feature_weights"]
+```
+
+## Benefits
+
+1. **Automatic Feature Selection**: The model learns which features are most important for your task.
+2. **Interpretability**: Feature weights provide insights into feature importance.
+3. **Improved Performance**: By focusing on relevant features, the model can achieve better performance.
+4. **Regularization**: Dropout and residual connections help prevent overfitting.
+5. **Flexibility**: Can be applied to different feature types and combinations.
+
+## Integration with Other Features
+
+The feature selection mechanism integrates seamlessly with other preprocessing components:
+
+1. **Transformer Blocks**: Can be used before or after transformer blocks
+2. **Tabular Attention**: Complements tabular attention by focusing on important features
+3. **Custom Preprocessors**: Works with any custom preprocessing steps
+
+## Example
+
+Here's a complete example of using feature selection:
+
+```python
+from kdp.processor import PreprocessingModel
+from kdp.features import NumericalFeature, CategoricalFeature
+
+# Define features
+features = {
+    "numeric_1": NumericalFeature(
+        name="numeric_1",
+        feature_type=FeatureType.FLOAT_NORMALIZED
+    ),
+    "numeric_2": NumericalFeature(
+        name="numeric_2",
+        feature_type=FeatureType.FLOAT_NORMALIZED
+    ),
+    "category_1": CategoricalFeature(
+        name="category_1",
+        feature_type=FeatureType.STRING_CATEGORICAL
+    )
+}
+
+# Create model with feature selection
+model = PreprocessingModel(
+    # ... other parameters ...
+    features_specs=features,
+    feature_selection_placement="all_features", # or "numeric" or "categorical"
+    feature_selection_units=64,
+    feature_selection_dropout=0.2
+)
+
+# Build and use the model
+preprocessor = model.build_preprocessor()
+processed_data = model.transform(data) # data can be pd.DataFrame, python Dict, or tf.data.Dataset
+
+# Analyze feature importance
+for feature_name in features:
+    weights = processed_data[f"{feature_name}_weights"]
+    print(f"Feature {feature_name} importance: {weights.mean()}")
+```
+
+## Testing
+
+The feature selection components include comprehensive unit tests that verify:
+
+1. Output shapes and types
+2. Gating mechanism behavior
+3. Residual connections
+4. Dropout behavior
+5. Feature weight properties
+6. Serialization/deserialization
+
+Run the tests using:
+```bash
+python -m pytest test/test_feature_selection.py -v
+```