UnicoLab
diff --git a/‎docs/advanced_numerical_embeddings.md‎
Lines changed: 95 additions & 0 deletions b/‎docs/advanced_numerical_embeddings.md‎
Lines changed: 95 additions & 0 deletions
diff --git a/‎docs/complex_example.md‎
Lines changed: 21 additions & 0 deletions b/‎docs/complex_example.md‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎docs/example_usages.md‎
Lines changed: 63 additions & 0 deletions b/‎docs/example_usages.md‎
Lines changed: 63 additions & 0 deletions
diff --git a/‎docs/imgs/complex_example.png‎
442 KB b/‎docs/imgs/complex_example.png‎
442 KB
diff --git a/‎docs/imgs/numerical_example_model_with_advanced_numerical_embedding.png‎
181 KB b/‎docs/imgs/numerical_example_model_with_advanced_numerical_embedding.png‎
181 KB
diff --git a/‎kdp/custom_layers.py‎
Lines changed: 154 additions & 18 deletions b/‎kdp/custom_layers.py‎
Lines changed: 154 additions & 18 deletions
@@ -0,0 +1,95 @@
+# Advanced Numerical Embeddings in KDP
+
+Keras Data Processor (KDP) now provides advanced numerical embedding techniques to better capture complex numerical relationships in your data. This release introduces two embedding approaches:
+
+---
+
+## AdvancedNumericalEmbedding
+
+**Purpose:**
+Processes individual numerical features with tailored embedding layers. This layer performs adaptive binning, applies MLP transformations per feature, and can incorporate dropout and batch normalization.
+
+**Key Parameters:**
+- **`embedding_dim`**: Dimension for each feature's embedding.
+- **`mlp_hidden_units`**: Number of hidden units in the MLP applied to each feature.
+- **`num_bins`**: Number of bins used for discretizing continuous inputs.
+- **`init_min` and `init_max`**: Initialization boundaries for binning.
+- **`dropout_rate`**: Dropout rate for regularization.
+- **`use_batch_norm`**: Flag to apply batch normalization.
+
+**Usage Example:**
+```python
+from kdp.custom_layers import AdvancedNumericalEmbedding
+import tensorflow as tf
+
+layer = AdvancedNumericalEmbedding(
+    embedding_dim=8,
+    mlp_hidden_units=16,
+    num_bins=10,
+    init_min=[-3.0, -2.0, -4.0],
+    init_max=[3.0, 2.0, 4.0],
+    dropout_rate=0.1,
+    use_batch_norm=True,
+)
+
+# Input shape: (batch_size, num_features)
+x = tf.random.normal((32, 3))
+# Output shape: (32, 3, 8)
+output = layer(x, training=False)
+```
+
+---
+
+## GlobalAdvancedNumericalEmbedding
+
+**Purpose:**
+Combines a set of numerical features into a single, compact representation. It does so by applying an internal advanced numerical embedding on the concatenated input and then performing a global pooling over all features.
+
+**Key Parameters (prefixed with `global_`):**
+- **`global_embedding_dim`**: Global embedding dimension (final pooled vector size).
+- **`global_mlp_hidden_units`**: Hidden units in the global MLP.
+- **`global_num_bins`**: Number of bins for discretization.
+- **`global_init_min` and `global_init_max`**: Global initialization boundaries.
+- **`global_dropout_rate`**: Dropout rate.
+- **`global_use_batch_norm`**: Whether to apply batch normalization.
+- **`global_pooling`**: Pooling method to use ("average" or "max").
+
+**Usage Example:**
+```python
+from kdp.custom_layers import GlobalAdvancedNumericalEmbedding
+import tensorflow as tf
+
+global_layer = GlobalAdvancedNumericalEmbedding(
+    global_embedding_dim=8,
+    global_mlp_hidden_units=16,
+    global_num_bins=10,
+    global_init_min=[-3.0, -2.0],
+    global_init_max=[3.0, 2.0],
+    global_dropout_rate=0.1,
+    global_use_batch_norm=True,
+    global_pooling="average"
+)
+
+# Input shape: (batch_size, num_features)
+x = tf.random.normal((32, 3))
+# Global output shape: (32, 8)
+global_output = global_layer(x, training=False)
+```
+
+---
+
+## When to Use Which?
+
+- **AdvancedNumericalEmbedding:**
+  Use this when you need to process each numerical feature individually, preserving their distinct characteristics via per-feature embeddings.
+
+- **GlobalAdvancedNumericalEmbedding:**
+  Choose this option when you want to merge multiple numerical features into a unified global embedding using a pooling mechanism. This is particularly useful when the overall interaction across features is more important than the individual feature details.
+
+## Advanced Configuration
+
+Both layers offer additional parameters to fine-tune the embedding process. You can adjust dropout rates, batch normalization, and binning strategies to best suit your data. For more detailed information, please refer to the API documentation.
+
+---
+
+This document highlights the key differences and usage examples for the new advanced numerical embeddings available in KDP.
@@ -123,6 +123,27 @@ ppr = PreprocessingModel(
     # Distribution aware configuration
     use_distribution_aware=True, # here we activate the distribution aware encoder
     distribution_aware_bins=1000, # thats the default value, but you can change it for finer data
+
+    # Add advanced numerical embedding
+    use_advanced_numerical_embedding=True,
+    embedding_dim=32,  # Match embedding size with categorical features
+    mlp_hidden_units=16,
+    num_bins=10,
+    init_min=-3.0,
+    init_max=3.0,
+    dropout_rate=0.1,
+    use_batch_norm=True,
+
+    # Add global numerical embedding
+    use_global_numerical_embedding=True,
+    global_embedding_dim=32,  # Match embedding dimensions
+    global_mlp_hidden_units=16,
+    global_num_bins=10,
+    global_init_min=-3.0,
+    global_init_max=3.0,
+    global_dropout_rate=0.1,
+    global_use_batch_norm=True,
+    global_pooling="average",
 )
 
 # Build the preprocessor
 
@@ -362,3 +362,66 @@ feature_importances = ppr.get_feature_importances()
 ```
 Here is the plot of the model:
 ![Complex Model](imgs/numerical_example_model_with_distribution_aware.png)
+
+
+## Example 5: Numerical features with numerical embedding
+
+Numerical embedding is a technique that allows us to embed numerical features into a higher dimensional space.
+This can be useful for capturing non-linear relationships within/between numerical feature/s.
+
+```python
+from kdp.features import NumericalFeature, FeatureType
+from kdp.processor import PreprocessingModel, OutputModeOptions
+
+
+# Define features
+features = {
+    "basic_float": NumericalFeature(
+        name="basic_float",
+        feature_type=FeatureType.FLOAT,
+    ),
+
+    "rescaled_float": NumericalFeature(
+        name="rescaled_float",
+        feature_type=FeatureType.FLOAT_RESCALED,
+        scale=2.0,
+    ),
+
+    "custom_float": NumericalFeature(
+        name="custom_float",
+        feature_type=FeatureType.FLOAT,
+        preprocessors=[
+            tf.keras.layers.Rescaling,
+            tf.keras.layers.Normalization,
+            DistributionAwareEncoder,
+        ],
+    ),
+}
+
+# Now we can create a preprocessing model with the features
+ppr = PreprocessingModel(
+    path_data="sample_data.csv",
+    features_specs=features,
+    features_stats_path="features_stats.json",
+    overwrite_stats=True,
+
+    # Add numerical embedding
+    # Use advanced numerical embedding for individual features
+    use_advanced_numerical_embedding=True,
+    # Use global numerical embedding for all features
+    use_global_numerical_embedding=True,
+
+    output_mode=OutputModeOptions.CONCAT,
+)
+
+# Build the preprocessor
+result = ppr.build_preprocessor()
+
+# Transform data using direct model prediction
+transformed_data = ppr.model.predict(test_batch)
+
+# Get feature importances
+feature_importances = ppr.get_feature_importances()
+```
+Here is the plot of the model:
+![Complex Model](imgs/numerical_example_model_with_advanced_numerical_embedding.png)
@@ -1981,15 +1981,27 @@ class AdvancedNumericalEmbedding(layers.Layer):
 
     def __init__(
         self,
-        embedding_dim: int,
-        mlp_hidden_units: int,
-        num_bins: int,
-        init_min,
-        init_max,
-        dropout_rate: float = 0.0,
-        use_batch_norm: bool = False,
+        embedding_dim: int = 8,
+        mlp_hidden_units: int = 16,
+        num_bins: int = 10,
+        init_min: float | list[float] = -3.0,
+        init_max: float | list[float] = 3.0,
+        dropout_rate: float = 0.1,
+        use_batch_norm: bool = True,
         **kwargs,
     ):
+        """Initialize the AdvancedNumericalEmbedding layer.
+
+        Args:
+            embedding_dim: Dimension of the output embedding for each feature.
+            mlp_hidden_units: Number of hidden units in the MLP.
+            num_bins: Number of bins for discretization.
+            init_min: Minimum value(s) for initialization. Can be a single float or list of floats.
+            init_max: Maximum value(s) for initialization. Can be a single float or list of floats.
+            dropout_rate: Dropout rate for regularization.
+            use_batch_norm: Whether to use batch normalization.
+            **kwargs: Additional layer arguments.
+        """
         super().__init__(**kwargs)
         self.embedding_dim = embedding_dim
         self.mlp_hidden_units = mlp_hidden_units
@@ -2046,17 +2058,22 @@ def build(self, input_shape):
             init_min_tensor = tf.fill([self.num_features], init_min_tensor)
         if init_max_tensor.shape.ndims == 0:
             init_max_tensor = tf.fill([self.num_features], init_max_tensor)
-        # Convert tensors to numpy arrays, which are acceptable by tf.constant_initializer.
-        init_min_value = (
-            init_min_tensor.numpy()
-            if hasattr(init_min_tensor, "numpy")
-            else init_min_tensor
-        )
-        init_max_value = (
-            init_max_tensor.numpy()
-            if hasattr(init_max_tensor, "numpy")
-            else init_max_tensor
-        )
+
+        if tf.executing_eagerly():
+            init_min_value = init_min_tensor.numpy()
+            init_max_value = init_max_tensor.numpy()
+        else:
+            # Fallback: if not executing eagerly, force conversion to list
+            init_min_value = (
+                init_min_tensor.numpy().tolist()
+                if hasattr(init_min_tensor, "numpy")
+                else self.init_min
+            )
+            init_max_value = (
+                init_max_tensor.numpy().tolist()
+                if hasattr(init_max_tensor, "numpy")
+                else self.init_max
+            )
 
         self.learned_min = self.add_weight(
             name="learned_min",
@@ -2117,6 +2134,9 @@ def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
         # Combine branches via a per-feature, per-dimension gate.
         gate = tf.nn.sigmoid(self.gate)  # (num_features, embedding_dim)
         output = gate * cont + (1 - gate) * disc  # (batch, num_features, embedding_dim)
+        # If only one feature is provided, squeeze the features axis.
+        if self.num_features == 1:
+            return tf.squeeze(output, axis=1)  # New shape: (batch, embedding_dim)
         return output
 
     def get_config(self):
@@ -2133,3 +2153,119 @@ def get_config(self):
             }
         )
         return config
+
+
+class GlobalAdvancedNumericalEmbedding(tf.keras.layers.Layer):
+    """
+    Global AdvancedNumericalEmbedding processes concatenated numeric features.
+    It applies an inner AdvancedNumericalEmbedding over the flattened input and then
+    performs global pooling (average or max) to produce a compact representation.
+    """
+
+    def __init__(
+        self,
+        global_embedding_dim: int = 8,
+        global_mlp_hidden_units: int = 16,
+        global_num_bins: int = 10,
+        global_init_min: float | list[float] = -3.0,
+        global_init_max: float | list[float] = 3.0,
+        global_dropout_rate: float = 0.1,
+        global_use_batch_norm: bool = True,
+        global_pooling: str = "average",
+        **kwargs,
+    ):
+        """Initialize the GlobalAdvancedNumericalEmbedding layer.
+
+        Args:
+            global_embedding_dim: Dimension of the final global embedding.
+            global_mlp_hidden_units: Number of hidden units in the global MLP.
+            global_num_bins: Number of bins for discretization.
+            global_init_min: Minimum value(s) for initialization. Can be a single float or list of floats.
+            global_init_max: Maximum value(s) for initialization. Can be a single float or list of floats.
+            global_dropout_rate: Dropout rate for regularization.
+            global_use_batch_norm: Whether to use batch normalization.
+            global_pooling: Pooling method to use ("average" or "max").
+            **kwargs: Additional layer arguments.
+        """
+        super().__init__(**kwargs)
+        self.global_embedding_dim = global_embedding_dim
+        self.global_mlp_hidden_units = global_mlp_hidden_units
+        self.global_num_bins = global_num_bins
+
+        # Ensure initializer parameters are Python scalars, lists, or numpy arrays.
+        if not isinstance(global_init_min, (list, tuple, np.ndarray)):
+            try:
+                global_init_min = float(global_init_min)
+            except Exception:
+                raise ValueError(
+                    "init_min must be a Python scalar, list, tuple or numpy array"
+                )
+        if not isinstance(global_init_max, (list, tuple, np.ndarray)):
+            try:
+                global_init_max = float(global_init_max)
+            except Exception:
+                raise ValueError(
+                    "init_max must be a Python scalar, list, tuple or numpy array"
+                )
+        self.global_init_min = global_init_min
+        self.global_init_max = global_init_max
+        self.global_dropout_rate = global_dropout_rate
+        self.global_use_batch_norm = global_use_batch_norm
+        self.global_pooling = global_pooling
+
+        # Use the existing advanced numerical embedding block
+        self.inner_embedding = AdvancedNumericalEmbedding(
+            embedding_dim=self.global_embedding_dim,
+            mlp_hidden_units=self.global_mlp_hidden_units,
+            num_bins=self.global_num_bins,
+            init_min=self.global_init_min,
+            init_max=self.global_init_max,
+            dropout_rate=self.global_dropout_rate,
+            use_batch_norm=self.global_use_batch_norm,
+            name="global_numeric_emebedding",
+        )
+        if self.global_pooling == "average":
+            self.global_pooling_layer = tf.keras.layers.GlobalAveragePooling1D(
+                name="global_avg_pool"
+            )
+        elif self.global_pooling == "max":
+            self.global_pooling_layer = tf.keras.layers.GlobalMaxPooling1D(
+                name="global_max_pool"
+            )
+        else:
+            raise ValueError(f"Unsupported pooling method: {self.global_pooling}")
+
+    def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
+        """
+        Expects inputs with shape (batch, ...) and flattens them (except for the batch dim).
+        Then, the inner embedding produces a 3D output (batch, num_features, embedding_dim),
+        which is finally pooled to yield (batch, embedding_dim).
+        """
+        # If inputs have more than 2 dimensions, flatten them (except for batch dimension).
+        if len(inputs.shape) > 2:
+            inputs = tf.reshape(inputs, (tf.shape(inputs)[0], -1))
+        # Pass through the inner advanced embedding.
+        x_embedded = self.inner_embedding(inputs, training=training)
+        # Global pooling over numeric features axis.
+        x_pooled = self.global_pooling_layer(x_embedded)
+        return x_pooled
+
+    def compute_output_shape(self, input_shape):
+        # Regardless of the input shape, the output shape is (batch_size, embedding_dim)
+        return (input_shape[0], self.global_embedding_dim)
+
+    def get_config(self):
+        config = super().get_config()
+        config.update(
+            {
+                "global_embedding_dim": self.global_embedding_dim,
+                "global_mlp_hidden_units": self.global_mlp_hidden_units,
+                "global_num_bins": self.global_num_bins,
+                "global_init_min": self.global_init_min,
+                "global_init_max": self.global_init_max,
+                "global_dropout_rate": self.global_dropout_rate,
+                "global_use_batch_norm": self.global_use_batch_norm,
+                "global_pooling": self.global_pooling,
+            }
+        )
+        return config