Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 5c3a974

Browse files
feat(KDP): adding numerical embedding layers (#26)
2 parents 1d06b76 + 4181bb3 commit 5c3a974

10 files changed

+1032
-145
lines changed
Lines changed: 95 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,95 @@
1+
# Advanced Numerical Embeddings in KDP
2+
3+
Keras Data Processor (KDP) now provides advanced numerical embedding techniques to better capture complex numerical relationships in your data. This release introduces two embedding approaches:
4+
5+
---
6+
7+
## AdvancedNumericalEmbedding
8+
9+
**Purpose:**
10+
Processes individual numerical features with tailored embedding layers. This layer performs adaptive binning, applies MLP transformations per feature, and can incorporate dropout and batch normalization.
11+
12+
**Key Parameters:**
13+
- **`embedding_dim`**: Dimension for each feature's embedding.
14+
- **`mlp_hidden_units`**: Number of hidden units in the MLP applied to each feature.
15+
- **`num_bins`**: Number of bins used for discretizing continuous inputs.
16+
- **`init_min` and `init_max`**: Initialization boundaries for binning.
17+
- **`dropout_rate`**: Dropout rate for regularization.
18+
- **`use_batch_norm`**: Flag to apply batch normalization.
19+
20+
**Usage Example:**
21+
```python
22+
from kdp.custom_layers import AdvancedNumericalEmbedding
23+
import tensorflow as tf
24+
25+
layer = AdvancedNumericalEmbedding(
26+
embedding_dim=8,
27+
mlp_hidden_units=16,
28+
num_bins=10,
29+
init_min=[-3.0, -2.0, -4.0],
30+
init_max=[3.0, 2.0, 4.0],
31+
dropout_rate=0.1,
32+
use_batch_norm=True,
33+
)
34+
35+
# Input shape: (batch_size, num_features)
36+
x = tf.random.normal((32, 3))
37+
# Output shape: (32, 3, 8)
38+
output = layer(x, training=False)
39+
```
40+
41+
---
42+
43+
## GlobalAdvancedNumericalEmbedding
44+
45+
**Purpose:**
46+
Combines a set of numerical features into a single, compact representation. It does so by applying an internal advanced numerical embedding on the concatenated input and then performing a global pooling over all features.
47+
48+
**Key Parameters (prefixed with `global_`):**
49+
- **`global_embedding_dim`**: Global embedding dimension (final pooled vector size).
50+
- **`global_mlp_hidden_units`**: Hidden units in the global MLP.
51+
- **`global_num_bins`**: Number of bins for discretization.
52+
- **`global_init_min` and `global_init_max`**: Global initialization boundaries.
53+
- **`global_dropout_rate`**: Dropout rate.
54+
- **`global_use_batch_norm`**: Whether to apply batch normalization.
55+
- **`global_pooling`**: Pooling method to use ("average" or "max").
56+
57+
**Usage Example:**
58+
```python
59+
from kdp.custom_layers import GlobalAdvancedNumericalEmbedding
60+
import tensorflow as tf
61+
62+
global_layer = GlobalAdvancedNumericalEmbedding(
63+
global_embedding_dim=8,
64+
global_mlp_hidden_units=16,
65+
global_num_bins=10,
66+
global_init_min=[-3.0, -2.0],
67+
global_init_max=[3.0, 2.0],
68+
global_dropout_rate=0.1,
69+
global_use_batch_norm=True,
70+
global_pooling="average"
71+
)
72+
73+
# Input shape: (batch_size, num_features)
74+
x = tf.random.normal((32, 3))
75+
# Global output shape: (32, 8)
76+
global_output = global_layer(x, training=False)
77+
```
78+
79+
---
80+
81+
## When to Use Which?
82+
83+
- **AdvancedNumericalEmbedding:**
84+
Use this when you need to process each numerical feature individually, preserving their distinct characteristics via per-feature embeddings.
85+
86+
- **GlobalAdvancedNumericalEmbedding:**
87+
Choose this option when you want to merge multiple numerical features into a unified global embedding using a pooling mechanism. This is particularly useful when the overall interaction across features is more important than the individual feature details.
88+
89+
## Advanced Configuration
90+
91+
Both layers offer additional parameters to fine-tune the embed­ding process. You can adjust dropout rates, batch normalization, and binning strategies to best suit your data. For more detailed information, please refer to the API documentation.
92+
93+
---
94+
95+
This document highlights the key differences and usage examples for the new advanced numerical embeddings available in KDP.

docs/complex_example.md

Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -123,6 +123,27 @@ ppr = PreprocessingModel(
123123
# Distribution aware configuration
124124
use_distribution_aware=True, # here we activate the distribution aware encoder
125125
distribution_aware_bins=1000, # thats the default value, but you can change it for finer data
126+
127+
# Add advanced numerical embedding
128+
use_advanced_numerical_embedding=True,
129+
embedding_dim=32, # Match embedding size with categorical features
130+
mlp_hidden_units=16,
131+
num_bins=10,
132+
init_min=-3.0,
133+
init_max=3.0,
134+
dropout_rate=0.1,
135+
use_batch_norm=True,
136+
137+
# Add global numerical embedding
138+
use_global_numerical_embedding=True,
139+
global_embedding_dim=32, # Match embedding dimensions
140+
global_mlp_hidden_units=16,
141+
global_num_bins=10,
142+
global_init_min=-3.0,
143+
global_init_max=3.0,
144+
global_dropout_rate=0.1,
145+
global_use_batch_norm=True,
146+
global_pooling="average",
126147
)
127148

128149
# Build the preprocessor

docs/example_usages.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -362,3 +362,66 @@ feature_importances = ppr.get_feature_importances()
362362
```
363363
Here is the plot of the model:
364364
![Complex Model](imgs/numerical_example_model_with_distribution_aware.png)
365+
366+
367+
## Example 5: Numerical features with numerical embedding
368+
369+
Numerical embedding is a technique that allows us to embed numerical features into a higher dimensional space.
370+
This can be useful for capturing non-linear relationships within/between numerical feature/s.
371+
372+
```python
373+
from kdp.features import NumericalFeature, FeatureType
374+
from kdp.processor import PreprocessingModel, OutputModeOptions
375+
376+
377+
# Define features
378+
features = {
379+
"basic_float": NumericalFeature(
380+
name="basic_float",
381+
feature_type=FeatureType.FLOAT,
382+
),
383+
384+
"rescaled_float": NumericalFeature(
385+
name="rescaled_float",
386+
feature_type=FeatureType.FLOAT_RESCALED,
387+
scale=2.0,
388+
),
389+
390+
"custom_float": NumericalFeature(
391+
name="custom_float",
392+
feature_type=FeatureType.FLOAT,
393+
preprocessors=[
394+
tf.keras.layers.Rescaling,
395+
tf.keras.layers.Normalization,
396+
DistributionAwareEncoder,
397+
],
398+
),
399+
}
400+
401+
# Now we can create a preprocessing model with the features
402+
ppr = PreprocessingModel(
403+
path_data="sample_data.csv",
404+
features_specs=features,
405+
features_stats_path="features_stats.json",
406+
overwrite_stats=True,
407+
408+
# Add numerical embedding
409+
# Use advanced numerical embedding for individual features
410+
use_advanced_numerical_embedding=True,
411+
# Use global numerical embedding for all features
412+
use_global_numerical_embedding=True,
413+
414+
output_mode=OutputModeOptions.CONCAT,
415+
)
416+
417+
# Build the preprocessor
418+
result = ppr.build_preprocessor()
419+
420+
# Transform data using direct model prediction
421+
transformed_data = ppr.model.predict(test_batch)
422+
423+
# Get feature importances
424+
feature_importances = ppr.get_feature_importances()
425+
```
426+
Here is the plot of the model:
427+
![Complex Model](imgs/numerical_example_model_with_advanced_numerical_embedding.png)

docs/imgs/complex_example.png

442 KB
Loading
181 KB
Loading

kdp/custom_layers.py

Lines changed: 154 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1981,15 +1981,27 @@ class AdvancedNumericalEmbedding(layers.Layer):
19811981

19821982
def __init__(
19831983
self,
1984-
embedding_dim: int,
1985-
mlp_hidden_units: int,
1986-
num_bins: int,
1987-
init_min,
1988-
init_max,
1989-
dropout_rate: float = 0.0,
1990-
use_batch_norm: bool = False,
1984+
embedding_dim: int = 8,
1985+
mlp_hidden_units: int = 16,
1986+
num_bins: int = 10,
1987+
init_min: float | list[float] = -3.0,
1988+
init_max: float | list[float] = 3.0,
1989+
dropout_rate: float = 0.1,
1990+
use_batch_norm: bool = True,
19911991
**kwargs,
19921992
):
1993+
"""Initialize the AdvancedNumericalEmbedding layer.
1994+
1995+
Args:
1996+
embedding_dim: Dimension of the output embedding for each feature.
1997+
mlp_hidden_units: Number of hidden units in the MLP.
1998+
num_bins: Number of bins for discretization.
1999+
init_min: Minimum value(s) for initialization. Can be a single float or list of floats.
2000+
init_max: Maximum value(s) for initialization. Can be a single float or list of floats.
2001+
dropout_rate: Dropout rate for regularization.
2002+
use_batch_norm: Whether to use batch normalization.
2003+
**kwargs: Additional layer arguments.
2004+
"""
19932005
super().__init__(**kwargs)
19942006
self.embedding_dim = embedding_dim
19952007
self.mlp_hidden_units = mlp_hidden_units
@@ -2046,17 +2058,22 @@ def build(self, input_shape):
20462058
init_min_tensor = tf.fill([self.num_features], init_min_tensor)
20472059
if init_max_tensor.shape.ndims == 0:
20482060
init_max_tensor = tf.fill([self.num_features], init_max_tensor)
2049-
# Convert tensors to numpy arrays, which are acceptable by tf.constant_initializer.
2050-
init_min_value = (
2051-
init_min_tensor.numpy()
2052-
if hasattr(init_min_tensor, "numpy")
2053-
else init_min_tensor
2054-
)
2055-
init_max_value = (
2056-
init_max_tensor.numpy()
2057-
if hasattr(init_max_tensor, "numpy")
2058-
else init_max_tensor
2059-
)
2061+
2062+
if tf.executing_eagerly():
2063+
init_min_value = init_min_tensor.numpy()
2064+
init_max_value = init_max_tensor.numpy()
2065+
else:
2066+
# Fallback: if not executing eagerly, force conversion to list
2067+
init_min_value = (
2068+
init_min_tensor.numpy().tolist()
2069+
if hasattr(init_min_tensor, "numpy")
2070+
else self.init_min
2071+
)
2072+
init_max_value = (
2073+
init_max_tensor.numpy().tolist()
2074+
if hasattr(init_max_tensor, "numpy")
2075+
else self.init_max
2076+
)
20602077

20612078
self.learned_min = self.add_weight(
20622079
name="learned_min",
@@ -2117,6 +2134,9 @@ def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
21172134
# Combine branches via a per-feature, per-dimension gate.
21182135
gate = tf.nn.sigmoid(self.gate) # (num_features, embedding_dim)
21192136
output = gate * cont + (1 - gate) * disc # (batch, num_features, embedding_dim)
2137+
# If only one feature is provided, squeeze the features axis.
2138+
if self.num_features == 1:
2139+
return tf.squeeze(output, axis=1) # New shape: (batch, embedding_dim)
21202140
return output
21212141

21222142
def get_config(self):
@@ -2133,3 +2153,119 @@ def get_config(self):
21332153
}
21342154
)
21352155
return config
2156+
2157+
2158+
class GlobalAdvancedNumericalEmbedding(tf.keras.layers.Layer):
2159+
"""
2160+
Global AdvancedNumericalEmbedding processes concatenated numeric features.
2161+
It applies an inner AdvancedNumericalEmbedding over the flattened input and then
2162+
performs global pooling (average or max) to produce a compact representation.
2163+
"""
2164+
2165+
def __init__(
2166+
self,
2167+
global_embedding_dim: int = 8,
2168+
global_mlp_hidden_units: int = 16,
2169+
global_num_bins: int = 10,
2170+
global_init_min: float | list[float] = -3.0,
2171+
global_init_max: float | list[float] = 3.0,
2172+
global_dropout_rate: float = 0.1,
2173+
global_use_batch_norm: bool = True,
2174+
global_pooling: str = "average",
2175+
**kwargs,
2176+
):
2177+
"""Initialize the GlobalAdvancedNumericalEmbedding layer.
2178+
2179+
Args:
2180+
global_embedding_dim: Dimension of the final global embedding.
2181+
global_mlp_hidden_units: Number of hidden units in the global MLP.
2182+
global_num_bins: Number of bins for discretization.
2183+
global_init_min: Minimum value(s) for initialization. Can be a single float or list of floats.
2184+
global_init_max: Maximum value(s) for initialization. Can be a single float or list of floats.
2185+
global_dropout_rate: Dropout rate for regularization.
2186+
global_use_batch_norm: Whether to use batch normalization.
2187+
global_pooling: Pooling method to use ("average" or "max").
2188+
**kwargs: Additional layer arguments.
2189+
"""
2190+
super().__init__(**kwargs)
2191+
self.global_embedding_dim = global_embedding_dim
2192+
self.global_mlp_hidden_units = global_mlp_hidden_units
2193+
self.global_num_bins = global_num_bins
2194+
2195+
# Ensure initializer parameters are Python scalars, lists, or numpy arrays.
2196+
if not isinstance(global_init_min, (list, tuple, np.ndarray)):
2197+
try:
2198+
global_init_min = float(global_init_min)
2199+
except Exception:
2200+
raise ValueError(
2201+
"init_min must be a Python scalar, list, tuple or numpy array"
2202+
)
2203+
if not isinstance(global_init_max, (list, tuple, np.ndarray)):
2204+
try:
2205+
global_init_max = float(global_init_max)
2206+
except Exception:
2207+
raise ValueError(
2208+
"init_max must be a Python scalar, list, tuple or numpy array"
2209+
)
2210+
self.global_init_min = global_init_min
2211+
self.global_init_max = global_init_max
2212+
self.global_dropout_rate = global_dropout_rate
2213+
self.global_use_batch_norm = global_use_batch_norm
2214+
self.global_pooling = global_pooling
2215+
2216+
# Use the existing advanced numerical embedding block
2217+
self.inner_embedding = AdvancedNumericalEmbedding(
2218+
embedding_dim=self.global_embedding_dim,
2219+
mlp_hidden_units=self.global_mlp_hidden_units,
2220+
num_bins=self.global_num_bins,
2221+
init_min=self.global_init_min,
2222+
init_max=self.global_init_max,
2223+
dropout_rate=self.global_dropout_rate,
2224+
use_batch_norm=self.global_use_batch_norm,
2225+
name="global_numeric_emebedding",
2226+
)
2227+
if self.global_pooling == "average":
2228+
self.global_pooling_layer = tf.keras.layers.GlobalAveragePooling1D(
2229+
name="global_avg_pool"
2230+
)
2231+
elif self.global_pooling == "max":
2232+
self.global_pooling_layer = tf.keras.layers.GlobalMaxPooling1D(
2233+
name="global_max_pool"
2234+
)
2235+
else:
2236+
raise ValueError(f"Unsupported pooling method: {self.global_pooling}")
2237+
2238+
def call(self, inputs: tf.Tensor, training: bool = False) -> tf.Tensor:
2239+
"""
2240+
Expects inputs with shape (batch, ...) and flattens them (except for the batch dim).
2241+
Then, the inner embedding produces a 3D output (batch, num_features, embedding_dim),
2242+
which is finally pooled to yield (batch, embedding_dim).
2243+
"""
2244+
# If inputs have more than 2 dimensions, flatten them (except for batch dimension).
2245+
if len(inputs.shape) > 2:
2246+
inputs = tf.reshape(inputs, (tf.shape(inputs)[0], -1))
2247+
# Pass through the inner advanced embedding.
2248+
x_embedded = self.inner_embedding(inputs, training=training)
2249+
# Global pooling over numeric features axis.
2250+
x_pooled = self.global_pooling_layer(x_embedded)
2251+
return x_pooled
2252+
2253+
def compute_output_shape(self, input_shape):
2254+
# Regardless of the input shape, the output shape is (batch_size, embedding_dim)
2255+
return (input_shape[0], self.global_embedding_dim)
2256+
2257+
def get_config(self):
2258+
config = super().get_config()
2259+
config.update(
2260+
{
2261+
"global_embedding_dim": self.global_embedding_dim,
2262+
"global_mlp_hidden_units": self.global_mlp_hidden_units,
2263+
"global_num_bins": self.global_num_bins,
2264+
"global_init_min": self.global_init_min,
2265+
"global_init_max": self.global_init_max,
2266+
"global_dropout_rate": self.global_dropout_rate,
2267+
"global_use_batch_norm": self.global_use_batch_norm,
2268+
"global_pooling": self.global_pooling,
2269+
}
2270+
)
2271+
return config

0 commit comments

Comments
 (0)