Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 02a137a

Browse files
docs(KDP): improving documentation
1 parent 7146afe commit 02a137a

File tree

4 files changed

+672
-15
lines changed

4 files changed

+672
-15
lines changed

β€Ždocs/advanced_numerical_embeddings.mdβ€Ž

Lines changed: 193 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -88,8 +88,200 @@ global_output = global_layer(x, training=False)
8888

8989
## Advanced Configuration
9090

91-
Both layers offer additional parameters to fine-tune the embedΒ­ding process. You can adjust dropout rates, batch normalization, and binning strategies to best suit your data. For more detailed information, please refer to the API documentation.
91+
Both layers offer additional parameters to fine-tune the embedding process. You can adjust dropout rates, batch normalization, and binning strategies to best suit your data. For more detailed information, please refer to the API documentation.
9292

9393
---
9494

9595
This document highlights the key differences and usage examples for the new advanced numerical embeddings available in KDP.
96+
97+
# 🌐 Global Numerical Embedding
98+
99+
## πŸ“š Overview
100+
101+
Global Numerical Embedding is a powerful technique for processing numerical features collectively rather than individually. It transforms batches of numerical features through a unified embedding approach, capturing relationships across the entire numerical feature space.
102+
103+
## πŸ”‘ Key Benefits
104+
105+
- **Cross-Feature Learning**: Captures relationships between different numerical features
106+
- **Unified Representation**: Creates a consistent embedding space for all numerical data
107+
- **Dimensionality Control**: Transforms variable numbers of features into fixed-size embeddings
108+
- **Performance Enhancement**: Typically improves performance on complex tabular datasets
109+
110+
## πŸ’» Usage
111+
112+
### Basic Configuration
113+
114+
Enable Global Numerical Embedding by setting the appropriate parameters in your `PreprocessingModel`:
115+
116+
```python
117+
from kdp.processor import PreprocessingModel
118+
from kdp.features import FeatureType
119+
120+
# Define features
121+
features_specs = {
122+
"feature1": FeatureType.FLOAT_NORMALIZED,
123+
"feature2": FeatureType.FLOAT_NORMALIZED,
124+
"feature3": FeatureType.FLOAT_RESCALED,
125+
# more numerical features...
126+
}
127+
128+
# Initialize with Global Numerical Embedding
129+
preprocessor = PreprocessingModel(
130+
features_specs=features_specs,
131+
use_global_numerical_embedding=True, # Enable the feature
132+
global_embedding_dim=16, # Output dimension per feature
133+
global_pooling="average" # Pooling strategy
134+
)
135+
136+
# Build the model
137+
result = preprocessor.build_preprocessor()
138+
```
139+
140+
### Advanced Configuration
141+
142+
Fine-tune Global Numerical Embedding with additional parameters:
143+
144+
```python
145+
preprocessor = PreprocessingModel(
146+
features_specs=features_specs,
147+
use_global_numerical_embedding=True,
148+
global_embedding_dim=32, # Embedding dimension size
149+
global_mlp_hidden_units=64, # Units in the MLP layer
150+
global_num_bins=20, # Number of bins for discretization
151+
global_init_min=-3.0, # Minimum initialization bound
152+
global_init_max=3.0, # Maximum initialization bound
153+
global_dropout_rate=0.2, # Dropout rate for regularization
154+
global_use_batch_norm=True, # Whether to use batch normalization
155+
global_pooling="concat" # Pooling strategy
156+
)
157+
```
158+
159+
## βš™οΈ Parameters
160+
161+
| Parameter | Type | Default | Description |
162+
|-----------|------|---------|-------------|
163+
| `global_embedding_dim` | int | 8 | Dimension of each feature embedding |
164+
| `global_mlp_hidden_units` | int | 16 | Number of units in the MLP layer |
165+
| `global_num_bins` | int | 10 | Number of bins for discretization |
166+
| `global_init_min` | float | -3.0 | Minimum initialization bound |
167+
| `global_init_max` | float | 3.0 | Maximum initialization bound |
168+
| `global_dropout_rate` | float | 0.1 | Dropout rate for regularization |
169+
| `global_use_batch_norm` | bool | True | Whether to use batch normalization |
170+
| `global_pooling` | str | "average" | Pooling strategy ("average", "max", or "concat") |
171+
172+
## 🧩 Architecture
173+
174+
The Global Numerical Embedding layer processes numerical features through several steps:
175+
176+
1. **Normalization**: Numerical features are normalized to a standard range
177+
2. **Binning**: Features are discretized into bins
178+
3. **Embedding**: Each bin is mapped to a learned embedding vector
179+
4. **MLP Processing**: A small MLP network processes each embedded feature
180+
5. **Pooling**: Features are aggregated using the specified pooling strategy
181+
6. **Output**: A fixed-size embedding representing all numerical features
182+
183+
```
184+
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
185+
β”‚ Numerical β”‚ β”‚ Discretizeβ”‚ β”‚ Embedding β”‚ β”‚ MLP β”‚
186+
β”‚ Features │────▢│ to Bins │────▢│ Lookup │────▢│ Network β”‚
187+
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
188+
β”‚
189+
β–Ό
190+
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
191+
β”‚ Output β”‚ β”‚ Pooling β”‚ β”‚ Feature β”‚
192+
β”‚ Embedding │◀────│ Operation │◀──────────────────────│ Vectors β”‚
193+
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
194+
```
195+
196+
## 🧠 Pooling Strategies
197+
198+
The `global_pooling` parameter controls how feature embeddings are combined:
199+
200+
- **"average"**: Compute the mean across all feature embeddings (default)
201+
- **"max"**: Take the maximum value for each dimension across all embeddings
202+
- **"concat"**: Concatenate all feature embeddings (increases output dimension)
203+
204+
## πŸš€ When to Use
205+
206+
Global Numerical Embedding is particularly effective when:
207+
208+
- Your dataset has many numerical features (>5)
209+
- Features have complex relationships with each other
210+
- You want to reduce the dimensionality of your numerical features
211+
- You're working with tabular data where feature interactions matter
212+
213+
## πŸ“Š Comparison with Individual Processing
214+
215+
| Aspect | Global Embedding | Individual Processing |
216+
|--------|------------------|----------------------|
217+
| Feature Interactions | Captures cross-feature relationships | Processes each feature independently |
218+
| Output Dimension | Fixed size regardless of input features | Scales with number of features |
219+
| Parameter Efficiency | Shares parameters across features | Separate parameters per feature |
220+
| Computational Cost | Higher for few features, more efficient for many | Linear with feature count |
221+
| Model Performance | Often better for complex datasets | Simpler, may miss interactions |
222+
223+
## πŸ” Implementation Details
224+
225+
The Global Numerical Embedding implementation is based on the `GlobalNumericalEmbedding` layer:
226+
227+
```python
228+
# Sample internal implementation (simplified)
229+
class GlobalNumericalEmbedding(tf.keras.layers.Layer):
230+
def __init__(self, global_embedding_dim=8, global_pooling="average", **kwargs):
231+
super().__init__(**kwargs)
232+
self.embedding_dim = global_embedding_dim
233+
self.pooling = global_pooling
234+
# ...additional initialization...
235+
236+
def build(self, input_shape):
237+
# Create embeddings, MLP layers, etc.
238+
239+
def call(self, inputs):
240+
# 1. Discretize numerical inputs
241+
# 2. Apply embedding lookup
242+
# 3. Process through MLP
243+
# 4. Apply pooling
244+
# 5. Return transformed features
245+
```
246+
247+
## πŸ“ Examples
248+
249+
### Basic Example
250+
251+
```python
252+
# Simple example with default parameters
253+
preprocessor = PreprocessingModel(
254+
features_specs={"feature1": FeatureType.FLOAT, "feature2": FeatureType.FLOAT},
255+
use_global_numerical_embedding=True
256+
)
257+
```
258+
259+
### Advanced Example with Custom Pooling
260+
261+
```python
262+
# Using concatenation pooling for maximum information preservation
263+
preprocessor = PreprocessingModel(
264+
features_specs=features_specs,
265+
use_global_numerical_embedding=True,
266+
global_embedding_dim=16,
267+
global_pooling="concat", # Will concatenate all feature embeddings
268+
global_dropout_rate=0.2 # Increased regularization
269+
)
270+
```
271+
272+
### Combined with Other Advanced Features
273+
274+
```python
275+
# Combining Global Numerical Embedding with other advanced features
276+
preprocessor = PreprocessingModel(
277+
features_specs=features_specs,
278+
# Global Numerical Embedding
279+
use_global_numerical_embedding=True,
280+
global_embedding_dim=16,
281+
# Distribution-Aware Encoding
282+
use_distribution_aware=True,
283+
# Tabular Attention
284+
tabular_attention=True,
285+
tabular_attention_placement="MULTI_RESOLUTION"
286+
)
287+
```

β€Ždocs/index.mdβ€Ž

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,6 +20,12 @@ Say goodbye to tedious data preparation tasks and hello to streamlined, efficien
2020

2121
- 🧠 **Enhanced with Transformer Blocks**: Incorporate transformer blocks into your preprocessing model to boost feature interaction analysis and uncover complex patterns, enhancing predictive model accuracy.
2222

23+
- πŸ“ˆ **Distribution-Aware Encoding**: Automatically detect underlying data distributions and apply specialized transformations to preserve statistical properties and improve model performance.
24+
25+
- 🌐 **Global Numerical Embedding**: Transform batches of numerical features with a unified embedding approach, capturing relationships across the entire feature space and enhancing model performance on tabular data.
26+
27+
- πŸ‘οΈ **Tabular Attention Mechanisms**: Implement powerful attention-based learning on tabular data with standard and multi-resolution approaches to capture complex feature interactions.
28+
2329
- βš™οΈ **Easy Integration**: Designed to seamlessly integrate as the first layers in your TensorFlow Keras models, facilitating a smooth transition from raw data to trained model, accelerating your workflow significantly.
2430

2531
## πŸš€ Getting started:
@@ -141,6 +147,63 @@ ppr = PreprocessingModel(
141147
)
142148
```
143149

150+
### 🌐 Global Numerical Embedding
151+
152+
The Global Numerical Embedding layer offers a powerful way to process numerical features collectively, capturing relationships across your entire numerical feature space. This is particularly useful for tabular data with many numerical columns.
153+
154+
- **Unified Embedding**: Process all numerical features together through a shared embedding space
155+
- **Advanced Pooling**: Aggregate information across features with various pooling strategies
156+
- **Adaptive Binning**: Intelligently discretize continuous values for more effective embedding
157+
158+
Example configuration:
159+
160+
```python
161+
numerical_embedding_config = {
162+
'use_global_numerical_embedding': True,
163+
'global_embedding_dim': 16, # Embedding dimension size
164+
'global_mlp_hidden_units': 32, # Units in the MLP layer
165+
'global_num_bins': 15, # Number of bins for discretization
166+
'global_init_min': -2.0, # Minimum initialization bound
167+
'global_init_max': 2.0, # Maximum initialization bound
168+
'global_dropout_rate': 0.1, # Dropout rate for regularization
169+
'global_use_batch_norm': True, # Whether to use batch normalization
170+
'global_pooling': 'average' # Pooling strategy ('average', 'max', or 'concat')
171+
}
172+
173+
ppr = PreprocessingModel(
174+
path_data="data/my_data.csv",
175+
features_specs=features_spec,
176+
**numerical_embedding_config
177+
)
178+
```
179+
180+
### πŸ‘οΈ Tabular Attention Configuration
181+
182+
Leverage attention mechanisms specifically designed for tabular data to capture complex feature interactions. See πŸ‘€ [Tabular Attention](tabular_attention.md) for detailed information.
183+
184+
- **Standard Attention**: Apply uniform attention across all features
185+
- **Multi-Resolution Attention**: Use different attention mechanisms for numerical and categorical data
186+
- **Placement Options**: Control where attention is applied in your feature space
187+
188+
Example configuration:
189+
190+
```python
191+
attention_config = {
192+
'tabular_attention': True,
193+
'tabular_attention_heads': 4, # Number of attention heads
194+
'tabular_attention_dim': 64, # Attention dimension
195+
'tabular_attention_dropout': 0.1, # Dropout rate
196+
'tabular_attention_placement': 'ALL_FEATURES', # Where to apply attention
197+
'tabular_attention_embedding_dim': 32 # For multi-resolution attention
198+
}
199+
200+
ppr = PreprocessingModel(
201+
path_data="data/my_data.csv",
202+
features_specs=features_spec,
203+
**attention_config
204+
)
205+
```
206+
144207
### πŸ—οΈ Custom Preprocessors
145208

146209
Tailor your preprocessing steps with custom preprocessors for each feature type. Define specific preprocessing logic that fits your data characteristics or domain-specific requirements, see πŸ‘€ [Custom Preprocessors](features.md#πŸš€-custom-preprocessing-steps).

0 commit comments

Comments
Β (0)