Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 462bfc1

Browse files
feat(KDP): adding feature selection mechanism to the preprocessor (docs, tests) (#19)
2 parents 4b1c510 + 23b36ce commit 462bfc1

File tree

9 files changed

+1295
-18
lines changed

9 files changed

+1295
-18
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -166,4 +166,4 @@ kdp/data/fake_data.csv
166166
my_tests/*
167167

168168
# derivative files
169-
data.csv
169+
*.csv

docs/complex_example.md

Lines changed: 19 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -87,12 +87,15 @@ df = pd.DataFrame({
8787
] * 20
8888
})
8989

90-
# Save to CSV
90+
# Format data
9191
df.to_csv("sample_data.csv", index=False)
92+
test_batch = tf.data.Dataset.from_tensor_slices(dict(df.head(3))).batch(3)
9293

9394
# Create preprocessor with both transformer blocks and attention
9495
ppr = PreprocessingModel(
9596
path_data="sample_data.csv",
97+
features_stats_path="features_stats.json",
98+
overwrite_stats=True, # Force stats generation, recommended to be set to True
9699
features_specs=features,
97100
output_mode=OutputModeOptions.CONCAT,
98101

@@ -111,32 +114,32 @@ ppr = PreprocessingModel(
111114
tabular_attention_dropout=0.1, # Attention dropout rate
112115
tabular_attention_embedding_dim=16, # Embedding dimension
113116

114-
# Other parameters
115-
overwrite_stats=True, # Force stats generation, recommended to be set to True
117+
# Feature selection configuration
118+
feature_selection_placement="all_features", # Choose between (all_features|numeric|categorical)
119+
feature_selection_units=32,
120+
feature_selection_dropout=0.15,
116121
)
117122

118123
# Build the preprocessor
119124
result = ppr.build_preprocessor()
120125
```
121126

122-
Now if one wants to plot, use the Neural Network for predictions or just get the statistics, use the following:
127+
Now if one wants to plot the a block diagram of the model or get the outout of the NN or get the importance weights of the features, use the following:
123128

124129
```python
125130
# Plot the model architecture
126131
ppr.plot_model("complex_model.png")
127132

128-
# Get predictions with an example test batch from the example data
129-
test_batch = tf.data.Dataset.from_tensor_slices(dict(df.head(3))).batch(3)
130-
predictions = result["model"].predict(test_batch)
131-
print("Output shape:", predictions.shape)
132-
133-
# Print feature statistics
134-
print("\nFeature Statistics:")
135-
for feature_type, features in ppr.get_feature_statistics().items():
136-
if isinstance(features, dict):
137-
print(f"\n{feature_type}:")
138-
for feature_name, stats in features.items():
139-
print(f" {feature_name}: {list(stats.keys())}")
133+
# Transform data using direct model prediction
134+
transformed_data = ppr.model.predict(test_batch)
135+
136+
# Transform data using batch_predict
137+
transformed_data = ppr.batch_predict(test_batch)
138+
transformed_batches = list(transformed_data) # For better visualization
139+
140+
# Get feature importances
141+
feature_importances = ppr.get_feature_importances()
142+
print("Feature importances:", feature_importances)
140143
```
141144

142145

docs/feature_selection.md

Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# Feature Selection in Keras Data Processor
2+
3+
The Keras Data Processor includes a sophisticated feature selection mechanism based on the Gated Residual Variable Selection Network (GRVSN) architecture. This document explains the components, usage, and benefits of this feature.
4+
5+
## Overview
6+
7+
The feature selection mechanism uses a combination of gated units and residual networks to automatically learn the importance of different features in your data. It can be applied to both numeric and categorical features, either independently or together.
8+
9+
## Components
10+
11+
### 1. GatedLinearUnit
12+
13+
The `GatedLinearUnit` is the basic building block that implements a gated activation function:
14+
15+
```python
16+
gl = GatedLinearUnit(units=64)
17+
x = tf.random.normal((32, 100))
18+
y = gl(x)
19+
```
20+
21+
Key features:
22+
- Applies a linear transformation followed by a sigmoid gate
23+
- Selectively filters input data based on learned weights
24+
- Helps control information flow through the network
25+
26+
### 2. GatedResidualNetwork
27+
28+
The `GatedResidualNetwork` combines gated linear units with residual connections:
29+
30+
```python
31+
grn = GatedResidualNetwork(units=64, dropout_rate=0.2)
32+
x = tf.random.normal((32, 100))
33+
y = grn(x)
34+
```
35+
36+
Key features:
37+
- Uses ELU activation for non-linearity
38+
- Includes dropout for regularization
39+
- Adds residual connections to help with gradient flow
40+
- Applies layer normalization for stability
41+
42+
### 3. VariableSelection
43+
44+
The `VariableSelection` layer is the main feature selection component:
45+
46+
```python
47+
vs = VariableSelection(nr_features=3, units=64, dropout_rate=0.2)
48+
x1 = tf.random.normal((32, 100))
49+
x2 = tf.random.normal((32, 200))
50+
x3 = tf.random.normal((32, 300))
51+
selected_features, weights = vs([x1, x2, x3])
52+
```
53+
54+
Key features:
55+
- Processes each feature independently using GRNs
56+
- Calculates feature importance weights using softmax
57+
- Returns both selected features and their weights
58+
- Supports different input dimensions for each feature
59+
60+
## Usage in Preprocessing Model
61+
62+
### Configuration
63+
64+
Configure feature selection in your preprocessing model:
65+
66+
```python
67+
model = PreprocessingModel(
68+
# ... other parameters ...
69+
feature_selection_placement="all_features", # or "numeric" or "categorical"
70+
feature_selection_units=64,
71+
feature_selection_dropout=0.2
72+
)
73+
```
74+
75+
### Placement Options
76+
77+
The `FeatureSelectionPlacementOptions` enum provides several options for where to apply feature selection:
78+
79+
1. `NONE`: Disable feature selection
80+
2. `NUMERIC`: Apply only to numeric features
81+
3. `CATEGORICAL`: Apply only to categorical features
82+
4. `ALL_FEATURES`: Apply to all features
83+
84+
### Accessing Feature Weights
85+
86+
After processing, feature weights are available in the `processed_features` dictionary:
87+
88+
```python
89+
# Process your data
90+
processed = model.transform(data)
91+
92+
# Access feature weights
93+
numeric_weights = processed["numeric_feature_weights"]
94+
categorical_weights = processed["categorical_feature_weights"]
95+
```
96+
97+
## Benefits
98+
99+
1. **Automatic Feature Selection**: The model learns which features are most important for your task.
100+
2. **Interpretability**: Feature weights provide insights into feature importance.
101+
3. **Improved Performance**: By focusing on relevant features, the model can achieve better performance.
102+
4. **Regularization**: Dropout and residual connections help prevent overfitting.
103+
5. **Flexibility**: Can be applied to different feature types and combinations.
104+
105+
## Integration with Other Features
106+
107+
The feature selection mechanism integrates seamlessly with other preprocessing components:
108+
109+
1. **Transformer Blocks**: Can be used before or after transformer blocks
110+
2. **Tabular Attention**: Complements tabular attention by focusing on important features
111+
3. **Custom Preprocessors**: Works with any custom preprocessing steps
112+
113+
## Example
114+
115+
Here's a complete example of using feature selection:
116+
117+
```python
118+
from kdp.processor import PreprocessingModel
119+
from kdp.features import NumericalFeature, CategoricalFeature
120+
121+
# Define features
122+
features = {
123+
"numeric_1": NumericalFeature(
124+
name="numeric_1",
125+
feature_type=FeatureType.FLOAT_NORMALIZED
126+
),
127+
"numeric_2": NumericalFeature(
128+
name="numeric_2",
129+
feature_type=FeatureType.FLOAT_NORMALIZED
130+
),
131+
"category_1": CategoricalFeature(
132+
name="category_1",
133+
feature_type=FeatureType.STRING_CATEGORICAL
134+
)
135+
}
136+
137+
# Create model with feature selection
138+
model = PreprocessingModel(
139+
# ... other parameters ...
140+
features_specs=features,
141+
feature_selection_placement="all_features", # or "numeric" or "categorical"
142+
feature_selection_units=64,
143+
feature_selection_dropout=0.2
144+
)
145+
146+
# Build and use the model
147+
preprocessor = model.build_preprocessor()
148+
processed_data = model.transform(data) # data can be pd.DataFrame, python Dict, or tf.data.Dataset
149+
150+
# Analyze feature importance
151+
for feature_name in features:
152+
weights = processed_data[f"{feature_name}_weights"]
153+
print(f"Feature {feature_name} importance: {weights.mean()}")
154+
```
155+
156+
## Testing
157+
158+
The feature selection components include comprehensive unit tests that verify:
159+
160+
1. Output shapes and types
161+
2. Gating mechanism behavior
162+
3. Residual connections
163+
4. Dropout behavior
164+
5. Feature weight properties
165+
6. Serialization/deserialization
166+
167+
Run the tests using:
168+
```bash
169+
python -m pytest test/test_feature_selection.py -v
170+
```

docs/imgs/complex_model.png

35 KB
Loading

0 commit comments

Comments
 (0)