Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 7b76a99

Browse files
refactor(KDP): impreoving auto configuration functionality and UX
1 parent 1916c40 commit 7b76a99

File tree

4 files changed

+499
-337
lines changed

4 files changed

+499
-337
lines changed

docs/auto_configuration.md

Lines changed: 25 additions & 49 deletions
Original file line numberDiff line numberDiff line change
@@ -11,69 +11,45 @@ The automatic model configuration system leverages statistical analysis to:
1111
3. **Optimize global settings** - Recommends global parameters for improved model performance
1212
4. **Generate code** - Provides ready-to-use Python code implementing the recommendations
1313

14-
## 🛠️ How It Works
15-
16-
The system works in two main phases:
17-
18-
### 1. Statistics Collection
19-
20-
First, the `DatasetStatistics` class analyzes your dataset to compute various statistical properties:
21-
22-
- **Numerical features**: Mean, variance, distribution shape metrics (estimated skewness/kurtosis)
23-
- **Categorical features**: Vocabulary size, cardinality, unique values
24-
- **Text features**: Vocabulary statistics, average sequence length
25-
- **Date features**: Cyclical patterns, temporal variance
26-
27-
### 2. Configuration Recommendation
28-
29-
Then, the `ModelAdvisor` analyzes these statistics to recommend:
30-
31-
- **Feature-specific transformations**: Based on the detected distribution type
32-
- **Advanced encoding options**: Such as distribution-aware encoding for complex distributions
33-
- **Attention mechanisms**: Tabular attention or multi-resolution attention when appropriate
34-
- **Global parameters**: Overall architecture suggestions based on the feature mix
35-
3614
## 🚀 Using the Configuration Advisor
3715

38-
### Method 1: Using the Python API
16+
The simplest way to use the automatic configuration system is through the `auto_configure` function:
3917

4018
```python
41-
from kdp.stats import DatasetStatistics
42-
from kdp.processor import PreprocessingModel
19+
from kdp import auto_configure
4320

44-
# Initialize statistics calculator
45-
stats_calculator = DatasetStatistics(
46-
path_data="data/my_dataset.csv",
47-
features_specs=features_specs # Optional, will be inferred if not provided
48-
)
21+
# Analyze your dataset and get recommendations
22+
config = auto_configure("data/my_dataset.csv")
4923

50-
# Calculate statistics
51-
stats = stats_calculator.main()
24+
# Get the ready-to-use code snippet
25+
print(config["code_snippet"])
5226

53-
# Generate recommendations
54-
recommendations = stats_calculator.recommend_model_configuration()
27+
# Get feature-specific recommendations
28+
print(config["recommendations"])
5529

56-
# Use the recommendations to build a model
57-
# You can directly use the generated code snippet or access specific recommendations
58-
print(recommendations["code_snippet"])
30+
# Get computed statistics (if save_stats=True)
31+
print(config["statistics"])
5932
```
6033

61-
### Method 2: Using the Command-Line Tool
34+
### Advanced Usage
6235

63-
KDP provides a command-line tool to analyze datasets and generate recommendations:
36+
You can customize the analysis with additional parameters:
6437

65-
```bash
66-
python scripts/analyze_dataset.py --data path/to/data.csv --output recommendations.json
38+
```python
39+
config = auto_configure(
40+
data_path="data/my_dataset.csv",
41+
features_specs={
42+
"age": "NumericalFeature",
43+
"category": "CategoricalFeature",
44+
"text": "TextFeature"
45+
},
46+
batch_size=100_000,
47+
save_stats=True,
48+
stats_path="my_stats.json",
49+
overwrite_stats=False
50+
)
6751
```
6852

69-
Options:
70-
- `--data`, `-d`: Path to CSV data file or directory (required)
71-
- `--output`, `-o`: Path to save recommendations (default: recommendations.json)
72-
- `--stats`, `-s`: Path to save/load feature statistics (default: features_stats.json)
73-
- `--batch-size`, `-b`: Batch size for processing (default: 50000)
74-
- `--overwrite`, `-w`: Overwrite existing statistics file
75-
- `--feature-types`, `-f`: JSON file specifying feature types (optional)
76-
7753
## 🔮 Distribution Detection
7854

7955
The system can detect and recommend specific configurations for various distribution types:

kdp/__init__.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
TransformerBlockPlacementOptions,
1717
)
1818
from kdp.stats import DatasetStatistics
19+
from kdp.auto_config import auto_configure
1920

2021
__all__ = [
2122
"ProcessingStep",
@@ -33,4 +34,5 @@
3334
"TransformerBlockPlacementOptions",
3435
"OutputModeOptions",
3536
"TabularAttentionPlacementOptions",
37+
"auto_configure",
3638
]

kdp/auto_config.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,92 @@
1+
"""
2+
Automatic model configuration module that provides a simple interface for
3+
analyzing datasets and generating optimal preprocessing configurations.
4+
"""
5+
6+
from pathlib import Path
7+
from typing import Dict, Any, Optional, Union
8+
9+
from loguru import logger
10+
11+
from kdp.stats import DatasetStatistics
12+
from kdp.model_advisor import ModelAdvisor
13+
14+
15+
def auto_configure(
16+
data_path: Union[str, Path],
17+
features_specs: Optional[Dict[str, Any]] = None,
18+
batch_size: int = 50_000,
19+
save_stats: bool = True,
20+
stats_path: Optional[Union[str, Path]] = None,
21+
overwrite_stats: bool = False,
22+
) -> Dict[str, Any]:
23+
"""
24+
Automatically analyze a dataset and generate optimal preprocessing configurations.
25+
26+
This is a high-level function that handles all the complexity of analyzing your dataset
27+
and recommending the best preprocessing strategies. It will:
28+
1. Calculate comprehensive statistics about your features
29+
2. Analyze the distributions and characteristics of each feature
30+
3. Generate specific recommendations for preprocessing each feature
31+
4. Provide global configuration recommendations
32+
5. Generate ready-to-use code implementing the recommendations
33+
34+
Args:
35+
data_path: Path to your dataset (CSV file or directory of CSVs)
36+
features_specs: Optional dictionary specifying feature types and configurations
37+
batch_size: Batch size for processing large datasets (default: 50000)
38+
save_stats: Whether to save the computed statistics (default: True)
39+
stats_path: Optional path to save/load statistics (default: features_stats.json)
40+
overwrite_stats: Whether to overwrite existing statistics file (default: False)
41+
42+
Returns:
43+
Dictionary containing:
44+
- feature-specific recommendations
45+
- global configuration recommendations
46+
- ready-to-use code snippet
47+
- computed statistics (if save_stats=True)
48+
49+
Example:
50+
>>> config = auto_configure("data/my_dataset.csv")
51+
>>> print(config["code_snippet"]) # Get ready-to-use code
52+
>>> print(config["recommendations"]) # Get feature-specific recommendations
53+
"""
54+
# Convert paths to Path objects
55+
data_path = Path(data_path)
56+
if stats_path is None:
57+
stats_path = Path("features_stats.json")
58+
else:
59+
stats_path = Path(stats_path)
60+
61+
# Initialize statistics calculator
62+
stats_calculator = DatasetStatistics(
63+
path_data=str(data_path),
64+
features_specs=features_specs,
65+
features_stats_path=stats_path,
66+
overwrite_stats=overwrite_stats,
67+
batch_size=batch_size,
68+
)
69+
70+
# Calculate statistics
71+
logger.info("Calculating dataset statistics...")
72+
stats = stats_calculator.main()
73+
74+
# Generate recommendations
75+
logger.info("Generating preprocessing recommendations...")
76+
advisor = ModelAdvisor(stats)
77+
recommendations = advisor.analyze_feature_stats()
78+
79+
# Generate code snippet
80+
logger.info("Generating code snippet...")
81+
code_snippet = advisor.generate_code_snippet()
82+
83+
# Prepare output
84+
output = {
85+
"recommendations": recommendations,
86+
"code_snippet": code_snippet,
87+
}
88+
89+
if save_stats:
90+
output["statistics"] = stats
91+
92+
return output

0 commit comments

Comments
 (0)