@@ -11,69 +11,45 @@ The automatic model configuration system leverages statistical analysis to:
11113 . ** Optimize global settings** - Recommends global parameters for improved model performance
12124 . ** Generate code** - Provides ready-to-use Python code implementing the recommendations
1313
14- ## 🛠️ How It Works
15-
16- The system works in two main phases:
17-
18- ### 1. Statistics Collection
19-
20- First, the ` DatasetStatistics ` class analyzes your dataset to compute various statistical properties:
21-
22- - ** Numerical features** : Mean, variance, distribution shape metrics (estimated skewness/kurtosis)
23- - ** Categorical features** : Vocabulary size, cardinality, unique values
24- - ** Text features** : Vocabulary statistics, average sequence length
25- - ** Date features** : Cyclical patterns, temporal variance
26-
27- ### 2. Configuration Recommendation
28-
29- Then, the ` ModelAdvisor ` analyzes these statistics to recommend:
30-
31- - ** Feature-specific transformations** : Based on the detected distribution type
32- - ** Advanced encoding options** : Such as distribution-aware encoding for complex distributions
33- - ** Attention mechanisms** : Tabular attention or multi-resolution attention when appropriate
34- - ** Global parameters** : Overall architecture suggestions based on the feature mix
35-
3614## 🚀 Using the Configuration Advisor
3715
38- ### Method 1: Using the Python API
16+ The simplest way to use the automatic configuration system is through the ` auto_configure ` function:
3917
4018``` python
41- from kdp.stats import DatasetStatistics
42- from kdp.processor import PreprocessingModel
19+ from kdp import auto_configure
4320
44- # Initialize statistics calculator
45- stats_calculator = DatasetStatistics(
46- path_data = " data/my_dataset.csv" ,
47- features_specs = features_specs # Optional, will be inferred if not provided
48- )
21+ # Analyze your dataset and get recommendations
22+ config = auto_configure(" data/my_dataset.csv" )
4923
50- # Calculate statistics
51- stats = stats_calculator.main( )
24+ # Get the ready-to-use code snippet
25+ print (config[ " code_snippet " ] )
5226
53- # Generate recommendations
54- recommendations = stats_calculator.recommend_model_configuration( )
27+ # Get feature-specific recommendations
28+ print (config[ " recommendations " ] )
5529
56- # Use the recommendations to build a model
57- # You can directly use the generated code snippet or access specific recommendations
58- print (recommendations[" code_snippet" ])
30+ # Get computed statistics (if save_stats=True)
31+ print (config[" statistics" ])
5932```
6033
61- ### Method 2: Using the Command-Line Tool
34+ ### Advanced Usage
6235
63- KDP provides a command-line tool to analyze datasets and generate recommendations :
36+ You can customize the analysis with additional parameters :
6437
65- ``` bash
66- python scripts/analyze_dataset.py --data path/to/data.csv --output recommendations.json
38+ ``` python
39+ config = auto_configure(
40+ data_path = " data/my_dataset.csv" ,
41+ features_specs = {
42+ " age" : " NumericalFeature" ,
43+ " category" : " CategoricalFeature" ,
44+ " text" : " TextFeature"
45+ },
46+ batch_size = 100_000 ,
47+ save_stats = True ,
48+ stats_path = " my_stats.json" ,
49+ overwrite_stats = False
50+ )
6751```
6852
69- Options:
70- - ` --data ` , ` -d ` : Path to CSV data file or directory (required)
71- - ` --output ` , ` -o ` : Path to save recommendations (default: recommendations.json)
72- - ` --stats ` , ` -s ` : Path to save/load feature statistics (default: features_stats.json)
73- - ` --batch-size ` , ` -b ` : Batch size for processing (default: 50000)
74- - ` --overwrite ` , ` -w ` : Overwrite existing statistics file
75- - ` --feature-types ` , ` -f ` : JSON file specifying feature types (optional)
76-
7753## 🔮 Distribution Detection
7854
7955The system can detect and recommend specific configurations for various distribution types:
0 commit comments