Monotone Weight Of Evidence (WOE) Transformer and LogisticRegression model with scikit-learn API. Optimized for performance and stability.
- WOE Transformation: Convert categorical and numerical features to Weight of Evidence encoding
- Automated Feature Selection: Multiple algorithms for optimal feature selection
- Automated Feature Generation: Automatically create and select high-quality ratio and interaction features
- Binning Strategies: Smart binning with monotonicity constraints
- Sklearn Compatibility: Follows scikit-learn's API standards
- Performance Optimized: Parallel processing and vectorized operations
- SQL Export: Generate SQL for model deployment
- Scorecard Generation: Create credit scorecards with customizable scaling
pip install woe-scoring- Install the package:
pip install woe-scoring- Use WOETransformer:
import pandas as pd
from woe_scoring import WOETransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
df, test_size=0.3, random_state=42, stratify=df["Survived"]
)
special_cols = [
"PassengerId",
"Survived",
"Name",
"Ticket",
"Cabin",
]
cat_cols = [
"Pclass",
"Sex",
"SibSp",
"Parch",
"Embarked",
]
encoder = WOETransformer(
max_bins=8,
min_pct_group=0.1,
diff_woe_threshold=0.1,
cat_features=cat_cols,
special_cols=special_cols,
n_jobs=-1,
merge_type="chi2",
generate_features=True, # Enable feature generation
max_generated_features=10,
)
encoder.fit(train, train["Survived"])
encoder.save_to_file("train_dict.json")
encoder.load_woe_iv_dict("train_dict.json")
encoder.refit(train, train["Survived"])
enc_train = encoder.transform(train)
enc_test = encoder.transform(test)
model = LogisticRegression()
model.fit(enc_train, train["Survived"])
test_proba = model.predict_proba(enc_test)[:, 1]- Use CreateModel:
import pandas as pd
from woe_scoring import CreateModel
from sklearn.model_selection import train_test_split
df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
df, test_size=0.3, random_state=42, stratify=df["Survived"]
)
special_cols = [
"PassengerId",
"Survived",
"Name",
"Ticket",
"Cabin",
]
model = CreateModel(
max_vars=5,
special_cols=special_cols,
selection_method="sfs",
model_type="sklearn",
gini_threshold=5.0,
n_jobs=-1,
random_state=42,
class_weight="balanced",
cv=3,
)
model.fit(train, train["Survived"])
test_proba = model.predict_proba(test[model.feature_names_])
print(model.coef_, model.intercept_)
print(model.feature_names_)The WOETransformer converts categorical and numerical features into Weight of Evidence (WOE) values. WOE measures the predictive power of a feature by comparing the distribution of events and non-events.
WOETransformer(
max_bins=10, # Maximum number of bins for each feature
min_pct_group=0.05, # Minimum percentage of each bin
n_jobs=1, # Number of parallel jobs
prefix="WOE_", # Prefix for transformed features
merge_type="chi2", # Bin merging strategy ('chi2', 'woe', 'monotonic')
cat_features=None, # List of categorical features
special_cols=None, # Columns to exclude from transformation
cat_features_threshold=0, # Threshold for auto-identifying categorical features
diff_woe_threshold=0.05, # Minimum WOE difference between bins
safe_original_data=False, # Whether to keep original features
generate_features=False, # Whether to generate new features
max_generated_features=20 # Max number of generated features to select
)fit(data, target): Calculates optimal bins and WOE valuestransform(data): Converts features to WOE valuessave_to_file(path): Saves binning information to a JSON fileload_woe_iv_dict(path): Loads binning information from a JSON filerefit(data, target): Updates WOE values for existing bins with new data
The CreateModel class combines feature selection, model training, and model evaluation:
CreateModel(
selection_method='rfe', # Feature selection method ('rfe', 'sfs', 'iv')
model_type='sklearn', # Model implementation ('sklearn', 'statsmodel')
max_vars=None, # Maximum number of features to select
special_cols=None, # Columns to include as-is
unused_cols=None, # Columns to exclude
n_jobs=1, # Number of parallel jobs
gini_threshold=5.0, # Minimum Gini score to keep a feature
iv_threshold=0.05, # Minimum IV threshold for feature selection
corr_threshold=0.5, # Correlation threshold for feature selection
min_pct_group=0.05, # Minimum percentage for each group
random_state=None, # Random seed for reproducibility
class_weight='balanced', # Class weighting strategy
direction='forward', # Direction for sequential feature selection
cv=3, # Cross-validation folds
l1_exp_scale=4, # Exponent scale for L1 regularization
l1_grid_size=20, # Grid size for L1 regularization search
scoring='roc_auc' # Performance metric
)fit(data, target): Selects features and fits modelpredict(data): Makes binary predictionspredict_proba(data): Returns probability predictionssave_reports(path): Saves model reportsgenerate_sql(encoder): Generates SQL for model deploymentsave_scorecard(encoder, path, ...): Creates credit scorecard
WOE-Scoring can automatically generate and select high-quality features from your data:
encoder = WOETransformer(
generate_features=True, # Enable feature generation
max_generated_features=20, # Select top 20 new features
n_jobs=-1
)
encoder.fit(X, y)This process:
- Creates ratio features from all pairs of numeric columns
- Calculates statistical aggregations (mean) for numeric columns grouped by categorical columns
- Calculates the Gini score for all new features
- Selects the top
max_generated_features - Adds them to the dataset and proceeds with WOE binning
# First fit the WOE transformer and model
encoder = WOETransformer()
encoder.fit(train, train["target"])
train_woe = encoder.transform(train)
model = CreateModel()
model.fit(train_woe, train["target"])
# Generate SQL query for scoring
sql_query = model.generate_sql(encoder)# Save a credit scorecard to Excel
model.save_scorecard(
encoder=encoder,
path="output_dir",
base_scorecard_points=600, # Base score
odds=50, # Base odds
points_to_double_odds=20 # Points to double the odds
)# Specify categorical features and their treatment
encoder = WOETransformer(
cat_features=["education", "marital_status", "occupation"],
max_bins=5, # Max bins for categorical features
diff_woe_threshold=0.1, # Merge bins with similar WOE values
min_pct_group=0.05 # Minimum population percentage per bin
)The library is optimized for performance with:
- Vectorized operations for fast transformation
- Parallel processing for binning and feature selection
- Efficient memory usage for large datasets
- Optimized algorithms for binning and feature selection
This project is licensed under the MIT License - see the LICENSE file for details.