Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Monotone Weight Of Evidence Transformer and LogisticRegression model with scikit-learn API

License

Notifications You must be signed in to change notification settings

kiraplenkin/woe_scoring

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

181 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code style: black

WOE-Scoring

Monotone Weight Of Evidence (WOE) Transformer and LogisticRegression model with scikit-learn API. Optimized for performance and stability.

Features

  • WOE Transformation: Convert categorical and numerical features to Weight of Evidence encoding
  • Automated Feature Selection: Multiple algorithms for optimal feature selection
  • Automated Feature Generation: Automatically create and select high-quality ratio and interaction features
  • Binning Strategies: Smart binning with monotonicity constraints
  • Sklearn Compatibility: Follows scikit-learn's API standards
  • Performance Optimized: Parallel processing and vectorized operations
  • SQL Export: Generate SQL for model deployment
  • Scorecard Generation: Create credit scorecards with customizable scaling

Installation

pip install woe-scoring

Quickstart

  1. Install the package:
pip install woe-scoring
  1. Use WOETransformer:
import pandas as pd
from woe_scoring import WOETransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

special_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

cat_cols = [
    "Pclass",
    "Sex",
    "SibSp",
    "Parch",
    "Embarked",
]

encoder = WOETransformer(
    max_bins=8,
    min_pct_group=0.1,
    diff_woe_threshold=0.1,
    cat_features=cat_cols,
    special_cols=special_cols,
    n_jobs=-1,
    merge_type="chi2",
    generate_features=True,  # Enable feature generation
    max_generated_features=10,
)

encoder.fit(train, train["Survived"])
encoder.save_to_file("train_dict.json")

encoder.load_woe_iv_dict("train_dict.json")
encoder.refit(train, train["Survived"])

enc_train = encoder.transform(train)
enc_test = encoder.transform(test)

model = LogisticRegression()
model.fit(enc_train, train["Survived"])
test_proba = model.predict_proba(enc_test)[:, 1]
  1. Use CreateModel:
import pandas as pd
from woe_scoring import CreateModel
from sklearn.model_selection import train_test_split

df = pd.read_csv("titanic_data.csv")
train, test = train_test_split(
    df, test_size=0.3, random_state=42, stratify=df["Survived"]
)

special_cols = [
    "PassengerId",
    "Survived",
    "Name",
    "Ticket",
    "Cabin",
]

model = CreateModel(
    max_vars=5,
    special_cols=special_cols,
    selection_method="sfs",
    model_type="sklearn",
    gini_threshold=5.0,
    n_jobs=-1,
    random_state=42,
    class_weight="balanced",
    cv=3,
)
model.fit(train, train["Survived"])
test_proba = model.predict_proba(test[model.feature_names_])

print(model.coef_, model.intercept_)
print(model.feature_names_)

Detailed Documentation

WOETransformer

The WOETransformer converts categorical and numerical features into Weight of Evidence (WOE) values. WOE measures the predictive power of a feature by comparing the distribution of events and non-events.

WOETransformer(
    max_bins=10,               # Maximum number of bins for each feature
    min_pct_group=0.05,        # Minimum percentage of each bin
    n_jobs=1,                  # Number of parallel jobs
    prefix="WOE_",             # Prefix for transformed features
    merge_type="chi2",         # Bin merging strategy ('chi2', 'woe', 'monotonic')
    cat_features=None,         # List of categorical features
    special_cols=None,         # Columns to exclude from transformation
    cat_features_threshold=0,  # Threshold for auto-identifying categorical features
    diff_woe_threshold=0.05,   # Minimum WOE difference between bins
    safe_original_data=False,  # Whether to keep original features
    generate_features=False,   # Whether to generate new features
    max_generated_features=20  # Max number of generated features to select
)

Key Methods

  • fit(data, target): Calculates optimal bins and WOE values
  • transform(data): Converts features to WOE values
  • save_to_file(path): Saves binning information to a JSON file
  • load_woe_iv_dict(path): Loads binning information from a JSON file
  • refit(data, target): Updates WOE values for existing bins with new data

CreateModel

The CreateModel class combines feature selection, model training, and model evaluation:

CreateModel(
    selection_method='rfe',    # Feature selection method ('rfe', 'sfs', 'iv')
    model_type='sklearn',      # Model implementation ('sklearn', 'statsmodel')
    max_vars=None,             # Maximum number of features to select
    special_cols=None,         # Columns to include as-is
    unused_cols=None,          # Columns to exclude
    n_jobs=1,                  # Number of parallel jobs
    gini_threshold=5.0,        # Minimum Gini score to keep a feature
    iv_threshold=0.05,         # Minimum IV threshold for feature selection
    corr_threshold=0.5,        # Correlation threshold for feature selection
    min_pct_group=0.05,        # Minimum percentage for each group
    random_state=None,         # Random seed for reproducibility
    class_weight='balanced',   # Class weighting strategy
    direction='forward',       # Direction for sequential feature selection
    cv=3,                      # Cross-validation folds
    l1_exp_scale=4,            # Exponent scale for L1 regularization
    l1_grid_size=20,           # Grid size for L1 regularization search
    scoring='roc_auc'          # Performance metric
)

Key Methods

  • fit(data, target): Selects features and fits model
  • predict(data): Makes binary predictions
  • predict_proba(data): Returns probability predictions
  • save_reports(path): Saves model reports
  • generate_sql(encoder): Generates SQL for model deployment
  • save_scorecard(encoder, path, ...): Creates credit scorecard

Advanced Usage

Automated Feature Generation

WOE-Scoring can automatically generate and select high-quality features from your data:

encoder = WOETransformer(
    generate_features=True,    # Enable feature generation
    max_generated_features=20, # Select top 20 new features
    n_jobs=-1
)
encoder.fit(X, y)

This process:

  1. Creates ratio features from all pairs of numeric columns
  2. Calculates statistical aggregations (mean) for numeric columns grouped by categorical columns
  3. Calculates the Gini score for all new features
  4. Selects the top max_generated_features
  5. Adds them to the dataset and proceeds with WOE binning

Generating SQL for Deployment

# First fit the WOE transformer and model
encoder = WOETransformer()
encoder.fit(train, train["target"])
train_woe = encoder.transform(train)

model = CreateModel()
model.fit(train_woe, train["target"])

# Generate SQL query for scoring
sql_query = model.generate_sql(encoder)

Creating a Scorecard

# Save a credit scorecard to Excel
model.save_scorecard(
    encoder=encoder,
    path="output_dir",
    base_scorecard_points=600,  # Base score
    odds=50,                    # Base odds
    points_to_double_odds=20    # Points to double the odds
)

Customizing Binning for Categorical Features

# Specify categorical features and their treatment
encoder = WOETransformer(
    cat_features=["education", "marital_status", "occupation"],
    max_bins=5,                 # Max bins for categorical features
    diff_woe_threshold=0.1,     # Merge bins with similar WOE values
    min_pct_group=0.05          # Minimum population percentage per bin
)

Performance Optimization

The library is optimized for performance with:

  • Vectorized operations for fast transformation
  • Parallel processing for binning and feature selection
  • Efficient memory usage for large datasets
  • Optimized algorithms for binning and feature selection

License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Monotone Weight Of Evidence Transformer and LogisticRegression model with scikit-learn API

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages