Welcome to FeatureClus, a Python library designed to simplify feature selection for clustering models. This tool helps you select the most relevant features that enhance clustering performance, ensuring you avoid the "curse of dimensionality" and make your clustering algorithms more efficient and interpretable. 🧠
The feature selection process is driven by evaluating how each feature impacts the clustering results. FeatureClus uses an isolated data shift for each feature to assess its importance. The process follows these steps:
- MinMaxScaler: First, we scale the features using MinMaxScaler to normalize the data.
- PCA (80% variance): Next, we apply Principal Component Analysis (PCA) to reduce dimensionality, retaining 80% of the variance.
- DBSCAN Clustering: After reducing the dimensionality, DBSCAN is used to perform clustering.
- Silhouette Score Calculation: For each feature, we calculate the silhouette score to evaluate the quality of the clusters. The silhouette score represents how similar an object is to its own cluster compared to other clusters.
- Data Shift and Feature Importance: By applying isolated shifts to each feature and recalculating the silhouette score, we measure how the score changes. The absolute difference in the silhouette score after shifting each feature is used to rank the features by importance.
This method ensures that the features are evaluated for their individual contribution to the clustering process, allowing you to focus on the most impactful features.
- 🔍 Feature Ranking: Ranks features based on the absolute change in silhouette score after applying isolated shifts to each feature.
- 📈 Cluster Evaluation Metrics: Calculates the silhouette score to assess the clustering quality and the influence of each feature.
- 💻 Easy-to-Use API: A simple, intuitive API that can be easily integrated into your machine learning pipeline.
To install the library, run the following command:
pip install featclusHere is a quick example of how to use FeatureClus with a clustering algorithm (e.g., KMeans):
from featureclus import FeatureSelection
from sklearn.datasets import make_blobs
# Sample DataFrame
data, labels = make_blobs(n_samples=10000, centers=7, n_features=15, random_state=42)
df = pd.DataFrame(data, columns=[f"Feature_{i}" for i in range(15)])
# Initialize the FeatureSelection
model = FeatureSelection(data=df, shifts=[1, 25, 50, 75, 100], n_jobs=-1)
# See how the metrics are important
metrics2 = model2.get_metrics()Returns metrics that assess how each feature contributes to clustering.
Selects the top n_features features based on their importance to clustering results.
If you find this inventory optimization tool helpful and would like to support its continued development, consider buying me a coffee. Your support helps maintain and improve this project!
- ⭐ Star this repository
- 🍴 Fork it and contribute
- 📢 Share it with others who might find it useful
- 🐛 Report issues or suggest new features
Your support, in any form, is greatly appreciated! 🙏
This project is licensed under the MIT License. See the LICENSE file for more details.
Happy clustering! 🎉