Description
Describe the workflow you want to enable
The workflow I want to enable is the ability for users to easily incorporate feature importance into distance-based algorithms like clustering (e.g., KMeans) and nearest neighbors (e.g., KNeighborsClassifier). Currently, scikit-learn allows users to define custom distance metrics, but there is no built-in support for weighted distance metrics, which are essential when certain features are more important than others.
Example Workflow:
A user has a dataset where some features are more relevant than others (e.g., in customer segmentation, age and income might be more important than the number of children).
The user wants to use a clustering algorithm like KMeans or a nearest neighbors algorithm like KNeighborsClassifier but needs to account for the varying importance of features.
The user specifies a vector of weights corresponding to the importance of each feature.
The algorithm uses the weighted Euclidean distance metric to compute distances, ensuring that more important features have a greater influence on the results.
Describe your proposed solution
I propose adding a Weighted Euclidean Distance Metric to scikit-learn as a built-in distance metric. This will allow users to specify feature weights directly, making it easier to incorporate feature importance into distance-based algorithms.
Key Components of the Solution:
- New Class:
- Add a WeightedEuclideanDistance class to the sklearn.metrics.pairwise module.
This class will accept a vector of weights during initialization. - It will compute the weighted Euclidean distance between two points using the formula:
d(x, y) = sqrt( summation from i = 1 to n of [ w_i * (x_i - y_i) squared ] )
where wi are the user-defined weights.
Describe alternatives you've considered, if relevant
No response
Additional context
Why This Feature is Needed:
- Feature Importance: In many real-world datasets, not all features are equally important. For example, in medical diagnosis, certain biomarkers might be more relevant than others. A weighted distance metric allows users to account for this.
- Ease of Use: While scikit-learn allows users to define custom distance metrics, this requires writing additional code and can be error-prone. A built-in weighted distance metric would simplify the process and make it more accessible to users.
- Alignment with scikit-learn’s Design: Scikit-learn emphasizes ease of use and consistency. Adding a weighted distance metric aligns with this philosophy by providing a simple, consistent way to incorporate feature importance into distance-based algorithms.
Existing Alternatives:
- Custom Distance Functions: Users can define custom distance functions using the metric parameter. However, this requires additional coding and lacks the convenience of a built-in solution.
- Feature Weighting via Preprocessing: Users can scale features by their importance before applying distance-based algorithms. However, this approach is less intuitive and can be cumbersome.
Potential Impact:
- This feature would benefit users in domains like bioinformatics, finance, and customer analytics, where feature importance is critical.
- It would also make scikit-learn more competitive with other machine learning libraries that offer weighted distance metrics.