Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Add Weighted Euclidean Distance Metric #30732

Open
@Mohdnihal03

Description

@Mohdnihal03

Describe the workflow you want to enable

The workflow I want to enable is the ability for users to easily incorporate feature importance into distance-based algorithms like clustering (e.g., KMeans) and nearest neighbors (e.g., KNeighborsClassifier). Currently, scikit-learn allows users to define custom distance metrics, but there is no built-in support for weighted distance metrics, which are essential when certain features are more important than others.

Example Workflow:
A user has a dataset where some features are more relevant than others (e.g., in customer segmentation, age and income might be more important than the number of children).

The user wants to use a clustering algorithm like KMeans or a nearest neighbors algorithm like KNeighborsClassifier but needs to account for the varying importance of features.

The user specifies a vector of weights corresponding to the importance of each feature.

The algorithm uses the weighted Euclidean distance metric to compute distances, ensuring that more important features have a greater influence on the results.

Describe your proposed solution

I propose adding a Weighted Euclidean Distance Metric to scikit-learn as a built-in distance metric. This will allow users to specify feature weights directly, making it easier to incorporate feature importance into distance-based algorithms.
Key Components of the Solution:

  1. New Class:
  • Add a WeightedEuclideanDistance class to the sklearn.metrics.pairwise module.
    This class will accept a vector of weights during initialization.
  • It will compute the weighted Euclidean distance between two points using the formula:
    d(x, y) = sqrt( summation from i = 1 to n of [ w_i * (x_i - y_i) squared ] )
    where ​wi are the user-defined weights.

Describe alternatives you've considered, if relevant

No response

Additional context

Why This Feature is Needed:

  • Feature Importance: In many real-world datasets, not all features are equally important. For example, in medical diagnosis, certain biomarkers might be more relevant than others. A weighted distance metric allows users to account for this.
  • Ease of Use: While scikit-learn allows users to define custom distance metrics, this requires writing additional code and can be error-prone. A built-in weighted distance metric would simplify the process and make it more accessible to users.
  • Alignment with scikit-learn’s Design: Scikit-learn emphasizes ease of use and consistency. Adding a weighted distance metric aligns with this philosophy by providing a simple, consistent way to incorporate feature importance into distance-based algorithms.

Existing Alternatives:

  • Custom Distance Functions: Users can define custom distance functions using the metric parameter. However, this requires additional coding and lacks the convenience of a built-in solution.
  • Feature Weighting via Preprocessing: Users can scale features by their importance before applying distance-based algorithms. However, this approach is less intuitive and can be cumbersome.

Potential Impact:

  • This feature would benefit users in domains like bioinformatics, finance, and customer analytics, where feature importance is critical.
  • It would also make scikit-learn more competitive with other machine learning libraries that offer weighted distance metrics.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions