Sci-kit Learn
Summary
Ch 1. Basic Concepts.....................................................................................................................................................4
Ch 1.1 Loading Data............................................................................................................................................................................. 4
sklearn.datasets (Dataset Loading & Fetching)............................................................................................................................4
sklearn.model_selection (Data Splitting).....................................................................................................................................4
Ch 1.2 Data Preprocessing...................................................................................................................................................................5
sklearn.preprocessing (Feature Scaling & Transformation).........................................................................................................5
Ch 1.3 Feature Selection...................................................................................................................................................................... 5
sklearn.feature_selection (Feature Selection Methods).............................................................................................................5
Ch 1.4 Model Training..........................................................................................................................................................................5
sklearn.linear_model (Regression & Classification Models).........................................................................................................5
sklearn.ensemble (Ensemble Models).........................................................................................................................................6
Ch 1.5 Model Evaluation......................................................................................................................................................................6
sklearn.metrics (Performance Metrics)........................................................................................................................................6
Ch 2. Clustering............................................................................................................................................................6
Ch 2.1 Clustering KMeans....................................................................................................................................................................6
sklearn.cluster (K-Means Clustering)............................................................................................................................................6
Other Common Methods in KMeans:.............................................................................................................................................7
Ch 2.2 Clustering MeanShift................................................................................................................................................................ 7
sklearn.cluster (MeanShift Clustering).........................................................................................................................................7
Other Common Methods in MeanShift:..........................................................................................................................................7
Ch 2.3 Clustering DBSCAN................................................................................................................................................................... 7
sklearn.cluster (DBSCAN Clustering)............................................................................................................................................7
Other Common Methods in DBSCAN:.............................................................................................................................................8
Ch 2.4 Clustering GMM........................................................................................................................................................................8
sklearn.mixture (Gaussian Mixture Model Clustering).................................................................................................................8
Other Common Methods in GMM:.................................................................................................................................................8
Ch 3. Classifying............................................................................................................................................................9
Ch 3.1 Classifying KNN......................................................................................................................................................................... 9
sklearn.neighbors (K-Nearest Neighbors Classification)...............................................................................................................9
Other Common Methods in KNN:...................................................................................................................................................9
Ch 3.2 Classifying Naive Bayes.............................................................................................................................................................9
sklearn.naive_bayes (Naive Bayes Classification)........................................................................................................................9
Other Common Methods in Naive Bayes:.....................................................................................................................................10
Ch 3.3 Classifying Logistic Reg........................................................................................................................................................... 10
sklearn.linear_model (Logistic Regression)................................................................................................................................10
Other Common Methods in Logistic Regression:..........................................................................................................................10
Ch 3.4 Classifying SVM.......................................................................................................................................................................10
sklearn.svm (Support Vector Machine Classification).................................................................................................................10
Other Common Methods in SVM:.................................................................................................................................................11
Ch 3.5 Classifying Decision Tree.........................................................................................................................................................11
sklearn.tree (Decision Tree Classification)...................................................................................................................................11
Other Common Methods in Decision Tree:...................................................................................................................................11
Ch 3.6 Classifying MLP....................................................................................................................................................................... 12
sklearn.neural_network (Multi-Layer Perceptron Classification)...............................................................................................12
Other Common Methods in MLP..................................................................................................................................................12
Ch 4. Regression.........................................................................................................................................................12
Ch 4.1 Regression KNN...................................................................................................................................................................... 12
sklearn.neighbors (K-Nearest Neighbors Regression)...................................................................................................................12
Other Common Methods in KNN Regression:...............................................................................................................................13
Ch 4.2 Regression LR..........................................................................................................................................................................13
sklearn.linear_model (Linear Regression).....................................................................................................................................13
Other Common Methods in Linear Regression:............................................................................................................................13
Ch 4.3 Regression SVM...................................................................................................................................................................... 13
sklearn.svm (Support Vector Machine Regression).......................................................................................................................13
Other Common Methods in SVM Regression:...............................................................................................................................14
Ch 4.4 Regression Decision Tree........................................................................................................................................................14
Ch 4.5 Regression MLP...................................................................................................................................................................... 14
Ch 5. Dimensionality Reduction..................................................................................................................................14
Ch 1. Basic Concepts
Ch 1.1 Loading Data
sklearn.datasets (Dataset Loading & Fetching)
To work with datasets, import: import sklearn.datasets as ds
data = ds.load_iris()
o Returns: A dictionary-like dataset with .data, .target, and .feature_names.
data = ds.load_digits()
o Returns: A dataset for handwritten digit recognition.
data = ds.load_wine()
o Returns: A dataset for wine classification.
data = ds.load_breast_cancer()
o Returns: A dataset for diagnosing breast cancer.
data = ds.fetch_openml(name, version)
o Returns: A dataset from OpenML in Pandas or NumPy format.
o Parameters:
name (str) – Name of the dataset.
version (int, default=None) – Version of the dataset (if multiple exist).
X, y = ds.make_classification(n_samples, n_features, n_classes)
o Returns: A synthetic dataset for classification.
o Parameters:
n_samples (int, default=100) – Number of generated samples.
n_features (int, default=20) – Total number of input features.
n_classes (int, default=2) – Number of distinct target classes.
Example:
iris = ds.load_iris()
sklearn.model_selection (Data Splitting)
To split datasets, import: import sklearn.model_selection as ms
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size, random_state)
o Returns: Training and test sets.
o Parameters:
test_size (float, default=0.25) – Proportion of data for testing.
random_state (int, default=None) – Ensures reproducibility.
Example:
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state=42)
Ch 1.2 Data Preprocessing
sklearn.preprocessing (Feature Scaling & Transformation)
To preprocess data, import: import sklearn.preprocessing as pp
scaler = pp.StandardScaler()
o Returns: A StandardScaler instance for normalizing features.
X_scaled = scaler.fit_transform(X)
o Returns: The scaled feature matrix.
encoder = pp.OneHotEncoder()
o Returns: An encoder instance for categorical variables.
X_encoded = encoder.fit_transform(X_categorical)
o Returns: A transformed sparse matrix representing one-hot encoded categorical values.
o Parameters:
X_categorical (array-like) – Input categorical features to encode.
Example:
X_scaled = pp.StandardScaler().fit_transform(X)
Ch 1.3 Feature Selection
sklearn.feature_selection (Feature Selection Methods)
To select important features, import: import sklearn.feature_selection as fs
selector = fs.SelectKBest(score_func, k)
o Returns: A selector instance that picks the top k features.
o Parameters:
score_func (callable, e.g., f_classif) – Scoring function to evaluate features.
k (int, default=10) – Number of top features to select.
X_selected = selector.fit_transform(X, y)
o Returns: The dataset with selected features.
Example:
X_selected = fs.SelectKBest(fs.f_classif, k=5).fit_transform(X, y)
Ch 1.4 Model Training
sklearn.linear_model (Regression & Classification Models)
To train models, import: import sklearn.linear_model as lm
model = lm.LogisticRegression()
o Returns: A logistic regression model instance.
model.fit(X_train, y_train)
o Returns: None. Trains the model.
sklearn.ensemble (Ensemble Models)
To ensemble models, import:import sklearn.ensemble as en
model = en.RandomForestClassifier(n_estimators)
o Returns: A random forest classifier.
o Parameters:
n_estimators (int, default=100) – Number of trees in the forest.
Example:
model = lm.LogisticRegression().fit(X_train, y_train)
Ch 1.5 Model Evaluation
sklearn.metrics (Performance Metrics)
To evaluate models, import: import sklearn.metrics as mt
score = mt.accuracy_score(y_true, y_pred)
o Returns: The accuracy of classification.
o Parameters:
y_true (array-like) – True labels.
y_pred (array-like) – Predicted labels.
cm = mt.confusion_matrix(y_true, y_pred)
o Returns: The confusion matrix.
o Parameters:
y_true (array-like) – True labels.
y_pred (array-like) – Predicted labels.
r2 = mt.r2_score(y_true, y_pred)
o Returns: The R² score for regression.
Example:
score = mt.accuracy_score(y_test, model.predict(X_test))
Ch 2. Clustering
Ch 2.1 Clustering KMeans
sklearn.cluster (K-Means Clustering)
To use KMeans, first import: import sklearn.cluster as cl
kmn = cl.KMeans(n_clusters, init, n_init, max_iter, random_state)
o Returns: A KMeans clustering model instance.
o Parameters:
n_clusters (int, default=8) – The number of clusters to form.
init ({'k-means++', 'random'} or ndarray, default='k-means++') – Initialization method for cluster centers.
n_init (int, default=10) – Number of times KMeans runs with different centroid seeds.
max_iter (int, default=300) – Maximum iterations per run.
random_state (int, default=None) – Controls random number generation for centroid initialization.
Exmpl:kmn = cl.KMeans(n_clusters=3, init='k-means++', n_init=10, max_iter=300, random_state=42)
Other Common Methods in KMeans:
kmn.fit(X) – Trains the KMeans model.
labels = kmn.predict(X) – Assigns clusters to data points.
labels = kmn.fit_predict(X) – Directly assigns cluster labels while training.
labels = kmn.labels_ – Retrieves assigned cluster labels.
centroids = kmn.cluster_centers_ – Retrieves centroid coordinates.
inertia = kmn.inertia_ – Measures clustering performance via squared distances.
Ch 2.2 Clustering MeanShift
sklearn.cluster (MeanShift Clustering)
To use MeanShift, first import: import sklearn.cluster as cl
ms = cl.MeanShift(bandwidth, bin_seeding, cluster_all)
o Returns: A MeanShift clustering model instance.
o Parameters:
bandwidth (float, default=None) – Kernel size controlling cluster granularity.
bin_seeding (bool, default=False) – If True, uses initial bins for seed selection.
cluster_all (bool, default=True) – If True, assigns all points to a cluster.
Example: ms = cl.MeanShift(bandwidth=2.0, bin_seeding=True, cluster_all=True)
Other Common Methods in MeanShift:
ms.fit(X) – Trains the MeanShift model.
labels = ms.predict(X) – Predicts cluster assignments.
labels = ms.fit_predict(X) – Assigns cluster labels during training.
centroids = ms.cluster_centers_ – Retrieves estimated cluster centers
Ch 2.3 Clustering DBSCAN
sklearn.cluster (DBSCAN Clustering)
To use DBSCAN, first import: import sklearn.cluster as cl
db = cl.DBSCAN(eps, min_samples, metric, algorithm)
o Returns: A DBSCAN clustering model instance.
o Parameters:
eps (float, default=0.5) – Maximum distance for points to be in the same cluster.
min_samples (int, default=5) – Minimum points required to form a dense region.
metric (str, default='euclidean') – Distance metric to compute point similarity.
algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') – Algorithm used for nearest neighbor search.
Example: db = cl.DBSCAN(eps=0.3, min_samples=10, metric='euclidean', algorithm='auto')
Other Common Methods in DBSCAN:
db.fit(X) – Trains the DBSCAN model.
labels = db.fit_predict(X) – Assigns cluster labels during training.
labels = db.labels_ – Retrieves assigned cluster labels.
core_samples = db.core_sample_indices_ – Gets indices of core points.
Ch 2.4 Clustering GMM
sklearn.mixture (Gaussian Mixture Model Clustering)
To use GMM, first import: import sklearn.mixture as mix
gmm = mix.GaussianMixture(n_components, covariance_type, tol, max_iter, random_state)
o Returns: A Gaussian Mixture Model clustering instance.
o Parameters:
n_components (int, default=1) – Number of mixture components (clusters).
covariance_type ({'full', 'tied', 'diag', 'spherical'}, default='full') – Specifies the form of the covariance matrix.
tol (float, default=1e-3) – Convergence threshold.
max_iter (int, default=100) – Maximum iterations for Expectation-Maximization (EM).
random_state (int, default=None) – Controls random number generation.
Example:
gmm = mix.GaussianMixture(n_components=3, covariance_type='full', max_iter=200, random_state=42)
Other Common Methods in GMM:
gmm.fit(X) – Trains the Gaussian Mixture Model.
labels = gmm.predict(X) – Assigns cluster labels to data points.
probs = gmm.predict_proba(X) – Returns probabilities of each point belonging to a cluster.
means = gmm.means_ – Retrieves cluster means.
covariances = gmm.covariances_ – Retrieves covariance matrices for clusters.
Ch 3. Classifying
Ch 3.1 Classifying KNN
sklearn.neighbors (K-Nearest Neighbors Classification)
To use KNN, first import: import sklearn.neighbors as nb
knn = nb.KNeighborsClassifier(n_neighbors, weights, algorithm, metric)
o Returns: A KNN classification model instance.
o Parameters:
n_neighbors (int, default=5) – Number of nearest neighbors to consider.
weights ({'uniform', 'distance'} or callable, default='uniform') – Determines how neighbors are weighted.
algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') – Algorithm used to compute nearest neighbors.
metric (str, default='minkowski') – Distance metric for neighbor calculation.
Example:
knn = nb.KNeighborsClassifier(n_neighbors=3, weights='distance', algorithm='auto', metric='euclidean')
Other Common Methods in KNN:
knn.fit(X_train, y_train) – Trains the KNN model.
y_pred = knn.predict(X_test) – Predicts class labels for new data.
probs = knn.predict_proba(X_test) – Returns class probabilities for each sample.
score = knn.score(X_test, y_test) – Returns the model accuracy.
Ch 3.2 Classifying Naive Bayes
sklearn.naive_bayes (Naive Bayes Classification)
To use Naive Bayes, first import: import sklearn.naive_bayes as nb
nbc = nb.GaussianNB()
o Returns: A Gaussian Naive Bayes classifier instance.
Example:
nbc = nb.GaussianNB()
nbc = nb.MultinomialNB(alpha, fit_prior)
o Returns: A multinomial Naive Bayes classifier for discrete features.
o Parameters:
alpha (float, default=1.0) – Additive smoothing parameter.
fit_prior (bool, default=True) – Whether to learn class priors from the data.
Example:
nbc = nb.MultinomialNB(alpha=0.5, fit_prior=True)
Other Common Methods in Naive Bayes:
nbc.fit(X_train, y_train) – Trains the Naive Bayes model.
y_pred = nbc.predict(X_test) – Predicts class labels.
probs = nbc.predict_proba(X_test) – Returns class probabilities.
score = nbc.score(X_test, y_test) – Computes model accuracy.
Ch 3.3 Classifying Logistic Reg
sklearn.linear_model (Logistic Regression)
To use Logistic Regression, first import: import sklearn.linear_model as lm
lr = lm.LogisticRegression(penalty, C, solver, max_iter, random_state)
o Returns: A Logistic Regression model instance.
o Parameters:
penalty ({'l1', 'l2', 'elasticnet', 'none'}, default='l2') – Regularization type.
C (float, default=1.0) – Inverse of regularization strength.
solver ({'newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'}, default='lbfgs') – Algorithm for optimization.
max_iter (int, default=100) – Maximum number of iterations.
random_state (int, default=None) – Controls random number generation.
Example:
lr = lm.LogisticRegression(penalty='l2', C=1.0, solver='lbfgs', max_iter=200, random_state=42)
Other Common Methods in Logistic Regression:
lr.fit(X_train, y_train) – Trains the Logistic Regression model.
y_pred = lr.predict(X_test) – Predicts class labels.
probs = lr.predict_proba(X_test) – Returns class probabilities.
coefficients = lr.coef_ – Retrieves learned feature coefficients.
score = lr.score(X_test, y_test) – Computes model accuracy.
Ch 3.4 Classifying SVM
sklearn.svm (Support Vector Machine Classification)
To use SVM, first import: import sklearn.svm as svm
svm_model = svm.SVC(C, kernel, gamma, degree, probability, random_state)
o Returns: An SVM classifier instance.
o Parameters:
C (float, default=1.0) – Regularization strength (higher values reduce misclassification).
kernel ({'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf') – Kernel function used in decision boundary.
gamma ({'scale', 'auto'} or float, default='scale') – Kernel coefficient for non-linear models.
degree (int, default=3) – Degree of the polynomial kernel (used only if kernel='poly').
probability (bool, default=False) – If True, enables probability estimates.
random_state (int, default=None) – Controls random number generation.
Example:
svm_model = svm.SVC(C=1.0, kernel='rbf', gamma='scale', probability=True, random_state=42)
Other Common Methods in SVM:
svm_model.fit(X_train, y_train) – Trains the SVM model.
y_pred = svm_model.predict(X_test) – Predicts class labels.
probs = svm_model.predict_proba(X_test) – Returns class probabilities (only if probability=True).
score = svm_model.score(X_test, y_test) – Computes model accuracy.
Ch 3.5 Classifying Decision Tree
sklearn.tree (Decision Tree Classification)
To use Decision Trees, first import: import sklearn.tree as tr
dt = tr.DecisionTreeClassifier(criterion, max_depth, min_samples_split, random_state)
o Returns: A Decision Tree classifier instance.
o Parameters:
criterion ({'gini', 'entropy', 'log_loss'}, default='gini') – Function used to measure the quality of splits.
max_depth (int, default=None) – Maximum depth of the tree.
min_samples_split (int or float, default=2) – Minimum number of samples required to split an internal node.
random_state (int, default=None) – Controls random number generation for reproducibility.
Example:
dt = tr.DecisionTreeClassifier(criterion='entropy', max_depth=5, min_samples_split=4, random_state=42)
Other Common Methods in Decision Tree:
dt.fit(X_train, y_train) – Trains the Decision Tree model.
y_pred = dt.predict(X_test) – Predicts class labels.
probs = dt.predict_proba(X_test) – Returns class probabilities.
score = dt.score(X_test, y_test) – Computes model accuracy.
feature_importance = dt.feature_importances_ – Retrieves feature importance scores.
Ch 3.6 Classifying MLP
sklearn.neural_network (Multi-Layer Perceptron Classification)
To use MLP, first import: import sklearn.neural_network as nn
mlp = nn.MLPClassifier(hidden_layer_sizes, activation, solver, alpha, learning_rate, max_iter,
random_state)
o Returns: An MLP classifier instance.
o Parameters:
hidden_layer_sizes (tuple, default=(100,)) – Number of neurons in hidden layers.
activation ({'identity', 'logistic', 'tanh', 'relu'}, default='relu') – Activation function for hidden layers.
solver ({'lbfgs', 'sgd', 'adam'}, default='adam') – Optimization algorithm for weight updates.
alpha (float, default=0.0001) – L2 regularization parameter.
learning_rate ({'constant', 'invscaling', 'adaptive'}, default='constant') – Learning rate schedule.
max_iter (int, default=200) – Maximum number of training iterations.
random_state (int, default=None) – Controls random number generation.
Example:
mlp = nn.MLPClassifier(hidden_layer_sizes=(50, 50), activation='relu', solver='adam', max_iter=300,
random_state=42)
Other Common Methods in MLP
mlp.fit(X_train, y_train) – Trains the MLP model.
y_pred = mlp.predict(X_test) – Predicts class labels.
probs = mlp.predict_proba(X_test) – Returns class probabilities.
score = mlp.score(X_test, y_test) – Computes model accuracy.
loss_curve = mlp.loss_curve_ – Retrieves loss values over training iterations.
Ch 4. Regression
Ch 4.1 Regression KNN
sklearn.neighbors (K-Nearest Neighbors Regression)
To use KNN for regression, first import: import sklearn.neighbors as nb
knn_reg = nb.KNeighborsRegressor(n_neighbors, weights, algorithm, metric)
o Returns: A KNN regression model instance.
o Parameters:
n_neighbors (int, default=5) – Number of nearest neighbors to consider.
weights ({'uniform', 'distance'} or callable, default='uniform') – Determines how neighbors are weighted.
algorithm ({'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto') – Algorithm used to compute nearest neighbors.
metric (str, default='minkowski') – Distance metric for neighbor calculation.
Example:
knn_reg = nb.KNeighborsRegressor(n_neighbors=3, weights='distance', algorithm='auto', metric='euclidean')
Other Common Methods in KNN Regression:
knn_reg.fit(X_train, y_train) – Trains the KNN regressor.
y_pred = knn_reg.predict(X_test) – Predicts target values.
score = knn_reg.score(X_test, y_test) – Computes R2R^2R2 regression score.
Ch 4.2 Regression LR
sklearn.linear_model (Linear Regression)
To use Linear Regression, first import: import sklearn.linear_model as lm
lr_reg = lm.LinearRegression(fit_intercept, normalize, copy_X, n_jobs)
o Returns: A Linear Regression model instance.
o Parameters:
fit_intercept (bool, default=True) – If False, forces model to pass through the origin.
normalize (bool, default=False) – If True, normalizes features before training (deprecated).
copy_X (bool, default=True) – If False, allows modifying X during fitting.
n_jobs (int, default=None) – Number of parallel computations (if None, uses one core).
Example:
lr_reg = lm.LinearRegression(fit_intercept=True, copy_X=True, n_jobs=-1)
Other Common Methods in Linear Regression:
lr_reg.fit(X_train, y_train) – Trains the Linear Regression model.
y_pred = lr_reg.predict(X_test) – Predicts target values.
coefficients = lr_reg.coef_ – Retrieves learned feature coefficients.
intercept = lr_reg.intercept_ – Retrieves the model intercept.
score = lr_reg.score(X_test, y_test) – Computes R2R^2R2 regression score.
Ch 4.3 Regression SVM
sklearn.svm (Support Vector Machine Regression)
To use SVM for regression, first import: import sklearn.svm as svm
svr = svm.SVR(kernel, C, epsilon, gamma, degree)
o Returns: An SVM regression model instance.
o Parameters:
kernel ({'linear', 'poly', 'rbf', 'sigmoid', 'precomputed'}, default='rbf') – Kernel function used for regression.
C (float, default=1.0) – Regularization strength (higher values reduce misclassification).
epsilon (float, default=0.1) – Defines margin within which no penalty is given.
gamma ({'scale', 'auto'} or float, default='scale') – Kernel coefficient for non-linear models.
degree (int, default=3) – Degree of the polynomial kernel (used only if kernel='poly').
Example:
svr = svm.SVR(kernel='rbf', C=1.0, epsilon=0.1, gamma='scale')
Other Common Methods in SVM Regression:
svr.fit(X_train, y_train) – Trains the SVR model.
y_pred = svr.predict(X_test) – Predicts target values.
score = svr.score(X_test, y_test) – Computes R2R^2R2 regression score.
Ch 4.4 Regression Decision Tree
sklearn.tree (Decision Tree Regression)
To use Decision Trees for regression, first import: import sklearn.tree as tr
dt_reg = tr.DecisionTreeRegressor(criterion, max_depth, min_samples_split, random_state)
o Returns: A Decision Tree regression model instance.
o Parameters:
criterion ({'squared_error', 'friedman_mse', 'absolute_error', 'poisson'}, default='squared_error') – Function
used to measure the quality of a split.
max_depth (int, default=None) – Maximum depth of the tree.
min_samples_split (int or float, default=2) – Minimum number of samples required to split an internal node.
random_state (int, default=None) – Controls random number generation for reproducibility.
Example:
dt_reg = tr.DecisionTreeRegressor(criterion='squared_error', max_depth=5, min_samples_split=4,
random_state=42)
Other Common Methods in Decision Tree Regression:
dt_reg.fit(X_train, y_train) – Trains the Decision Tree regressor.
y_pred = dt_reg.predict(X_test) – Predicts target values.
score = dt_reg.score(X_test, y_test) – Computes R2R^2R2 regression score.
feature_importance = dt_reg.feature_importances_ – Retrieves feature importance scores.
Ch 4.5 Regression MLP
sklearn.neural_network (Multi-Layer Perceptron Regression)
To use MLP for regression, first import: import sklearn.neural_network as nn
mlp_reg = nn.MLPRegressor(hidden_layer_sizes, activation, solver, alpha, learning_rate, max_iter,
random_state)
o Returns: An MLP regression model instance.
o Parameters:
hidden_layer_sizes (tuple, default=(100,)) – Number of neurons in hidden layers.
activation ({'identity', 'logistic', 'tanh', 'relu'}, default='relu') – Activation function for hidden layers.
solver ({'lbfgs', 'sgd', 'adam'}, default='adam') – Optimization algorithm for weight updates.
alpha (float, default=0.0001) – L2 regularization parameter.
learning_rate ({'constant', 'invscaling', 'adaptive'}, default='constant') – Learning rate schedule.
max_iter (int, default=200) – Maximum number of training iterations.
random_state (int, default=None) – Controls random number generation.
Example:
mlp_reg = nn.MLPRegressor(hidden_layer_sizes=(50, 50), activation='relu', solver='adam', max_iter=300,
random_state=42)
Other Common Methods in MLP Regression:
mlp_reg.fit(X_train, y_train) – Trains the MLP model.
y_pred = mlp_reg.predict(X_test) – Predicts target values.
score = mlp_reg.score(X_test, y_test) – Computes R2R^2R2 regression score.
loss_curve = mlp_reg.loss_curve_ – Retrieves loss values over training iterations.
Ch 5. Dimensionality Reduction
sklearn.decomposition (Principal Component Analysis - PCA)
To use PCA, first import: import sklearn.decomposition as dc
pca = dc.PCA(n_components, svd_solver, whiten, random_state)
o Returns: A PCA transformation instance.
o Parameters:
n_components (int, float, default=None) – Number of principal components to retain.
svd_solver ({'auto', 'full', 'arpack', 'randomized'}, default='auto') – Algorithm used for Singular Value
Decomposition.
whiten (bool, default=False) – If True, normalizes transformed features.
random_state (int, default=None) – Controls random number generation.
Example:
pca = dc.PCA(n_components=2, svd_solver='auto', whiten=True, random_state=42)
sklearn.manifold (t-SNE for Non-Linear Dimensionality Reduction)
To use t-SNE, first import: import sklearn.manifold as mf
tsne = mf.TSNE(n_components, perplexity, learning_rate, n_iter, random_state)
o Returns: A t-SNE transformation instance.
o Parameters:
n_components (int, default=2) – Number of dimensions to reduce data to.
perplexity (float, default=30.0) – Controls balance between local and global aspects of data.
learning_rate (float, default=200.0) – Step size during optimization.
n_iter (int, default=1000) – Number of optimization iterations.
random_state (int, default=None) – Controls random number generation.
Example:
tsne = mf.TSNE(n_components=2, perplexity=30, learning_rate=200, n_iter=1000, random_state=42)
Other Common Methods in Dimensionality Reduction:
pca.fit(X) – Computes the principal components.
X_transformed = pca.transform(X) – Applies PCA transformation.
X_transformed = tsne.fit_transform(X) – Applies t-SNE transformation.
explained_variance = pca.explained_variance_ratio_ – Retrieves variance explained by each component.
Evaluation Metrics in Scikit-Learn
Clustering Metrics
Used to assess the quality of clusters when no ground truth labels are available.
silhouette_score(X, labels) – Measures how well-separated clusters are. Ranges from -1 (poor clustering) to 1 (well-clustered).
davies_bouldin_score(X, labels) – Lower values indicate better-defined clusters.
adjusted_rand_score(labels_true, labels_pred) – Compares predicted labels to ground truth (if available). Values range
from -1 (random) to 1 (perfect match).
from sklearn.metrics import silhouette_score, davies_bouldin_score,
adjusted_rand_score
sil_score = silhouette_score(X, labels)
db_score = davies_bouldin_score(X, labels)
ari_score = adjusted_rand_score(y_true, labels_pred)
Classification Metrics
Used to evaluate model performance when the ground truth labels are available.
accuracy_score(y_true, y_pred) – Measures overall correctness (ratio of correct predictions).
precision_score(y_true, y_pred, average='macro') – Measures how many predicted positives are actually correct.
recall_score(y_true, y_pred, average='macro') – Measures how many actual positives were correctly identified.
f1_score(y_true, y_pred, average='macro') – Harmonic mean of precision and recall, balancing both.
confusion_matrix(y_true, y_pred) – Displays true positive, false positive, false negative, and true negative values.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score,
confusion_matrix
acc = accuracy_score(y_test, y_pred)
prec = precision_score(y_test, y_pred, average='macro')
rec = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
cm = confusion_matrix(y_test, y_pred)
Regression Metrics
Used to measure the error between predicted and actual continuous values.
r2_score(y_true, y_pred) – Measures how well the model explains variance (ranges from -∞ to 1).
mean_squared_error(y_true, y_pred) – Penalizes large errors by squaring them (lower is better).
mean_absolute_error(y_true, y_pred) – Measures average absolute differences (less sensitive to large errors).
median_absolute_error(y_true, y_pred) – Measures median of absolute errors, robust to outliers.
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error,
median_absolute_error
r2 = r2_score(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
med_ae = median_absolute_error(y_test, y_pred)
Dimensionality Reduction Metrics
Used to assess how much information is retained after reducing dimensions.
explained_variance_ratio_ (PCA) – Measures the proportion of variance retained per component.
reconstruction_error_ (PCA) – Measures information loss from compression.
kl_divergence_ (t-SNE) – Measures the difference between original and reduced distributions (lower is better).
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(X)
explained_variance = pca.explained_variance_ratio_