Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views3 pages

Data Set

The document outlines a data preprocessing and dimensionality reduction workflow using Python's pandas, Seaborn, and scikit-learn libraries. It includes one-hot encoding, PCA, and t-SNE techniques to analyze and visualize a dataset, ultimately concluding that these methods help simplify modeling and reveal underlying patterns. The final dataset combines principal components with the original data for further analysis.

Uploaded by

Dzurg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views3 pages

Data Set

The document outlines a data preprocessing and dimensionality reduction workflow using Python's pandas, Seaborn, and scikit-learn libraries. It includes one-hot encoding, PCA, and t-SNE techniques to analyze and visualize a dataset, ultimately concluding that these methods help simplify modeling and reveal underlying patterns. The final dataset combines principal components with the original data for further analysis.

Uploaded by

Dzurg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 3

dataset = pd.

get_dummies(dataset, drop_first=True)

dataset.head()

- This code is performing one-hot encoding on the categorical variables in the pandas DataFrame
object 'dataset' by converting them into binary dummy variables.

X = dataset.drop(columns = 'Rings (+1.5=Years)') X.head()


- This code creates a new DataFrame X by dropping the column 'Rings (+1.5=Years)' from the
original dataset.

import seaborn as sns

sns.heatmap(X.corr(),

annot = True,

fmt = '.1g',

center = 0,

cmap = 'coolwarm',

linewidths = 1,

linecolor='black')

- This code creates a heatmap using the Seaborn library to visualize the correlation between all
the columns in the DataFrame ‘X’

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

X_scaled = scaler.fit_transform(X) # chuyeenr hoas dữ liệu về dạng 0, 1

X_scaled = pd.DataFrame(X_scaled, columns = X.columns)

X_scaled.head()

- This code imports the MinMaxScaler function from the scikit-learn library and initializes a scaler
object.

model = PCA(random_state=1503

).fit(X_scaled)
plt.plot(model.explained_variance_ratio_,

linewidth = 4)

plt.xlabel('Component')

plt.ylabel('Explained Variance')

plt.show()

- This code performs PCA (Principal Component Analysis) on the scaled data ‘X_scaled’ using
scikit-learn's ‘PCA’ class with a specified random state.

model = PCA(n_components=3,

random_state=1108).fit(X_scaled)

- This code performs principal component analysis (PCA) on the scaled data 'X_scaled' and creates
a PCA object with 3 principal components using the 'PCA' function from the scikit-learn library.

model_interpretation = pd.DataFrame(model.components_,

columns = X.columns)

model_interpretation

- This code creates a Pandas DataFrame named ‘model_interpretation’ that stores the principal
component analysis (PCA) interpretation of the ‘X_scaled’ data.

components = model.transform(X_scaled)

components = pd.DataFrame(components,

columns =['small size and weight', 'sex','big size and weight'])

components.head()

- This code uses the trained PCA model to transform the scaled dataset ‘X_scaled’ into a new
dataset called 'components', which contains the principal components of X_scaled.

final_dataset = pd.concat([components,dataset], axis = 1)

final_dataset.head()

- This code initializes a new dataframe named ‘final_dataset’ is created by concatenating the
original dataset ‘dataset’ and the dataframe ‘components’.
from sklearn.manifold import TSNE

model = TSNE(n_components = 2,

random_state = 1108)

components = model.fit_transform(X)

components

- This code uses the t-distributed stochastic neighbor embedding (TSNE) algorithm to reduce the
dimensionality of the dataset X to 2 dimensions.

plt.scatter(components[:,0],

components[:,1],

cmap='hsv',

c = dataset["Rings (+1.5=Years)"])

plt.title("t-SNE scatter plot")

plt.show()

- This code uses the t-SNE algorithm to perform dimensionality reduction on the data in X and
project it onto a two-dimensional space.

CONCLUSION: From the dimension reduction challenge above, we can infer that the original dataset had
a large number of features that were potentially correlated with each other, leading to the possibility of
multicollinearity issues. By using dimension reduction techniques such as Principal Component Analysis
(PCA) or t-SNE, we were able to reduce the number of features in the dataset while still preserving most
of the variance in the data. This can help simplify the modeling process and potentially improve the
model's performance. In addition, by visualizing the data in a scatter plot using t-SNE, we can observe
the grouping or separation of data points in a lower-dimensional space, which can provide insights into
the underlying patterns or relationships in the data.

You might also like