dataset = pd.
get_dummies(dataset, drop_first=True)
dataset.head()
- This code is performing one-hot encoding on the categorical variables in the pandas DataFrame
object 'dataset' by converting them into binary dummy variables.
X = dataset.drop(columns = 'Rings (+1.5=Years)') X.head()
- This code creates a new DataFrame X by dropping the column 'Rings (+1.5=Years)' from the
original dataset.
import seaborn as sns
sns.heatmap(X.corr(),
annot = True,
fmt = '.1g',
center = 0,
cmap = 'coolwarm',
linewidths = 1,
linecolor='black')
- This code creates a heatmap using the Seaborn library to visualize the correlation between all
the columns in the DataFrame ‘X’
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X) # chuyeenr hoas dữ liệu về dạng 0, 1
X_scaled = pd.DataFrame(X_scaled, columns = X.columns)
X_scaled.head()
- This code imports the MinMaxScaler function from the scikit-learn library and initializes a scaler
object.
model = PCA(random_state=1503
).fit(X_scaled)
plt.plot(model.explained_variance_ratio_,
linewidth = 4)
plt.xlabel('Component')
plt.ylabel('Explained Variance')
plt.show()
- This code performs PCA (Principal Component Analysis) on the scaled data ‘X_scaled’ using
scikit-learn's ‘PCA’ class with a specified random state.
model = PCA(n_components=3,
random_state=1108).fit(X_scaled)
- This code performs principal component analysis (PCA) on the scaled data 'X_scaled' and creates
a PCA object with 3 principal components using the 'PCA' function from the scikit-learn library.
model_interpretation = pd.DataFrame(model.components_,
columns = X.columns)
model_interpretation
- This code creates a Pandas DataFrame named ‘model_interpretation’ that stores the principal
component analysis (PCA) interpretation of the ‘X_scaled’ data.
components = model.transform(X_scaled)
components = pd.DataFrame(components,
columns =['small size and weight', 'sex','big size and weight'])
components.head()
- This code uses the trained PCA model to transform the scaled dataset ‘X_scaled’ into a new
dataset called 'components', which contains the principal components of X_scaled.
final_dataset = pd.concat([components,dataset], axis = 1)
final_dataset.head()
- This code initializes a new dataframe named ‘final_dataset’ is created by concatenating the
original dataset ‘dataset’ and the dataframe ‘components’.
from sklearn.manifold import TSNE
model = TSNE(n_components = 2,
random_state = 1108)
components = model.fit_transform(X)
components
- This code uses the t-distributed stochastic neighbor embedding (TSNE) algorithm to reduce the
dimensionality of the dataset X to 2 dimensions.
plt.scatter(components[:,0],
components[:,1],
cmap='hsv',
c = dataset["Rings (+1.5=Years)"])
plt.title("t-SNE scatter plot")
plt.show()
- This code uses the t-SNE algorithm to perform dimensionality reduction on the data in X and
project it onto a two-dimensional space.
CONCLUSION: From the dimension reduction challenge above, we can infer that the original dataset had
a large number of features that were potentially correlated with each other, leading to the possibility of
multicollinearity issues. By using dimension reduction techniques such as Principal Component Analysis
(PCA) or t-SNE, we were able to reduce the number of features in the dataset while still preserving most
of the variance in the data. This can help simplify the modeling process and potentially improve the
model's performance. In addition, by visualizing the data in a scatter plot using t-SNE, we can observe
the grouping or separation of data points in a lower-dimensional space, which can provide insights into
the underlying patterns or relationships in the data.