Write a snippet to download data from di erent repositories
External source
import pandas as pd
url = "https://forstoringfiles.000webhostapp.com/vault/uploads/Iris.csv"
housing = pd.read_csv(url)
housing.head()
github
import pandas as pd
# Load the dataset
url = "https://raw.githubusercontent.com/ageron/handson-
ml/master/datasets/housing/housing.csv"
housing = pd.read_csv(url)
housing.head()
Write a snippet to load and read the data
import pandas as pd
url = "https://raw.githubusercontent.com/ageron/handson-
ml/master/datasets/housing/housing.csv"
housing = pd.read_csv(url)
housing.head()
code for custom tranformers
Although Scikit-Learn provides many useful transformers, you will need to write your
own for tasks such as custom cleanup operations or combining specific attributes. You
will want your transformer to work seamlessly with Scikit-Learn func- tionalities (such
as pipelines), and since Scikit-Learn relies on duck typing (not inher- itance), all you
need is to create a class and implement three methods: fit() (returning self), transform(),
and fit_transform(). You can get the last one for free by simply adding TransformerMixin
as a base class. Also, if you add BaseEstima tor as a base class (and avoid *args and
**kargs in your constructor) you will get two extra methods (get_params() and
set_params()) that will be useful for auto- matic hyperparameter tuning. For example,
here is a small transformer class that adds the combined attributes we discussed
earlier
from sklearn.base import BaseEstimator, TransformerMixin
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6
class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
def __init__(self, add_bedrooms_per_room = True): # no *args or **kargs
self.add_bedrooms_per_room = add_bedrooms_per_room
def fit(self, X, y=None):
return self # nothing else to do
def transform(self, X, y=None):
rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
population_per_household = X[:, population_ix] / X[:, households_ix]
if self.add_bedrooms_per_room:
bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
return np.c_[X, rooms_per_household, population_per_household,
bedrooms_per_room]
else:
return np.c_[X, rooms_per_household, population_per_household]
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
housing_extra_attribs = attr_adder.transform(housing.values)
In this example the transformer has one hyperparameter, add_bedrooms_per_room, set
to True by default (it is often helpful to provide sensible defaults). This hyperpara- meter
will allow you to easily find out whether adding this attribute helps the Machine Learning
algorithms or not. More generally, you can add a hyperparameter to gate any data
preparation step that you are not 100% sure about. The more you automate these data
preparation steps, the more combinations you can automatically try out, making it
much more likely that you will find a great combination (and sav- ing you a lot of time).
Code for transformer pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
num_pipeline = Pipeline([
('imputer', SimpleImputer(strategy="median")),
('attribs_adder', CombinedAttributesAdder()),
('std_scaler', StandardScaler()),
])
housing_num_tr = num_pipeline.fit_transform(housing_num)
The Pipeline constructor in Scikit-Learn takes a list of name/estimator pairs to define a
sequence of steps, where all but the last estimator must be transformers with a
fit_transform() method. When the pipeline's fit() method is called, it sequentially applies
fit_transform() on all transformers and then calls fit() on the final estimator. The pipeline
inherits the methods of the final estimator. To handle both categorical and numerical
columns within a single transformer, Scikit-Learn's ColumnTransformer can be used.
Introduced in version 0.20, ColumnTransformer works well with Pandas DataFrames to
apply appropriate transformations to each column of the dataset.
from sklearn.compose import ColumnTransformer
num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", OneHotEncoder(), cat_attribs),
])
housing_prepared = full_pipeline.fit_transform(housing)
heres how to use the ColumnTransformer, first import the ColumnTransformer class.
Then, get lists of numerical and categorical column names. Construct a
ColumnTransformer with a list of tuples, where each tuple contains a name, a
transformer, and a list of column names (or indices) that the transformer applies to. In
this example, numerical columns use a pre-defined num_pipeline, and categorical
columns use a OneHotEncoder. Finally, apply the ColumnTransformer to the housing
data, which applies each transformer to the appropriate columns and concatenates the
outputs along the second axis, ensuring the transformers return the same number of
rows.
train and test data code
import pandas as pd
from sklearn.model_selection import train_test_split
url = "https://raw.githubusercontent.com/ageron/handson-
ml/master/datasets/housing/housing.csv"
housing = pd.read_csv(url)
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
Explain performance measure ? explain rmse and mse
When building and evaluating machine learning models, especially regression models,
performance measurement is critical. It helps us understand how well the model
predicts continuous target variables (e.g., housing prices, sales figures).
Here are two popular performance measure
Mean square error :
Mean Square Error (MSE) is a common measure used to evaluate the accuracy of a model. It
measures the average of the squares of the errors, which are the differences between the
observed and predicted values. The formula for MSE is given by:
MSE=
Where:
y is the number of observations.
yi is the actual value of the iii-th observation.
y^i is the predicted value of the iii-th observation.
∑ denotes the summation over all observations from i=1 to n.
Code snippet for building a model for linear regression ,decision tree,and random forest
Linear Regression
from sklearn.linear_model import LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)
Decision tree
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)
Random forest
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)
explain fine tunning of the model and give code snippet to find parameters (grid search)
Fine-tuning a Model
Fine-tuning involves adjusting a model's hyperparameters to optimize its performance for a
specific task. These hyperparameters are settings that control the model's behavior but aren't
directly learned from the data. Examples include the number of trees in a Random Forest or
the learning rate in a neural network. By tweaking these hyperparameters, you can
significantly improve a model's ability to fit your data and make accurate predictions.
Finding Best Parameters with Grid Search
Manually trying out different hyperparameter combinations can be tedious and time-
consuming. Grid search automates this process by systematically evaluating a predefined set
of hyperparameter values. Here's how it works:
1. Define the Hyperparameter Grid: You specify a dictionary (param_grid) where
each key represents a hyperparameter and the corresponding value is a list of values to
try. In the example, the grid explores different combinations of n_estimators
(number of trees) and max_features (number of features considered) for the Random
Forest model.
2. Create the Model: You define the machine learning model you want to fine-tune
(e.g., RandomForestRegressor).
3. Perform Grid Search: Scikit-Learn's GridSearchCV class is used to perform the grid
search. You provide the model, the hyperparameter grid (param_grid), the number of
folds for cross-validation (cv), a scoring metric (scoring), and optionally, a flag to
return training scores (return_train_score).
4. Fit the Grid Search: Call the fit method of grid_search on your training data
(housing_prepared) and target labels (housing_labels). This trains the model with
all defined hyperparameter combinations using cross-validation and evaluates their
performance based on the scoring metric.
5. Access Results: After fitting, you can access the best hyperparameter combination
using grid_search.best_params_. This provides a dictionary containing the
hyperparameter names and their corresponding best values identified by the grid
search.
6. Retrieve Best Model: The grid_search.best_estimator_ attribute stores the
model instance trained with the best hyperparameters found during the search.
Code Snippet:
Python
from sklearn.model_selection import GridSearchCV
# Define the hyperparameter grid (refer to the explanation above)
param_grid = [
{'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
{'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3,
4]},
]
# Create a RandomForestRegressor model (refer to the explanation above)
forest_reg = RandomForestRegressor()
# Create a GridSearchCV object (refer to the explanation above)
grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
scoring='neg_mean_squared_error', return_train_score=True)
# Fit the grid search to the training data
grid_search.fit(housing_prepared, housing_labels)
# Access the best hyperparameters (refer to the explanation above)
print(grid_search.best_params_)
# Access the best model (refer to the explanation above)
print(grid_search.best_estimator_)
data visualization and gain insights code snippet
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pandas.plotting import scatter_matrix
# Load the dataset
housing = pd.read_csv('housing.csv')
# Visualizing Geographical Data
plt.figure(figsize=(10,7))
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
s=housing["population"]/100, label="population",
c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True)
plt.legend()
plt.show()
# Calculate Correlation Matrix
corr_matrix = housing.corr()
# Display Correlation Matrix as a Heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()
# Visualize Correlations with Scatter Matrix
attributes = ["median_house_value", "median_income", "total_rooms",
"housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))
plt.show()
# Zoom in on the most promising correlation: median_income vs median_house_value
plt.figure(figsize=(10,7))
housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1)
plt.title('Correlation between Median Income and Median House Value')
plt.show()