Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
8 views74 pages

CWH Sklearn Merged

The document outlines a structured approach to solving machine learning problems, detailing steps from understanding the problem to deploying and maintaining models. It emphasizes the importance of data collection, exploration, preparation, model selection, and evaluation, including metrics like accuracy, MAE, and RMSE. Additionally, it discusses specific applications, such as building a house price prediction model for Gurgaon using techniques like stratified sampling and data visualization.

Uploaded by

gostshreyash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views74 pages

CWH Sklearn Merged

The document outlines a structured approach to solving machine learning problems, detailing steps from understanding the problem to deploying and maintaining models. It emphasizes the importance of data collection, exploration, preparation, model selection, and evaluation, including metrics like accuracy, MAE, and RMSE. Additionally, it discusses specific applications, such as building a house price prediction model for Gurgaon using techniques like stratified sampling and data visualization.

Uploaded by

gostshreyash
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

Machine Learning Problem-Solving

Steps
In order to solve machine learning problems, we follow a structured approach that
helps ensure accuracy, clarity, and effectiveness. Here are the main steps involved:

1. Look at the Big Picture

Understand the overall problem you’re solving. Define your objective clearly —
what does success look like?

2. Get the Data

Collect relevant and quality data from reliable sources. Without data, there’s no
machine learning.

3. Explore and Visualize the Data

Analyze and visualize data to uncover patterns, trends, and anomalies. This step
helps you understand what you’re working with.

4. Prepare the Data

Clean, transform, and format the data. Handle missing values, normalize features,
and split the data into training and testing sets.

5. Select a Model and Train It

Choose a suitable machine learning algorithm and train it using your data. This is
where your model learns from patterns.
6. Fine-Tune Your Model

Optimize hyperparameters, try different techniques, and improve performance


through iteration.

7. Present Your Solution

Explain your model’s results using visuals, metrics, and clear language so
stakeholders can understand and make decisions.

8. Launch, Monitor, and Maintain

Deploy the model in the real world, monitor its performance, and update it
regularly as new data arrives.
Datasets for Machine Learning
Machine Learning requires quality datasets for training and testing models. Some
popular sources include:

• OpenML.org – A collaborative platform offering a wide range of datasets with


metadata and tools for benchmarking.
• UCI Machine Learning Repository – One of the oldest and most widely used
sources for machine learning datasets.
• Kaggle – A data science community offering large-scale, real-world datasets
and competitions.
Quick Training with Scikit-learn (CSV
Data)
This guide walks you through training a model on your own CSV dataset using
scikit-learn.

1. Import Libraries

import pandas as pd
from sklearn.ensemble import RandomForestClassifier

2. Load Your CSV File

data = pd.read_csv('data.csv') # Replace with your actual CSV file path

3. Separate Features and Label

X = data.iloc[:, :-1] # All columns except the last as features


y = data.iloc[:, -1] # Last column as label

4. Train the Model

model = RandomForestClassifier()
model.fit(X, y)

The model is now trained on your complete dataset.


5. Inference

predictions = model.predict(X) # Predict on the same/new data (for demonstration)


print(predictions) # Display predictions
Evaluating Performance of ML Models
When we build a machine learning model, especially a classification model (which
predicts categories like “spam” or “not spam”, “dog” or “cat”), it’s important to
measure how well the model is performing.

One of the most basic ways to evaluate a classification model is accuracy.

What is Accuracy?

In simple terms, accuracy tells us how often the model was right.

If you gave your model 100 questions to answer, and it got 90 of them correct, its
accuracy would be:

Accuracy = Correct Predictions / Total Predictions = 90 / 100 = 90%

So, accuracy is just the fraction of predictions the model got right.

Evaluating Accuracy
Accuracy = (Correct Predictions) / (Total Predictions)

This is the simplest and most intuitive way to check if the model is doing a good
job.
Gurgaon House Price Prediction Model

Why Are We Building This?

Gurgaon, a rapidly growing city in India, has seen a sharp rise in real estate
development over the past decade. With its proximity to Delhi, booming IT hubs,
and modern infrastructure, Gurgaon has become a major attraction for both
homebuyers and investors. However, the real estate market here is highly dynamic
and often difficult to assess without proper data-driven tools.

We’re building a Gurgaon house price prediction model to help:

• Understand how factors like location, size, number of rooms, and amenities
affect property prices in Gurgaon.
• Assist buyers in identifying fair prices based on historical trends.
• Help sellers estimate an appropriate asking price.
• Empower real estate agents and platforms to improve recommendations and
negotiations.

How Will We Build It?

While we don’t have access to a large, clean dataset of house prices in Gurgaon
right now, we will use a well-known and cleaned dataset—the California housing
dataset—as a proxy. This will allow us to build, test, and evaluate a working model
with real-world variables like:

• Median income of the area


• Proximity to the city center
• Number of rooms
• Latitude and longitude
• Population density
We’ll treat this as a simulation: suppose the California data is Gurgaon data, and
suppose we are building this model for a neighborhood where both you and I live
or work nearby.

Once the model is developed and understood, we can later adapt the same
approach to real Gurgaon data when available, using the same techniques and
logic.
Revisiting Steps to solve this problem

Understanding the Problem

Before we build the model, we need to understand what kind of machine learning
problem we are solving.

First, we’ll check if this is a supervised or unsupervised learning task. Since we


have historical data with house prices (our target), and we want to predict the price
of a house based on input features, this is a supervised learning problem.

Next, we observe that the model is predicting one continuous label — the price of
a house — based on several input features. This makes it a univariate regression
problem.
Measuring Errors (RMSE & MAE)
After training our regression model, we need to evaluate how good its predictions
are. Two common metrics used for this are MAE and RMSE.

1. Mean Absolute Error (MAE)


MAE stands for Mean Absolute Error. It calculates the average of the absolute
differences between the predicted and actual values.

Formula: MAE = (1/n) × sum of |actual - predicted|

• It treats all errors equally, no matter their size.


• MAE is based on the Manhattan norm (also called L1 norm), which measures
distance by summing absolute values.

2. Root Mean Squared Error (RMSE)


RMSE stands for Root Mean Squared Error. It calculates the square root of the
average of squared differences between predicted and actual values.

Formula: RMSE = sqrt((1/n) × sum of (actual - predicted)²)

• RMSE gives more weight to larger errors because it squares them.


• RMSE is based on the Euclidean norm (also called L2 norm), which measures
straight-line distance.

Summary
• Use MAE when all errors should be treated equally.
• Use RMSE when larger errors should be penalized more.
Analyzing the Data (EDA)
Exploratory Data Analysis (EDA) is the process of examining a dataset to
summarize its main characteristics, often using visual methods or quick commands.
The goal is to understand the structure of the data, detect patterns, spot
anomalies, and get a feel for what kind of preprocessing or modeling might be
needed.

For our project, we’ll perform EDA on the California housing dataset (which we are
treating as if it represents Gurgaon data). Here are some key commands we’ll use:

1. df.head()

• Displays the first 5 rows of the dataset.


• Useful for getting a quick overview of what the data looks like — column
names, data types, and sample values.

2. df.info()

• Gives a summary of the dataset.


• Shows the number of entries, column names, data types, and how many non-
null values each column has.
• Helps us identify missing values or incorrect data types.

3. df.describe()

• Provides statistical summaries for numeric columns.

• Shows:

• Count: Total number of non-null entries


• Mean: Average value
• Std: Standard deviation
• Min: The smallest value (0th percentile or 1st quartile in some contexts)
• 25%: The 1st quartile (Q1) — 25% of the data is below this value
• 50%: The median or 2nd quartile (Q2) — half of the data is below this
value
• 75%: The 3rd quartile (Q3) — 75% of the data is below this value
• Max: The largest value (often considered the 4th quartile or 100th
percentile)

Percentiles divide the data into 100 equal parts. Quartiles divide the data into
4 equal parts (Q1 = 25th percentile, Q2 = 50th, Q3 = 75th). So:

• Min is the 0th percentile


• Max is the 100th percentile

This helps us understand how the values are spread out and if there are outliers.

4. df['column_name'].value_counts()

• Shows the count of each unique value in a specific column.


• Useful for categorical columns to see how values are distributed.
Creating a Test Set
When building machine learning models, one of the most important steps is
splitting your dataset into training and test sets. This ensures your model is
evaluated on data it has never seen before, which is critical for assessing its ability
to generalize.

The Problem of Data Snooping Bias

Data snooping bias occurs when information from the test set leaks into the
training process. This can lead to overly optimistic performance metrics and
models that don’t perform well in real-world scenarios.

To avoid this, the test set must be isolated before any data exploration, feature
selection, or model training begins.

Random Sampling: A Basic Approach

A simple method to split the data is to randomly shuffle it and then divide it:

import numpy as np

def shuffle_and_split_data(data, test_ratio):


np.random.seed(42) # Set the seed for reproducibility
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]

Setting the random seed (e.g., with np.random.seed(42) ) ensures consistency


across runs — this is crucial for debugging and comparing models fairly.
However, pure random sampling might not always be reliable, especially if the
dataset contains important patterns that are not evenly distributed.

Stratified Sampling

To ensure that important characteristics of the population are well represented in


both the training and test sets, we use stratified sampling.

What is a Strata?
A strata is a subgroup of the data defined by a specific attribute. Stratified
sampling ensures that each of these subgroups is proportionally represented.

For example, in the California housing dataset, median income is a strong


predictor of house prices. Instead of randomly sampling, we can create strata
based on income levels (e.g., binning median income into categories) and ensure
the test set maintains the same distribution of income levels as the full dataset.

Creating Income Categories

import pandas as pd
# Load the dataset
data = pd.read_csv("housing.csv")
# Create income categories

data["income_cat"] = pd.cut(data["median_income"],
bins=[0, 1.5, 3.0, 4.5, 6.0, np.inf],
labels=[1, 2, 3, 4, 5])

This code creates a new column income_cat that categorizes the median_income
into five bins. Each bin represents a range of income levels, allowing us to stratify
our sampling based on these categories.

We can plot these income categories to visualize the distribution:

import matplotlib.pyplot as plt


data["income_cat"].value_counts().sort_index().plot.bar(rot=0, grid=True)
plt.title("Income Categories Distribution")
plt.xlabel("Income Category")
plt.ylabel("Number of Instances")
plt.show()

Stratified Shuffle Split in Scikit-Learn

Scikit-learn provides a built-in way to perform stratified sampling using


StratifiedShuffleSplit .

Here’s how you can use it:

from sklearn.model_selection import StratifiedShuffleSplit

# Assume income_cat is a column in the dataset created from median_income


split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(data, data["income_cat"]):

strat_train_set = data.loc[train_index]
strat_test_set = data.loc[test_index]

This ensures that the income distribution in both sets is similar to that of the full
dataset, reducing sampling bias and making your model evaluation more reliable.
Data Visualization
Before handling missing values or training models, it’s important to visualize the
data to uncover patterns, relationships, and potential issues.

Geographical Scatter Plot

Visualize the geographical distribution of the data:

df.plot(kind="scatter", x="longitude", y="latitude", grid=True, alpha=0.2)


plt.show()

• alpha=0.2 makes overlapping points more visible.


• This helps reveal data clusters and high-density areas like coastal regions.

Correlation Matrix

To understand relationships between numerical features, compute the correlation


matrix:

corr_matrix = df.corr()

Check how strongly each attribute correlates with the target:

corr_matrix["median_house_value"].sort_values(ascending=False)

This helps identify useful predictors. For example, median_income usually shows a
strong positive correlation with house prices.
Scatter Matrix

Plot selected features to see pairwise relationships:

from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median


_age"]
scatter_matrix(df[attributes], figsize=(12, 8))
plt.show()

This gives an overview of which features are linearly related and may be good
predictors.

Focused Income vs Price Plot

Plot median_income vs median_house_value directly:

df.plot(kind="scatter", x="median_income", y="median_house_value", alpha=0.1,


grid=True)
Further Preprocessing & Handling
Missing Data
Before feeding your data into a machine learning algorithm, you need to clean and
prepare it.

Prepare Data for Training

It’s best to write transformation functions instead of applying them manually. This
ensures:

• Reproducibility on any dataset


• Reusability across projects
• Compatibility with live systems
• Easier experimentation

Start by creating a clean copy and separating the predictors and labels:

housing = strat_train_set.drop("median_house_value", axis=1)

housing_labels = strat_train_set["median_house_value"].copy()

Handling Missing Data

Some features, like total_bedrooms , contain missing values. You can:

1. Drop rows with missing values


2. Drop the entire column
3. Impute missing values (recommended)

We’ll use option 3 using SimpleImputer from Scikit-Learn, which allows consistent
handling across all datasets (train, test, new data):
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="median")

housing_num = housing.select_dtypes(include=[np.number])
imputer.fit(housing_num)

This computes the median for each numerical column and stores it in
imputer.statistics_ :

>>> imputer.statistics_
array([-118.51 , 34.26 , 29. , 2125. , 434. , 1167. , 408. , 3.5385])

Now apply the learned medians to transform the data:

X = imputer.transform(housing_num)

Other available strategies:

• "mean" – replaces with mean value


• "most_frequent" – for the most common value (can handle categorical)
• "constant" – fill with a fixed value using fill_value=...
Scikit-Learn Design Principles
Scikit-Learn has a simple and consistent API that makes it easy to use and
understand. Below are the key design principles behind it:

1. Consistency

All objects follow a standard interface, which makes learning and using different
tools in Scikit-Learn easier.

2. Estimators

Any object that learns from data is called an estimator.

• Use the .fit() method to train an estimator.


• In supervised learning, pass both X (features) and y (labels) to .fit(X, y) .
• Hyperparameters (like strategy='mean' in SimpleImputer ) are set when
creating the object.

Example:

imputer = SimpleImputer(strategy="median")
imputer.fit(data)
3. Transformers

Some estimators can also transform data. These are called transformers.

• Use .transform() to apply the transformation after fitting.


• Use .fit_transform() to do both in one step.

Example:

X_transformed = imputer.fit_transform(data)

4. Predictors

Models that can make predictions are predictors.

• Use .predict() to make predictions on new data.


• Use .score() to evaluate performance (e.g., accuracy or R²).

Example:

model = LinearRegression()

model.fit(X_train, y_train)
predictions = model.predict(X_test)
score = model.score(X_test, y_test)

5. Inspection

• Hyperparameters can be accessed directly: model.param_name


• Learned parameters are stored with an underscore: model.coef_ ,
imputer.statistics_
6. No Extra Classes

• Inputs and outputs are basic structures like NumPy arrays or Pandas
DataFrames.
• No need to learn custom data types.

7. Composition

You can combine steps into a Pipeline, chaining transformers and a final predictor.

Example:

pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("model", LinearRegression())

])
pipeline.fit(X, y)

8. Sensible Defaults

Most tools in Scikit-Learn work well with default settings, so you can get started
quickly.

Note on DataFrames

Even if you input a Pandas DataFrame, the output of transformers like


transform() will be a NumPy array. You can convert it back like this:

X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
Handling Categorical and Text
Attributes in Scikit-Learn
Most machine learning algorithms work best with numerical data. But real-world
datasets often contain categorical or text attributes. Let’s understand how to
handle these in Scikit-Learn using the ocean_proximity column from the
California housing dataset as an example.

1. Categorical Attributes

Text columns like "ocean_proximity" are not free-form text but limited to a fixed
set of values (e.g., "NEAR BAY" , "INLAND" ). These are known as categorical
attributes.

Example:

housing_cat = housing[["ocean_proximity"]]
housing_cat.head()

2. Ordinal Encoding

Scikit-Learn’s OrdinalEncoder can convert categories to numbers:

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)

This will output a 2D NumPy array with numerical category codes.


To see the mapping:

ordinal_encoder.categories_
# Output: array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'])

⚠️ Caution: Ordinal encoding implies an order between categories, which may not
be true here. For example, it treats INLAND (1) as closer to <1H OCEAN (0) than
NEAR OCEAN (4) , which might not make sense.

3. One-Hot Encoding

For unordered categories, one-hot encoding is a better choice. It creates one


binary column per category.

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

This gives a sparse matrix (efficient storage for mostly zeros).

To convert it to a regular NumPy array:

housing_cat_1hot.toarray()

Or directly get a dense array:

cat_encoder = OneHotEncoder(sparse=False)
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)

To check category order:

cat_encoder.categories_
4. Summary

Method Use When Output Type

OrdinalEncoder Categories have an order 2D NumPy array

OneHotEncoder Categories are unordered Sparse or dense

Using the right encoding ensures your model learns correctly from categorical
features.
Feature Scaling and Transformation
Feature scaling is a crucial preprocessing step. Most machine learning algorithms
perform poorly when input features have vastly different scales.

In the California housing dataset, for example:

• total_rooms ranges from 6 to over 39,000


• median_income ranges from 0 to 15

If you don’t scale these features, models will give more importance to
total_rooms simply because it has larger values.

Why Scaling Is Needed

• Many models (like Linear Regression, KNN, SVMs, Gradient Descent-based


algorithms) assume features are on a similar scale.
• Without scaling, features with larger ranges can dominate model behavior.
• Scaling makes training more stable and faster.

Min-Max Scaling (Normalization)

This method rescales the data to a specific range, usually [0, 1] or [-1, 1] .

Formula:

scaled_value = (x - min) / (max - min)

Use Scikit-Learn’s MinMaxScaler :

from sklearn.preprocessing import MinMaxScaler


min_max_scaler = MinMaxScaler(feature_range=(-1, 1))
housing_num_min_max_scaled = min_max_scaler.fit_transform(housing_num)

• Use feature_range=(-1, 1) for models like neural networks.


• Sensitive to outliers — extreme values can distort the scale.

Standardization (Z-score Scaling)

This method centers the data around 0 and scales it based on standard deviation.

Formula:

standardized_value = (x - mean) / std

Use Scikit-Learn’s StandardScaler :

from sklearn.preprocessing import StandardScaler

std_scaler = StandardScaler()
housing_num_std_scaled = std_scaler.fit_transform(housing_num)

• Resulting features have zero mean and unit variance


• Robust to outliers compared to min-max scaling
• Recommended for most ML algorithms, especially when using gradient
descent
Transformation Pipelines
As datasets grow more complex, data preprocessing often involves multiple steps
such as imputing missing values, scaling features, encoding categorical variables,
etc. These steps must be applied in the correct order and consistently across
training, validation, test, and future production data.

To streamline this process, Scikit-Learn provides the Pipeline class — a powerful


utility for chaining data transformations.

Building a Numerical Pipeline

A typical pipeline for numerical attributes might include:

1. Imputation of missing values (e.g., with median).


2. Feature scaling (e.g., with standardization).

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer


from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
("impute", SimpleImputer(strategy="median")),
("standardize", StandardScaler()),
])

How It Works
• The pipeline takes a list of steps as (name, transformer) pairs.
• Names must be unique and should not contain double underscores __ .
• All intermediate steps must be transformers (i.e., must implement
fit_transform() ).
• The final step can be either a transformer or a predictor.

Using make_pipeline

If you don’t want to name the steps manually, you can use make_pipeline() :

from sklearn.pipeline import make_pipeline

num_pipeline = make_pipeline(SimpleImputer(strategy="median"), StandardScaler())

• This automatically names the steps using the class names in lowercase.
• If the same class appears multiple times, a number is appended (e.g.,
standardscaler-1 ).

Applying the Pipeline

Call fit_transform() to apply all transformations in sequence:

housing_num_prepared = num_pipeline.fit_transform(housing_num)
print(housing_num_prepared[:2].round(2))

Example output:

array([[-1.42, 1.01, 1.86, 0.31, 1.37, 0.14, 1.39, -0.94],

[ 0.60, -0.70, 0.91, -0.31, -0.44, -0.69, -0.37, 1.17]])

• Each row corresponds to a transformed sample.


• Each column corresponds to a scaled feature.

Retrieving Feature Names


To turn the result back into a DataFrame with feature names:
df_housing_num_prepared = pd.DataFrame(

housing_num_prepared,
columns=num_pipeline.get_feature_names_out(),
index=housing_num.index
)

Pipeline as a Transformer or Predictor

• If the last step is a transformer, the pipeline behaves like a transformer


( fit_transform() , transform() ).
• If the last step is a predictor (e.g., a model), the pipeline behaves like an
estimator ( fit() , predict() ).

This flexibility makes Pipeline the standard way to handle data preprocessing
and modeling in Scikit-Learn projects.
Data Preprocessing - Final Pipeline
In this section, we will consolidate everything we’ve done so far into one final
script using Scikit-Learn pipelines. This includes:

1. Creating a stratified test set


2. Handling missing values
3. Encoding categorical variables
4. Scaling numerical features
5. Combining everything using Pipeline and ColumnTransformer

This will ensure clean, modular, and reproducible code — perfect for production
and education.

Final Preprocessing Code using Scikit-Learn Pipelines

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedShuffleSplit


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
# from sklearn.preprocessing import OrdinalEncoder # Uncomment if you prefer
ordinal

# 1. Load the data


housing = pd.read_csv("housing.csv")

# 2. Create a stratified test set based on income category


housing["income_cat"] = pd.cut(
housing["median_income"],
bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
labels=[1, 2, 3, 4, 5]
)

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)


for train_index, test_index in split.split(housing, housing["income_cat"]):
strat_train_set = housing.loc[train_index].drop("income_cat", axis=1)
strat_test_set = housing.loc[test_index].drop("income_cat", axis=1)

# Work on a copy of training data


housing = strat_train_set.copy()

# 3. Separate predictors and labels


housing_labels = housing["median_house_value"].copy()
housing = housing.drop("median_house_value", axis=1)

# 4. Separate numerical and categorical columns


num_attribs = housing.drop("ocean_proximity", axis=1).columns.tolist()
cat_attribs = ["ocean_proximity"]

# 5. Pipelines
# Numerical pipeline
num_pipeline = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])

# Categorical pipeline
cat_pipeline = Pipeline([
# ("ordinal", OrdinalEncoder()) # Use this if you prefer ordinal encoding

("onehot", OneHotEncoder(handle_unknown="ignore"))
])

# Full pipeline
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs),
])

# 6. Transform the data


housing_prepared = full_pipeline.fit_transform(housing)
# housing_prepared is now a NumPy array ready for training

print(housing_prepared.shape)
Training and Evaluating ML Models
Now that our data is preprocessed, let’s move on to training machine learning
models and evaluating their performance. We’ll start with:

• Linear Regression
• Decision Tree Regressor
• Random Forest Regressor

We’ll first test them on the training data and then use cross-validation to get a
better estimate of their true performance.

1. Train and Test Models on the Training Set

from sklearn.linear_model import LinearRegression


from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

# Linear Regression
lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

# Decision Tree
tree_reg = DecisionTreeRegressor(random_state=42)
tree_reg.fit(housing_prepared, housing_labels)

# Random Forest
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(housing_prepared, housing_labels)

# Predict using training data


lin_preds = lin_reg.predict(housing_prepared)
tree_preds = tree_reg.predict(housing_prepared)
forest_preds = forest_reg.predict(housing_prepared)

# Calculate RMSE
lin_rmse = mean_squared_error(housing_labels, lin_preds, squared=False)
tree_rmse = mean_squared_error(housing_labels, tree_preds, squared=False)
forest_rmse = mean_squared_error(housing_labels, forest_preds, squared=False)

print("Linear Regression RMSE:", lin_rmse)


print("Decision Tree RMSE:", tree_rmse)
print("Random Forest RMSE:", forest_rmse)

A Warning About Training RMSE

Training RMSE only shows how well the model fits the training data. It does not
tell us how well it will perform on unseen data. In fact, the Decision Tree and
Random Forest may overfit, leading to very low training error but poor
generalization.

2. Cross-Validation: A Better Evaluation Strategy

Cross-validation helps us evaluate how a model generalizes to new data without


needing to touch the test set.

What is Cross-Validation?
Instead of training the model once and evaluating on a holdout set, k-fold cross-
validation splits the training data into k folds (typically 10), trains the model on k-1
folds, and validates it on the remaining fold. This process repeats k times.

We’ll use cross_val_score from sklearn.model_selection .


Cross-Validation on Decision Tree

from sklearn.model_selection import cross_val_score


import pandas as pd

# Evaluate Decision Tree with cross-validation


tree_rmses = -cross_val_score(
tree_reg,
housing_prepared,
housing_labels,
scoring="neg_root_mean_squared_error",
cv=10
)

# WARNING: Scikit-Learn’s scoring uses utility functions (higher is better), so


RMSE is returned as negative.
# We use minus (-) to convert it back to positive RMSE.
print("Decision Tree CV RMSEs:", tree_rmses)
print("\nCross-Validation Performance (Decision Tree):")
print(pd.Series(tree_rmses).describe())
Model Persistence and Inference with
Joblib in a Random Forest Pipeline
Lets now summarize how to train a Random Forest model on California housing
data, save the model and preprocessing pipeline using joblib , and reuse the
model later for inference on new data ( input.csv ). This approach helps avoid
retraining the model every time, improving performance and enabling
reproducibility.

Why These Steps?

1. Why Train Once and Save?


• Training models repeatedly is time-consuming and computationally
expensive.
• Saving the model ( model.pkl ) and preprocessing pipeline ( pipeline.pkl )
ensures you can quickly load and run inference anytime in the future.

2. Why Use a Preprocessing Pipeline?


• Raw data needs to be cleaned, scaled, and encoded before model training.
• A Pipeline automates this transformation and ensures identical
preprocessing during inference.

3. Why Use Joblib?


• joblib efficiently serializes large NumPy arrays (like in sklearn models).
• Faster and more suitable than pickle for scikit-learn objects.
4. Why the If-Else Logic?
• The program checks if a saved model exists.

• If not, it trains and saves the model.


• If it does, it skips training and only runs inference, saving time.

Full Code

import os
import pandas as pd
import numpy as np
import joblib

from sklearn.model_selection import StratifiedShuffleSplit


from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor

MODEL_FILE = "model.pkl"
PIPELINE_FILE = "pipeline.pkl"

def build_pipeline(num_attribs, cat_attribs):


num_pipeline = Pipeline([

("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
cat_pipeline = Pipeline([
("onehot", OneHotEncoder(handle_unknown="ignore"))
])
full_pipeline = ColumnTransformer([
("num", num_pipeline, num_attribs),
("cat", cat_pipeline, cat_attribs)
])
return full_pipeline
if not os.path.exists(MODEL_FILE):
# TRAINING PHASE

housing = pd.read_csv("housing.csv")
housing['income_cat'] = pd.cut(housing["median_income"],
bins=[0.0, 1.5, 3.0, 4.5, 6.0, np.inf],
labels=[1, 2, 3, 4, 5])
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, _ in split.split(housing, housing['income_cat']):
housing = housing.loc[train_index].drop("income_cat", axis=1)

housing_labels = housing["median_house_value"].copy()
housing_features = housing.drop("median_house_value", axis=1)

num_attribs = housing_features.drop("ocean_proximity", axis=1).columns.tolist()


cat_attribs = ["ocean_proximity"]

pipeline = build_pipeline(num_attribs, cat_attribs)


housing_prepared = pipeline.fit_transform(housing_features)

model = RandomForestRegressor(random_state=42)
model.fit(housing_prepared, housing_labels)

# Save model and pipeline


joblib.dump(model, MODEL_FILE)
joblib.dump(pipeline, PIPELINE_FILE)

print("Model trained and saved.")

else:

# INFERENCE PHASE
model = joblib.load(MODEL_FILE)
pipeline = joblib.load(PIPELINE_FILE)

input_data = pd.read_csv("input.csv")
transformed_input = pipeline.transform(input_data)
predictions = model.predict(transformed_input)
input_data["median_house_value"] = predictions
input_data.to_csv("output.csv", index=False)
print("Inference complete. Results saved to output.csv")

Summary

With this setup, our ML pipeline is:

• Efficient – No retraining needed if the model exists.


• Reproducible – Same preprocessing logic every time.
• Production-ready – Can be deployed or reused across multiple systems.
Conclusion: California Housing Price
Prediction Project
In this project, we built a complete machine learning pipeline to predict California
housing prices using various regression algorithms. We started by:

• Loading and preprocessing the dataset ( housing.csv ) with careful treatment


of missing values, scaling, and encoding using a custom pipeline.

• Stratified splitting was used to maintain income category distribution


between train and test sets.

• We trained and evaluated multiple algorithms including:

• Linear Regression
• Decision Tree Regressor
• Random Forest Regressor

• Through cross-validation, we found that Random Forest performed the best,


offering the lowest RMSE and most stable results.

Finally, we built a script that:

• Trains the Random Forest model and saves it using joblib .


• Uses an if-else logic to skip retraining if the model exists.
• Applies the trained model to new data ( input.csv ) to predict
median_house_value , storing results in output.csv .

This pipeline ensures that predictions are accurate, efficient, and ready for
production deployment.
Introduction to Neural Networks

Inspiration from Nature


Birds inspired humans to build airplanes. The tiny hooks on burrs sticking to a
dog’s fur led to the invention of Velcro. And just like that, nature has always been
humanity’s greatest engineer.

So, when it came to making machines that could think, learn, and solve problems,
where did we look?

To the human brain.

That’s how neural networks were born — machines inspired by neurons in our
brains, built to recognize patterns, make decisions, and even learn from experience.

What is AI, ML, and DL?

Before we dive into neural networks, let’s untangle these buzzwords.

Term Stands for Think of it as…

AI Artificial Intelligence The big umbrella: making machines “smart”

A subset of AI: machines that learn from


ML Machine Learning
data

DL Deep Learning A type of ML: uses neural networks

Let’s simplify:

• AI is the dream: “Can we make machines intelligent?”


• ML is the method: “Let’s give machines data and let them learn.”
• DL is the tool: “Let’s use neural networks that learn in layers — like the brain.”
So, when we talk about neural networks, we’re entering the world of deep
learning, which is a part of machine learning, which itself is a part of AI.

So, What Are Neural Networks?

Imagine a bunch of simple decision-makers called neurons, connected together in


layers.

Each neuron:

• Takes some input (like a number)


• Applies a little math (weights + bias)
• Passes the result through a rule (called an activation function)
• Sends the output to the next layer

By connecting many of these neurons, we get a neural network.

And what’s amazing?

Even though each neuron is simple, when combined, the network becomes
powerful — like how a bunch of ants can build a complex colony.

Why Are Neural Networks Useful?

Because they can learn patterns, even when we don’t fully understand the patterns
ourselves.

Examples:

• Recognize cats in photos


• Convert speech to text
• Translate languages
• Predict stock prices
• Generate art
• Power AI like ChatGPT
Structure of a Neural Network

Here’s the basic anatomy of a neural network:

Input Layer → Hidden Layers → Output Layer


(Data) (Neurons doing math) (Prediction)

Each layer is just a bunch of neurons working together. The more hidden layers,
the “deeper” the network. Hence: Deep Learning.

Wait — Why Not Use Simple Code Instead?

Good question.

Sometimes, a simple formula or rule is enough (like area = length × width ).

But what about:

• Recognizing handwritten digits?


• Understanding language?
• Diagnosing diseases from X-rays?

There are no easy formulas for these. Neural networks learn the formula by
themselves from lots of examples.

How Do Neural Networks Learn?

Let’s say the network tries to predict y = x² + x .

1. It starts with random guesses (bad predictions)


2. It checks how wrong it is (loss)
3. It adjusts the internal settings (weights) to be a little better
4. Repeat, repeat, repeat…
Over time, the network figures out the relationship between x and y.

This process is called training — and it’s where the magic happens.

Are They Really Like the Brain?

Kind of — but very simplified.

• A biological brain neuron connects to 1000s of others


• It processes chemicals, spikes, timings
• It adapts and rewires itself

A neural network is a mathematical model — inspired by the brain, but way


simpler. Still, the results are powerful.

Summary

Concept Meaning

AI Making machines act smart

ML Letting machines learn from data

DL Using multi-layered neural networks to learn complex stuff

Neural
A network of artificial neurons that learns from data
Network
Perceptron – The Simplest Neural
Network

What is a Perceptron?

The perceptron is the basic building block of a neural network. It’s a simple
computational model that takes several inputs, applies weights to them, adds a
bias, and produces an output. It’s essentially a decision-making unit.

Real-life Analogy

Imagine you’re trying to decide whether to go outside based on:

• Is it sunny?
• Is it the weekend?
• Are you free today?

You assign importance (weights) to each factor:

• Sunny: 0.6
• Weekend: 0.3
• Free: 0.8

You combine these factors to make a decision: Go or Not Go.

This is what a perceptron does.


Perceptron Formula

A perceptron takes inputs (x1, x2, …, xn), multiplies each by its corresponding
weight (w1, w2, …, wn), adds a bias (b), and passes the result through an activation
function.

y = f(w1x1 + w2x2 + … + wn*xn + b)

Where:

• xi: input features


• wi: weights
• b: bias
• f: activation function (e.g., step function)

Step-by-step Example: Binary Classification

Let’s say we want a perceptron to learn this simple table:

Input (x1, x2) Output (y)

(0, 0) 0

(0, 1) 0

(1, 0) 0

(1, 1) 1

This is the behavior of a logical AND gate.

We will use:

• Inputs: x1, x2
• Weights: w1, w2
• Bias: b
• Activation Function: Step function

Step function:
def step(x):

return 1 if x >= 0 else 0

Code: Simple Perceptron from Scratch

def step(x):
return 1 if x >= 0 else 0

def perceptron(x1, x2, w1, w2, b):


z = x1 * w1 + x2 * w2 + b
return step(z)

# Try different weights and bias to match the AND logic


print(perceptron(0, 0, 1, 1, -1.5)) # Expected: 0
print(perceptron(0, 1, 1, 1, -1.5)) # Expected: 0
print(perceptron(1, 0, 1, 1, -1.5)) # Expected: 0
print(perceptron(1, 1, 1, 1, -1.5)) # Expected: 1

This matches the AND logic perfectly.

Summary

• A perceptron is the simplest form of a neural network.


• It performs a weighted sum of inputs, adds a bias, and passes the result
through an activation function to make a decision.
• It can model simple binary functions like AND, OR, etc.
Common Terms in Deep Learning
Before diving into neural networks, let’s clarify some common terms you’ll
encounter:

Perceptron

A perceptron is the simplest type of neural network — just one neuron. It takes
inputs, multiplies them by weights, adds a bias, and applies an activation function
to make a decision (e.g., classify 0 or 1).

Neural Network

A neural network is a collection of interconnected layers of perceptrons (neurons).


Each layer transforms its inputs using weights, biases, and activation functions.
Deep neural networks have multiple hidden layers and can model complex
patterns.

Hyperparameters

These are settings we configure before training a model. They are not learned
from the data. Examples include:

• Learning rate: How much to adjust weights during training

• Number of epochs: How many times the model sees the entire training
dataset

• Batch size: How many samples to process before updating weights


• Number of layers or neurons: How many neurons are in each layer of the
network

Learning Rate (η)

This controls how much we adjust the weights after each training step. A learning
rate that’s too high may overshoot the solution; too low may make training very
slow.

Training

This is the process where the model learns patterns from data by updating
weights based on errors between predicted and actual outputs.

Backpropagation

Backpropagation is the algorithm used to update weights in a neural network. It


calculates the gradient of the loss function with respect to each weight by applying
the chain rule, allowing the model to learn from its mistakes.

Inference

Inference is when the trained model is used to make predictions on new, unseen
data.
Activation Function

This function adds non-linearity to the output of neurons, helping networks model
complex patterns.

Common activation functions:

• ReLU : Rectified Linear Unit


• Sigmoid : squashes output between 0 and 1
• Tanh : squashes output between -1 and 1

Perceptrons typically use a step function as the activation, but modern neural
nets often use ReLU .

Epoch

One epoch means one full pass over the entire training dataset. Multiple epochs
are used so the model can keep refining its understanding.
Training a Perceptron with scikit-learn

1. Import Libraries

from sklearn.linear_model import Perceptron


from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

2. Create and Split Data

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2,


random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

3. Initialize the Perceptron

clf = Perceptron(
max_iter=1000, # Maximum number of epochs
eta0=0.1, # Learning rate
random_state=42, # For reproducibility
tol=1e-3, # Stop early if improvement is smaller than this
shuffle=True # Shuffle data each epoch
)
4. Train the Model

clf.fit(X_train, y_train)

Under the hood, this performs the following steps:

• Loops through the data up to max_iter times (epochs)


• Computes predictions
• If a prediction is wrong, updates weights

5. Evaluate the Model

accuracy = clf.score(X_test, y_test)


print(f"Accuracy: {accuracy:.2f}")

Important Hyperparameters Recap:

Hyperparameter Description

max_iter Number of epochs (passes over training data)

eta0 Learning rate

tol Tolerance for stopping early

shuffle Whether to shuffle data between epochs

random_state Seed for reproducibility


TensorFlow vs Keras vs PyTorch
Deep learning has transformed industries—from self-driving cars to language
models like ChatGPT. But behind the scenes, there are powerful libraries that make
all of this possible: TensorFlow, Keras, and PyTorch. Today, we’ll explore:

• Why and how each library was created


• How they differ in philosophy and design
• What a “backend” means in Keras
• Which one might be right for you

A Brief History

1. TensorFlow
• Released: 2015 by Google Brain
• Language: Python (but has C++ core)
• Goal: Provide an efficient, production-ready, and scalable library for deep
learning.
• TensorFlow is a computational graph framework, meaning it represents
computations as nodes in a graph.
• Open sourced in 2015, TensorFlow quickly became the go-to library for many
companies and researchers.

Fun Fact: TensorFlow was a spiritual successor to Google’s earlier tool called
DistBelief.

2. Keras
• Released: 2015 by François Chollet
• Goal: Make deep learning simple, intuitive, and user-friendly.
• Keras was originally just a high-level wrapper over Theano and TensorFlow,
making it easier to build models with fewer lines of code.
• Keras introduced the idea of writing deep learning models like stacking Lego
blocks.

Keras was not a full deep learning engine—it needed a “backend” to actually do
the math

3. PyTorch
• Released: 2016 by Facebook AI Research (FAIR)
• Language: Python-first, with a strong integration to NumPy
• Goal: Make deep learning flexible, dynamic, and easier for research.
• PyTorch uses dynamic computation graphs, meaning the graph is built on the
fly, allowing for more intuitive debugging and flexibility.

PyTorch gained massive popularity in academia and research because of its


Pythonic nature and simplicity.

In a nut shell…

Feature TensorFlow Keras PyTorch

François
Developed
Google Chollet Facebook (Meta)
By
(Google)

Low-level (with
Low-level & high- High-level
Level some high-level
level only
APIs)

Computation Static (TensorFlow Depends on


Dynamic
Graph 1.x), Hybrid (2.x) backend

Ease of Use Very high High


Feature TensorFlow Keras PyTorch

Medium (Better in
2.x)

Harder (in 1.x),


Debugging Easy Very easy (Pythonic)
Easier in 2.x

Yes (via
Production-
Yes TensorFlow Gaining ground
ready
backend)

Research
Moderate Moderate Very high
usage

What Is a “Backend” in Keras?

Keras itself doesn’t perform computations like matrix multiplications or gradient


descent. It’s more like a user interface or a frontend. It delegates the heavy lifting
to a backend engine.

Supported Backends Over Time:


• Theano (now discontinued)
• TensorFlow (default backend now)
• CNTK (Microsoft, also discontinued)
• PlaidML (experimental support)

You can think of Keras as the steering wheel, while TensorFlow or Theano was the
engine under the hood.

In TensorFlow 2.0 and later, Keras is fully integrated as tf.keras , eliminating the
need to manage separate backends.
TensorFlow vs PyTorch: The Real Battle

Area TensorFlow PyTorch

Ease of TensorFlow Serving, TFX, TorchServe, ONNX, some


Deployment TensorFlow Lite catching up

Mobile
Excellent (TF Lite, TF.js) Improving but limited
Support

Dynamic TF 1.x: No, TF 2.x: Yes (via


Native support
Graphs Autograph)

Community Large, especially in production Massive in academia

Also strong, especially for


Performance Highly optimized
GPUs

Which One Should You Learn First?

• Beginner? → Start with Keras via TensorFlow 2.x ( tf.keras ). It’s simple and
production-ready.
• Researcher or experimenting a lot? → Use PyTorch.
• Looking for deployment & scalability? → TensorFlow is very robust.
Installing TensorFlow 2.0
Today we will install TensorFlow 2.0, which is a powerful library for machine
learning and deep learning tasks. TensorFlow 2.0 simplifies the process of building
and training models, making it more user-friendly compared to its predecessor.
Understanding and Visualizing the
MNIST Dataset with TensorFlow
The MNIST dataset is like the “Hello World” of deep learning. It contains 70,000
grayscale images of handwritten digits (0–9), each of size 28x28 pixels. Before we
train a neural network, it’s important to understand what we’re working with.

Today we’ll:

• Load the MNIST dataset using TensorFlow


• Visualize a few digit samples
• Understand the data format

Step 1: Load the Dataset

TensorFlow provides a built-in method to load MNIST, so no extra setup is needed.

import tensorflow as tf
import matplotlib.pyplot as plt

# Load dataset
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

# Check the shape


print("Training data shape:", x_train.shape)
print("Training labels shape:", y_train.shape)

Output:
Training data shape: (60000, 28, 28)

Training labels shape: (60000,)

• We have 60,000 training images and 10,000 test images.


• Each image is 28x28 pixels.
• Each label is a number from 0 to 9.

Step 2: Visualize Sample Digits

Let’s look at a few images to get a feel for the dataset:

# Plot first 10 images with their labels


plt.figure(figsize=(10, 2))
for i in range(10):

plt.subplot(1, 10, i + 1)
plt.imshow(x_train[i], cmap="gray")
plt.axis("off")
plt.title(str(y_train[i]))
plt.tight_layout()
plt.show()

This will display the first 10 handwritten digits with their corresponding labels
above them.

What You Should Notice

• The digits vary in writing style, which makes this dataset great for teaching
computers to generalize.
• All images are normalized 28x28 grayscale—no color channels.
• The label is not embedded in the image; it’s provided separately.
Your First Neural Network
TensorFlow is an open-source deep learning library developed by Google. It’s
widely used in industry and academia for building and training machine learning
models. TensorFlow 2.0 brought significant improvements in ease of use, especially
with eager execution and tight integration with Keras.

In this tutorial, we’ll create a simple neural network that learns to classify
handwritten digits using the MNIST dataset. This dataset contains 28x28 grayscale
images of digits from 0 to 9.

Key Features of TensorFlow 2.0

• Eager execution by default (no more complex session graphs!)


• Keras as the official high-level API ( tf.keras )
• Better debugging and simplicity
• Great for both beginners and professionals

What is a Neural Network?

A neural network is a collection of layers that learn to map input data to outputs.
Think of layers as filters that extract meaningful patterns. Each layer applies
transformations using weights and activation functions.

Installation

pip install tensorflow


Importing Required Libraries

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Flatten
from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

Loading and Preparing the Data

# Load the data


(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Normalize the input data


x_train = x_train / 255.0
x_test = x_test / 255.0

# One-hot encode the labels


y_train = to_categorical(y_train, 10)
y_test = to_categorical(y_test, 10)

Building a Simple Neural Network

model = Sequential([
Flatten(input_shape=(28, 28)), # 28x28 images to 784 input features
Dense(128, activation='relu'), # Hidden layer with 128 neurons
Dense(10, activation='softmax') # Output layer for 10 classes
])
Compiling the Model

model.compile(
optimizer='adam',

loss='categorical_crossentropy',
metrics=['accuracy']
)

Training the Model

model.fit(x_train, y_train, epochs=5, batch_size=32)

Evaluating the Model

test_loss, test_acc = model.evaluate(x_test, y_test)


print(f"Test accuracy: {test_acc:.4f}")
Introduction to LLMs (Large Language
Models)
Large Language Models (LLMs) are a breakthrough in artificial intelligence that
have revolutionized how machines understand and generate human language.
These models are capable of performing a wide range of tasks such as translation,
summarization, question answering, and even creative writing — all by learning
from massive text datasets.

In this section, we will build a foundational understanding of what LLMs are, why
they matter in data science, and how they differ from traditional machine learning
models.

What is an LLM?

A Large Language Model is a type of AI model that uses deep learning, specifically
transformer architectures, to process and generate natural language. These models
are “large” because they contain billions (or even trillions) of parameters —
tunable weights that help the model make predictions.

At their core, LLMs are trained to predict the next word in a sentence, given the
words that came before. With enough data and training, they learn complex
language patterns, world knowledge, and even reasoning skills.

Why are LLMs Important?

• Versatility: One LLM can perform dozens of tasks without needing task-
specific training.
• Zero-shot and few-shot learning: LLMs can handle tasks they’ve never
explicitly seen before, based on prompts or examples.
• Human-like generation: They produce text that is often indistinguishable from
human writing.
• Foundation for AI applications: They power modern tools like ChatGPT,
Copilot, Bard, Claude, and more.

How are LLMs Different from Traditional ML Models?

Traditional ML
Feature LLMs
Models

Input Structured data Natural language (text)

General pretraining on large


Training Task-specific
text

Thousands to
Parameters Billions to trillions
millions

Adaptability Limited Highly adaptable via prompting

Knowledge Feature-
Implicit via word embeddings
representation engineered

Where are LLMs Used?

LLMs are widely used across industries:

• Customer support: Chatbots and automated help desks


• Education: AI tutors, personalized learning
• Healthcare: Clinical documentation and patient interaction
• Software Development: Code generation and bug detection
• Creative fields: Story writing, poetry, music lyrics
History of LLMs
Understanding the history of Large Language Models (LLMs) helps us appreciate
how far we’ve come in natural language processing (NLP) and the innovations that
made today’s AI systems possible.

This section walks through the key milestones — from early statistical models to
the modern transformer revolution.

Early NLP Approaches

Before LLMs, language tasks were handled using:

• Rule-based systems: Manually written logic for grammar and syntax.


• Statistical models: Such as n-gram models, which predicted the next word
based on a fixed window of previous words.
• Bag-of-words and TF-IDF: Used for basic text classification but ignored word
order and context.

These models worked for simple tasks, but failed to capture deeper meaning,
semantics, or long-range dependencies in language.

The Rise of Neural Networks

With the rise of deep learning, models began learning richer representations:

• Word Embeddings like Word2Vec (2013) and GloVe (2014) mapped words to
continuous vector spaces.
• Recurrent Neural Networks (RNNs) and LSTMs enabled models to process
sequences, but they struggled with long texts and parallel processing.
Transformers: The Game Changer

In 2017, Google introduced the Transformer architecture in the paper “Attention is


All You Need.”

Key features of transformers:

• Self-attention mechanism allows the model to weigh the importance of


different words, regardless of their position.
• Enables parallelization, making training on massive datasets feasible.

This led to a new generation of LLMs:

Model Year Key Contribution

BERT 2018 Bidirectional context understanding

GPT-1 2018 Introduced unidirectional generation

GPT-2 2019 Generated coherent long-form text

T5 2020 Unified text-to-text framework

175B parameters, capable of few-shot


GPT-3 2020
learning

ChatGPT / GPT-3.5 / 2022– Conversational abilities, better


GPT-4 2023 reasoning

Claude, Gemini, LLaMA, Open-source and scalable


2023+
Mistral alternatives

Pretraining & Finetuning

The modern LLM pipeline consists of:

1. Pretraining on a large corpus of general text (e.g., books, Wikipedia, web


pages).
2. Finetuning for specific tasks (e.g., summarization, coding help).
3. Reinforcement Learning from Human Feedback (RLHF) — used to make
models safer and more helpful (e.g., ChatGPT).

Summary

The evolution of LLMs is a story of scale, data, and architecture. The shift from
handcrafted rules to deep neural transformers has allowed machines to understand
and generate language with remarkable fluency.
How LLMs Work
In this section, we break down the inner workings of Large Language Models
(LLMs). While these models seem like magic from the outside, they are grounded in
fundamental machine learning and deep learning principles — especially the
transformer architecture.

We’ll go through how LLMs process text, represent meaning, and generate
coherent outputs.

Tokenization: Breaking Text into Units

LLMs do not process text as raw strings. Instead, they break input text into smaller
units called tokens. Tokens can be:

• Whole words (for simple models)


• Subwords (e.g., “un” + “believ” + “able”)
• Characters (rare for LLMs, used in specific domains)

This process helps reduce the vocabulary size and handle unknown or rare words
efficiently.

Popular tokenizers include Byte-Pair Encoding (BPE) and SentencePiece.

Embeddings: Converting Tokens to Vectors

Once text is tokenized, each token is mapped to a high-dimensional vector


through an embedding layer. These embeddings capture relationships between
words based on context.

For example, the words “king” and “queen” will be closer in the embedding space
than unrelated words like “banana” or “car”.
The Transformer Architecture

The core of LLMs is the transformer, introduced in 2017. It replaced earlier models
like RNNs and LSTMs by allowing for better performance and scalability.

Key components of a transformer:

• Self-Attention Mechanism: Enables the model to focus on different parts of


the input when processing each token. For example, in the sentence “The cat
sat on the mat,” the word “sat” may attend more to “cat” and “mat” than to
“the”.
• Multi-Head Attention: Allows the model to capture different types of
relationships simultaneously.
• Feedforward Networks: Add depth and complexity to the model.
• Positional Encoding: Since transformers process all tokens in parallel, they
need a way to encode the order of tokens.

These components are stacked in layers — more layers typically mean more
modeling power.

Training LLMs: Predicting the Next Token

LLMs are trained using a simple but powerful objective: predict the next token
given the previous tokens.

For example:

• Input: “The sun rises in the”


• Output: “east”

The model adjusts its internal weights using a large dataset and gradient descent
to minimize prediction error. Over billions of examples, the model learns grammar,
facts, reasoning patterns, and even basic common sense.
This process is known as causal language modeling in models like GPT. Other
models like BERT use masked language modeling, where random tokens are
hidden and the model must predict them.

Generation: Producing Human-like Text

Once trained, the model can generate text by predicting one token at a time:

1. Start with an input prompt.


2. Predict the next token based on context.
3. Append the new token to the prompt.
4. Repeat until a stopping condition is met.

Several sampling strategies control the output:

• Greedy decoding: Always choose the most likely next token.


• Beam search: Explore multiple token sequences in parallel.
• Top-k / top-p sampling: Add randomness for more creative or diverse
outputs.

Limitations of LLMs

Despite their capabilities, LLMs have limitations:

• No true understanding: They learn patterns, not meaning.


• Hallucinations: They can generate plausible but false information.
• Bias: Trained on large web corpora, they can inherit societal biases.
• Compute-intensive: Training and running LLMs requires significant hardware
resources.
Introduction to RAG-based Systems
As Large Language Models (LLMs) become central to modern AI applications, a key
limitation remains: they don’t know anything beyond their training data. They
cannot access up-to-date information or internal company documents unless
explicitly provided.

This is where RAG (Retrieval-Augmented Generation) systems come in. RAG


bridges the gap between language generation and external knowledge, making
LLMs more accurate, dynamic, and context-aware.

What is a RAG System?

Retrieval-Augmented Generation (RAG) is an AI architecture that combines:

1. A retriever – to search a knowledge base for relevant documents or facts.


2. A generator (LLM) – to synthesize a response using both the retrieved content
and the input question.

Rather than generating answers purely from internal memory (which may be
outdated or incomplete), a RAG system fetches real documents and grounds the
model’s output in that information.

Why Use RAG?

RAG addresses several key challenges of LLMs:

Problem in LLMs How RAG Helps

Hallucination (made-up
Anchors generation in real data
facts)
Problem in LLMs How RAG Helps

Uses fresh, external sources like databases or


Outdated knowledge
websites

Limited context window Dynamically injects only the most relevant info

Domain-specific needs Connects model to private corpora, PDFs, etc.

Basic Workflow of a RAG System

1. User query →
2. Retriever fetches relevant documents (e.g., from a vector database) →
3. Documents + query are passed to the LLM →
4. LLM generates a grounded, accurate response.

This loop allows the model to act more like a researcher with access to a
searchable library.

Components of a RAG Pipeline

• Embedding Model: Converts text into dense vectors to enable similarity


search.
• Vector Store: A searchable index (e.g., FAISS, Weaviate, Pinecone) where
document embeddings are stored.
• Retriever: Queries the vector store with the input to find top-k most similar
documents.
• LLM: Uses the retrieved content and prompt to generate a final response.
• Optional Reranker: Improves retrieval quality by reordering results.
Example Use Cases

• Enterprise chatbots: Pull answers from internal documents and manuals.


• Customer support: Query knowledge bases in real-time.
• Academic research tools: Generate summaries grounded in actual papers.
• Healthcare assistants: Retrieve clinical guidelines or patient history for
personalized advice.

You might also like