Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
13 views38 pages

AIML Internship Report

This report presents the development of a machine learning model for rainfall prediction using historical weather data, focusing on algorithms like Linear Regression, Random Forest, and Support Vector Machines. The Random Forest model was identified as the most effective for accurate predictions, emphasizing the importance of data preprocessing and model selection. The findings highlight the potential of machine learning in weather forecasting, with implications for agriculture, water management, and disaster preparedness.

Uploaded by

Giri charan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views38 pages

AIML Internship Report

This report presents the development of a machine learning model for rainfall prediction using historical weather data, focusing on algorithms like Linear Regression, Random Forest, and Support Vector Machines. The Random Forest model was identified as the most effective for accurate predictions, emphasizing the importance of data preprocessing and model selection. The findings highlight the potential of machine learning in weather forecasting, with implications for agriculture, water management, and disaster preparedness.

Uploaded by

Giri charan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

ABSTRACT

This report details the development and implementation of a machine learning-based


model for predicting rainfall, utilizing historical weather data. The primary objective of
this project was to explore the potential of machine learning algorithms in forecasting
rainfall events, which can be pivotal for sectors such as agriculture, water management,
and disaster risk reduction. The dataset used for training the model includes key
meteorological features such as temperature, humidity, atmospheric pressure, wind speed,
and previous rainfall measurements.

Several machine learning models were explored, including Linear Regression,


Random Forest, and Support Vector Machines (SVM), to identify the best approach for
rainfall prediction. Data preprocessing was a critical component, involving steps like
handling missing data, normalizing features, and encoding categorical variables to prepare
the data for modeling.

Among the models tested, Random Forest was found to deliver the most accurate
predictions in terms of both classification accuracy and the ability to handle complex, non-
linear relationships between the input variables. Performance evaluation was carried out
using various metrics such as accuracy, precision, recall, and Root Mean Squared Error
(RMSE), with Random Forest outperforming other algorithms in all these categories.

The findings from this study underscore the effectiveness of machine learning in
weather forecasting, particularly for predicting rainfall. This project demonstrates that with
proper data preprocessing and model selection, machine learning can provide reliable
rainfall predictions that can aid in resource planning, disaster preparedness, and climate
monitoring. Furthermore, this work opens avenues for future exploration, such as
integrating additional weather variables or utilizing deep learning techniques for even
higher accuracy in prediction tasks.

5
INDEX

S.NO CONTENTS PAGE NO.

1. Introduction 7

2. Python and why it’s used? 12-16

3. Machine Learning Algorithms 17-24

4. TensorFlow 25-29

5. SciKit-Learn 29-30

6. Training & Testing Data 30-32

7. Project 33-40

6
INTRODUCTION
Machine Learning is the science of getting computers to learn without being explicitly
programmed. It is closely related to computational statistics, which focuses on making
prediction using computer. In its application across business problems, machine learning
is also referred as predictive analysis. Machine Learning is closely related to computational
statistics. Machine Learning focuses on the development of computer programs that can
access data and use it to learn themselves. The process of learning begins with observations
or data, such as examples, direct experience, or instruction, in order to look for patterns in
data and make better decisions in the future based on the examples that we provide. The
primary aim is to allow the computers learn automatically without human intervention or
assistance and adjust actions accordingly.

History of Machine Learning

The name machine learning was coined in 1959 by Arthur Samuel. Tom M. Mitchell
provided a widely quoted,more formal definition of the algorithms studied in the machine
learning field: "A computer program is saidto learn from experience E with respect to some
class of tasks T and performance measure P if its performance at tasks in T, as measured
by P, improves with experience E." This follows Alan Turing's proposal in his paper
"Computing Machinery and Intelligence", in which the question "Can machines think?" is
replaced with the question "Can machines do what we (as thinking entities) can do?". In
Turing’s proposal the characteristics that could be possessed by a thinking machine and
the various implications in constructingone are exposed.

7
Types Of Machine Learning

The types of machine learning algorithms differ in their approach, the type of data they
input and output, andthe type of task or problem that they are intended to solve. Broadly
Machine Learning can be categorized into four categories.

I. Supervised Learning
II. Unsupervised Learning
III. Reinforcement Learning
IV. Semi-supervised Learning

Machine learning enables analysis of massive quantities of data. While it generally


delivers faster, more accurate results in order to identify profitable opportunities or
dangerous risks, it may also require additional time and resources to train it properly.

SUPERVISED LEARNING
Supervised Learning is a type of learning in which we are given a data set and we
already know what are correct output should look like, having the idea that there is a
relationship between the input and output. Basically, it is learning task of learning a
function that maps an input to an output based on example input- output pairs. It infers a
function from labeled training data consisting of a set of training examples.
Supervised learning problems are categorized

Key Characteristics:
1. Labeled Data: Supervised learning requires labeled data, where each example is
accompanied by a target output.
2. Training: The algorithm learns from the labeled data during the training phase.
3. Prediction: The trained model makes predictions on new, unseen data.

UNSUPERVISED LEARNING
Unsupervised Learning is a type of learning that allows us to approach problems with
little or no idea what our problem should look like. We can derive the structure by clustering
the data based on a relationship amongthe variables in data. With unsupervised learning
there is no feedback based on prediction result. Basically, it is a type of self-organized
learning that helps in finding previously unknown patterns in data set without pre- existing
label.

1. Unlabeled Data: Unsupervised learning uses unlabeled data, where no target output is
provided.
8
2. Discovery: The algorithm discovers patterns, relationships, or groupings in the data.
3. No Supervision: No human expertise or labeled data is required.

REINFORCEMENT LEARNING

Reinforcement learning is a learning method that interacts with its environment by


producing actions and discovers errors or rewards. Trial and error search and delayed
reward are the most relevant characteristicsof reinforcement learning. This method allows
machines and software agents to automatically determine the ideal behavior within a
specific context in order to maximize its performance. Simple reward feedback is required
for the agent to learn which action is best.

Key Characteristics:
1. Agent: The decision-making entity.
2. Environment: The external system with which the agent interacts.
3. Actions: The agent's decisions.

SEMI SUPERVISED LEARNING

Semi-supervised learning fall somewhere in between supervised and unsupervised


learning, since they use both labeled and unlabeled data for training – typically a small
amount of labeled data and a large amount of unlabeled data. The systems that use this
method are able to considerably improve learning accuracy. Usually, semi-supervised
learning is chosen when the acquired labeled data requires skilled and relevant resources
in order to train it / learn from it. Otherwise, acquiring unlabeled data generally doesn’t
require additional resources.

9
Applications Of Machine Learning

Machine learning is one of the most exciting technologies that one would have ever come
across. As it is evident from the name, it gives the computer that which makes it more
similar to humans: The ability to learn.Machine learning is actively being used today,
perhaps in many more places than one would expect. We probably use a learning algorithm
dozen of time without even knowing it. Applications of Machine Learning include:
• Web Search Engine: One of the reasons why search engines like google, bing etc
work so well is because the system has learnt how to rank pages through a complex
learning algorithm.
• Photo tagging Applications: Be it facebook or any other photo tagging application,
the ability to tagfriends makes it even more happening. It is all possible because of
a face recognition algorithm that runs behind the application.
• Spam Detector: Our mail agent like Gmail or Hotmail does a lot of hard work for
us in classifying themails and moving the spam mails to spam folder. This is again
achieved by a spam classifier running in the back end of mail application.

OBJECTIVES
Main objectives of training were to learn:
• How to determine and measure program complexity,
• Python Programming
• ML Library Scikit, Numpy , Matplotlib, Pandas.
• Statistical Math for the Algorithms.
• Learning to solve statistics and mathematical concepts.
• Supervised and Unsupervised Learning
• Classification and Regression
• ML Algorithms

10
2.Python – The New Generation Language
Python is a widely used general-purpose, high level programming language. It was
initially designed by Guido van Rossum in 1991 and developed by Python Software
Foundation. It was mainly developed for an emphasis on code readability, and its syntax
allows programmers to express concepts in fewer lines of code. Python is dynamically
typed and garbage-collected. It supports multiple programming paradigms, including
procedural, object-oriented, and functional programming. Python is often described as a
"batteries included"language due to its comprehensive standard library.

Feature of Python:

• Interpreted -
In Python there is no separate compilation and execution steps like C/C++. It directly
run the program from the source code. Internally, Python converts the source code into
an intermediate form called bytecodes which is then translated into native language of
specific computer to run it.
• Platform Independent -
Python programs can be developed and executed on the multiple operating system
platform. Python canbe used on Linux, Windows, Macintosh, Solaris and many more.
• Multi- Paradigm -
Python is a multi-paradigm programming language. Object-oriented programming and
structured programming are fully supported, and many of its features support functional
programming and aspect- oriented programming .
• Simple -
Python is a very simple language. It is a very easy to learn as it is closer to English
language. In pythonmore emphasis is on the solution to the problem rather than the
syntax.
• Rich Library Support -
Python standard library is very vast. It can help to do various things involving regular
expressions, documentation generation, unit testing, threading, databases, web
browsers, CGI, email, XML, HTML, WAV files, cryptography, GUI and many more.
• Free and Open Source -
Firstly, Python is freely available. Secondly, it is open-source. This means that its
source code is available to the public. We can download it, change it, use it, and
distribute it.

11
Why Python Is a Perfect Language for Machine
Learning?

1. A great library ecosystem -


A great choice of libraries is one of the main reasons Python is the most popular
programming languageused for AI. A library is a module or a group of modules
published by different sources which includea pre-written piece of code that allows
users to reach some functionality or perform different actions. Python libraries
provide base level items so developers don’t have to code them from the very
beginning every time. ML requires continuous data processing, and Python’s
libraries let us access, handle and transform data. These are some of the most
widespread libraries you can use for ML and AI:
o Scikit-learn for handling basic ML algorithms like clustering, linear and
logistic regressions,regression, classification, and others.
o Pandas for high-level data structures and analysis. It allows merging and
filtering of data, aswell as gathering it from other external sources like
Excel, for instance.
o Keras for deep learning. It allows fast calculations and prototyping, as
it uses the GPU inaddition to the CPU of the computer.
o TensorFlow for working with deep learning by setting up, training, and
utilizing artificialneural networks with massive datasets.
o Matplotlib for creating 2D plots, histograms, charts, and other forms of
visualization.
o NLTK for working with computational linguistics, natural
language recognition, andprocessing.
o Scikit-image for image processing.
In the PyPI repository, we can discover and compare more python libraries.

2. A low entry barrier -


Working in the ML and AI industry means dealing with a bunch of data that we
need to process in themost convenient and effective way. The low entry barrier
allows more data scientists to quickly pick up Python and start using it for AI
development without wasting too much effort into learning the language.

12
3. Flexibility-
Python for machine learning is a great choice, as this language is very flexible:
▪ It offers an option to choose either to use OOPs or scripting.
▪ There’s also no need to recompile the source code, developers
can implement anychanges and quickly see the results.
▪ Programmers can combine Python and other languages to reach their
goals.

4. Good Visualization Options-


For AI developers, it’s important to highlight that in artificial intelligence, deep
learning, and machinelearning, it’s vital to be able to represent data in a human-
readable format. Libraries like Matplotlib allow data scientists to build charts,
histograms, and plots for better data comprehension, effective presentation, and
visualization. Different application programming interfaces also simplify the
visualization process and make it easier to create clear reports.

5. Community Support-
It’s always very helpful when there’s strong community support built around the
programming language. Python is an open-source language which means that
there’s a bunch of resources open for programmers starting from beginners and
ending with pros. A lot of Python documentation is available online as well as in
Python communities and forums, where programmers and machine learning
developers discuss errors, solve problems, and help each other out. Python
programming language is absolutely free as is the variety of useful libraries and
tools.

6. Growing Popularity-
As a result of the advantages discussed above, Python is becoming more and more
popular among data scientists. According to StackOverflow, the popularity of
Python is predicted to grow until 2020, at least. This means it’s easier to search for
developers and replace team players if required. Also, the costof their work maybe
not as high as when using a less popular programming language.

13
Data Preprocessing , Analyzing and Visualization

Machine Learning algorithms don’t work so well with processing raw data. Before we
can feed such data to an ML algorithm, we must preprocess it. We must apply some
transformations on it. With data preprocessing, we convert raw data into a clean data
set. To perform data this, there are 7 techniques -

1. Rescaling Data -
For data with attributes of varying scales, we can rescale attributes to possess the same
scale. We rescale attributes into the range 0 to 1 and call it normalization. We use the
MinMaxScaler class from scikit- learn.This gives us values between 0 and 1.

2. Standardizing Data -
With standardizing, we can take attributes with a Gaussian distribution and different
means and standard deviations and transform them into a standard Gaussian
distribution with a mean of 0 and a standard deviation of 1.

3. Normalizing Data -
In this task, we rescale each observation to a length of 1 (a unit norm). For this, we use
the Normalizer class.

4. Binarizing Data -
Using a binary threshold, it is possible to transform our data by marking the values
above it 1 and those equal to or below it, 0. For this purpose, we use the Binarizer class.

5. Mean Removal-
We can remove the mean from each feature to center it on zero.

6. One Hot Encoding -


When dealing with few and scattered numerical values, we may not need to store these.
Then, we can perform One Hot Encoding. For k distinct values, we can transform the
feature into a k-dimensional vectorwith one value of 1 and 0 as the rest values.

7. Label Encoding -
Some labels can be words or numbers. Usually, training data is labelled with words to
make it readable.Label encoding converts word labels into numbers to let algorithms
work on them.
Types of Data Visualization:
1. Univariate Plots: Histograms, box plots, and density plots.

14
2. Multivariate Plots: Scatter plots, heatmaps, and parallel coordinates.
3. Interactive Visualization: Dashboards and interactive plots.
Tools for Data Preprocessing, Analysis, and Visualization:
1. Python Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn.
2. R Libraries: dplyr, tidyr, ggplot2, caret.
3. Data Visualization Tools: Tableau, Power BI, D3.js.

Example Code (Python):


import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
# Load data
df = pd.read_csv('data.csv')
# Handle missing values
df.fillna(df.mean(), inplace=True)
# Standardize features
scaler = StandardScaler()
df_scaled = scaler.fit_transform(df)
plt.scatter(df_scaled[:, 0], df_scaled[:, 1])
plt.show()

15
3.MACHINE LEARNING ALGORITHMS

There are many types of Machine Learning Algorithms specific to different use cases. As
we work with datasets, a machine learning algorithm works in two stages. We usually split
the data around 20%-80% between testing and training stages. Under supervised learning,
we split a dataset into a training data and test data in Python ML. Followings are the
Algorithms of Python Machine Learning.

1.Linear Regression –
Linear regression is one of the supervised Machine learning algorithms in Python that
observes continuous features and predicts an outcome. Depending on whether it runs on a
single variable or on many features, we can call it simple linear regression or multiple
linear regression.
This is one of the most popular Python ML algorithms and often under-appreciated. It
assigns optimal weightsto variables to create a line ax+b to predict the output. We often
use linear regression to estimate real valueslike a number of calls and costs of houses based
on continuous variables. The regression line is the best line that fits Y=a*X+b to denote a
relationship between independent and dependent variables.

Linear Regression Equation:


Y = β0 + β1X + ε
where:
- Y: outcome variable
- X: predictor variable
- β0: intercept
- β1: slope
- ε: error term

16
2. Logistic Regression -
Logistic regression is a supervised classification is unique Machine Learning algorithms
in Python that finds its use in estimating discrete values like 0/1, yes/no, and true/false. This
is based on a given set of independentvariables. We use a logistic function to predict the
probability of an event and this gives us an output between 0 and 1. Although it says
‘regression’, this is actually a classification algorithm. Logistic regression fits data into a
logit function and is also called logit regression.
Logistic Regression Equation:
p = 1 / (1 + e^(-z))
where:
- p: probability of the event occurring
- e: base of the natural logarithm
- z: weighted sum of input variables

Types of Logistic Regression:


1. Binary Logistic Regression: Used for binary classification problems.
2. Multinomial Logistic Regression: Used for multi-class classification problems.
3. Ordinal Logistic Regression: Used for ordinal classification problems.

1. Decision Tree -
A decision tree falls under supervised Machine Learning Algorithms in Python and comes
of use for both classification and regression- although mostly for classification. This model
takes an instance, traverses the tree, and compares important features with a determined
conditional statement. Whether it descends to the leftchild branch or the right depends on
the result. Usually, more important features are closer to the root.
Decision Tree, a Machine Learning algorithm in Python can work on both categorical and
continuous dependent variables. Here, we split a population into two or more homogeneous
sets. Tree models where the target variable can take a discrete set of values are called
classification trees; in these tree structures, leaves represent class labels and branches
represent conjunctions of features that lead to those class labels. Decision trees where the
target variable can take continuous values (typically real numbers) are called regression
trees.

How Decision Trees Work?


The process of creating a decision tree involves:
1. Selecting the Best Attribute: Using a metric like Gini impurity, entropy, or
information gain, the best attribute to split the data is selected.
2. Splitting the Dataset: The dataset is split into subsets based on the selected attribute.
3. Repeating the Process: The process is repeated recursively for each subset, creating

17
a new internal node or leaf node until a stopping criterion is met (e.g., all instances in
a node belong to the same class or a predefined depth is reached).

Python code for Decision Tree:

import tensorflow_decision_forests as tfdf


import pandas as pd
# Load dataset
data = pd.read_csv("your_data.csv")
train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(data, label="target_column")

# Train a Decision Tree model


model = tfdf.keras.RandomForestModel()
model.fit(train_ds)

# Evaluate the model


model.summary()

2.Support Vector Machine -

SVM is a supervised classification is one of the most important Machines Learning


algorithms in Python, that plots a line that divides different categories of your data. In
this ML algorithm, we calculate the vector to optimize the line. This is to ensure that the
closest point in each group lies farthest from each other. While you will almost always find
this to be a linear vector, it can be other than that. An SVM model is a representation of
the examples as points in space, mapped so that the examples of the separate categories are
divided by a clear gap that is as wide as possible. In addition to performing linear
classification, SVMs can efficiently perform a non-linear classification using what is called
the kernel trick, implicitly mapping their inputs into high-dimensional feature spaces. When
data are unlabeled, supervised learning is not possible, and an unsupervised learning
approach is required, which attempts to find natural clustering of the data to groups, and
then map new data to these formed groups.

18
1.Naive Bayes Algorithm -

Naive Bayes is a classification method which is based on Bayes’ theorem. This assumes
independence betweenpredictors. A Naive Bayes classifier will assume that a feature in a
class is unrelated to any other. Consider a fruit. This is an apple if it is round, red, and 2.5
inches in diameter. A Naive Bayes classifier will say these characteristics independently
contribute to the probability of the fruit being an apple. This is even if features depend on
each other. For very large data sets, it is easy to build a Naive Bayesian model. Not only is
this model very simple, it performs better than many highly sophisticated classification
methods. Naïve Bayes classifiers are highly scalable, requiring a number of parameters
linear in the number of variables (features/predictors) in a learning problem. Maximum-
likelihood training can be done by evaluating a closed-form expression, which takes linear
time, rather than by expensive iterative approximation as used for many other types of
classifiers.

Naïve Bayes Classification formula is represented as:

P(A∣B)=P(B)P(B∣A)P(A)

19
2.KNN Algorithm -
This is a Python Machine Learning algorithm for classification and regression- mostly for
classification. This i s a supervised learning algorithm that considers different centroids and
uses a usually Euclidean function to compare distance. Then, it analyzes the results and
classifies each point to the group to optimize it to place with all closest points to it. It
classifies new cases using a majority vote of k of its neighbors. The case it assignsto a class
is the one most common among its K nearest neighbors. For this, it uses a distance function.
k-NN is a type of instance-based learning, or lazy learning, where the function is only
approximated locally and all computation is deferred until classification. k-NN is a special
case of a variable-bandwidth, kernel density "balloon" estimator with a uniform kernel.

Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance also be visualized as length of the straight line that joins
the two points which are into consideration. This metric helps us calculate the net displacement
done between the two states of an object.

distance(x,Xi)=∑j=1d(xj–Xij)2]

o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.

3. K- Means Algorithm -
k-Means is an unsupervised algorithm that solves the problem of clustering. It classifies
data using a number of clusters. The data points inside a class are homogeneous and
heterogeneous to peer groups. k-means clustering is a method of vector quantization,
originally from signal processing, that is popular for cluster analysis in data mining. k-
means clustering aims to partition n observations into k clusters in which each observation
belongs to the cluster with the nearest mean, serving as a prototype of the cluster. k-means
clustering is rather easy to apply to even large data sets, particularly when using heuristics
such as Lloyd's algorithm. It often is used as a preprocessing step for other algorithms, for
example to find a starting configuration. The problem is computationally difficult (NP-
hard). k-means originates from signal processing, and still finds use in this domain. In
cluster analysis, the k-means algorithm can be used to partition the input data set into k
partitions (clusters). k-means clustering has been used as a feature learning (or dictionary
learning) step, in either (semi-)supervised learning or unsupervised learning.

A Python Code Implementation:


X, Y= make_blobs (n_samples = 500,n_features = 2,centers = 3,random_state = 23)
fig = plt.figure(0)
plt.grid(True)
plt.scatter(X[:,0],X[:,1])
plt.show()

20
4.Random Forest -

A random forest is an ensemble of decision trees. In order to classify every new object
based on its attributes, trees vote for class- each tree provides a classification. The
classification with the most votes wins in the forest.Random forests or random decision
forests are an ensemble learning method for classification, regression andother tasks that
operates by constructing a multitude of decision trees at training time and outputting the
classthat is the mode of the classes (classification) or mean prediction (regression) of the
individual trees.
Algorithm for Random Forest Work:
1. Step 1: Select random K data points from the training set.
2. Step 2: Build the decision trees associated with the selected data points (Subsets).
3. Step 3: Choose the number N for decision trees that you want to build.
4. Step 4: Repeat Step 1 and 2.
5. Step 5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.

Key Machine Learning Concepts:


Feature Engineering: The process of transforming raw data into meaningful inputs for the
model. This often includes selecting, scaling, and encoding variables.

Model Evaluation Metrics: Metrics such as accuracy, precision, recall, and F1-score evaluate
a model’s effectiveness. Cross-validation further ensures model reliability by testing it on
multiple data splits.

Overfitting and Regularization: Overfitting occurs when a model performs well on training
data but poorly on unseen data. Regularization techniques, like L1 and L2 penalties, help
prevent overfitting by discouraging complex model structures.

21
4.Machine Learning Using TensorFlow

TensorFlow is an open-source machine learning platform that enables users to build, train, and
deploy machine learning models efficiently. Its architecture is designed for both research and
production, offering flexibility, scalability, and a robust set of tools for developing
sophisticated machine learning solutions.

Overview of TensorFlow
TensorFlow is an end-to-end platform for machine learning and deep learning tasks. Originally
developed for internal use by Google, it was released as an open-source framework in 2015,
which helped it gain popularity across industry and academia. TensorFlow allows users to
construct and execute complex machine learning models efficiently, leveraging computational
resources such as GPUs and TPUs.
• Ecosystem: TensorFlow offers a complete ecosystem for machine learning, including
data preprocessing, model building, training, and deployment tools.
• Flexibility: TensorFlow supports a range of machine learning applications, from
traditional models to cutting-edge neural networks.
• Community Support: The TensorFlow community contributes a wealth of resources,
such as tutorials, pre-trained models, and additional libraries.

Key Components of TensorFlow


TensorFlow provides several core components that simplify the process of building and
training machine learning models.
Tensors
• Definition: Tensors are multidimensional arrays that serve as the primary data
structure in TensorFlow. They can represent scalars, vectors, matrices, and higher-
dimensional arrays.
• Operations: TensorFlow performs computations on tensors, including basic
arithmetic operations, transformations, and matrix manipulations. These operations
are optimized for parallel computing on CPUs, GPUs, and TPUs.
Computational Graphs
• Graph Structure: TensorFlow uses computational graphs to define operations as
nodes and data as edges. This graph-based approach enables efficient computation by
organizing and optimizing tasks before executing them.
• Eager Execution: In TensorFlow 2.x, eager execution allows operations to be
evaluated immediately, similar to standard Python, which improves code readability
and debugging.

TensorFlow Tools and APIs for Machine Learning


TensorFlow includes several powerful APIs and tools for building, training, and deploying
machine learning models.
tf.keras:

22
tf.keras is a high-level API within TensorFlow that simplifies the construction, training, and
evaluation of machine learning models. It supports both Sequential and Functional APIs:
• Sequential API: Used for building simple linear stacks of layers, which is ideal for
straightforward neural network architectures.
• Functional API: Allows for the creation of complex, non-linear models such as multi-
input, multi-output, and custom layer models.

TensorFlow Data Pipelines:


TensorFlow’s tf.data API enables efficient handling of large datasets and pre-processing data
before feeding it into the model. The tf.data.Dataset class provides methods for:
• Data Loading: Reading data from various sources, including files, databases, and
APIs.
• Data Transformation: Applying operations like shuffling, batching, and mapping to
prepare the data.
• Data Augmentation: Applying real-time transformations to images or text, which can
help reduce overfitting and improve model robustness.

Tensor Board:
TensorBoard is TensorFlow's visualization tool for tracking and analysing model performance.
It provides a graphical interface to view:
• Training Metrics: Visualize loss, accuracy, and other key metrics during training.
• Model Graphs: Display computational graphs to understand model architecture.
• Hyperparameter Tuning: Track different hyperparameter settings and their impacts
on performance.
TensorBoard is an essential tool for diagnosing issues, monitoring training, and refining model
parameters.

TensorFlow Hub:
TensorFlow Hub is a library for accessing and reusing pre-trained models. By using pre-trained
models, developers can:
• Transfer Learning: Improve model accuracy and reduce training time by leveraging
models that have been trained on large datasets.
• Domain-Specific Models: Access specialized models (e.g., for image classification,
text embeddings) and integrate them with custom applications.
• Fine-Tuning: Modify the final layers of pre-trained models to adapt them for specific
tasks.

Building Models with TensorFlow


TensorFlow simplifies model creation, training, and tuning, allowing users to focus on
improving performance and solving specific challenges.
Data Handling with tf.data API:
The tf.data API in TensorFlow provides an efficient framework for loading, transforming, and
preparing data for machine learning tasks. It supports:
• Data Loading: Load data from various sources, such as CSV files, databases, and
image folders.
• Batching and Shuffling: Batch data for faster training and shuffle to prevent patterns
that may cause overfitting.

23
• Data Augmentation: Apply transformations such as rotation, scaling, or cropping to
images, which helps reduce overfitting and improves model robustness.

Model Building with tf.keras:


tf.keras simplifies the process of building machine learning models, offering two primary
methods:
• Sequential API: Allows the creation of linear stacks of layers, ideal for feed-forward
models.

Python code:
model = tf.keras.Sequential([
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])

Functional API: Supports more complex architectures, such as multi-input or multi-output


models, and allows for custom layer connections.

Model Training and Optimization:


In TensorFlow, models are trained using the following components:
• Loss Functions: Measure the error between predicted and actual values. Common
loss functions include Mean Squared Error for regression and Cross-Entropy for
classification.
• Optimizers: Adjust model weights to minimize the loss function. Popular optimizers
include Stochastic Gradient Descent (SGD) and Adam, which provide efficient and
adaptive updates.
• Model Fitting: model.fit() function allows training on batches of data, with options
for setting epochs, batch size, and validation data.

24
5.Scikit-Learn in Machine Learning
Scikit-Learn is a popular, open-source machine learning library in Python, known for its
simplicity, efficiency, and versatility. It provides an extensive set of tools for data
preprocessing, model training, model evaluation, and hyperparameter tuning, making it
suitable for both beginners and experts.

1. Key Features of Scikit-Learn:


• Ease of Use: Scikit-Learn is designed with a clean API, making it easy to build and
evaluate machine learning models.
• Wide Range of Algorithms: It offers a broad collection of supervised and
unsupervised learning algorithms, including linear regression, decision trees, support
vector machines, k-means clustering, and more.
• Integration with Python Libraries: Scikit-Learn integrates well with popular Python
libraries like NumPy, SciPy, and Matplotlib, enhancing its capabilities in data analysis
and visualization.

2. Core Components of Scikit-Learn:


2.1 Data Preprocessing
Scikit-Learn provides various utilities for data preprocessing, which are essential for preparing
data before training a model:
• Standardization and Normalization: Transform features to have a standard mean and
variance, or scale them to a specific range.
• Encoding Categorical Variables: Convert categorical data into numerical format
using methods like One-Hot Encoding.
• Imputation: Handle missing data by filling in missing values with mean, median, or
other strategies.

2.2 Model Training and Selection


Scikit-Learn offers a range of algorithms for different tasks:
• Classification: Algorithms like logistic regression, k-nearest neighbors, support vector
machines, and random forests for tasks such as image classification and text
categorization.
• Regression: Linear regression, decision trees, and ensemble methods like gradient
boosting for predicting continuous outputs.
• Clustering: Unsupervised learning methods like k-means clustering and DBSCAN for
discovering patterns and grouping similar data points.

2.3 Model Evaluation


Scikit-Learn provides evaluation metrics that help assess model performance, including:
• Classification Metrics: Accuracy, precision, recall, F1-score, ROC-AUC for
measuring classification results.
• Regression Metrics: Mean squared error (MSE), mean absolute error (MAE), and R²
score for regression tasks.

25
2.4 Hyperparameter Tuning
Scikit-Learn includes tools for optimizing model parameters:
• Grid Search: Systematically test combinations of parameters to find the best model
configuration.
• Randomized Search: Efficiently sample a random subset of parameters for faster
tuning.

Example Workflow in Scikit-Learn:


A typical Scikit-Learn workflow involves these steps:
1. Data Loading: Load data into a Pandas DataFrame or directly use Scikit-Learn’s
sample datasets.
2. Data Preprocessing: Use Scikit-Learn transformers to clean and prepare the data.
3. Model Selection and Training: Instantiate and train a model on the training data.
4. Model Evaluation: Evaluate model performance using appropriate metrics.
5. Hyperparameter Tuning: Optimize model parameters for better performance.
6. Prediction: Use the trained model to make predictions on new data.

6.Model Building, Training, and Testing in Machine Learning

Model building, training, and testing are core stages in the machine learning workflow.
These steps are crucial in developing models that can generalize well to new, unseen data
and provide valuable predictions. Below, we break down each phase in the model
development process.

Selecting the Right Model:


The first step in building a machine learning model is to select the appropriate algorithm based
on the problem you are trying to solve. Some common types of algorithms used are:
• Supervised Learning Models:
o Classification: Used for problems where the target variable is categorical (e.g.,
logistic regression, decision trees, support vector machines).
o Regression: Used for predicting continuous values (e.g., linear regression,
decision tree regressor, random forest regressor).
• Unsupervised Learning Models:
o Clustering: Used to group similar data points together (e.g., k-means,
hierarchical clustering).
o Dimensionality Reduction: Used to reduce the number of features in a dataset
while retaining as much information as possible (e.g., PCA).

The choice of model depends on factors such as the data type, the task (prediction,
classification, etc.), and the quality of data (size, noise, missing values)

26
Model Training

Model training involves using labelled data to adjust the model's parameters so that it can
make accurate predictions on new data. This is the process where the model "learns" from
the data.

Preparing the Data:


Before training the model, it is essential to prepare the data:
• Data Splitting: Typically, the dataset is split into training, validation, and test sets.
Common splits are 70% for training, 15% for validation, and 15% for testing.
o Training Set: Used to train the model and adjust the parameters.
o Validation Set: Used to tune hyperparameters and select the best model.
o Test Set: Used to evaluate the model's final performance after training.

Training the Model:


During training, the model learns to map input features to the correct output labels by
iteratively adjusting the model's parameters (e.g., weights in neural networks). The process
involves:
• Forward Propagation: Input data is passed through the model, and predictions are
generated.
• Loss Calculation: The model’s predictions are compared to the actual labels using a
loss function (e.g., Mean Squared Error for regression, Cross-Entropy for
classification).
• Backpropagation (for neural networks): The model adjusts its weights by
calculating the gradient of the loss with respect to each weight and updating them to
minimize the loss using an optimizer (e.g., Stochastic Gradient Descent).

Monitoring and Tuning:


While training, it's important to monitor the model's performance:
• Overfitting and Underfitting: These are common issues:
o Overfitting occurs when the model becomes too complex and fits the training
data too closely, leading to poor generalization on unseen data.
o Underfitting happens when the model is too simple and fails to capture
underlying patterns in the data.
• Hyperparameter Tuning: Tuning hyperparameters such as the learning rate, batch
size, and number of epochs can significantly impact the model’s performance.

Model Testing

Testing the model is a critical step in evaluating its ability to generalize to new, unseen data.
This phase involves using the test set that was not used during training to assess how well the
model performs in real-world scenarios.
Evaluating Performance:
The model's performance is evaluated using different metrics depending on the type of task:

27
• For Classification:
o Accuracy: The percentage of correct predictions made by the model.
o Precision and Recall: Precision measures how many of the positive predictions
were correct, while recall measures how many of the actual positives were
identified.
o F1-Score: The harmonic mean of precision and recall, providing a balanced
metric when dealing with imbalanced classes.
o ROC-AUC: The Receiver Operating Characteristic Curve and the Area Under
the Curve are used to assess the model’s ability to discriminate between classes.
• For Regression:
o Mean Squared Error (MSE): Measures the average of the squared differences
between the predicted and actual values.
o R² Score: Represents the proportion of variance in the dependent variable
explained by themodel.
o Mean Absolute Error (MAE): Measures the average absolute difference
between predicted and actual values.

Model Validation:
During testing, it is also important to assess the model’s ability to generalize to unseen data. If
a model performs well on the training set but poorly on the test set, it is an indication of
overfitting. Techniques such as cross-validation can help mitigate this by ensuring the
model’s performance is consistent across multiple subsets of the dataset.

28
7.PROJECT REPORT

Objective:

Rainfall Prediction Using Machine Learning

Dataset Link:
Rainfall-Prediction-Using-Machine-Learning/Rainfall csv at main ·
nithyaprakash2003/Rainfall-Prediction-Using-Machine-Learning

Steps involved:

• We Import the data set into Jupiter Notebook using Python statements.
• We first Import the Libraries and then we Start Manipulating and Processing the
Data.
• We then remove Nan values.
• Then we detect the outliers and remove it.
• Split the data into Training and Testing data.

29
Python Code for Importing Libraries:

# importing the required libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics
from sklearn import model_selection
from sklearn import linear_model
from sklearn import ensemble
import xgboost

import numpy as np
import pandas as pd
import pickle
from sklearn import metrics

data = pd.read_csv("rainfall in india 1901-2015.csv")


# data.head()

data = data.fillna(data.mean())

group =
data.groupby('SUBDIVISION')['YEAR','JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','
SEP','OCT','NOV','DEC']
data=group.get_group(('TAMIL NADU'))
# data.head()

df=data.melt(['YEAR']).reset_index()
# df.head()

df= df[['YEAR','variable','value']].reset_index().sort_values(by=['YEAR','index'])
# df.head()

df.columns=['Index','Year','Month','Avg_Rainfall']
Month_map={'JAN':1,'FEB':2,'MAR' :3,'APR':4,'MAY':5,'JUN':6,'JUL':7,'AUG':8,'SEP':9,
'OCT':10,'NOV':11,'DEC':12}
df['Month']=df['Month'].map(Month_map)
# df.head(12)

df.drop(columns="Index",inplace=True)

X=np.asanyarray(df[['Year','Month']]).astype('int')
y=np.asanyarray(df['Avg_Rainfall']).astype('int')

from sklearn.model_selection import train_test_split

30
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

from sklearn.ensemble import RandomForestRegressor


random_forest_model = RandomForestRegressor(max_depth=100, max_features='sqrt',
min_samples_leaf=4,
min_samples_split=10, n_estimators=800)
random_forest_model.fit(X_train, y_train)

# y_predict = random_forest_model.predict(X_test)

# print('MAE:', metrics.mean_absolute_error(y_test,y_predict))
# print('MSE:', metrics.mean_squared_error(y_test, y_predict))

# print('RMSE:', np.sqrt(metrics.mean_squared_error(y_test, y_predict)))


# print("-----------Training Accuracy------------")
# print(round(random_forest_model.score(X_train,y_train),3)*100)
# print("-----------Testing Accuracy------------")
# print(round(random_forest_model.score(X_test,y_test),3)*100)

file = open("model.pkl","wb")
pickle.dump(random_forest_model,file)
file.close()
# print(y_predict)

31
Data Exploration and Preprocessing

Data exploration and preprocessing are critical steps in any machine learning project, as they
help to understand the dataset’s structure, identify patterns, detect anomalies, and prepare the
data for model training. Proper preprocessing can significantly improve model accuracy and
robustness.

We get Info About the Data Set Using Below statement

32
Filling Nan Values with Mean:
# filling na values with mean

data = data.fillna(data.mean())
data.isnull().any()

DATA VISUALIZATION:

Python code for Data Visualization


Code:
plt.figure(figsize=(15,8))
data.groupby("YEAR").sum()['ANNUAL'].plot(kind="line",color="r",marker=".")
plt.xlabel("YEARS”, size=12)
plt.ylabel("RAINFALL IN MM”, size=12)
plt. grid (axis="both”, linestyle="-.")
plt.title("Rainfall over Years")
plt.show()

33
DATA MODELLING:

Code:
Group=
data.groupby('SUBDIVISION')['YEAR','JAN','FEB','MAR','APR','MAY','JUN','JUL','AUG','
SEP','OCT','NOV','DEC']
data=group.get_group (('TAMIL NADU'))
data.head()
df= df[['YEAR','variable','value']].reset_index().sort_values(by=['YEAR','index'])
df.head()
Month_map={'JAN':1,'FEB':2,'MAR' :3,'APR':4,'MAY':5,'JUN':6,'JUL':7,'AUG':8,'SEP':9,
'OCT':10,'NOV':11,'DEC':12}
df['Month']=df['Month'].map(Month_map)
df.head(12)

34
TRAINING AND TESTING DATASET:
TO Split Train and Test Data:
Code:

# splitting the dataset into training and testing


from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=10)

MODELS:

SUPPORT VECTOR MACHINES:


We use this Support Vector Machine Algorithm for splitting Train and Test Data.

Code:
from sklearn import preprocessing
from sklearn import svm
svm_regr = svm.SVC(kernel='rbf')
svm_regr.fit(X_train, y_train)

35
RANDOM FOREST MODEL:
Code:

from sklearn.ensemble import RandomForestRegressor


random_forest_model = RandomForestRegressor(max_depth=100, max_features='sqrt',
min_samples_leaf=4,
min_samples_split=10, n_estimators=800)
random_forest_model.fit(X_train, y_train)

y_train_predict=random_forest_model.predict(X_train)
y_test_predict=random_forest_model.predict(X_test)

Python Flask Code:


We use this Python Flask as a Web Framework.

from flask import Flask,render_template,url_for,request,jsonify


from flask_cors import cross_origin
import pandas as pd
import numpy as np
import datetime
import pickle

app = Flask(__name__, template_folder="template")


model = pickle.load(open("cat.pkl", "rb"))
print("Model Loaded")

@app.route("/",methods=['GET'])
@cross_origin()
def home():
return render_template("index.html")

@app.route("/predict",methods=['GET', 'POST'])

36
@cross_origin()
def predict():
if request.method == "POST":
# DATE
date = request.form['date']
day = float(pd.to_datetime(date, format="%Y-%m-%dT").day)
month = float(pd.to_datetime(date, format="%Y-%m-%dT").month)
# MinTemp
minTemp = float(request.form['mintemp'])
# MaxTemp
maxTemp = float(request.form['maxtemp'])
# Rainfall
rainfall = float(request.form['rainfall'])
# Evaporation
evaporation = float(request.form['evaporation'])
# Sunshine
sunshine = float(request.form['sunshine'])
# Wind Gust Speed
windGustSpeed = float(request.form['windgustspeed'])
# Pressure 9am
pressure9am = float(request.form['pressure9am'])
# Pressure 3pm
pressure3pm = float(request.form['pressure3pm'])
# Temperature 9am
temp9am = float(request.form['temp9am'])
# Temperature 3pm
temp3pm = float(request.form['temp3pm'])

# Wind Dir 3pm


winddDir3pm = float(request.form['winddir3pm'])
# Wind Gust Dir
windGustDir = float(request.form['windgustdir'])
# Rain Today
rainToday = float(request.form['raintoday'])

input_lst = [location , minTemp , maxTemp , rainfall , evaporation , sunshine ,


windGustDir , windGustSpeed , winddDir9am ,
winddDir3pm , windSpeed9am , windSpeed3pm ,
humidity9am , humidity3pm , pressure9am ,
pressure3pm , cloud9am , cloud3pm , temp9am , temp3pm ,
rainToday , month , day]
pred = model.predict(input_lst)
output = pred
if output == 0:
return render_template("after_sunny.html")
else:
return render_template("after_rainy.html")
return render_template("predictor.html")
if __name__=='__main__':
app.run(debug=True)

37
Frontend Code(HTML):

<!DOCTYPE html>
<html lang="en" dir="ltr">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Rainy Brain</title>
<link rel="stylesheet" href={{url_for('static',filename='style1.css')}}>
<link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/font-
awesome/5.14.0/css/all.min.css">
</head>
<body>

<section>
<input type="checkbox" id="check">
<header>
</div>
<h2><a href="#" class="logo">Rainy Brain</a></h2>
<div class="navigation">
<a href="#">Home</a>
<a href="#about">About Rainy Brain</a>
<a href="#dashboard">Dashboard</a>
<a href="#info">Developer</a>
<a href="/predict">Predictor</a>
</div>
<label for="check">
<i class="fas fa-bars menu-btn"></i>
<i class="fas fa-times close-btn"></i>
</label>
</header>
<div class="content" style="margin-top: 8%;">
<div class="info">
<h2>Plant Trees <br><span>Save Rain</span></h2>
<p> "Plant trees to bring the rains and get rid of the summer's heat.” - Trees help
reduce and moderate the temperature and climate, which is why it is so important that we
have more of them</p>
<a href="#about" class="info-btn">More info</a>
</div>
<section id="about">
<h2>About Rainy Brain</h2>
<p class="about-content" style="text-align: center;">Rainy Brain is a web app which has
a Machine Learning model running at the back. The purpose of developing this app is to
predict whether it will rain the next day or not.
Occasionally, tropical cyclones can bring heavy
rainfall to tropical coastal regions, which is also likely to reach further inland.
</p>
</section>
<section id="dashboard">

38
<h2>Dashboard</h2>
<p>This dashboard is done using a software called PowerBI which is a product of
Microsoft.
Here I have just attached the images of the dashboard because PowerBI needs
oraganizational
account. So to see the visualizations interactive I am attaching my <a
href="../static/rain.pbix" style="color: black; font-weight: bold;">PowerBI</a>
dashboard file. This requires PowerBI software to open the file. The usage of
dashboards like
these is to bring a better understanding about the dataset and also to bring some
beautiful insights</p>
<img class="dashboard-image" src="../static/dashboard.png" alt="1">
<div>
<img src="../static/1.png" alt="1">
<img src="../static/3.png" alt="3">
<img src="../static/4.png" alt="4">
<img src="../static/5.png" alt="5">
<img src="../static/6.png" alt="6">
<img src="../static/7.png" alt="7">
<img src="../static/8.png" alt="8">
<img src="../static/9.png" alt="9">
</div>
</section>
<section id="info">
<div>
<img src="../static/abc.png" alt="Shashank">
</div>
<div>
<h2>Developer</h2></br>
<ul style="list-style-type:circle">
<li style="font-size:25px">Shashank Kumar (20MCA0142)</li>
<li style="font-size:25px">Vishal Kumar (20MCA0238)</li>
<li style="font-size:25px">Devansh Tiwari (20MCA0007)</li>
<li style="font-size:25px">Prachi Jha (20MCA0143)</li>
<footer>
<p>
Developed with Shashank & Team
</p>
<div class="media-icons">
<a href="https://www.linkedin.com/in/shashank-kumar-sk269/"><i class="fab fa-
linkedin"></i></a>
<a href="https://github.com/kumar-shashank"><i class="fab fa-github"></i></a>
<a href="https://twitter.com/"><i class="fab fa-twitter"></i></a>
</div>
</footer>
</body>
</html>

39
Output:
The Output shows a user interface page that is built using HTML,CSS also we use Python
Flask For Web Framework.

40
We get the predicted rainfall in ‘mm’ of an state were input is month and year.

41
CONCLUSION :

In conclusion, the summer internship on machine learning using TensorFlow offered


an insightful journey into the field of artificial intelligence, with a focus on developing
predictive models. TensorFlow’s powerful framework and extensive libraries streamlined the
process of building, training, and deploying machine learning models, providing an invaluable
experience in both theoretical knowledge and practical application. The internship covered
essential machine learning concepts, including data preprocessing, model selection,
evaluation, and optimization techniques. Furthermore, TensorFlow’s high-level APIs, allowed
for efficient data handling and model building, enabling rapid prototyping and easy
experimentation with various model architectures.
The internship culminated in a rainfall prediction project, which applied machine
learning techniques to forecast rainfall based on historical weather data. This project
highlighted the real-world impact of machine learning by addressing an important problem in
the agricultural and environmental sectors, where accurate rainfall prediction is critical for
crop planning and water resource management. Using a neural network model built with
TensorFlow, the project achieved significant accuracy by leveraging data features such as
temperature, humidity, wind speed, and historical rainfall patterns.

42

You might also like