FINAL YEAR PROJECT REPORT
OF
BACHELOR OF TECHNOLOGY
Babu Banarasi Das
Northern India Institute of Technology
Affiliated to Dr. A.P.J. Abdul Kalam Technical University (AKTU Code : 056)
Approved by All India Council for Technical Education (AICTE)
Sector II, Dr Akhilesh Das Nagar, Faizabad Road, Lucknow (UP) – India,
226028
CROP PREDICTION
B.Tech Final Project Report Submitted
In Partial Fulfillment of the Requirements
For the Degree of
BACHELOR OF TECHNOLOGY
in
Information Technology
By
Khushbu Sharma (1705613014)
Jyoti Singh (1705613013)
Under the Supervision of
Dr Praveen Kumar Shukla(HOD)
Department of Information Technology
Babu Banarasi Das
Northern India Institute of Technology
Affiliated to Dr. A.P.J. Abdul Kalam Technical University (AKTU Code : 056)
Approved by All India Council for Technical Education (AICTE)
Sector II, Dr Akhilesh Das Nagar, Faizabad Road, Lucknow (UP) – India, 226028
July-2021
DECLARATION
We hereby declare that the work presented in this report entitled “Crop Prediction
System” was carried out by us. We have not submitted the matter embodied in this
report for the award of any other degree or diploma of any other University or
Institute.
We have given due credit to the original authors/sources for all words, ideas,
diagrams, graphics, computer programs, experiments, results, that are not our original
contribution. We have used quotation marks to identify verbatim sentences and given
credit to the original authors/sources.
We affirm that no portion of our work is plagiarized, and the experiments and results
reported in the report are not manipulated. In the event of a complaint of plagiarism
and the manipulation of the experiments and results, we shall be fully responsible and
answerable.
Signature of Student 1:
Name of Student 1: Khushbu Sharma
University Roll No.: 1705613014
Branch: Information Technology
Institute: Babu Banarasi Das Northern India Institute of Technology
Signature of Student 2:
Name of Student 2: Jyoti Singh
University Roll No.: 1705613013
Branch: Information Technology
Institute: Babu Banarasi Das Northern India Institute of Technology
CERTIFICATE
Certified that Khushbu Sharma(1705613014),Jyoti Singh(1705613013) Have
carried out the Project Work presented in this report entitled “Crop Prediction
System” for the B.Tech. Final Year in the Academic Session 2020-21 from
Babu Banarasi Das Northern India Institute of Technology (AKTU Code:
056) Lucknow under my supervision. The report embodies result of original
work and studies carried out by the Students themselves and the contents of the
Project do not form the basis for the award of any other degree to the candidate
or to anybody else.
Date:
Mrs. Neelam Chakravarti Dr. Praveen Kumar Shukla
Designation: Assistant Professor Head of Department
Name of Department: Information Information Technology
Technology
ABSTRACT
In this Project Report, we will discuss Crop Prediction System as a solution to
solve the farmers problems and also to maintain the crop production. Our aim of
this project to help farmers to grow proper crops to achieve better yields.
Nowadays, agriculture sector is a major contributor to India Economy.
In a country like India have a large population, which has ever increasing
demand of food due to rising population, advances in agriculture sector are
required to meet the needs, Therefore Crop Yield Prediction remains a
challenging task in this domain.
There are various parameters that affect the yield of crop like rainfall,
temperature, fertilizers, pesticides, Ph level, pressure, apparent temperature
maximum, apparent temperature minimum, latitude, longitude and other
atmosphere conditions and parameter.
These are the all parameters which are use to help the crop prediction in future
on the basis of previous dataset and help to solve the farmer problem occur
during cultivation.
This project focuses the farmer to know the yield of their crop before
cultivating onto the agriculture field and help them to make appropriate
decision.
For such kind of data analysis in crop prediction, there are different technique or
algorithms are use, and with the help of these algo. We can predict the crop
yield before cultivation with is very beneficial for the farmer.
Machine learning approaches have been executed on the agricultural data to
evaluate the best performing technique.
In order to overcome the issue of crop congestion and less production of crop
Machine Learning technology is used.
In this project machine learning technology are used. there are four module used
in a crop prediction. There are Linear regression model_1, Linear regression
model_2, tuned regression model and graph _model.
In linear regression model_1 used linear regression algorithm without shuffling
the dataset.
In linear regression model_2 used linear regression algorithm with shuffling the
dataset.
In tuned regression model various machine learning algorithms are used to
predict the accuracy of crop production.
In graph _model different graph are plots by using the dataset.
Python language and Machine Learning technology is used to implement this
system. HTML used for the frontend part. flask for user to see the last predicted
value, flask are used as a database.
ACKNOWLEDGEMENT
The satisfaction and euphoria that accompany the development of any
task would be incomplete without the mention of the people who make it
possible, whose constant guidance and encouragement crowned our efforts with
success.
Our first vote of thanks goes to our Parents, with every gesture, every
word, every pat on back, they have always tried to inspire us for better things
and attain them.
We wish to place on record our whole hearted gratitude to Mrs. Neelam
Chakravarti, our project guide for making available every facility that we
required during the course of our project. With friendly advice and guidance at
every step, her presence was a welcome sight throughout the project.
We are also deeply indebted to our Head of Department Dr. Praveen
Kumar Shukla for his constant presence, supervision and advice that paid off
in the culmination of this project and has helped us a great deal with the project
with their constant words of encouragement and advice. Actually, this project
report is just an excuse to convey our feelings about how much we appreciate
the amount of concern and caring that our teachers exhibit in all our pursuits
ranging from anything as simple as the routine lab programs to something as
taxing as a project.
Thanks to all of you….
Finally we would like to bind up paying our heartfelt thanks and
prayers to the Almighty, without whose willing nothing is possible in this
world and all my dear friends for all your support and help.
TABLE OF CONTENTS
Chapter No. Title Page
No.
ABSTRACT iv
ACKNOWLEDGEMENT
vi
LIST OF TABLES
xii
LIST OF FIGURES
xiii
Chapter No. 1: INTRODUCTION
1.1:Introduction
1.2:Scope and objective
Chapter No. 2: Literature Survey
2.1:Tools and Technology used
2.1.1:Introduction of Python
2.1.2: Visualization and Prediction of
Crop Production data using Python
2.1.3:Introduction of Machine Learning
2.1.4: Visualization and Prediction of
Crop Production data using
Machine learning.
2.1.5:Algorithms used in project
2.1.6:Flask
Chapter No. 3:Software requirements specification
3.1:hardware requirement
3.2: Software requirement
3.3:Feasibility study
3.3.1:Economical Feasibility
3.3.2:Technical Feasibility
3.3.3:Social Feasibility
Chapter No.4: System Analysis and Design
4.1:System Design
4.2:Sequence diagram of crop yield
4.3:Use case diagram for predicting the
Crop yield.
Chapter No.5: Implementation
5.1:Coding
5.2:Images
Chapter No.6: Software testing
6.1:Testing level
Chapter No.7: Conclusion
Chapter No.8: Future Enhancement
Chapter No.9: References
CHAPTER No.1
INTRODUCTION
Agriculture is the backbone of every economy. In a country like
India, which has ever increasing demand of food due to rising population,
advances in agriculture sector are required to meet the needs. From ancient
period, agriculture is considered as the main and the foremost culture
practiced in India. Ancient people cultivate the crops in their own land and so
they have been accommodated to their needs. Therefore, the natural crops
are cultivated and have been used by many creatures such as human beings,
animals and birds. The greenish goods produced in the land which have been
taken by the creature leads to a healthy and welfare life. Since the invention of
new innovative technologies and techniques the agriculture field is slowly
degrading. Due to these, abundant invention people are been concentrated on
cultivating artificial products that is hybrid products where there leads to an
unhealthy life.
Nowadays, modern people don’t have awareness about the cultivation of the
crops in a right time and at a right place. Because of these cultivating
techniques The seasonal climatic conditions are also being changed against
fundamental assets like soil, water and air which lead to insecurity of food. By
analysing all these issues and problems like weather, temperature and several
factors, there is no proper solution and technologies to overcome the situation
faced by us. In India there are several ways to increase the economical growth
in the field of agriculture.
There are multiple ways to increase and improve the crop yield and the quality
of the crops. Data mining also useful for predicting the crop yield production.
Generally, data mining is the process of analysing data from different
perspectives and summarizing it into useful information. Data mining software
is an analytical tool that allows users to analyse data from many different
dimensions or angles, categorize, and summarize the relationships identified.
Technically, data mining is the process of finding correlations or patterns
among dozens of fields in large relational databases. The patterns,
associations, or relationships among all this data can provide information. The
problem statement of this project is predict the production of yield crop in
future on the basis of previous dataset. Today agriculture sector is a major
contributor to Indian Economy.
In a country like india, which has ever increasing demand of food due to rising
population, advances in agriculture sector are required to meet the needs,
Therefore Crop Yield Prediction remains a challenging task in this domain.
There are various parameters that affect the yield of crop like rainfall,
temperature, fertilizers, pesticides, Ph level, and other atmosphere conditions
and parameter.
This project will help the farmer to know the yield of their crop before
cultivating onto the agriculture field and help them to make appropriate
decision. For such kind of data analysis in crop prediction, there are different
technique or algorithms are use, and the help of these algo. We can predict the
crop yield before cultivation. Machine learning approaches have been
executed on the agricultural data to evaluate the best performing technique.
1.2 Scope and objective:
The proposed system aims at predicting or forecasting the crop yield by
learning the past data of the farming land. By considering various factors such
as soil conditions, rainfall, temperature, yield and other entities the system
builds a predicting a model using machine learning techniques. Here we make
use of different machine learning techniques such random forest, Polynomial
Regression, Decision Tree. Performance is evaluated based on predicted
accuracy.
Challenges are the major basis which imminent the negative impacts on
current project. Some of the challenges faced during crop yield prediction are:
Choosing appropriate dataset, after choosing dataset tuning of the
parameters which makes project more efficient to get the desired results.
Model must be trained by taking consideration of less computational
efficiency and power.
Increase of error rate due to dynamically changing the environment.
The scope of the project is to determine the crop yield of an area by
considering
dataset with some features which are important or related to crop production
such as temperature, moisture, rainfall, and production of the crop in previous
years. To predict a continuous value, regression models are used. It is a
supervised technique. The coefficients are preprocessed and fit into the
trained data during training and construction the regression model. The main
focus here is to reduce the cost function by finding the best fit-line. The output
function facilitates in error measurement. During training period, error
between the predicted and actual values is reduced in order to minimize error
function. Python is used for this project by Naïve Bayes and KNN to predict the
class of analysed soil dataset. The soil is categorized into high, medium and
low. By doing this the farmer and the soil analyst gets the prior knowledge
about the land. Meanwhile they can decide which crop best suits to sows. The
results in-turn will help in predicting the crop yield [6]. Agriculture is one of the
tedious processes which involve numerous estimations and effecting factors. In
this research work author focused on modeling some of the important inputs
which plays a major role in the collected dataset and to derive a strong
relationship between the variables. SVM and k-means algorithms are used to
forward pollution from atmosphere and also to classify between soil and plants
[7]. Fuzzy Cognitive Maps are one of the Soft Computing techniques used for
modeling expert’s knowledge. In this work the author has made use of this
approach to predict yield of cotton plants. Representing the knowledge more
visually, effectively, simple and structural makes this technique more
advantageous and thus a convenient approach for predicting and improving
cotton yield prediction The author proposes a model which focuses on
removing noise factors to get efficient predictions. By considering various
factors like differences in agricultural policies and practices. To accommodate
the spatial dependence arising between different regions prior distributions
are developed. In addition to this basis expansions and dimension-reduction
schemes are incorporated to evaluate the improved predictions. To achieve
excellence farming and suggest the farmer to select the best previous
agriculture information is used. This research work also paves way to
improvised yield prediction and increase in income level of crops.
Chapter No. 2:
Literature Survey
Agriculture field assumes an important job within the Indian economy.70
percentage house hold in an India relies fundamentally on this farming. In
India, Agriculture would not to contribute 17% to the Gross Value during
2015-2016. Yet, there is a consistent decrease of agribusiness' commitment
to Gross Value Added. Nourishment is prime for all times and that we rely
principally upon agrarian yields, so farmers assume a particularly essential
job. The accompanying correlation is given underneath: The work
Regression algorithm for the investigation of crop. Choice tree calculation
and Classification are utilized in examining of quite 362 datasets and give
arrangement. The preparation informational index here is redirected into
natural, inorganic and land for anticipating the sort of soil and finding the
accuracy of production using machine learning algorithms like rectilinear
regression , multiple rectilinear regression ,polynomial rectilinear
regression and random forest .Agriculture is taken into account because
the science of plant cultivation which played a key role for the well-being of
humans..
2.1: Tools and Technology used
In this project python tools and machine learning technology are used. Machine
Learning is among the highest and most prospective directions within
the software development niche. The concept helps conveniently automate
various work processes (including Big Data processing), enhance the
precision of business predicting results, optimize the availability chain, etc.
Also, ML may be a foundation for applications that feature the popularity of
voice signals (sounds, speech), countenance, and other objects which can't
be identified with the assistance of single-line mathematical formulas and
straightforward Boolean expressions.
There are many tools to assist within the creation of
solutions supported machine learning within the Python programming
language. That’s what this feature is all about – we highlight most famous,
efficient ML tools also as another important aspects of ML.
The ML technology are often of great use in various spheres of business and
industry. It are often utilized in banking and finance, commerce and
ecommerce, or health and entertainment. Nevertheless, all the tasks machine
learning software are meant to handle are often subdivided into three major
categories (there are more categories, but the subsequent three cover
the overwhelming majority of case studies):
Supervised learning. The input file in supervised learning is that the data and
therefore the results of its processing. Such pairs are called ‘examples’ and
therefore the further activity of the software algorithm is indicated by way of
analysing such examples. The more examples learned, the more precise the
software, which is cheap to expect. Supervised learning may be a basis for
the classification tasks (indication of one correct solution among the
amount , N ,of possibilities supported previous experience) and regression
tasks (indication of the precise answer which isn't a discrete
value supported previous experience).
Unsupervised learning. during this category the externally collected data
sets up connections and defines templates autonomously. The unsupervised
learning solves the clusterization tasks.
Reinforcement learning. Here, the input file is employed so as to
accumulate a supervisor’s reaction. If the chosen answer is wrong , the
supervisor reacts positively in response; within the case of a negative
reaction, the software looks for other solutions for the set task.
2.1.1: Introduction of Python
Python language is incredibly easy to use and learn for new beginners and
newcomers. The python language is one of the most accessible programming
languages available because it has simplified syntax and not complicated, which
gives more emphasis on natural language. Due to its ease of learning and usage,
python codes can be easily written and executed much faster than other
programming languages.
When Guido van Rossum was creating python in the 1980s, he made sure to
design it to be a general-purpose language. One of the main reasons for the
popularity of python would be its simplicity in syntax so that it could be easily
read and understood even by amateur developers also.
One can also quickly experiment by changing the code base of python because
it is an interpreted language which makes it even more popular among all kinds
of developers.
Python was created more than 30 years ago, which is a lot of time for any
community of programming language to grow and mature adequately to support
developers ranging from beginner to expert levels. There are plenty of
documentation, guides and Video Tutorials for Python language are available
that learner and developer of any skill level or ages can use and receive the
support required to enhance their knowledge in python programming language.
Many students get introduced to computer science only through Python
language, which is the same language used for in-depth research projects.
If any programming language lacks developer support or documentation, then
they don’t grow much. But python has no such kind of problems because it has
been here for a very long time. The python developer community is one of the
most incredibly active programming language communities.
This means that if somebody has an issue with python language, they can get
instant support from developers of all levels ranging from beginner to expert in
the community. Getting help on time plays a vital role in the development of the
project, which otherwise might cause delays.
Programming languages grows faster when a corporate sponsor backs it. For
example, PHP is backed by Facebook, Java by Oracle and Sun, Visual Basic &
C# by Microsoft. Python Programming language is heavily backed by
Facebook, Amazon Web Services, and especially Google.
Google adopted python language way back in 2006 and have used it for many
applications and platforms since then. Lots of Institutional effort and money
have been devoted to the training and success of the python language by
Google. They have even created a dedicated portal only for python. The list of
support tools and documentation keeps on growing for python language in the
developers’ world.
Ask any python developer, and they will wholeheartedly agree that the python
language is efficient, reliable, and much faster than most modern languages.
Python can be used in nearly any kind of environment, and one will not face
any kind of performance loss issue irrespective of the platform one is working.
One more best thing about versatility of python language is that it can be used
in many varieties of environments such as mobile applications, desktop
applications, web development, hardware programming, and many more. The
versatility of python makes it more attractive to use due to its high number of
applications.
Cloud Computing, Machine Learning, and Big Data are some of the hottest
trends in the computer science world right now, which helps lots of
organizations to transform and improve their processes and workflows.
Python language is the second most popular used tool after data science and
analytics. Lots of many data processing workloads in the organization are
powered by python language only. Most of the research and development takes
place in python language due to its many applications, including ease of
analysing and organizing the usable data.
Not only this, but hundreds of python libraries are being used in thousands of
machine learning projects every day, such as TensorFlow for neural networks
and OpenCV for computer vision, etc.
2.1.2 Crop Production data using Python
Numpy: NumPy stands for ‘Numerical Python’ or ‘Numeric Python’. It is
an open source module of Python which provides fast mathematical
computation on arrays and matrices. Since, arrays and matrices are an
essential part of the Machine Learning ecosystem, NumPy along with
Machine Learning modules like Scikit-learn, Pandas, Matplotlib,
TensorFlow, etc. complete the Python Machine Learning
Ecosystem.NumPy provides the essential multi-dimensional array-
oriented computing functionalities designed for high-level mathematical
functions and scientific computation. Numpy can be imported into the
notebook using NumPy’s main object is the homogeneous
multidimensional array. It is a table with same type elements, i.e,
integers or string or characters (homogeneous), usually integers. In
NumPy, dimensions are called axes. The number of axes is called the
rank. numpy are several ways to create an array in NumPy like np.array,
np.zeros, no.ones, etc. Each of them provides some flexibility.
Some of the important attributes of a NumPy object which are used in
this project:
1. Ndim: displays the dimension of the array
2. Shape: returns a tuple of integers indicating the size of the array
3. Size: returns the total number of elements in the NumPy array
4. Dtype: returns the type of elements in the array, i.e., int, character
5. Itemsize: returns the size in bytes of each item
6. Reshape: Reshapes the NumPy array
Pandas: Pandas is one of the most widely used python libraries in data
science. It provides high-performance, easy to use structures and data
analysis tools. Unlike NumPy library which provides objects for multi-
dimensional arrays, Pandas provides in-memory 2d table object called
Dataframe. It is like a spreadsheet with column names and row
labels.Hence, with 2d tables, pandas is capable of providing many
additional functionalities like creating pivot tables, computing columns
based on other columns and plotting graphs.
Some commonly used data structures in pandas are:
1. Series objects: 1D array, similar to a column in a spreadsheet
2. DataFrame objects: 2D table, similar to a spreadsheet
3. Panel objects: Dictionary of DataFrames, similar to sheet in MS
Excel
Pandas Series object is created using pd.Series function. Each row is
provided with an index and by defaults is assigned numerical values
starting from 0. Like NumPy, Pandas also provide the basic
mathematical functionalities like addition, subtraction and conditional
operations and broadcasting. Pandas dataframe object represents a
spreadsheet with cell values, column names, and row index labels.
Dataframe can be visualized as dictionaries of Series. Dataframe rows
and columns are simple and intuitive to access. Pandas also provide
SQL-like functionality to filter, sort rows based on conditions.
Following pandas function used in this project:
1. head(): returns the top 5 rows in the dataframe object
2. tail(): returns the bottom 5 rows in the dataframe
3. info(): prints the summary of the dataframe
4. describe(): gives a nice overview of the main aggregated values
over each column.
Matplotlib :Matplotlib is a 2d plotting library which produces publication
quality figures in a variety of hardcopy formats and interactive
environments. Matplotlib can be used in Python scripts, Python and
Python shell, Jupyter Notebook, web application servers and GUI
toolkits. matplotlib.pyplot is a collection of functions that make
matplotlib work like MATLAB. Majority of plotting commands in pyplot
have MATLAB analogs with similar arguments.
Seaborn: This library sits on top of matplotlib. In a sense, it has some
flavors of matpotlib while from the visualization point, its is much better
than matplotlib and has added features as well.
2.1.3:Introduction of Machine Learning:
A Machine Learning (ML) deals with problems where
the relation between input and output variables is not known or hard to
obtain. The “learning” term here denotes the automatic acquisition of
structural descriptions from examples of what is being described.Unlike
traditional statistical methods, ML does not make assumptions about the
correct structure of the data model, which describes the data. This
characteristic is very useful to model complex non-linear behaviors, such as a
function for crop yield prediction. ML techniques most successfully applied to
Crop Yield Prediction (CYP).
Supervised Learning algorithm consist of a target / outcome variable (or
dependent variable) which is to be predicted from a given set of predictors
(independent variables). Using these set of variables, we generate a function
that map inputs to desired outputs. The training process continues until the
model achieves a desired level of accuracy on the training data. Examples of
Supervised Learning: Regression, Decision Tree, Random Forest, KNN, Logistic
Regression etc.
Random Forest Classifier: Random forest is a most popular and powerful
supervised machine learning algorithm capable of performing both
classification, regression tasks, that operate by constructing a multitude of
decision trees at training time and outputting the class that is the mode of the
classes (classification) or mean prediction (regression) of the individual trees.
The more trees in a forest the more robust the prediction. Random decision
forests correct for decision trees habit of over fitting to their training set .The
data sets considered are rainfall, perception, production, temperature to
construct random forest, a collection of decision trees by considering two-third
of the records in the datasets. These decision trees are applied on the
remaining records for accurate classification. The resultant training sets can be
applied on the test data for correct prediction of crop yield based on the input
attributes. RF algorithm was used to study the performance of this approach
on the dataset. The advantage of random forest algorithm is , Overfitting is less
of an issue with Random Forests, unlike decision tree machine learning
algorithms. There is no need of pruning the random forest. Random Forest
machine learning algorithms can be grown in parallel.
This algorithm runs efficiently on large databases and it has higher
classification accuracy. There are three parameters in the random forest
algorithm.
n tree-the name suggests, the number of trees to grow. Larger the tree, it will
be more computationally expensive to build models.
m try - It refers to how many variables we should select at a node split. The
default value is p/3 for regression and sqrt(p) for classification and always try
to avoid using smaller values of m try to avoid overfitting.
node size - It refers to how many observations we want in the terminal
nodes. This parameter is directly related to tree depth. Higher the number,
lower the tree depth. With lower tree depth, the tree might even fail to
recognize useful signals from the data.
2.1.4: Visualization and Prediction of Crop Production
using machine learning:
The implementation is divided into the following modules.
A. Data Gathering
B. Data Cleaning
C. Data Exploration
D. Prediction using Machine learning
E. Web application
1. Data Gathering Dataset is prepared by collecting the crop and rainfall
data from the Indian government data repository (data.gov.in). There
are a lot of datasets that contain data.
2. Data Cleaning One of the most significant steps in any machine learning
project is data cleaning. There are several different methods of statistical
analysis and data visualization techniques in the dataset that you can
use to explore the data to identify the appropriate data cleaning
operations to be conducted. There are some very simple data cleaning
operations before jumping to the advanced methods that we can
conduct in a machine learning project on every single dataset. They are
so important that models can break or report excessively optimistic
outcomes of success if missed. In our dataset, we cleaned all the null
values and checked whether all the datatypes are valid.
3. Data Exploration Also known as E.D.A, exploratory data analysis is a very
important phase in researching and investigating various data sets and
summarizing their significant characteristics, often using different
methods of data visualization. It allows it simpler for a data analyst to
obtain repeated trends, spot anomalies, test theories, and conclusions
to decide the best way to monitor data sources to get the results with
greater precision.
4. Prediction using Machine learning We tested around 6 different
algorithms and found the Decision Tree Regressor algorithm as the most
effective. A Decision Tree is one of the most commonly used algorithms
for supervised learning. In the form of a tree structure, a decision tree
generally generates regression models or classification models. It breaks
down a dataset into smaller subsets and, based on the subsets, a
decision tree is created. A tree containing decision nodes and leaf nodes
is the final product. There are mostly two or more branches of a decision
node, each representing values for the checked attribute. An option for
the final numerical goal is represented by the Leaf node. The topmost
node in the tree corresponds to the root node, which is the best node.
Both continuous and categorical data can be processed by decision
trees.
Machine learning prediction has the following steps:
Step 1: Initialize a dataset containing information on rainfall and
wholesale price index.
Step 2: From the dataset select all the rows and columns 1,2,3 to “X”
Which is the independent variable.
Step 3: From the dataset select all of the rows and column 4 to “y”
Which is the dependent variable.
Step 4: Fit the x and y variables with a decision tree regressor.
Step 5: Update the UI with predicted values.
5. Web application The Predicted WPI data is converted into Price per
Quintal using the formula (WPI x Base Price)/100 and displayed in a
visually understandable web application created using the Flask
framework. Flask is one of the popular, extensible web micro-framework
for building beautiful web applications with Python.
2.1.5: Algorithms used in project :
Linear regression: Linear regression is examined as a procedure that is
utilized to break down a reaction variable Y which changes with the estimation
of the intercession variable X. A methodology of anticipating the estimation of
a response variable from a given estimation of the explanatory variable is
referred to as prediction. Here to find the relationship two variables, one is the
dependent variable (Y) and the other one variable that is independent (X) with
a best fit straight line is commonly called as regression line .
The regression equation is shown below,
Y = a + (b*X) + e
Where,
Y – Dependent variable
X – Independent variable
a – Intercept
b – Slope
e – Residual (error)
In this project when we apply linear regression in our dataset then accuracy of
the crop prediction is 25 percent, which is not suitable for the crop prediction.
Then apply next algorithm multiple linear regression and check the next
accuracy.
Multiple Linear regression: Multiple linear regression (MLR), also known
simply as multiple regression, is a statistical technique that uses several
explanatory variables to predict the outcome of a response variable. The goal
of multiple linear regression (MLR) is to model the linear regression between
the explanatory (independent) variables and response (dependent) variable.
Simple linear regression is a function that allows an analyst or statistician to
make predictions about one variable based on the information that is known
about another variable. Linear regression can only be used when one has two
continuous variables—an independent variable and a dependent variable. The
independent variable is the parameter that is used to calculate the dependent
variable or outcome. A multiple regression model extends to several
explanatory variables.
Multiple linear regression (MLR) is used to determine a mathematical
relationship among a number of random variables. In other terms, MLR
examines how multiple independent variables are related to one dependent
variable. Once each of the independent factors has been determined to predict
the dependent variable, the information on the multiple variables can be used to
create an accurate prediction on the level of effect they have on the outcome
variable. The model creates a relationship in the form of a straight line (linear)
that best approximates all the individual data points.
In this project when we apply multiple linear regression in our dataset then
accuracy of the crop prediction is 25 percent, which is not suitable for the crop
prediction.
Then apply next algorithm polynomial regression and check the next accuracy.
Polynomial linear regression: In the last section, we saw two variables in
your data set were correlated but what happens if we know that our data
is correlated, but the relationship doesn’t look linear? So hence
depending on what the data looks like, we can do a polynomial regression
on the data to fit a polynomial equation to it. Hence If we try to use a
simple linear regression in the above graph then the linear regression line
won’t fit very well. It is very difficult to fit a linear regression line in the
above graph with a low value of error. Hence we can try to use the
polynomial regression to fit a polynomial line so that we can achieve a
minimum error or minimum cost function. The equation of the
polynomial regression for the above graph data would be:
y = θo + θ₁x₁ + θ₂ x₁²
This is the general equation of a polynomial regression is:
Y=θo + θ₁X + θ₂X² + … + θₘXᵐ + residual error
In this project when we apply polynomial regression in our dataset then
accuracy of the crop prediction is 31 percent, which is not suitable for the crop
prediction.
Then apply next algorithm random forest and check the next accuracy.
Random forest classifier: You must have at least once solved a problem
of probability in your high-school in which you were supposed to find the
probability of getting a specific colour red ball from a bag containing
different colour red balls, given the number of balls of each colour.
Random forests are simple if we try to learn them with this analogy in
mind.
Random forests (RF) are basically a bag containing n Decision Trees (DT)
having a different set of hyper-parameters and trained on different subsets of
data. Let’s say I have 100 decision trees in my Random forest bag!! As I just
said, these decision trees have a different set of hyper-parameters and a different
subset of training data, so the decision or the prediction given by these trees can
vary a lot. Let’s consider that I have somehow trained all these 100 trees with
their respective subset of data. Now I will ask all the hundred trees in my bag
that what is their prediction on my test data. Now we need to take only one
decision on one example or one test data, we do it by taking a simple vote. We
go with what the majority of the trees have predicted for that example.
Random forest adds additional randomness to the model, while growing the
trees. Instead of searching for the most important feature while splitting a node,
it searches for the best feature among a random subset of features. This results
in a wide diversity that generally results in a better model. Therefore, in random
forest, only a random subset of the features is taken into consideration by the
algorithm for splitting a node. You can even make trees more random by
additionally using random thresholds for each feature rather than searching for
the best possible thresholds (like a normal decision tree does). Andrew wants to
decide where to go during one-year vacation, so he asks the people who know
him best for suggestions. The first friend he seeks out asks him about the likes
and dislikes of his past travels. Based on the answers, he will give Andrew some
advice. This is a typical decision tree algorithm approach. Andrew's friend
created rules to guide his decision about what he should recommend, by using
Andrew 'answers. Afterwards, Andrew starts asking more and more of his
friends to advise him and they again ask him different questions they can use to
derive some recommendations from. Finally, Andrew chooses the places that
where recommend the most to him, which is the typical random forest algorithm
approach.Another great quality of the random forest algorithm is that it is very
easy to measure the relative importance of each feature on the prediction.
Sklearn provides a great tool for this that measures a feature's importance by
looking at how much the tree nodes that use that feature reduce impurity across
all trees in the forest. It computes this score automatically for each feature after
training and scales the results so the sum of all importance is equal to one.
If you don’t know how a decision tree works or what a leaf or node is, here is a
good description from Wikipedia: '"In a decision tree each internal node
represents a 'test' on an attribute (e.g. whether a coin flip comes up heads or
tails), each branch represents the outcome of the test, and each leaf node
represents a class label (decision taken after computing all attributes). A node
that has no children is a leaf.'"
By looking at the feature importance you can decide which features to possibly
drop because they don’t contribute enough (or sometimes nothing at all) to the
prediction process. This is important because a general rule in machine learning
is that the more features you have the more likely your model will suffer from
overfitting and vice versa.
In this project when we apply random forest classifier in our dataset then
accuracy of the crop prediction is 99.99 percent. Which is consider that random
forest is the best algorithm model for this project.
Therefore, in this project finally apply random forest model which is best model
for this project.
KNN Algorithm: The k-nearest neighbors (KNN) algorithm is a simple,
easy to-implement supervised machine learning algorithm that can
be used to solve both classification and regression problems. The
KNN algorithm assumes that similar things exist in close proximity.
In other words, similar thingsare near to each other.
In this project when we apply KNN in our dataset then accuracy of the crop
prediction is 75 percent which is also be a best algorithm model for this project
but when we see accuracy random forest is the best for this project.
Therefore, in this project finally apply random forest model which is best model
for this project.
2.1.6: Flask:
Flask is a web framework that provides libraries to build lightweight
web applications in python. It is developed by Armin Ronacher who leads an
international group of python enthusiasts (POCCO). It is based on WSGI toolkit
and jinja2 template engine. Flask is considered as a micro framework
Flask is flexible. It doesn’t require you to use any particular project or code
layout. However, when first starting, it’s helpful to use a more structured
approach. This means that the tutorial will require a bit of boilerplate up front,
but it’s done to avoid many common pitfalls that new developers encounter, and
it creates a project that’s easy to expand on. Once you become more
comfortable with Flask, you can step out of this structure and take full
advantage of Flask’s flexibility.
The Predicted WPI data is converted into Price per Quintal using the formula
(WPI x Base Price)/100 and displayed in a visually understandable web
application created using the Flask framework. Flask is one of the popular,
extensible web micro-framework for building beautiful web applications with
Python.
Flask is an API of Python that allows us to build up web-applications. Flask’s
framework is more explicit than Django’s framework and is also easier to
learn because it has less base code to implement a simple web-Application. A
Web-Application Framework or Web Framework is the collection of modules
and libraries that helps the developer to write applications without writing the
low-level codes such as protocols, thread management, etc. Flask is based on
WSGI(Web Server Gateway Interface) toolkit and Jinja2 template engine.
If a user visits http://localhost:5000/hello URL, the output of the
hello_world() function will be rendered in the browser.
The add_url_rule() function of an application object can also be
used to bind URL with the function.