A PROJECT REPORT
ON
SENTIMENTAL ANALYSIS USING AI-DEEP LEARNING
A project report smbmitted iu partial fmlfillmeut of the reqmiremeuts for the
award of the degree of
BACHELOR OF
TECHNOLOGY IN
COMPUTER SCIENCE & ENGINEERING
By
A. Supraja 172P1A05
D. Indira 05
H. Neelesh 172P1A05
17
172P1A05
34
G. 172P1A05
Yashaswini 23
C. Vyshnavi 172P1A05
13
UNDER THE ESTEEMED GUIDANCE OF
P. Narasimhaiah
Assistant Professor
DEPARTMENT OF COMPUTER SCIENCE AND
ENGINEERING AN ISO 9001:2015 CERTIFIED
INSTITUTION
CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY
(Sponsored by Bharathi Educational Society)
(Affiliated to J.N.T.U.A., Anantapuramu, Approved by AICTE, New
Delhi) Recognized by UGC Under the Sections 2(f)&12(B) of
UGC Act,1956
(Accredited by NAAC)
Vidyanagar, Proddatur-516 360, Y.S.R.(Dist.), A.P.
2017-2021
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
AN ISO 9001:2015 CERTIFIED INSTITUTION
CHAITANYA BHARATHI INSTITUTE OF TECHNOLOGY
(Sponsored by Bharathi Educational Society)
(Affiliated to J.N.T.U.A. Anantapuramu, Approved by AICTE, New
Delhi) Recognized by UGC under the Sections 2(f) &12(B) of UGC
Act, 1956 Accredited by NAAC
Vidyanagar, Proddatur-516 360, Y.S.R. (Dist.), A.P.
CERTIFICATE
This is to certify that the project work entitled “ SENTIMENTAL ANALYSIS
USING AI- DEEP LEARNING” is a bonafide work of A. SUPRAJA (172P1A0505),
D.INDIRA(172P1A0517),H.NEELESH(172P1A0534),G.YASHASWINI(172P1A0523),
C.VYS
HNAVI(172P1A0513) submitted to Chaitanya Bharathi Institute of
Technology, Proddatur in partial fulfillment of the requirement for the award
of degree of Bachelor of Technology In COMPUTER SCIENCE AND
ENGINEERING. The work reported here in does not form part of
any other thesis on which a degree has been awarded earlier.
This is to further certify that they have worked for a period of one
semester for preparing
their work under our supervision and guidance.
INTERNAL GUIDE HEAD OF THE DEPARTMENT
P. NARASIMHAIAH G. SREENIVASA REDDY
Assistant Professor Associate Professor
PROJECT CO-ORDINATOR
N. SRINIVASAN, M.Tech.,(ph.D)
INTERNAL EXAMINER EXTERNAL EXAMINER
DECLARATION BY THE CANDIDATES
We are A. Supraja, D. Indira, H. Neelesh, G.
Yashaswini, C. Vyshnavi bearing respective Roll No:
(172P1A0505), (172P1A0517), (172P1A0534),
(172P1A0523),
(172P1A0513) hereby declare that the Project Report entitled
“SENTIMENTAL ANALYSIS USING AI-DEEP LEARNING” under the guidance of
P. NARASIMHAIAH, Assistant Professor, Department of CSE is submitted in
partial fulfillment of the requirements for the award of the degree of
Bachelor of Technology in Computer Science & Engineering.
This is a Record of Bonafide work carried out by us and the results
embodied in this Project Report have not been reproduced or copied from
any source. The results embodied in this Project Report have not submitted
to any other University or Institute for the Award of any other Degree or
Diploma.
A.SUPRAJA 172P1A0505
D.INDIRA 172P1A0517
H.NEELESH 172P1A0534
G.YASHASWINI 172P1A0523
C.VYSHNAVI 172P1A0513
Dept. of Computer Science &
Engineering
Chaitanya Bharathi Institute of
Technology
Vidyanagar,Proddatur,Y.S.R.(Dist.)
ACKNOWLEDGEMENT
An endeavor over a long period can be successful only with the
advice and supports of many well wishers. We take this opportunity to
express our gratitude and appreciation to all of them.
We are extremely thankful to our beloved Chairman Sri
V.Jayachandra Reddy who took keen interest and encouraged us in every
effort throughout this course.
We owe our gratitude to our principal Dr. G.Sreenivasula Reddy,
M.Tech.,Ph.D.for permitting us to use the facilities available to accomplish
the project successfully.
We express our heartfelt thanks to G. Sreenivasa Reddy B.Tech.,Ph.D
Head of Dept – CSE for his kind attention and valuable guidance to us
throughout this course.
We also express our deep sense of gratitude towards N. Srinivasan,
Project Co-Ordinator, Dept. of COMPUTER SCIENCE AND ENGINEERING
for her support and guidance in completing our project.
We express our profound respect and gratitude to our project guide
P. Narasimhaiah,
B.Tech., for her valuable support and guidance in completing the project
successfully.
We are highly thankful to Mr. M. Naresh Raju Try Logic Soft Solutions
AP Pvt. Limited, Hyderabad, who has been kind enough to guide us in the
preparation and execution of this project.
We also thank all the teaching and non-teaching staff of the Dept.
of COMPUTER SCIENCE AND ENGINEERING. For their support throughout
our B.Tech. course.
We express our heartful thanks to My Parents for their valuable
support and encouragement in completion of my course. Also I express my
heartful regards to my Friends for being supportive in completion of my
project.
TABLE OF CONTENTS
CONTENT PAGE NO
ABSTRACT L
LL
TABLE OF CONTENTS
1. INTRODUCTION 1
1.1Domain Description 1
1.2About Project 6
1.2.1 Problem Definition 6
1.2.2 Proposed Solution 6
1.3Objective 8
2. SENTIMENTAL ANALYSIS SURVEY 9
2.1Theoretical Background 9
2.2Existing System 9
2.3Proposed System 9
2.4Advantages of proposed system 10
2.5Feasibility Study 10
2.5.1 Operational Feasibility 10
2.5.2 Technical Feasibility 11
2.5.2.1 Survey of Technology 11
2.5.2.2 Feasibility of Technology 11
2.5.3 Economic Feasibility 11
3. SYSTEM
ANALYSIS 12
Specifications 12
Software Requirements 13
Hardware Requirements 13
Module Description 13
4. DESIGN 17
Block Diagram 17
Data Flow Diagrams 18
Context Level DFD 19
Top Level DFD 20
Detailed Level DFD 20
Unified Modelling Language 21
Use Case Diagram 22
Sequence Diagram 24
Collaboration Diagram 25
Activity Diagram 26
5. IMPLEMENTATION 28
6. TESTING 47
Black Box Testing 47
White Box Testing 48
7. OUTPUT SCREENS 49
8. CONCLUSION 55
9. FUTURE ENHANCEMENT 56
10 BIBLIOGRAPHY 57
ABSTRACT
We are doing this major project on SENTIMENT ANALYSIS USING AI-
DEEP LEARNING.
This project addresses the problem of sentimental analysis or
opinion mining in social media like Facebook and Twitter, that is
classifying the tweets or people opinions according to the sentiment
expressed in them : positive, negative, neutral.
In social media, the information is present in large amount.
Extracting information from social media gives us several usage in
various fields. In the field of biomedical and healthcare, extracting
information from social media is providing number of benefits such as
knowledge about the latest technology and also updates of current
situation in medical field etc.
Due to the enormous amount of data and opinions being
produced, shared and transferred, Everyday across the internet and
other media, sentiment analysis has become one of the most active
research fields in natural language processing.
Sentiment Analysis or opinion mining is done using Machine
Learning models and Deep Learning models. Deep learning has made
a great breakthrough in the field of speech and image recognization.
In the implemented system, tweets or people opinions are
collected and
sentimental analysis is performed on them, Based on the results of
sentimental analysis few suggestions can be provided to the user.
In our project we are using NLP for processing the data and
Deep Learning algorithms and Machine Learning algorithms to build a
model.
LL
LIST OF FIGURES
Figure Name of Figures: Page
Number:
Number:
1.1..1 Image for Deep Learning 1
1.1..2 Image for Machine Learning 2
1.1.3 Image for Computer Vision 3
1.1.4 Image for Autonomous Vehicles 4
1.1.5 Image for Bots based on Deep 5
learning
4.1.1 Block Diagram for Sentimental 17
Analysis
4.2.1 Context Level DFD for Sentimental 19
Analysis
4.2.2 Top Level DFD for Sentimental 20
Analysis
4.2.3.1 Detailed Level DFD for Sentimental 21
Analysis
4.3.1 UseCase Diagram 23
4.3.2 Sequence Diagram 24
4.3.3 Collaboration Diagram 25
4.3.4 Activity Diagram 26
4.3.5 Data Dictionary 27
5.2.1 Image for Random Forest 39
Algorithm
LL
CHAPTER
INTRODUCTIO
10.1 DOMAIN
DESCRIPTION What is
Deep Learning ?
Deep Learning is a class of Machine Learning which performs
much better on unstructured data. Deep learning techniques are
outperforming current machine learning techniques. It enables
computational models to learn features progressively from data at
multiple levels. The popularity of deep learning amplified as the
amount of data available increased as well as the advancement of
hardware that provides powerful computers.
Fig 1.1.1 Image for Deep Learning
Deep learning has emerged as a powerful machine learning
technique that learns multiple layers of representations or features of
the data and produces state-of-the-art prediction results. Along with
the success of deep learning in many other application domains,
deep learning is also popularly used in sentiment
analysis in recent years
1
What is Machine Learning?
Machine learning is an application of artificial intelligence (AI)
that provides systems the ability to automatically learn and improve
from experience without being explicitly
programmed. Machine learning focuses on the development of
computer programs that can access data and use it learn for
themselves.
Fig 1.1.2 Image for Machine Learning
Machine Learning (ML) is coming into its own, with a growing
recognition that ML can play a key role in a wide range of critical
applications, such as data mining,Natural language processing, image
recognition, and expert systems. ML provides potential solutions in all
these domains and more, and is set to be a pillar of our future
civilization.
“A computer program is said to learn from experience E with
respect to some task T
and some performance measure P, if its performance on T, as measured
by P, improves with experience E.” -- Tom Mitchell, Carnegie Mellon
University
• Some machine learning methods:
Machine learning algorithms are often categorized as supervised
or unsupervised.
Supervised machine learning:
Supervised machine learning algorithms can apply what has
been learned in the past to new data using labelled examples to
predict future events.
Unsupervised machine learning:
Unsupervised machine learning algorithms are used when the
information used to train is neither classified nor labelled.
Deep Learning Examples in Real Life:
1. Computer Vision
High-end gamers interact with deep learning modules on a
very frequent basis. Deep neural networks power bleeding-edge
object detection, image classification, image restoration, and image
segmentation.
Fig 1.1.3 Image for Computer Vision
So much so, they even power the recognition of hand-written
digits on a computer
system. To wit, deep learning is riding on an extraordinary neural
network to empower machines to replicate the mechanism of the
human visual agency.
2. Autonomous Vehicles
The next time you are lucky enough to witness an
autonomous vehicle driving down, understand that there are several
AI models working simultaneously. While some models pin-point
pedestrians, others are adept at identifying street signs. A single car
can be
informed by millions of AI models while driving down the road. Many
have considered AI- powered car drives safer than human riding.
Fig 1.1.4 Image for Autonomous
Vehicles 3.Automated Translation
Automated translations did exist before the addition of deep
learning. But deep learning is helping machines make enhanced
translations with the guaranteed accuracy that was missing in the
past. Plus, deep learning also helps in translation derived from
images – something totally new that could not have been possible
using traditional text-based interpretation.
4.Bots Based on Deep Learning
Take a moment to digest this – Nvidia researchers have developed
an AI system that
helps robots learn from human demonstrative actions. Housekeeping
robots that perform actions based on artificial intelligence inputs from
several sources are rather common. Like human brains process
actions based on past experiences and sensory inputs, deep-
learning
infrastructures help robots execute tasks depending on varying AI
opinions.
Fig 1.1.5 Image for Bots based on Deep
Learning 5.Sentiment based News Aggregation
Carolyn Gregorie writes in her Huffington Post piece: “the world
isn't falling apart, but it can sure feel like it.” And we couldn't agree
more. I am not naming names here, but you cannot scroll down any
of your social media feed without stumbling across a couple of global
disasters – with the exception of Instagram perhaps.
News aggregators are now using deep learning modules to filter
out negative news and show you only the positive stuff happening
around. This is especially helpful given how
blatantly sensationalist a section of our
Machine Learning Examples in real life:
1 . Image Recognition
Image recognition is a well-known and widespread example of
machine learning in
the real world. It can identify an object as a digital image, based on
the intensity of the pixels in black and white images or colour
images.
Real-world examples of image recognition:
• Label an x-ray as cancerous or not
• Assign a name to a photographed face (aka “tagging” on social media)
2. Medical Diagnosis
Machine learning can help with the diagnosis of diseases.
Many physicians use chatbots with speech recognition capabilities to
discern patterns in symptoms.
In the case of rare diseases, the joint use of facial
recognition software and machine learning helps scan patient photos
and identify phenotypes that correlate with rare
genetic diseases.
3. Sentimental Analysis
Sentimental analysis is a top notch machine learning
application that refers to
sentiment classification , opinion mining , and analyzing emotions using
this model, machines groom themselves to analyze sentiments based
on the words. They can identify if the words
are said in a positive, negative or neutral notion. Also they can define
the magnitude of these words.
10.2 ABOUT PROJECT
The growth of the internet due to social networks such as
facebook, twitter, Linkedin, instagram etc. has led to significant users
interaction and has empowered users to express their opinions about
products, services, events, their preferences among others. It has
also provided opportunities to the users to share their wisdom and
experiences with each other.
The faster development of social networks is causing explosive
growth of digital content. It has turned online opinions, blogs,
tweets, and posts into a very valuable asset for the
corporates to get insights from the data and plan their strategy.
Business organizations need to process and study these sentiments
to investigate data and to gain business insights. Traditional
approach to manually extract complex features, identify which
feature is relevant, and derive the patterns from this huge
information is very time consuming and require significant human
efforts. However, Deep Learning can exhibit excellent performance
via Natural Language Processing (NLP) techniques to perform
sentiment analysis on this massive
information. The core idea of Deep Learning techniques is to identify
complex features extracted from this vast amount of data without
much external intervention using deep neural networks. These
algorithms automatically learn new complex features. Both automatic
feature extraction and availability of resources are very important
when comparing the traditional machine learning approach and deep
learning techniques. Here the goal is to classify the opinions and
sentiments expressed by users.
It is a set of techniques / algorithms used to detect the
sentiment (positive, negative, or neutral) of a given text. It is a very
powerful application of natural language processing (NLP) and finds
usage in a large number of industries. It refers to the use of NLP,
text analysis, computational linguistics, and biometrics to
systematically identify, extract, quantify, study different states and
subjective information. The sentiment analysis sometimes goes
beyond the categorization of texts to find opinions and categorizes
them as
positive or negative, desirable or undesirable. Below figure describes
the architecture of
sentiment classification on texts. In this, we modify the provided
reviews by applying specific filters, and we use the prepared dataset
by applying the parameters and implement our proposed model for
evaluation.
Another challenge of microblogging is the incredible breadth
of topic that is covered. It is not an exaggeration to say that people
tweet about anything and everything.
Therefore, to be able to build systemsto mine sentiment about any
given topic, we need a method for quickly identifying data that can
be used for training. In this paper, we explore one method for
building such data: using hashtags (e.g.,#bestfeeling, #epicfail,
#news) to identify positive, negative, and neutral tweets to use for
training threeway sentiment classifiers.
The online medium has become a significant way for people
to express their
opinions and with social media, there is an abundance of opinion
information available. Using
sentiment analysis, the polarity of opinions can be found, such as
positive, negative, or neutral by analyzing the text of the opinion.
Sentiment analysis has been useful for companies to get their
customer's opinions on their products predicting outcomes of
elections , and getting opinions from movie reviews. The information
gained from sentiment analysis is useful for companies making future
decisions. Many traditional approaches in sentiment analysis uses the
bag of words method. The bag of words technique does not consider
language morphology, and it could incorrectly classify two phrases of
having the same meaning because it could have the same bag of
words. The relationship between the collection of
words is considered instead of the relationship between individual
words . When determining the overall sentiment, the sentiment of
each word is determined and combined using a function . Bag of
words also ignores word order, which leads to phrases with negation
in them to be incorrectly classified.
10.3 OBJECTIVE
To address this solution , we should collect the data from
various sources like different websites , pdfs and word document.
After collecting the data we will convert it into csv file then, we will
break the data into individual sentences.Then.by using Natural
Language
useless words in the sentence or the extra data which are of no use.
For example, ”the” , ”a” , “an”, “in” , are some of the examples of stop
words in English. After that the algorithm naïve bayes is used to train
the model. ANN algorithm works in backend to generae pickle file.
Confusion matrix is used for validation technique and Accuracy is
used for model
defect.
CHAPTER 2
SENTIMENTAL ANALYSIS SURVEY
2.1THEOROTICAL BACKGROUND
This project Sentimental Analysis addresses the problem of
sentimental analysis
or opinion mining in social media like Facebook and Twitter, that
is classifying the tweets or people opinions according to the
sentiment expressed in them : positive
,negative, neutral.
In social media, the information is present in large amount.
Extracting information from social media gives us several usage in
various fields. In the field of biomedical and healthcare, extracting
information from social media is providing number of benefits such as
knowledge about the latest technology and also updates of current
situation in medical field etc.
2.2 EXISTING SYSTEM WITH DRAWBACKS
Existing approaches to Sentimental Analysis are Knowledge-
based techniques (lexical-based approach), Statistical methods, and
Hybrid approaches.
Knowledge-based techniques make use of prebuilt lexicon
sources containing polarity of sentiment words SentiWordNet(SWN)
for determining the polarity of a tweet. Lexicon based approach
suffers from poor recognition of sentiment.
Statistical methods involve machine learning(such as SVM) and
deep learning approaches, both approaches require labeled training
data for polarity detection.
Hybrid approach of sentiment analysis exploits both statistical
methods and knowledge-based methods for polarity detection. It
inherits high accuracy from machine learning(statistical methods)
and stability from the lexicon-based approach.
2.3PROPOSED SYSTEM WITH FEATURES
In the proposed system , sentimental analysis is done using
natural language processing , which defines a relation between user
posted tweet and opinion and in addition , suggestions of people.
Truly listening to a customers voice requires deeply
understanding what they have expressed in natural language. NLP is
a best way to understand natural language used and
uncover the sentiment behind it. NLP makes speech analysis easier.
Without NLP and access to the right data , it is difficult to discover
and collect insight necessary for driving business decisions. Deep
Learning algorithms are used to build a
model.
2.4ADVANTAGES OF PROPOSED SYSTEM
The advanced techniques like natural language processing is used
for the sentimental analysis.It makes our project very accurate.
NLP defines a relation between user posted tweet and opinion
and in addition suggestions of people.
NLP is a best way to understand natural language used by the
people and uncover the sentiment behind it.NLP makes speech analysis
easier.
2.5FEASIBILITY STUDY:
As the name implies, a feasibility analysis is used to determine
the viability of an idea, such as ensuring a project is legally and
technically feasible as well as economically justifiable. It tells us
whether a project is worth the investment—in some cases, a project
may not be doable. There can be many reasons for this, including
requiring too many resources, which not only prevents those
resources from performing other tasks but also may cost more than
an organization would earn back by taking on a project that isn't
profitable.
2.5.1 Operational Feasibility:
The number of people working on this project are 3 to 4. These
persons should have knowledge on the technologies from the domain
of Artificial Intelligence (A.I.), those are understanding of Machine
Learning (M.L.) and its types. Working of Natural Language
Processing (N.L.P).
2.5.2 Technical Feasibility:
Technical feasibility is the study which assesses the details of how you
intend to deliver a product or service to customers. Think materials, labour,
transportation, where your business will be located, and the technology that will
be necessary to bring all this together. It's the logistical or tactical plan of how
your business will produce, store, deliver and track
its products or services.
2.5.2.1 Survey of Technology:
For our project we have chosen the Artificial Intelligence aka A.I.
Technology as we found that by using this technology, we can complete our
project and get out desired output for the users.
2.5.2.2 Feasibility of Technology:
For our project from Machine Learning aka M.L., we have chosen
Unsupervised Machine Learning task to train our data on the GloVe, i.e., Global
Vectors for Word Representation Dataset. Later, training on this dataset, we'll
then, give our inputs to the model and it'll display the top N Sentences.
2.5.3 Economic Feasibility:
Our project is economically supportive as it's required small or medium
amount of resources which will cost up-to medium amount for those resources.
CHAPTER 3
SYSTEM ANALYSIS
System analysis is conducted for the purpose of studying a system or its
parts in order to identify its objectives. It is a problem-solving technique
that improves the system and ensures that all the components of the
system work efficiently to accomplish their purpose.
3.1 SPECIFICATION
Functional
requirements
The following are the functional requirements of our project:
• A training dataset has to be created on which training is performed.
A testing dataset has to be created on which testing is performed.
Non Functional Requirements:
• Maintainability: Maintainability is used to make future
maintenance easier, meet new requirements.
• Robustness: Robustness is the quality of being able to
withstand stress, pressures or changes in procedure or
circumstance.
• Reliability: Reliability is an ability of a person or system to
perform and maintain its functions in circumstances.
• Size: The size of a particular application play a major role, if the size
is less then efficiency will be high.
• Speed: If the speed is high then it is good. Since the no of lines
in our code is less, hence the speed is high.
3.2 SOFTXARE REQ^IREMENTS
One of the most difficult tasks is that, the selection of the software, once
system requirement is known that is determining whether a particular software
package fits the
requirements.
PROHRAMMINH LANH^AHE PYTHON
TECHNOLOHY PYCHARM
OPERATINH SYSTEM XINBOXS 1;
@ROXSER HOOHLECHROME
Table3.2.1SoftwareRequirements
3.3 HARBXARE REQ^IREMENTS
The selection of hardware is very important in the existence and
proper working of any software. In the selection of hardware, the size and
the capacity requirements are also important.
PROCESSOR INTEL CORE
RAM CAPACITY 2H@
HARBBISJ 1T@
I/O JEY@OARB,MONITER,MO^SE
Table 3.3.1 Hardware Requirements
3.4 MODULE DESCRIPTION
For predicting the literacy rate of India, our project has been divided
into following modules:
1. Data Analysis & Pre-processing
2. Model Training &Testing
3. Accuracy Measures
4. Prediction & Visualization
1. Data Analysis & Pre-processing
Data Analysis is done by collecting raw data from different literacy
websites.Data pre-processing technique involves transforming raw data into an
understandable format. Real-world data is often incomplete, inconsistent, and/or
lacking in certain behaviours or trends, and is likely to contain many errors.
Data pre-processing is a
proven method of resolving such issues. We use pandas module for Data Analysis
and pre- processing
Pandas:
In order to be able to work with the data in Python, we'll need to read
the csv file into a Pandas Data Frame. A Data Frame is a way to represent and
work with tabular data. Tabular data has rows and columns, just like our csv file.
2. Model Training &Testing
For Literacy rate prediction, we perform “converting into 2D array”
and “scaling using normalization” operations on data for further processing. We use
fit_transform
to center the data in a way that it has 0 mean and 1 standard error. Then, we
divide the data
into x_train and y_train. Our model will get the 0-th element from x_train and try
to predict the 0-th element from y_train. Finally, we reshape the x_train data to
match the requirements for training using keras. Now we need to train our model
using the above data.
The algorithm that we have used is Linear Regression
Linear Regression:
Linear Regression is a machine learning algorithm based on supervised
learning. It is a statistical approach for modeling relationship between a
dependent variable with a given set of independent variables. Here we refer
dependent variables as response and independent variables as features for
simplicity.
Simple Linear Regression is an approach for predicting a response using a single
feature. It is assumed that the two variables are linearly related. Hence, we try to
find a linear function that predicts the response value(y) as accurately as possible
as a function of the feature or independent variable(x).
For predicting the literacy rate of any given year, first we need predict the
population for that year. Then the predicted population is given as input to the
model which predict literacy rate
For the algorithm which predict population, year is taken as independent variable.
And the predicted population is taken as independent variable for the literacy
prediction algorithm.
Testing:
In testing, now we predict the data. Here we have 2 steps: predict the
literacy rate and plot it to compare with the real results.Using fit transform to
scale the data and then
reshape it for the prediction. Predict the data and rescale the predicted data to
match its real
values. Then plot real and predicted literacy rate on a graph. Then calculate the
accuracy.
We use Sklearn and Numpy python module for Training and testing
Sklearn:
It features various classification, regression and clustering
algorithms including support vector machines, random forests, gradient
boosting, j-means and DBSCAN, and is designed to interoperate with the
Python numerical and scientific libraries NumPy.
Numpy:
Numpy is the core library for scientific computing in Python. It
provides a high- performance multidimensional array object, and tools for working
with these arrays.It is used for Numerical Calculations
3. Accuracy Measures
The Accuracy of the model is to be evaluated to figure out the
correctness of the prediction. The proposed model got 87% Accuracy.
4. Prediction & Visualization
Using the Proposed model prediction is made for coming years.
Graphs are used to visualize state wise literacy rate predictions. We use
Matplotlib python module for Visualization
Matplotlib:
It is a plotting library for the Python programming language
and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like
Tkinter, wxPython, Qt, or GTK+. There is also a procedural "pylab"
on a
interface based
(like OpenGL), designed to closely resemble that of MATLAB, though
its use is discouraged.[3] SciPy makes use of Matplotlib.
CHAPTER 4
DESIHN
4.1@LOCJ DIAHRAM
The block diagram is typically used for a higher level, less detailed
description aimed more at understanding the overall concepts and less at
understanding the details of implementation.
Figure 4.1.1 Block Diagram for Sentimental Analysis
4.2DATA FLOW DIAGRAMS:
Data flow diagram (DFD) is a graphical representation of “flow” of
data through an information system, modelling its process concepts. Often
they are a preliminary step used to create an overview of the system which
can later be elaborated. DFD's can also be used for the visualization of data
processing (structured design).
A DFD shows what kinds of information will be input to and output from the
system, where the data will come from and go to, and where the data will be
stored. It doesn't show information about timing of processes, or information
about whether processes will operate in sequence or parallel. A DFD is also
called as “bubble chart”.
DFD Symbols:
In the DFD, there are four symbols:
• A square define a source or destination of system data.
• An arrow indicates dataflow. It is the pipeline through which the
information flows.
• A circle or a bubble represents transforms dataflow into outgoing dataflow.
• An open rectangle is a store, data at reset or at temporary repository of data.
Dataflow: Data move in a specific direction from an origin to a destination.
Process: People, procedures or devices that use or produce (Transform)
data. The physical component is not identified.
Sources: External sources or destination of data, which may be programs,
organizations or other entity.
Data store: Here data is stored or referenced by a process in the system's #
In our project, we had built the data flow diagrams at the very
beginning of business process modelling in order to model the functions that
our project has to carry out and the interaction between those functions
together with focusing on data exchanges between processes.
4.2.1 Context level DFD:
A Context level Data flow diagram created using select structured
systems
analysis and design method (SSADM). This level shows the overall context
of the system and its operating environment and shows the whole system as
just one process. It
does not usually show data stores, unless they are “owned” by external
systems, e.g. are accessed by but not maintained by this system, however,
these are often shown as external entities. The Context level DFD is shown
in fig.3.2.1
Figure 4.2.1 Context Level DFD for Sentimental Analysis
The Context Level Data Flow Diagram shows the data flow from the
application to the database and to the system.
4.2.2 Top level DFD:
A data flow diagram is that which can be used to indicate the clear
progress of a business venture. In the process of coming up with a data flow
diagram, the level one provides an overview of the major functional areas of
the undertaking. After presenting the values for most important fields of
discussion, it gives room for level two to be
drawn.
Figure 4.2.2 Top Level DFD for Sentimental Analysis
After starting and executing the application, training and testing the
dataset can be done as shown in the above figure
4.2.3 Detailed Level Diagram
This level explains each process of the system in a detailed manner. In
first detailed level DFD (Generation of individual fields): how data flows through
individual
process/fields in it are shown.
In second detailed level DFD (generation of detailed process of the
individual fields):
how data flows through the system to form a detailed description of the
individual processes.
Figure 4.2.3.1 Detailed level DFD for Sentimental Analysis
After starting and executing the application, training the dataset is
done by using dividing into 2D array and scaling using normalization
algorithms, and then testing is done.
Figure 4.2.3.2 Detailed level dfd for Sentimental Analysis
After starting and executing the application, training the dataset is
done by using linear regression and then testing is done.
4.3UNIFIED MODELLING LANGUAGE DIAGRAMS:
The Unified Modelling Language (UML) is a Standard language for
specifying, visualizing, constructing and documenting the software system
and its components. The
UML focuses on the conceptual and physical representation of the system. It
captures the decisions and understandings about systems that must be
constructed. A UML
system is represented using five different views that describe the system from
distinctly different perspective. Each view is defined by a set of diagram, which is
as follows.
• User Model View
i. This view represents the system from the user's perspective.
ii. The analysis representation describes a usage scenario
from the end- users perspective.
• Structural Model View
i. In this model the data and functionality are arrived from
inside the system.
ii. This model view models the static structures.
• Behavioral Model View
It represents the dynamic of behavioral as parts of the system,
depicting
the interactions of collection between various structural elements described
in the user model and structural model view.
• Implementation model View
In this the structural and behavioral as parts of the system
are represented as they are to be built.
• Environmental Model View
In this the structural and behavioral aspects of the
environment in which the system is to be implemented are
represented.
4.3.2 Use Case Diagram:
Use case diagrams are one of the five diagrams in the UML for
modeling the dynamic aspects of the systems (activity diagrams, sequence
diagram, state chart diagram, collaboration diagram are the four other kinds
of diagrams in the UML for modeling the dynamic aspects of systems).Use
case diagrams are central to modeling the behavior of the system, a sub-
system, or a class. Each one shows a set of use cases and actors and
relations.
Figure 4.3.1 USECASE DIAGRAM
4.3.3 Sequence Diagram:
Sequence diagram is an interaction diagram which is focuses on the
time ordering of messages. It shows a set of objects and messages exchanged
between these
objects. This diagram illustrates the dynamic view of a system.
Figure 4.3.2 Sequence Diagram
4.3.4 Collaboration Diagram:
Collaboration diagram is an interaction diagram that emphasizes the
structural organization of the objects that send and receive messages.
Collaboration diagram and sequence diagram are isomorphic.
Figure 4.3.3 Collaboration Diagram
4.3.5Activity Diagram:
An Activity diagram shows the flow from activity to activity within a
system it emphasizes the flow of control among objects.
Figure 4.3.4 Activity Diagram
4.3.1 DATA DICTIONARY
Fig 4.3.5 Data Dictionary
CHAPTER 5
IMPLEMENTATION
Implementation is the stage of the project when the theoretical design is
turned out into a working system. Thus it can be considered to be the most critical
stage in achieving a successful new system and in giving the user, confidence that
the new system will work and be effective.
The implementation stage involves careful planning, investigation of the
existing system and it's constraints on implementation, designing of methods to
achieve changeover
and evaluation of changeover methods.
The project is implemented by accessing simultaneously from more than
one system and more than one window in one system. The application is
implemented in the Internet Information Services 5.0 web server under the
Windows XP and accessed from various
clients.
5.1 TECHNOLOGIES USED
What is Python?
Python is an interpreter, high-level programming language for general-
purpose programming by “Guido van Rossum” and first released in 1991, Python
has a design philosophy that emphasizes code readability, and a syntax that allows
programmers to express concepts in fewer lines of code, notably using significant
whitespace. It provides constructs that enable clear programming on both small
and large scales. Python features a dynamic type system and automatic memory
management. It supports multiple programming paradigms, including object-
oriented, imperative, functional, procedural, and has a large and comprehensive
standard library.
Python interpreters are available for many operating systems. Python, the
reference implementation of Python, is open source software and has a
community-based development model, as do nearly all of its variant
implementations. Python is managed by the non-profit Python Software
Foundation.
Python is a general purpose, dynamic, high level and interpreted
programming language. It supports object-oriented programming approach to
develop applications. It is simple and easy to learn and provides lots of high level
data structures.
• Windows XP
• Python Programming
• Open source libraries: Pandas, NumPy, SciPy, matplotlib, OpenCV
Python Versions
Python 2.0 was released on 16 October 2000 and had many major new
features,
including a cycle-detecting, garbage collector, and support for Unicode. With this release, the
development process became more transparent and community-backed.
Python 3.0 (initially called Python 3000 or py3k) was released on 3 December 2008
after a long testing period. It is a major revision of the language that is not completely
backward-compatible with previous versions. However, many of its major features have been
back ported to the Python 2.6.xand 2.7.x version series, and releases of Python 3 include the
2to3 utility, which automates the translation of Python 2 code to Python 3.
Python 2.7's end-of-life date (a.k.a. EOL, sunset date) was initially set at 2015, then
postponed to 2020 out of concern that a large body of existing code could not easily be
forward-ported to Python 3.In January 2017, Google announced work on a Python 2.7 to go
Trans compiler to improve performance under concurrent workloads.
Python 3.6 had changes regarding UTF-8 (in Windows, PEP 528 and PEP 529) and
Python 3.7.0b1 (PEP 540) adds a new "UTF-8 Mode" (and overrides POSIX locale).
Why Python?
• Python is a scripting language like PHP, Perl, and Ruby.
• No licensing, distribution, or development fees
• It is a Desktop application.
• Linux, windows
• Excellent documentation
• Thriving developer community
• For us job opportunity
Libraries Of python:
Python's large standard library, commonly cited as one of its greatest
strengths,provides tools suited too many tasks. For Internet-facing applications, many
standard formats and protocols such as MIME and HTTP are supported. It includes modules
for creating graphical user interfaces, connecting to relational databases, generating
pseudorandom numbers, arithmetic with arbitrary precision decimals, manipulating regular
expressions, and unit testing.
Some parts of the standard library are covered by specifications (for example, the Web
Server Gateway Interface (WSGI) implementation wsgiref follows PEP 33), but most
modules are not.
They are specified by their code, internal documentation, and test suites (if supplied).
However, because most of the standard library is cross-platform Python code, only a few
modules need altering or rewriting for variant implementations.
As of March 2018, the Python Package Index (PyPI), the official repository for
thirdparty Python software, contains over 130,000 packages with a wide range of
functionality, including:
• Graphical user interfaces
Web frameworks
• Multimedia
• Databases
• Networking
• Test frameworks
• Automation
• Web scraping
• Documentation
• System administration
>.0 EOCMLDN
Machine Learning is an application of artificial intelligence (AI) that provides system
the ability to automatically learn and improve from experience without being explicitly
programmed. Machine learning focuses on the development of computer programs that can
access data and use it learn for themselves.
Basics of python machine learning:
• You'll know how to use Python and its libraries to explore your data with the help of
matplotlib and Principal Component Analysis (PCA).
• And you'll preprocess your data with normalization and you'll split your data into training
and test sets.
•
Next, you'll work with the well-known K-Means algorithm to construct an unsupervised
model, fit this model to your data, predict values, and validate the model that you have
built.
• As an extra, you'll also see how you can also use Support Vector Machines (SVM) to
construct another model to classify your data.
Why Machine Learning?
• It was born from pattern recognition and theory that computers can learn without being
programmed to specific tasks.
• It is a method of Data analysis that automates analytical model building.
Machine learning tasks are typically classified into two broad categories, depending
on whether there is a learning "signal" or "feedback" available to a learning system. They are
Supervised learning:The computer is presented with example inputs and their desired
outputs, given by a "teacher", and the goal is to learn a general rule that mapsinputs to outputs.
As special cases, the input signal can be only partially available, or restricted to special
feedback:
Semi-supervised learning: the computer is given only an incomplete training
training set with some (often many) of the target outputs missing.
Active learning:the computer can only obtain training labels for a limited set of instances
(based on a budget), and also has to optimize its choice of objects to acquire labels for. When
used interactively, these can be presented to the user for labelling.
Reinforcement learning: training data (in form of rewards and punishments) is given only
as feedback to the program's actions in a dynamic environment, such asdriving a vehicleor
playing a game against an opponent.
Unsupervised learning: No labels are given to the learning algorithm, leaving it on its
own to find structure in its input. Unsupervised learning can be a goal in itself (discovering
hidden patterns in data) or a means towards an end (feature learning).
Inregression, also a supervised problem, the outputs are continuous rather than
discrete.
Regression: The analysis or measure of the association between one variable (the dependent
variable) and one or more other variables (the independent variables), usually formulated in
an equation in which the independent variables have parametric coefficients, which may
enable future values of the dependent variable to be predicted.
Figure 4.1.1 Regression Structure
What is Regression Analysis?
Regression analysis is a form of predictive modelling technique which investigates the
relationship between a dependent(target) and independent variable (s) (predictor). This
technique is used for forecasting, time series modelling and finding thecausal
effectrelationshipbetween the variables. For example, relationship between rash driving and
number of road accidents by
Types of Regression:
1. Linear Regression
2. Logistic Regression
3. Polynomial Regression
4. Stepwise Regression
5. Ridge Regression
6. Lasso Regression
7. Elastic Net Regression
1. Linear Regression: -It is one of the most widely known modelling techniques. Linear
regression is usually among the first few topics which people pick while learning predictive
modelling. In this technique, the dependent variable is continuous, independent variable(s)
can becontinuousordiscrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one
or more independent variables (X) using a best fit straight line (also known as regression line).
2. Logistic Regression: -Logistic regression is used to find the probability of
event=Success and event=Failure. We should use logistic regression when the dependent
variable is binary (0/ 1, True/ False, Yes/ No) in nature. Here the value of Y ranges from 0 to
1 and it can have represented by following equation.
odds= p/ (1-p) = probability of event occurrence / probability of not event occurrence
ln(odds) = ln(p/(1-p)) logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3+bkXk
3. Polynomial Regression: -A regression equation is a polynomial regression
equation if the power of independent variable is more than 1. The equation below
represents a
polynomial equation:
y=a+b*x^2
4. Stepwise Regression: -This form of regression is used when we deal with multiple
independent variables. In this technique, the selection of independent variables is done with
the help of an automatic process, which involves dg human intervention.
This feat is achieved by observing statistical values like R-square, t-stats and AIC
metric to discern significant variables. Stepwise regression basically fits the regression model
by adding/dropping co-variants one at a time based on a specified criterion. Some of the most
commonly used Stepwise regression methods are listed below:
• Standard stepwise regression does two things. It adds and removes predictors as
needed for each step.
• Forward selection starts with most significant predictor in the model and adds variable
for each step.
• Backward elimination starts with all predictors in the model and removes the least
significant variable for each step.
The aim of this modelling technique is to maximize the prediction power with
minimum number of predictor variables. It is one of the methods to handle higher
dimensionality of data set.
5. Ridge Regression: -Ridge Regression is a technique used when the data suffers
from multi collinearity (independent variables are highly correlated). In multi collinearity,
even though the least squares estimate (OLS) are unbiased; their variances are large which
deviates
the observed value far from the true value. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard errors.
Above, we saw the equation for linear regression. Remember? It can be represented as:
y=a+ b*x
This equation also has an error term. The complete equation becomes:
y=a+b*x+e (error term), [error term is the value needed to correct for a prediction
error between the observed and predicted value]
=> y=a+y= a+ b1x1+ b2x2++e, for multiple independent variables.
In a linear equation, prediction errors can be decomposed into two sub components.
First is due to the biased and second is due to the variance. Prediction error can occur due to
any one of these two or both components. Here, we'll discuss about the error caused due to
variance.
Ridge regression solves the multi collinearity problem throughshrinkage parameterλ
(lambda). Look at the equation below.
In this equation, we have two components. First one is least square term and other one
is lambda of the summation of β2 (beta- square) where β is the coefficient. This is added to
least square term in order to shrink the parameter to have a very low variance.
Important Points:
• The assumptions of this regression is same as least squared regression except
normality is not to be assumed
• It shrinks the value of coefficients but doesn't reaches zero, which suggests no feature
selection feature
• This is a regularization method and uses l2 regularization.
6. Lasso Regression: -Similar to Ridge Regression, Lasso (Least Absolute Shrinkage
and Selection Operator) also penalizes the absolute size of the regression coefficients. In
addition, it is capable of reducing the variability and improving the accuracy of linear
regression models. Look at the equation below:
Lasso regression differs from ridge regression in a way that it uses absolute values in
the penalty function, instead of squares. This leads to penalizing (or equivalently constraining
the sum of the absolute values of the estimates) values which causes some of the parameter
estimates to turn out exactly zero. Larger the penalty applied, further the estimates get shrunk
towards absolute zero. This results to variable selection, out of given n variables.
Important Points:
• The assumptions of this regression is same as least squared regression except
normality is not to be assumed
• It shrinks coefficients to zero (exactly zero), which certainly helps in feature selection
• This is a regularization method and usesl1 regularization
• If group of predictors are highly correlated, lasso picks only one of them and shrinks
the others to zero
7. Elastic Net Regression: -Elastic Net is hybrid of Lasso and Ridge Regression
techniques. It is trained with L1 and L2 prior as regularize. Elastic-net is useful when there
are multiple features which are correlated. Lasso is likely to pick one of these at random,
while elastic-net is likely to pick both.
A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-
Net to inherit some of Ridge's stability under rotation.
Important Points:
• It encourages group effect in case of highly correlated variables
• There are no limitations on the number of selected variables
It can suffer with double shrinkage
Beyond these 7 most commonly used regression techniques, you can also look at
other models likeBayesian,EcologicalandRobust regression.
Classification
A classification problem is when the output variable is a category, such as “red” or
“blue” or “disease” and “no disease”. A classification model attempts to draw some
conclusion from observed values. Given one or more inputs a classification model will try
to predict the value of one or more outcomes. For example, when filtering emails “spam” or
“not spam”, when looking at transaction data, “fraudulent”, or “authorized”. In short
Classification either predicts categorical class labels or classifies data (construct a model)
based on the training set and the values (class labels) in classifying attributes and uses it in
classifying new data.
There are a number of classification models. Classification models include
1. Logistic regression
2. Decision tree
3. Random forest
4. Naive Bayes.
1. Logistic Regression
Logistic regression is a supervised learning classification algorithm used to predict
the probability of a target variable. The nature of target or dependent variable is
dichotomous, which means there would be only two possible classes. In simple words, the
dependent variable is binary in nature having data coded as either 1 (stands for success/yes)
or 0 (stands for failure/no).
Mathematically, a logistic regression model predicts P(Y=1) as a function of X. It
is one of the simplest ML algorithms that can be used for various classification problems
such as spam detection, Diabetes prediction, cancer detection etc.
2. Decision Tree
Decision Trees are a type of Supervised Machine Learning (that is you explain what
the input is and what the corresponding output is in the training data) where the data is
continuously split according to a certain parameter. The tree can be explained by two entities,
namely decision nodes and leaves.
3. Random Forest
Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique. It can be used for both Classification and Regression problems
in ML. It is based on the concept of ensemble learning, which is a process of combining
multiple classifiers to solve a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset. Instead of relying on one decision tree, the random forest
"
takes the prediction from each tree and based on the majority votes of predictions, and it
predicts the final output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Fig 5.2.1 Image for Random Forest Algorithm
4. Naive Bayes
It is a classification technique based on Bayes' Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along with
simplicity, Naive Bayes is known to outperform even highly sophisticated classification
methods.
Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c),
P(x) and P(x|c). Look at the equation below:
5.3 Deep learning
Deep Learning is a class of Machine Learning which performs much better on
unstructured data. Deep learning techniques are outper- forming current machine learning
techniques. It enables computational models to learn features progressively from data at
multiple levels. The popularity of deep learning amplified as the amount of data available
increased as well as the advancement of hardware that provides powerful computers.
Deep learning has emerged as a powerful machine learning technique that learns
multiple layers of representations or features of the data and produces state-of-the-art
prediction results. Along with the success of deep learning in many other application domains,
deep learning is also popularly used in sentiment analysis in recent years.
Deep Learning Algorithms:
There are two types of algorithms
1. Structured
Algorithm
2.Unstructured
Algorithm
1. Structured Algorithm
One of the structured algorithm is
Artificial Neural Network
Artificial Neural Networks are computational models and inspire by the human brain.
Many of the recent advancements have been made in the field of Artificial Intelligence,
including Voice Recognition, Image Recognition, Robotics using Artificial Neural Networks.
Artificial Neural Networks are the biologically inspired simulations performed on the
computer to perform certain specific tasks like –
• Clustering
• Classification
• Pattern Recognization
2. Unstructured Algorithm
One of the unstructured algorithm is
Deep Neural Network
A deep neural network (DNN) is an artificial neural network (ANN) with multiple
layers between the input and output layers. There are different types of neural networks but
they always consist of the same components: neurons, synapses, weights, biases, and
functions.These components functioning similar to the human brains and can be trained like
any other ML algorithm.
For example, a DNN that is trained to recognize dog breeds will go over the given
image and calculate the probability that the dog in the image is a certain breed. The user can
review the results and select which probabilities the network should display (above a certain
threshold, etc.) and return the proposed label. Each mathematical manipulation as such is
considered a layer, and complex DNN have many layers, hence the name "deep" networks.
Modules in python
Module: - A module allows you to logically organize your Python code.
Grouping related code into a module makes the code easier to understand and use. A
module is a Python object with arbitrarily named attributes that you can bind and
reference.
Pandas: -
Pandas is a Python package providing fast, flexible, and expressive data
structures designed to make working with “relational” or “labelled” data both easy and
intuitive. It aims to be the fundamental high-level building block for doing practical,
real world data analysis in Python. Additionally, it has the broader goal of becoming the
most powerful and flexible open source data analysis / manipulation tool available in
any language.
It is already well on its way toward this goal.
Pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel
spread sheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and
column labels
• Any other form of observational / statistical data sets. The data actually need not
be labelled at all to be placed into a panda's data structure
The two primary data structures of pandas,Series(1-dimensional) and Data
Frame (2dimensional), handle the vast majority of typical use cases in finance,
statistics, social science, and many areas of engineering. For R users,Data
Frameprovides everything that R's data frame provides and much more. Pandas is
built on top of NumPy and is intended to integrate well within a scientific computing
environment with many other 3rd party libraries. Few of the things that pandas does
well:
• Easy handling of missing data (represented as Nan) in floating point as well as
nonfloating-point data
• Size mutability: columns can be inserted and deleted from Data Frame and
higher dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned
to a set of labels, or the user can simply ignore the labels and let S9Vi9S, DŒtŒ
FVŒK9, etc. automatically align the data for you in computations
• Powerful, flexible group by functionality to perform split-apply-combine
operations on data sets, for both aggregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and
NumPy data structures into Data Frame objects
• Intelligent label-based slicing, fancy indexing, and sub settingof large data sets
• Intuitive mergingandjoining data sets
• Flexible reshaping and pivoting of data sets
• Hierarchical labelling of axes (possible to have multiple labels per tick)
• Robust IO tools for loading data from flat files (CSV and delimited), Excel files,
databases, and saving / loading data from the ultrafast HDF5 format
• Time series-specific functionality: date range
generation and frequency
conversion, moving window statistics, moving window linear regressions, date
shifting and lagging, etc.
Many of these principles are here to address the shortcomings frequently
experienced using other languages / scientific research environments. For data
scientists, working with data is typically divided into multiple stages: munging and
cleaning data, analysing / modelling it, then organizing the results of the analysis into a
form suitable for plotting or tabular display. pandas is the ideal tool for all of these
tasks.
• pandas is fast. Many of the low-level algorithmic bits have been extensively
improved in Python code. However, as with anything else generalization usually
sacrifices performance. So, if you focus on one feature for your application you
may be able to create a faster specialized tool.
• pandas are a dependency of stats models, making it an important part of the
statistical computing ecosystem in Python.
• pandas have been used extensively in production in financial applications.
NumPy: -
NumPy, which stands for Numerical Python, is a library consisting of
multidimensional array objects and a collection of routines for processing those arrays.
Using NumPy, mathematical and logical operations on arrays can be performed. This
tutorial explains the basics of NumPy such as its architecture and environment. It also
discusses the various array functions, types of indexing, etc. An introduction to
Matplotlib is also provided.
All this is explained with the help of examples for better understanding.
NumPy is a Python package. It stands for 'Numerical Python'. It is a library
consisting of multidimensional array objects and a collection of routines for processing
of array. Numeric, the ancestor of NumPy, was developed by Jim Humulin. Another
package Numara was also developed, having some additional functionalities. In 2005,
Travis Oliphant created NumPy package by incorporating the features of Numara into
Numeric package. There are many contributors to this open source project.
Operations using NumPy: -
Using NumPy, a developer can perform the following operations —
• Mathematical and logical operations on arrays.
• Fourier transforms and routines for shape manipulation.
• Operations related to linear algebra. NumPy has in-built functions for linear
algebra and random number generation.
NumPy – A Replacement for MATLAB
NumPy is often used along with packages like SciPy (Scientific Python) and Mat
—plotid (plotting library). This combination is widely used as a replacement for
MATLAB, a popular platform for technical computing. However, Python alternative to
MATLAB is now seen as a more modern and complete programming language. It is
open source, which is an added advantage of NumPy.
Sickit-learn: -
Scikit-learn (formerly scikits. learn) is a free software machine learning library
for the Python programming language. It features various classification, regression and
clustering algorithms including support vector machines, random forests, gradient
boosting, k-means and DBSCAN, and is designed to interoperate with the Python
numerical and scientific libraries NumPy and SciPy.
The scikit-learn project started as scikits. learn, a Google Summer of Code
project by David Courmayeur. Its name stems from the notion that it is a “SciKit”(SciPy
Toolkit), a separately-developed and distributed third-party extension to SciPy.
The original codebase was later rewritten by other developers. In 2010 Fabian
Pedrosa, Gael Viroqua, Alexandre Gramfort and Vincent Michel, all from INRIA took
leadership of the project and made the first public release on February the 1st 2010 .Of
the various scikits, scikit-learn as well as scikit-image were described as “well-
maintained and popular” in November 2012. Scikit-learn is largely written in Python,
with some core algorithms written in Cython to achieve performance. Support vector
machines are implemented by a Cython wrapper around LIBSVM; logistic regression
and linear support vector machines by a similar wrapper around LIBLINEAR.
Some popular groups of models provided by scikit-learn include:
• Ensemble methods: for combining the predictions of multiple
supervised models.
• Feature extraction: for defining attributes in image and text data.
• Feature selection: for identifying meaningful attributes from which to
create supervised models.
• Parameter Tuning: for getting the most out of supervised models.
• Manifold Learning: For summarizing and depicting complex multi-
dimensional data.
• Supervised Models: a vast array not limited to generalize linear models,
discriminate analysis, naive bayes, lazy methods, neural networks, support
vector machines and decision trees.
Matplotlib:-
Matplotlib is a plotting library for the Python programming language and its
numerical mathematics extension NumPy. It provides an object-oriented API for
embedding plots into applications using general-purpose GUI toolkits like Tkinter,
wxPython, Qt, or GTK+. There is also a procedural "pylab" interface based on a state
machine (like OpenGL), designed to closely resemble that of MATLAB, though its use
is discouraged. SciPy makes use of matplotlib.
CHAPTER 6
TESTING
It is the process of testing the functionality and it is the process of executing a
program with the intent of finding an error. A good test case is one that has a high
probability of finding an as at undiscovered error. A successful test is one that uncovers
an as at undiscovered error. Software testing is usually performed for one of two
reasons:
• Defect Detection
• Reliability estimation
6.1BLACK BOX TESTING:
The base of the black box testing strategy lies in the selection of appropriate data
as per functionality and testing it against the functional specifications in order to check
for normal and abnormal behavior of the system. Now a days, it is becoming to route
the testing work to a third party as the developer of the system knows too much of the
internal logic and coding of the system, which makes it unfit to test application by the
developer. The following are different types of techniques involved in black box testing.
They are:
• Decision Table Testing
• All pairs testing
• State transition tables testing
• Equivalence Partitioning
Software testing is used in association with Verification and Validation. Verification
is the checking of or testing of items, including software, for conformance and
consistency with an associated specification. Software testing is just one kind of
verification, which also uses techniques as reviews, inspections, walk-through.
Validation is the process of checking what has been specified is what the user actually
wanted.
• Validation: Are we doing the right job?
• Verification: Are we doing the job right?
In order to achieve consistency in the Testing style, it is imperative to have and
follow a set of testing principles. This enhances the efficiency of testing within SQA
team members and thus contributes to increased productivity. The purpose of this
document is to provide overview of the testing, plus the techniques. Here, after training
is done on the training dataset, testing is done.
1.0 XMLRN @G] RNSRLDH:
White box testing [10] requires access to source code. Though white box testing [10] can
be performed any time in the life cycle after the code is developed, it is a good practice to
perform white box testing [10] during unit testing phase.
In designing of database the flow of specific inputs through the code, expected output and
the functionality of conditional loops are tested.
At SDEI, 3 levels of software testing is done at various SDLC phases
•
^DLR RNSRLDH: in which each unit (basic component) of the software is
tested to verify that the detailed design for the unit has been correctly
implemented
•
LDRNHPORLGD RNSRLDH: in which progressively larger groups of tested
software components corresponding to elements of the architectural design
are integrated and tested until the software works as a whole.
•
S\SRNE RNSRLDH: in which the software is integrated to the overall
product and tested to show that all requirements are met. A further level of
testing is also done, in accordance with requirements:
•
PNHPNSSLGD RNSRLDH: is used to refer the repetition of the earlier
successful tests to ensure that changes made in the software have not
introduced new bugs/side effects.
•
OCCN[RODCN RNSRLDH: Testing to verify a product meets customer
specified requirements. The acceptance test suite is run against supplied
input data. Then the results obtained are compared with the expected results
of the client. A correct match was obtain.
CMO[RNP 8
G^R[^R SCPNNDS
We use Machine Learning models to evaluate a model.At back end Deep Learning
algoritms like ANN(Artificial Neural Network) is used to evaluate a model
The following algoritms are used
1.Doivn @oyns
It is a classification technique based on Bayes' Theorem with an assumption of
independence among predictors. In simple terms, a Naive Bayes classifier assumes that the
presence of a particular feature in a class is unrelated to the presence of any other feature.
Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.
0. Podbge Agrnst
Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression in ML.
multiple classifiers to sole a complex problem and to improve the performance of the model.
As the name suggests, “Random Forest is a classifier that contains a number of
decision trees on various subsets of the given dataset and takes the average to improve the
predictive accuracy of that dataset”.Instead of relying on one decision tree,the random forest
takes the prediction from each tree based on the majority votes of prediction,and it predicts
the final output.
At Backend ANN algorithm is used
1. ANN(Artificial Neural Network)
Artificial Neural Networks are computational models and inspire by the human brain.
Many of the recent advancements have been made in the field of Artificial Intelligence,
including Voice Recognition, Image Recognition, Robotics using Artificial Neural Networks.
Artificial Neural Networks are the biologically inspired simulations performed on the
computer to perform certain specific tasks like –
• Clustering
• Classification
• Pattern Recognization
For the above review the output is:
For the review ”Food is Bad”
Sentiment Analysis
Predicting is a review Positive or Negative.
food is bade