Relational Algebra
Relational algebra is a procedural query language. It gives a step by step process to obtain
the result of the query. It uses operators to perform queries.
Types of Relational operation
1. Select Operation:
o The select operation selects tuples that satisfy a given predicate.
o It is denoted by sigma (σ).
1. Notation: σ p(r)
Where:
σ is used for selection prediction
r is used for relation
p is used as a propositional logic formula which may use connectors like: AND OR and
NOT. These relational can use as relational operators like =, ≠, ≥, <, >, ≤.
For example: LOAN Relation
BRANCH_NAME LOAN_NO AMOUNT
Downtown L-17 1000
Redwood L-23 2000
Perryride L-15 1500
Downtown L-14 1500
Mianus L-13 500
Roundhill L-11 900
Perryride L-16 1300
Input:
1. σ BRANCH_NAME="perryride" (LOAN)
Output:
BRANCH_NAME LOAN_NO AMOUNT
Perryride L-15 1500
Perryride L-16 1300
2. Project Operation:
o This operation shows the list of those attributes that we wish to appear in the
result. Rest of the attributes are eliminated from the table.
o It is denoted by ∏.
1. Notation: ∏ A1, A2, An (r)
Where
A1, A2, A3 is used as an attribute name of relation r.
Example: CUSTOMER RELATION
NAME STREET CITY
Jones Main Harrison
Smith North Rye
Hays Main Harrison
Curry North Rye
Johnson Alma Brooklyn
Brooks Senator Brooklyn
Input:
1. ∏ NAME, CITY (CUSTOMER)
Output:
NAME CITY
Jones Harrison
Smith Rye
Hays Harrison
Curry Rye
Johnson Brooklyn
Brooks Brooklyn
3. Union Operation:
o Suppose there are two tuples R and S. The union operation contains all the
tuples that are either in R or S or both in R & S.
o It eliminates the duplicate tuples. It is denoted by ∪.
1. Notation: R ∪ S
A union operation must hold the following condition:
o R and S must have the attribute of the same number.
o Duplicate tuples are eliminated automatically.
Example:
DEPOSITOR RELATION
CUSTOMER_NAME ACCOUNT_NO
Johnson A-101
Smith A-121
Mayes A-321
Turner A-176
Johnson A-273
Jones A-472
Lindsay A-284
BORROW RELATION
CUSTOMER_NAME LOAN_NO
Jones L-17
Smith L-23
Hayes L-15
Jackson L-14
Curry L-93
Smith L-11
Williams L-17
Input:
1. ∏ CUSTOMER_NAME (BORROW) ∪ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
CUSTOMER_NAME
Johnson
Smith
Hayes
Turner
Jones
Lindsay
Jackson
Curry
Williams
Mayes
4. Set Intersection:
o Suppose there are two tuples R and S. The set intersection operation contains
all tuples that are in both R & S.
o It is denoted by intersection ∩.
1. Notation: R ∩ S
Example: Using the above DEPOSITOR table and BORROW table
Input:
1. ∏ CUSTOMER_NAME (BORROW) ∩ ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
CUSTOMER_NAME
Smith
Jones
5. Set Difference:
o Suppose there are two tuples R and S. The set intersection operation contains
all tuples that are in R but not in S.
o It is denoted by intersection minus (-).
1. Notation: R - S
Example: Using the above DEPOSITOR table and BORROW table
Input:
1. ∏ CUSTOMER_NAME (BORROW) - ∏ CUSTOMER_NAME (DEPOSITOR)
Output:
CUSTOMER_NAME
Jackson
Hayes
Willians
Curry
6. Cartesian product
o The Cartesian product is used to combine each row in one table with each row
in the other table. It is also known as a cross product.
o It is denoted by X.
1. Notation: E X D
Example:
EMPLOYEE
EMP_ID EMP_NAME EMP_DEPT
1 Smith A
2 Harry C
3 John B
DEPARTMENT
DEPT_NO DEPT_NAME
A Marketing
B Sales
C Legal
Input:
1. EMPLOYEE X DEPARTMENT
Output:
EMP_ID EMP_NAME EMP_DEPT DEPT_NO DEPT_NAME
1 Smith A A Marketing
1 Smith A B Sales
1 Smith A C Legal
2 Harry C A Marketing
2 Harry C B Sales
2 Harry C C Legal
3 John B A Marketing
3 John B B Sales
3 John B C Legal
7. Rename Operation:
The rename operation is used to rename the output relation. It is denoted by rho (ρ).
Example: We can use the rename operator to rename STUDENT relation to STUDENT1.
1. ρ(STUDENT1, STUDENT)
Python Tuples
Python List Vs Tuple
Python Sets
Python Dictionary
Python Functions
Python Built-in Functions
next →
Python Tutorial | Python Programming
Language
Python is a widely used programming language that offers several unique features and
advantages compared to languages like Java and C++. Our Python tutorial thoroughly
explains Python basics and advanced concepts, starting with installation, conditional
statements, loops, built-in data structures, Object-Oriented Programming, Generators,
Exception Handling, Python RegEx, and many other concepts. This tutorial is designed for
beginners and working professionals.
In the late 1980s, Guido van Rossum dreamed of developing Python. The first version of
Python 0.9.0 was released in 1991. Since its release, Python started gaining popularity.
According to reports, Python is now the most popular programming language among
developers because of its high demands in the tech realm.
What is Python
Python is a general-purpose, dynamically typed, high-level, compiled and interpreted,
garbage-collected, and purely object-oriented programming language that supports
procedural, object-oriented, and functional programming.
Features of Python:
Easy to use and Read - Python's syntax is clear and easy to read, making it an ideal
language for both beginners and experienced programmers. This simplicity can
lead to faster development and reduce the chances of errors.
Dynamically Typed - The data types of variables are determined during run-time.
We do not need to specify the data type of a variable during writing codes.
High-level - High-level language means human readable code.
Compiled and Interpreted - Python code first gets compiled into bytecode, and
then interpreted line by line. When we download the Python in our system form org
we download the default implement of Python known as CPython. CPython is
considered to be Complied and Interpreted both.
Garbage Collected - Memory allocation and de-allocation are automatically
managed. Programmers do not specifically need to manage the memory.
Purely Object-Oriented - It refers to everything as an object, including numbers
and strings.
Cross-platform Compatibility - Python can be easily installed on Windows, macOS,
and various Linux distributions, allowing developers to create software that runs
across different operating systems.
Rich Standard Library - Python comes with several standard libraries that provide
ready-to-use modules and functions for various tasks, ranging from web
development and data manipulation to machine learning and networking.
Open Source - Python is an open-source, cost-free programming language. It is
utilized in several sectors and disciplines as a result.
Python has many web-based assets, open-source projects, and a vibrant community. Learning
the language, working together on projects, and contributing to the Python ecosystem are all
made very easy for developers.
Because of its straightforward language framework, Python is easier to understand and write
code in. This makes it a fantastic programming language for novices. Additionally, it assists
seasoned programmers in writing clear and error-free code.
Python has many third-party libraries that can be used to make its functionality easier. These
libraries cover many domains, for example, web development, scientific computing, data
analysis, and more.
Java vs. Python
Python is an excellent choice for rapid development and scripting tasks. Whereas Java
emphasizes a strong type system and object-oriented programming.
Here are some basic programs that illustrates key differences between them.
Printing 'Hello World'
Python Code:
print("Hello World!")
Test it Now
Output:
Hello, World!
In Python, it is one line of code. It requires simple syntax to print 'Hello World'
Java Code:
public class HelloWorld {
public static void main(String[] args) {
System.out.println("Hello, World!");
}
}
Output:
Hello, World!
In Java, we need to declare classes, method structures many other things.
While both programs give the same output, we can notice the syntax difference in the print
statement.
In Python, it is easy to learn and write code. While in Java, it requires more code to
perform certain tasks.
Python is dynamically typed, meaning we do not need to declare the variable
Whereas Java is statistically typed, meaning we need to declare the variable type.
Python is suitable for various domains such as Data Science, Machine Learning,
Web development, and more. Whereas Java is suitable for web development,
mobile app development (Android), and more.
Python Basic Syntax
There is no use of curly braces or semicolons in Python programming language. It is an
English-like language. But Python uses indentation to define a block of code. Indentation is
nothing but adding whitespace before the statement when it is needed.
For example -
def func():
statement 1
statement 2
…………………
…………………
statement N
In the above example, the statements that are the same level to the right belong to the
function. Generally, we can use four whitespaces to define indentation.
Instead of Semicolon as used in other languages, Python ends its statements with a NewLine
character.
Python is a case-sensitive language, which means that uppercase and lowercase letters are
treated differently. For example, 'name' and 'Name' are two different variables in Python.
In Python, comments can be added using the '#' symbol. Any text written after the '#' symbol
is considered a comment and is ignored by the interpreter. This trick is useful for adding notes
to the code or temporarily disabling a code block. It also helps in understanding the code
better by some other developers.
'If', 'otherwise', 'for', 'while', 'try', 'except', and 'finally' are a few reserved keywords in Python that
cannot be used as variable names. These terms are used in the language for particular reasons
and have fixed meanings. If you use these keywords, your code may include errors, or the
interpreter may reject them as potential new Variables.
History of Python
Python was created by Guido van Rossum. In the late 1980s, Guido van Rossum, a Dutch
programmer, began working on Python while at the Centrum Wiskunde & Informatica (CWI)
in the Netherlands. He wanted to create a successor to the ABC programming language that
would be easy to read and efficient.
In February 1991, the first public version of Python, version 0.9.0, was released. This marked
the official birth of Python as an open-source project. The language was named after the
British comedy series "Monty Python's Flying Circus".
Python development has gone through several stages. In January 1994, Python 1.0 was
released as a usable and stable programming language. This version included many of the
features that are still present in Python today.
From the 1990s to the 2000s, Python gained popularity for its simplicity, readability, and
versatility. In October 2000, Python 2.0 was released. Python 2.0 introduced list
comprehensions, garbage collection, and support for Unicode.
In December 2008, Python 3.0 was released. Python 3.0 introduced several backward-
incompatible changes to improve code readability and maintainability.
Throughout 2010s, Python's popularity increased, particularly in fields like data science,
machine learning, and web development. Its rich ecosystem of libraries and frameworks made
it a favourite among developers.
The Python Software Foundation (PSF) was established in 2001 to promote, protect, and
advance the Python programming language and its community.
Why learn Python?
Python provides many useful features to the programmer. These features make it the most
popular and widely used language. We have listed below few-essential features of Python.
Easy to use and Learn: Python has a simple and easy-to-understand syntax, unlike
traditional languages like C, C++, Java, etc., making it easy for beginners to learn.
Expressive Language: It allows programmers to express complex concepts in just a
few lines of code or reduces Developer's Time.
Interpreted Language: Python does not require compilation, allowing rapid
development and testing. It uses Interpreter instead of Compiler.
Object-Oriented Language: It supports object-oriented programming, making
writing reusable and modular code easy.
Open-Source Language: Python is open-source and free to use, distribute and
modify.
Extensible: Python can be extended with modules written in C, C++, or other
languages.
Learn Standard Library: Python's standard library contains many modules and
functions that can be used for various tasks, such as string manipulation, web
programming, and more.
GUI Programming Support: Python provides several GUI frameworks, such as
Tkinter and PyQt, allowing developers to create desktop applications easily.
Integrated: Python can easily integrate with other languages and technologies,
such as C/C++, Java, and . NET.
Embeddable: Python code can be embedded into other applications as a scripting
language.
Home Python Java JavaScript HTML SQL PHP C#
Dynamic Memory Allocation: Python automatically manages memory allocation,
making it easier for developers to write complex programs without worrying about
memory management.
Wide Range of Libraries and Frameworks: Python has a vast collection of libraries
and frameworks, such as NumPy, Pandas, Django, and Flask, that can be used to
solve a wide range of problems.
Versatility: Python is a universal language in various domains such as web
development, machine learning, data analysis, scientific computing, and more.
Large Community: Python has a vast and active community of developers
contributing to its development and offering support. This makes it easy for
beginners to get help and learn from experienced developers.
Career Opportunities: Python is a highly popular language in the job market.
Learning Python can open up several career opportunities in data science, artificial
intelligence, web development, and more.
High Demand: With the growing demand for automation and digital
transformation, the need for Python developers is rising. Many industries seek
skilled Python developers to help build their digital infrastructure.
Increased Productivity: Python has a simple syntax and powerful libraries that can
help developers write code faster and more efficiently. This can increase
productivity and save time for developers and organizations.
Big Data and Machine Learning: Python has become the go-to language for big
data and machine learning. Python has become popular among data scientists and
machine learning engineers with libraries like NumPy, Pandas, Scikit-learn,
TensorFlow, and more.
Where is Python used?
Python is a general-purpose, popular programming language, and it is used in almost every
technical field. The various areas of Python use are given below.
Data Science: Data Science is a vast field, and Python is an important language for
this field because of its simplicity, ease of use, and availability of powerful data
analysis and visualization libraries like NumPy, Pandas, and Matplotlib.
Desktop Applications: PyQt and Tkinter are useful libraries that can be used in GUI
- Graphical User Interface-based Desktop Applications. There are better languages
for this field, but it can be used with other languages for making Applications.
Console-based Applications: Python is also commonly used to create command-
line or console-based applications because of its ease of use and support for
advanced features such as input/output redirection and piping.
Mobile Applications: While Python is not commonly used for creating mobile
applications, it can still be combined with frameworks like Kivy or BeeWare to
create cross-platform mobile applications.
Software Development: Python is considered one of the best software-making
languages. Python is easily compatible with both from Small Scale to Large Scale
software.
Artificial Intelligence: AI is an emerging Technology, and Python is a perfect
language for artificial intelligence and machine learning because of the availability
of powerful libraries such as TensorFlow, Keras, and PyTorch.
Web Applications: Python is commonly used in web development on the backend
with frameworks like Django and Flask and on the front end with tools like
JavaScript HTML and CSS.
Enterprise Applications: Python can be used to develop large-scale enterprise
applications with features such as distributed computing, networking, and parallel
processing.
3D CAD Applications: Python can be used for 3D computer-aided design (CAD)
applications through libraries such as Blender.
Machine Learning: Python is widely used for machine learning due to its simplicity,
ease of use, and availability of powerful machine learning libraries.
Computer Vision or Image Processing Applications: Python can be used for
computer vision and image processing applications through powerful libraries such
as OpenCV and Scikit-image.
Speech Recognition: Python can be used for speech recognition applications
through libraries such as SpeechRecognition and PyAudio.
Scientific computing: Libraries like NumPy, SciPy, and Pandas provide advanced
numerical computing capabilities for tasks like data analysis, machine learning, and
more.
Education: Python's easy-to-learn syntax and availability of many resources make it
an ideal language for teaching programming to beginners.
Testing: Python is used for writing automated tests, providing frameworks like unit
tests and pytest that help write test cases and generate reports.
Gaming: Python has libraries like Pygame, which provide a platform for developing
games using Python.
IoT: Python is used in IoT for developing scripts and applications for devices like
Raspberry Pi, Arduino, and others.
Networking: Python is used in networking for developing scripts and applications
for network automation, monitoring, and management.
DevOps: Python is widely used in DevOps for automation and scripting of
infrastructure management, configuration management, and deployment
processes.
Finance: Python has libraries like Pandas, Scikit-learn, and Statsmodels for financial
modeling and analysis.
Audio and Music: Python has libraries like Pyaudio, which is used for audio
processing, synthesis, and analysis, and Music21, which is used for music analysis
and generation.
Performance Metrics in Machine Learning
Evaluating the performance of a Machine learning model is one of the important
steps while building an effective ML model. To evaluate the performance or
quality of the model, different metrics are used, and these metrics are known
as performance metrics or evaluation metrics. These performance metrics help
us understand how well our model has performed for the given data. In this way,
we can improve the model's performance by tuning the hyper-parameters. Each
ML model aims to generalize well on unseen/new data, and performance metrics
help determine how well the model generalizes on the new dataset.
In machine learning, each task or problem is divided
into classification and Regression. Not all metrics can be used for all types of
problems; hence, it is important to know and understand which metrics should be
used. Different evaluation metrics are used for both Regression and Classification
tasks. In this topic, we will discuss metrics used for classification and regression
tasks.
1. Performance Metrics for Classification
In a classification problem, the category or classes of data is identified based on
training data. The model learns from the given dataset and then classifies the new
data into classes or groups based on the training. It predicts class labels as the
output, such as Yes or No, 0 or 1, Spam or Not Spam, etc. To evaluate the
performance of a classification model, different metrics are used, and some of
them are as follows:
o Accuracy
o Confusion Matrix
o Precision
o Recall
o F-Score
o AUC(Area Under the Curve)-ROC
I. Accuracy
The accuracy metric is one of the simplest Classification metrics to implement,
and it can be determined as the number of correct predictions to the total number
of predictions.
It can be formulated as:
To implement an accuracy metric, we can compare ground truth and predicted
values in a loop, or we can also use the scikit-learn module for this.
When to Use Accuracy?
It is good to use the Accuracy metric when the target variable classes in data are
approximately balanced. For example, if 60% of classes in a fruit image dataset
are of Apple, 40% are Mango. In this case, if the model is asked to predict whether
the image is of Apple or Mango, it will give a prediction with 97% of accuracy.
When not to use Accuracy?
It is recommended not to use the Accuracy measure when the target variable
majorly belongs to one class. For example, Suppose there is a model for a disease
prediction in which, out of 100 people, only five people have a disease, and 95
people don't have one. In this case, if our model predicts every person with no
disease (which means a bad prediction), the Accuracy measure will be 95%,
which is not correct.
II. Confusion Matrix
A confusion matrix is a tabular representation of prediction outcomes of any
binary classifier, which is used to describe the performance of the classification
model on a set of test data when true values are known.
The confusion matrix is simple to implement, but the terminologies used in this
matrix might be confusing for beginners.
A typical confusion matrix for a binary classifier looks like the below
image(However, it can be extended to use for classifiers with more than two
classes).
We can determine the following from the above matrix:
o In the matrix, columns are for the prediction values, and rows
specify the Actual values. Here Actual and prediction give two
possible classes, Yes or No. So, if we are predicting the presence
of a disease in a patient, the Prediction column with Yes means,
Patient has the disease, and for NO, the Patient doesn't have the
disease.
o In this example, the total number of predictions are 165, out of
which 110 time predicted yes, whereas 55 times predicted No.
o However, in reality, 60 cases in which patients don't have the
disease, whereas 105 cases in which patients have the disease.
In general, the table is divided into four terminologies, which are as follows:
1. True Positive(TP): In this case, the prediction outcome is true, and it is true in reality,
also.
2. True Negative(TN): in this case, the prediction outcome is false, and it is false in reality,
also.
3. False Positive(FP): In this case, prediction outcomes are true, but they are false in
actuality.
4. False Negative(FN): In this case, predictions are false, and they are true in actuality.
III. Precision
The precision metric is used to overcome the limitation of Accuracy. The
precision determines the proportion of positive prediction that was actually
correct. It can be calculated as the True Positive or predictions that are actually
true to the total positive predictions (True Positive and False Positive).
IV. Recall or Sensitivity
It is also similar to the Precision metric; however, it aims to calculate the
proportion of actual positive that was identified incorrectly. It can be calculated
as True Positive or predictions that are actually true to the total number of
positives, either correctly predicted as positive or incorrectly predicted as
negative (true Positive and false negative).
The formula for calculating Recall is given below:
When to use Precision and Recall?
From the above definitions of Precision and Recall, we can say that recall
determines the performance of a classifier with respect to a false negative,
whereas precision gives information about the performance of a classifier with
respect to a false positive.
So, if we want to minimize the false negative, then, Recall should be as near to
100%, and if we want to minimize the false positive, then precision should be
close to 100% as possible.
In simple words, if we maximize precision, it will minimize the FP errors, and if
we maximize recall, it will minimize the FN error.
V. F-Scores
F-score or F1 Score is a metric to evaluate a binary classification model on the
basis of predictions that are made for the positive class. It is calculated with the
help of Precision and Recall. It is a type of single score that represents both
Precision and Recall. So, the F1 Score can be calculated as the harmonic mean
of both precision and Recall, assigning equal weight to each of them.
The formula for calculating the F1 score is given below:
When to use F-Score?
As F-score make use of both precision and recall, so it should be used if both of
them are important for evaluation, but one (precision or recall) is slightly more
important to consider than the other. For example, when False negatives are
comparatively more important than false positives, or vice versa.
2. Performance Metrics for Regression
Regression is a supervised learning technique that aims to find the relationships
between the dependent and independent variables. A predictive regression model
predicts a numeric or discrete value. The metrics used for regression are different
from the classification metrics. It means we cannot use the Accuracy metric
(explained above) to evaluate a regression model; instead, the performance of a
Regression model is reported as errors in the prediction. Following are the
popular metrics that are used to evaluate the performance of Regression models.
o Mean Absolute Error
o Mean Squared Error
o R2 Score
o Adjusted R2
I. Mean Absolute Error (MAE)
Mean Absolute Error or MAE is one of the simplest metrics, which measures the
absolute difference between actual and predicted values, where absolute means
taking a number as Positive.
To understand MAE, let's take an example of Linear Regression, where the model
draws a best fit line between dependent and independent variables. To measure
the MAE or error in prediction, we need to calculate the difference between actual
values and predicted values. But in order to find the absolute error for the
complete dataset, we need to find the mean absolute of the complete dataset.
MAE is much more robust for the outliers. One of the limitations of MAE is that
it is not differentiable, so for this, we need to apply different optimizers such as
Gradient Descent. However, to overcome this limitation, another metric can be
used, which is Mean Squared Error or MSE.
II. Mean Squared Error
Mean Squared error or MSE is one of the most suitable metrics for Regression
evaluation. It measures the average of the Squared difference between predicted
values and the actual value given by the model. Since in MSE, errors are squared,
therefore it only assumes non-negative values, and it is usually positive and non-
zero.
Moreover, due to squared differences, it penalizes small errors also, and hence it
leads to over-estimation of how bad the model is.
MSE is a much-preferred metric compared to other regression metrics as it is
differentiable and hence optimized better.
III. R Squared Score
R squared error is also known as Coefficient of Determination, which is another
popular metric used for Regression model evaluation. The R-squared metric
enables us to compare our model with a constant baseline to determine the
performance of the model. To select the constant baseline, we need to take the
mean of the data and draw the line at the mean.
The R squared score will always be less than or equal to 1 without concerning if
the values are too large or small.
What is Model Tuning?
Model tuning is the experimental process of finding the optimal values
of hyperparameters to maximize model performance. Hyperparameters
are the set of variables whose values cannot be estimated by the model
from the training data. These values control the training process. Model
tuning is also known as hyperparameter optimization.
Why is it Essential to Tune an ML Model?
The purpose of tuning a model is to ensure that it performs at its best.
This process involves adjusting various elements of the model to
achieve optimal results. By fine-tuning the model, you can maximize
its performance and get the highest rate of performance possible.
How do you Tune a Model?
A robust evaluation criterion should be identified and set before model
tuning to optimize the tuning parameters towards the specific goal.
Model tuning can be done manually or using automated methods.
Manual model tuning: In this method, hyperparameter values are set
based on intuition or past experience. The model is then trained and
evaluated to determine the performance using the respective set of
hyperparameters. Adjustments are made and this process is continued
till optimal value for each hyperparameter is found.
Automated model tuning: In this method, optimal hyperparameter
values are found using algorithms. Here, we define a hyperparameter
search space from which the optimal set of hyperparameter values is
selected. Some of the popular algorithms for doing
automated hyperparameter tuning are
Grid search: The user defines a set of values for each hyperparameter
to form a grid. Different combinations of these hyperparameter values
are tried and the combination which yields the best result is selected as
the final set of optimal hyperparameters. The process is very resource-
intensive as the algorithm trains one model for each set of possible
hyperparameter combinations.
Random Search: The user here also defines a set of hyperparameter
values but here the algorithm will only try random combinations of
hyperparameter values rather than every possible combination. The
combination that yields the best result from this is selected as the
optimal set of hyperparameters.
Bayesian Search: Bayesian hyperparameter tuning, also known as
Bayesian optimization, methods keeps track of past evaluation results
to form the information used to make future decisions in selecting
future hyperparameter values. Bayesian hyperparameter tuning is
efficient because it chooses the hyperparameter values in an informed
manner.
Bias and Variance in Machine Learning
Machine learning is a branch of Artificial Intelligence, which allows
machines to perform data analysis and make predictions. However, if
the machine learning model is not accurate, it can make predictions
errors, and these prediction errors are usually known as Bias and
Variance. In machine learning, these errors will always be present as
there is always a slight difference between the model predictions and
actual predictions. The main aim of ML/data science analysts is to
reduce these errors in order to get more accurate results. In this topic,
we are going to discuss bias and variance, Bias-variance trade-off,
Underfitting and Overfitting. But before starting, let's first understand
what errors in Machine learning are?
Errors in Machine Learning?
In machine learning, an error is a measure of how accurately an
algorithm can make predictions for the previously unknown dataset. On
the basis of these errors, the machine learning model is selected that
can perform best on the particular dataset. There are mainly two types
of errors in machine learning, which are:
o Reducible errors: These errors can be reduced to improve the
model accuracy. Such errors can further be classified into bias
and Variance.
o Irreducible errors: These errors will always be present in the
model
regardless of which algorithm has been used. The cause of these errors
is unknown variables whose value can't be reduced.
What is Bias?
In general, a machine learning model analyses the data, find patterns in
it and make predictions. While training, the model learns these patterns
in the dataset and applies them to test data for prediction. While
making predictions, a difference occurs between prediction values
made by the model and actual values/expected values, and this
difference is known as bias errors or Errors due to bias. It can be
defined as an inability of machine learning algorithms such as Linear
Regression to capture the true relationship between the data points.
Each algorithm begins with some amount of bias because bias occurs
from assumptions in the model, which makes the target function simple
to learn. A model has either:
o Low Bias: A low bias model will make fewer assumptions about
the form of the target function.
o High Bias: A model with a high bias makes more assumptions,
and the model becomes unable to capture the important features
of our dataset. A high bias model also cannot perform well on
new data.
Generally, a linear algorithm has a high bias, as it makes them learn
fast. The simpler the algorithm, the higher the bias it has likely to be
introduced. Whereas a nonlinear algorithm often has low bias.
Some examples of machine learning algorithms with low bias are
Decision Trees, k-Nearest Neighbours and Support Vector
Machines. At the same time, an algorithm with high bias is Linear
Regression, Linear Discriminant Analysis and Logistic Regression.
Ways to reduce High Bias:
High bias mainly occurs due to a much simple model. Below are some
ways to reduce the high bias:
o Increase the input features as the model is underfitted.
o Decrease the regularization term.
o Use more complex models, such as including some polynomial
features.
What is a Variance Error?
The variance would specify the amount of variation in the prediction if
the different training data was used. In simple words, variance tells
that how much a random variable is different from its expected
value. Ideally, a model should not vary too much from one training
dataset to another, which means the algorithm should be good in
understanding the hidden mapping between inputs and output
variables. Variance errors are either of low variance or high variance.
Low variance means there is a small variation in the prediction of the
target function with changes in the training data set. At the same
time, High variance shows a large variation in the prediction of the
target function with changes in the training dataset.
Some examples of machine learning algorithms with low variance
are, Linear Regression, Logistic Regression, and Linear
discriminant analysis. At the same time, algorithms with high
variance are decision tree, Support Vector Machine, and K-nearest
neighbours.
Ways to Reduce High Variance:
o Reduce the input features or number of parameters as a model is
overfitted.
o Do not use a much complex model.
o Increase the training data.
o Increase the Regularization term.
Different Combinations of Bias-Variance
There are four possible combinations of bias and variances,
:Low-Bias,Low-Variance:
The combination of low bias and low variance shows an ideal machine
learning model. However, it is not possible practically.
1. Low-Bias, High-Variance: With low bias and high variance,
model predictions are inconsistent and accurate on average. This
case occurs when the model learns with a large number of
parameters and hence leads to an overfitting
2. High-Bias, Low-Variance: With High bias and low variance,
predictions are consistent but inaccurate on average. This case
occurs when a model does not learn well with the training dataset
or uses few numbers of the parameter. It leads
to underfitting problems in the model.
3. High-Bias, High-Variance:
With high bias and high variance, predictions are inconsistent and
also inaccurate on average.
How to identify High variance or High Bias?
High variance can be identified if the model has:
o Low training error and high test error.
High Bias can be identified if the model has:
o High training error and the test error is almost similar to training
error.
Bias-Variance Trade-Off
While building the machine learning model, it is really important to
take care of bias and variance in order to avoid overfitting and
underfitting in the model. If the model is very simple with fewer
parameters, it may have low variance and high bias. Whereas, if the
model has a large number of parameters, it will have high variance and
low bias. So, it is required to make a balance between bias and variance
errors, and this balance between the bias error and variance error is
known as the Bias-Variance trade-off.
For an accurate prediction of the model, algorithms need a low variance
and low bias. But this is not possible because bias and variance are
related to each other:
o If we decrease the variance, it will increase the bias.
o If we decrease the bias, it will increase the variance.
Bias-Variance trade-off is a central issue in supervised learning.
Ideally, we need a model that accurately captures the regularities in
training data and simultaneously generalizes well with the unseen
dataset. Unfortunately, doing this is not possible simultaneously.
Because a high variance algorithm may perform well with training data,
but it may lead to overfitting to noisy data. Whereas, high bias
algorithm generates a much simple model that may not even capture
important regularities in the data. So, we need to find a sweet spot
between bias and variance to make an optimal model.
Hence, the Bias-Variance trade-off is about finding the sweet spot to
make a balance between bias and variance errors.
What is Exploratory Data Analysis?
Exploratory Data Analysis is a primary process in the field of data science.
It includes the process of expressing the data using different statistical and
visualizing methods, which helps in the further process of analyzing the
data.
This article will give a brief description of the exploratory data analysis,
its methods, process and uses.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is a process of analyzing and exploring
the data to extract useful traits, find patterns and trends, determine
outliers, and find a prominent relationship between different variables.
This is the very first step before further steps of data analysis and
implementing the statistics on the data set. EDA consists of 80% of the
work done in the data science process.
EDA goes past the simple project of summarising facts; it's about
uncovering hidden insights that might not immediately be apparent.
Through the careful examination of data distributions, relationships
among variables, and traits over the years, analysts can unearth valuable
nuggets of records that could shape the path of their research.
Analysts need to smooth and preprocess statistics to ensure accuracy and
consistency. They need to use multiple visualisation techniques to explore
one of a kind aspects of the statistics. Finally, they ought to interpret
findings seriously, thinking about the context and capability biases within
the facts.
Objectives of Exploratory Data Analysis
o Exploratory Data Analysis includes data cleaning. It helps the data
scientists to clean the data, which includes different processes like
removing duplicates, null values, removing outliers, and
unnecessary features.
o It includes the basic statistics on the data set, including determining
the tendency, variability, etc. It is also used to calculate the mean,
median, mode, standard deviation, etc.
o It lists all the important factors, gives a predictive model, defines the
parameters, and many others.
o Exploratory data analysis also works in feature engineering, in
which a data scientist explores different variables and creates new
functions to extract insights and get some useful information from
them. Using feature engineering, data features can be scaled and
normalised and can create derived variables and encode the express
variables.
o Exploratory data analysis also develops a relationship and
dependencies between variables. It allows visualisation of the data
by creating different charts and graphs like scatter plots, bar graphs,
etc., which define the insights and relation between variables.
Importance of Exploratory Data Analysis in Data Science
Exploratory Data Analysis is a primary step that is used to prepare the
data for further processes in data science, including data manipulation,
visualisation, making predictive models, etc. It helps finding errors and
detect the patterns in the dataset. It helps in building a setup for data
science projects.
EDA helps data analysts and scientists to let them know if they are
proceeding in the right direction. It helps the customers to confirm that
they are asking the right questions. It answers minimal but necessary
questions like the correlation, standard deviation, mean, median, mode,
dependent features, and unnecessary attributes in the data set. After the
successful completion of the process of exploratory data analysis, the data
scientists move forward with the further process in a smooth manner by
making predictive models and analysing the data more deeply.
Tools Used for Exploratory Data Analysis
Exploratory Data Analysis can be performed using different tools. These
are:
Python: Python is the easiest but most useful object-oriented
programming language that provides a platform to solve many different
problems, including machine learning, deep learning, data science, and
many more. When talking about exploratory data analysis, Python
provides different libraries with simple, easy-to-read and understandable
syntax that can help to perform the task of EDA efficiently.
Python gives integrated records systems and functions that might be used
to locate and cope with missing values within the records set and compare
the simple systems and required capabilities for the data analysis. It helps
in deciding the best-applicable machine learning models. It additionally
gives versatile libraries that carry out the characteristics of machine
learning via building predictive fashions.
Another beneficial device used for exploratory facts evaluation is the R
programming language. It is an open-source programming language that
offers an environment for statistical computing. It offers special statistical
features to examine the information of the information set.
EDA has various packages across diverse industries, which includes
enterprise analytics, healthcare, finance, and advertising. In business
analytics, EDA helps in know-how customer behaviour and market
developments. In healthcare, EDA aids in ailment surveillance and
epidemiological research. In finance, EDA supports hazard assessment
and portfolio management. In advertising, EDA informs segmentation
and focuses on strategies.
Significance of EDA
EDA is important for a number of reasons:
o Insight Generation: It offers information that a cursory review of the
data might miss, enabling analysts to make more informed
decisions.
o Error Detection: By assisting in the early identification of data
quality issues during the analysis process, EDA lowers the
possibility of making erroneous conclusions.
o Generation of Hypotheses: EDA may result in the development of
theories that formal statistical techniques may be used to test.
o Communication: Findings are frequently conveyed to stakeholders
through the use of visualisations created during EDA, which helps
to make complex data easier to access and comprehend.
Types of Exploratory Data Analysis
Exploratory data analysis (EDA) encompasses various strategies, each
serving a particular motive in knowledge and analysing datasets. Here are
a few common varieties of EDA:
1. Univariate Analysis:
It makes a speciality of analysing a single variable at a time. Techniques
encompassing histograms, container plots, and precis facts like mean,
median, and mode are used. It also facilitates expertise in the distribution
and principal tendency of man or woman variables.
2. Bivariate Analysis:
It examines the relationship among two variables. It makes use of
strategies that consist of scatter plots, correlation analysis, and
contingency tables. It allows for identifying styles, institutions, and
dependencies between variables.
3. Multivariate Analysis:
Multivariate analysis is used to examine relationships between a couple
of variables concurrently using exclusive techniques, inclusive of multiple
regression evaluation, main component analysis (PCA), and cluster
evaluation. Also, it allows for a deeper exploration of complicated
relationships and interactions among variables.
4. Temporal Analysis:
It specializes in analysing statistics through the years. It makes use of
strategies including time series plots, trend analysis, and seasonality
decomposition that help in identifying styles and trends that spread over
time.
5. Spatial Analysis:
It analyses data in geographical areas using strategies that include spatial
mapping, spatial autocorrelation evaluation, and hotspot analysis. It is
useful for information on spatial patterns and relationships in statistics,
together with geographical clusters or trends.
6. Textual Analysis:
Textual analysis is used to analyze text records to extract significant
insights. It uses strategies together with sentiment analysis, subject matter
modeling, and textual content mining, which are useful for analyzing
textual records such as client critiques, social media posts, or survey
responses.
7. Interactive Visualisation:
It utilises interactive visualisations to discover information dynamically.
Techniques encompass interactive dashboards, drill-down charts, and
connected visualisations. It permits an extra attractive and exploratory
analysis revel in, permitting customers to interactively discover statistics
from exceptional perspectives
8. Statistical Modeling:
This type of exploratory data analysis involves fitting statistical models to
the data to test hypotheses or make predictions by using different
techniques, including linear regression, logistic regression, and machine
learning algorithms. It helps in quantifying relationships between
variables and making predictions based on data patterns.
Each type of EDA serves a specific purpose and can be used alone or in
combination to gain a comprehensive understanding of the dataset and
extract actionable insights.
Python Data Analytics
Data Analysis can help us to obtain useful information from data and
can provide a solution to our queries. Further, based on the observed
patterns we can predict the outcomes of different business policies.
Understanding the basic of Data Analytics
Data
The kind of data on which we work during the analysis is mostly of the
csv (comma separated values) format. Usually, the first row in the csv
files represents as headers.
Packages Available
There's a diversity of libraries available in Python packages that can
facilitate easy implementation without writing a long code.
Examples of some of the packages are-
1. Scientific computing libraries such as NumPy, Pandas & SciPy.
2. Visualization libraries such as Matplotlib and seaborn.
3. Algorithm libraries such as scikit-learn and statsmodels.
Importing & Exporting Datasets
The two essential things that we must take care of while importing the
datasets are-
1. Format- It refers to the way a file is encoded. The examples of
prominent formats are .csv,.xlsx,.json, etc.
2. Path of File- The path of the file refers to the location of the file
where it is stored. It can be available either in any of the drives or
some online source.
It can be done in the following way-
Example -
1. import pandas as pd
2. path=" "
3. df = pd.read_csv(path)
If the dataset doesn't contain a header, we can specify it in the following
way-
1. df = pd.read_csv(path,header=None)
To look at the first five and last five rows of the dataset, we can make
use of df.head() and df.tail() respectively.
Let's have a look at how we can export the data, if we have a file present
in the .csv format then,
1. path = " "
2. df.to_excel(path)
Data Wrangling
Data wrangling is a process of converting the data from a raw format
to the one in which it can be used for analysis
Let us see what this part encompasses-
How to deal with missing values?
Missing values - Some entries are left blank because of the
unavailability of information. It is usually represented with NaN, ? or
0.
Let us discuss how we can deal with them-
The best option is to replace the numerical variable with their average
and the categorical variable with the mode.
Sometimes a situation might occur, when we have to drop the missing
value, it can be done using-
1. df.dropna()
If we want to drop a row, we have to specify the axis as 0. If we want
to drop a column, we have to specify the axis as 1.
Moreover, if we want these changes to directly occur in the dataset, we
will specify one more parameter inplace = True.
Now let's see how the values can be replaced-
The syntax is -
1. df.replace(missing value, new value)
Here we will make a variable and store the mean of the attribute (whose
value we want to replace) in it
1. mean=df["attribute name"].mean()
2. df["attribute name"].replace(np.nan,mean)
How to proceed with data formatting?
It refers to the process of bringing the data in a comprehensible format.
For example - Changing a variable name to make it understandable.
Normalization of Data
The features present in the dataset have values that can result in a biased
prediction. Therefore, we must bring them to a range where they are
comparable.
To do the same, we can use the following techniques on an attribute-
1. Simple Feature Scaling Xn=Xold/Xmax
2. Min-Max approach Xn=Xold-Xmin/Xmax-Xmin
3. Z-score Xn=Xold-µ/Ꝺ
µ - average value
Ꝺ-standard deviation
How to convert categorical variables into numeric variables?
Under this, we proceed with a process called "One-Hot Encoding", let's
say there's an attribute that holds categorical values. We will make
dummy variables from the possibilities and assign them 0 or 1 based
on their occurrence in the attribute.
To convert categorical variables to dummy variables 0 or 1, we will use
1. pandas.get_dummies(df["attribute-name"])
2. This will generate the expected results.
Binning in Python
It refers to the process of converting numeric variables into categorical
variables.
Let's say we have taken an attribute 'price' from a dataset. We can
divide its data into three categories based on the range and then denote
them with names such as low-price, mid-price, and high price.
We can obtain the range using linspace() method
1. bin = np.linspace(min(df["attribute-name"]),max(df["attribute-
name"]),4)
2. cat_names=["low-price","mid-price","high-price"]
3. df["bin_name"]=pd.cut(df["attribute-name"],bin,labels=cat_names)
Exploratory Data Analysis
Statistics
We can find out the statistical summary of our dataset
using describe() method. It can be used as df.describe(). The
categorical variables can be summarized using value_counts() method.
Correlation
Correlation measures the scope to which two variables are
interdependent.
A visual idea of checking what kind of a correlation exists between the
two variables. We can plot a graph and interpret how does a rise in the
value of one attribute affects the other attribute.
Concerning statistics, we can obtain the correlation using Pearson
Correlation. It gives us the correlation coefficient and the P-value.
Let us have a look at the criteria-
CORRELATION COEFFICIENT RELATIONSHIP
1. Close to +1 Large Positive
2. Close to -1 Large Negative
3. Close to 0 No relationship exists
P-VALUE CERTAINITY
P-value<0.001 Strong
P-value<0.05 Moderate
P-value<0.1 Weak
P-value>0.1 No
We can use it in our piece of code using scipy stat package.
Let's say we want to calculate the correlation between two attributes,
attribute1 and attribute2-
1. pearson_coef,p_value=stats.pearsonr(df["attribute1"],df["attribute2"])
.
Further to check the correlation between all the variables, we can create
a heatmap.
Overfitting and Underfitting
Overfitting- It is the condition when the model is quite simple to fit
the data.
Underfitting - It is the condition when the model easily adjusts the
noise factor rather than the function.
Introduction to Text Analytics and Text Mining
In modern-day data-pushed world, textual content in the shape of files, social media
posts, emails, and reports constitutes a huge part of to be had statistics. To harness
the cost embedded in those significant textual assets, fields like text analysis, text
analytics, and text mining have emerged, providing sophisticated methods for
extracting and deciphering meaningful insights from text.
Text Analysis
Text evaluation is the systematic process of inspecting and deciphering textual facts
to extract meaningful records and insights. This subject utilizes diverse computational
techniques to convert unstructured textual content into established records, making it
less difficult to research and derive actionable conclusions. Text assessment is
important in severa domains, including commercial enterprise, healthcare, social
media, and further.
Key Components of Text Analysis
Tokenization:
o Definition: Tokenization includes breaking down a text into smaller
devices referred to as tokens, which can be words, terms, or
different huge factors.
o Purpose: It is step one in text preprocessing, assisting to simplify
text for further evaluation.
Parsing:
o Definition: Parsing is the way of reading the grammatical shape of
sentences.
o Purpose: This allows in knowledge the syntactic relationships
between words, including hassle, verb, and object.
Named Entity Recognition (NER):
o Definition: NER identifies and classifies key elements inside the text,
together with names of humans, groups, dates, and places.
o Purpose: It permits the agency and retrieval of particular information
from huge textual content corpora.
Sentiment Analysis:
o Definition: Sentiment evaluation determines the emotional tone at
the back of a chain of phrases, identifying it as awesome, awful, or
impartial.
o Purpose: It permits in information public opinion, client remarks, and
simple sentiment expressed in textual content.
Topic Modeling:
o Definition: Topic modeling algorithms discover hidden subjects or
subjects internal a massive collection of documents.
o Purpose: This technique is used to select out the number one
subjects cited in a corpus with out predefined labels.
Applications of Text Analysis
Customer Feedback Analysis:
o Use Case: Businesses analyze purchaser opinions and comments
to apprehend patron delight and perceive regions for improvement.
o Benefit: Improved customer support and product development
based on direct patron insights.
Social Media Monitoring:
o Use Case: Companies screen social media structures to gauge
public opinion, music logo mentions, and perceive trending topics.
o Benefit: Enhanced advertising and marketing techniques and real-
time response to public sentiment.
Fraud Detection:
o Use Case: Financial institutions observe transactional text
information to hit upon patterns indicative of fraudulent sports
activities.
o Benefit: Increased safety and prevention of monetary crimes.
Healthcare:
o Use Case: Analyzing scientific statistics and literature to discover
developments in affected person statistics and studies findings.
o Benefit: Improved affected person care and elevated clinical
studies.
Legal Document Analysis:
o Use Case: Law businesses and criminal departments examine
contracts, case criminal hints, and exclusive legal documents to
extract applicable information.
o Benefit: Enhanced crook studies and inexperienced managing of
prison matters.
Tools and Technologies in Text Analysis
Natural Language Processing:
o Description: NLP is a branch of artificial intelligence that facilitates
computer structures apprehend, interpret, and generate human
language.
o Use: NLP strategies are essential to textual content evaluation,
permitting duties like tokenization, parsing, and sentiment analysis.
Machine Learning:
o Description: Machine gaining knowledge of algorithms analyze from
textual content statistics to make predictions and choices.
o Use: These algorithms are used for text class, clustering, and
subject depend modeling.
Text Mining Software:
o Description: Software equipment collectively with Apache
OpenNLP, NLTK (Natural Language Toolkit), and spaCy provide
libraries and frameworks for text evaluation.
o Use: These gear simplify the implementation of text evaluation
strategies and techniques.
Sentiment Analysis Platforms:
o Description: Platforms like Lexalytics, MonkeyLearn, and TextBlob
offer specialized offerings for sentiment analysis.
o Use: These structures assist businesses and researchers recognize
the emotional tone of huge text datasets.
Text Analytics
Text analytics is a huge subject that involves analyzing textual information the usage
of severa methodologies from information, device studying, and linguistics. It
specializes in remodeling unstructured textual content into primarily based statistics
that can be quantified and analyzed, thereby deriving actionable insights.
Key Techniques in Text Analytics
Natural Language Processing:
o Definition: NLP is a department of artificial intelligence that allows
computer systems to understand, interpret, and generate human
language.
o Purpose: NLP strategies are crucial to text analytics, facilitating
responsibilities which includes language modeling, syntactic
parsing, and semantic understanding.
Machine Learning:
o Definition: Machine studying entails education algorithms to have a
look at from and make predictions or picks based totally on data.
o Purpose: In textual content analytics, device mastering is used for
tasks including text class, sentiment evaluation, and clustering.
Information Retrieval:
o Definition: Information retrieval is the approach of locating relevant
statistics internal huge datasets.
o Purpose: This approach is critical for engines like google like google
and document management structures to retrieve pertinent
documents from super text corpora.
Text Classification:
o Definition: Text type is the technique of assigning predefined
classes to text files.
o Purpose: Common packages consist of spam detection in emails,
subject depend categorization of news articles, and sentiment
evaluation.
Clustering:
o Definition: Clustering companies comparable texts together based
totally on their functions with out predefined labels.
o Purpose: This method allows in coming across herbal groupings in
data, which includes grouping comparable patron opinions or social
media posts.