0 ratings0% found this document useful (0 votes) 1K views22 pagesEDA Unit 4 Notes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
Relationships between Two Variables - Percent
: : tage Tables - Analyzing Contingent
Tables - Handling Several Batches - Scatterplots and Resistant Lines ene ommatons.
Contents
Relationship between Two Variables
Percentage Tables
Analyzing Contingency Tables
Handling Several Batches
Scatter Plots and Resistant Lines
Transformations
Two Marks Questions with AnswersData Exploration and Visualization (4-2) Bivariate Anaya,
€Q 4.1 Relationship between Two Variables
¢ The term bivarinte analysis refers to the analysis of two variables. I is a methodieg
statistical technique applied (oa pair of variables (Teatures/ attributes) oF data to determin
the empirical relationship between them. In order words, it is meant to determine
‘i i i any
concurrent relations (usually over and above a simple correlation analysis).
¢ Bivariate analysis is performed to find the relationship between cach variable in the
dataset and the target variable of interest (or) using 2 variables and finding the relationg
hip
between them, For example, Box plot, Violin plot.
ariate analysis can be thought of as simple as creating a scatter plot by plotting one
variable against another on a Cartesian plane (think X and Y axis) can sometimes Bive a
picture of what the data is trying to show. If the data seems to fit a line or curve then there
is a relationship or correlation between the two variables. For example, one might choose to
plot caloric intake versus weight.
* There are three common ways to perform bivariate analysis :
1, Scatter plots - This gives an idea of the pattems that can be formed using the two
variables.
2. Correlation coefficients - The coefficient helps to know if the data in question are
related. When the correlation coefficient is zero then this means that the variables are
not related. If the correlation coefficient is a positive or a negative 1 then this means that
the variables are perfectly correlated.
3. Simple linear regression - This uses a wide range of tools to determine how the data
post could be related. The post may follow an exponential curve. The regression
analysis gives the equation for a line or curve. It also: helps to find the correlation
coefficient.
* In the context of supervised learning, it can help determine the essential predictors when the
bivariate analysis is done keeping one of the variables as the dependent variable (Y) and the
other ones as independent variables (X,, Xp, ... and so on) hence plot all Y, X, pairs. So
essentially, it is a way of feature selection and feature prioritization.
Comparison of correlation and causality ‘
© It is a widespread fallacy to assume that if one variable is observed to vary with a change in
values of another empirically, then either of them is “causing” the other to change of
leading the other variable to change. In bivariate analysis, it might be observed that on
variable (especially the X,) is causing Y to change. Still, in actuality, it might just be ®"
indicator and not the actual driver.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge.1,
He Visualization
19 exploration and (4-3) Bivariate Analysis
of variable and bivariate analysis
eS
of bivariate analysis is dependent on the kind of attributes and variables that is
+ MMs to analyze the data. The variables may be ordinal, categorical or numeri. The
jndependent yariable is categorical like a brand of a pencil. In this case, prob it regression
ox log it regression 1S used, If the dependent and the independent variables are both ordinal
Siri means that they have a ranking or position then the rank correlation coefficient is
measured:
Incase the dependent attribute is ordinal then the ordered probit or the ordered logit is used.
Itis possible that the dependent attribute could be internal ar a ratio like the scale of
temperature. This is where regression is measured, Below are the kinds of bivariate data
correlation.
, Numerical and Numerical : In this kind of variable both the variables of the bivariate
data which includes the dependent and the independent variable have a numerical value.
Categorical and Categorical : When both the variables in the bivariate data are in the
static form then the data is interpreted and statements and predictions are made about it
During the research, the analysis will help to determine the cause and impact to
conclude that the given variable is categorical.
. Numerical and Categorical : This is when. one of the variables is numerical and the
other is categorical. Bivariate analysis is a kind of statistical analysis when two variables
are observed against each other. One of the variables will be dependent and the other is
independent. The variables are denoted by X and Y. The changes are analyzed between
the two variables to’ understand to what extent the change has occurred.
Further, there are two types of variables in data - Categorical and continuous (numerical).
Therefore, for bivariate analysis, there are 3 possible combinations for analysis that could
be carriéd out nariely; categorical and categorical, categorical and continuous, continuous
and continuous.
Categorical and categorical variables combination
* This is used in case both the variables being analyzed are categorical. In the case of
classification, models say, for example, classifying a credit card fraud or not as Y.
variables and then checking if the customer is at his hometown or away or outside the
country. Another example can be age vs gender and then counting the number of
customers who fall in that category. It is important to note that the
Visualization / summary shows the count or some mathematical or logical aggregation of
a3" Variable / metric like revenue or cost and the like in all such analyses. It can be
done using Crosstabs (heat maps) or Pivots in Python.
TECHNIGAL PUBLIGATIONS® - an up-trust for knowledgeData Exploration and Visualization (4-4) Bivariate Anatya,
stabs It i
© Cre sed to count between categories or get summaries between two
categories, Pandas library has this functionality.
Pivots : Another useful functionality that can be applied to Pandas dataframes 1g Bet
Excel like Pivot tables. This can work for 2+ categorical variables when placed in the
proper hierarchy.
2. Categorical and continuous (numerical) variables combination :
© In this type the variance plotting of a numerical variable in a class is performed, Fo,
example, how age varies in each segment or how do income and expenses of g
household vary by loan re-payment status.
* Categorical plot for aggregates of continuous variables : Used to get total or counts of a
numerical variable eg revenue for each month. Also, this can be used for counts of
another categorical variable too instead of the numerical.
© Plots for distribution of continuous (numerical) variables : Use to see the range and
statistics of a;numerical variable across categories.
© Plots used are - box plot, violin plot, swarm plot.
3. Continuous and continuous variable combination :
© This is the most common use case of bivariate analysis and is used for showing the
empirical relationship between two numerical (continuous) variables. This is usually
more applicable in regression cases.
@ In case there is large datasets with 30 - 70+ features (variables), there might not be
sufficient time to run each pair of variables through bivariate analysis one by one. One
could use the pair plot or pair grid from the seaborn library in such cases. It makes a grid
where each cell is a bivariate graph and Pair grid also allows customizations.
42 Percentage Tables
° A bivariate table addresses the joint distribution of two variables. Bivariate table is 4
table that illustrates the relationship between two variables by displaying the distribution of
one variable across the categories of a second variable.
To detect association within bivariate tables one can calculate percentages within the
categories of the independent variable or can compare percentages across the
categories ‘of the independent variable or can perform a Chi Square test of
Independence formally determines the statistical significance.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge-
exploration and Visualization 4-8)
pata Bivariate Analysis
; cross-tabulation : a technique used to explore the relationship between two variables that
ize c. ' : ;
pave been organi : a a table. Column variable is a variable whose categories comprise
a bivi c. : :
the columns of a bivariate table. Row variable is a variable whos categories comprise the
ows of a bivariate table. Cell is the intersection of a row and a column in a bivariate table,
Marginals is the row and column totals in a bivariate table,
oh pivariate table displays the distribution of one variable across the categories of another
variable. The independent variable usually goes in the columns, while the dependent
yariable goes in the rows. Rows and columns intersect at cells. The row and column totals
ofa bivariate table are called marginals.
« Bivariate relationships come in several different flavors. When the variation in the
dependent variable can be attributed only to the independent variable, the relationship is
said to be direct. When a third variable affects both the independent and dependent
variables, the relationship is said to be spurious. When the independent variable affects the
dependent variable only by way of a mediating variable (sort of like a chain reaction), it is
said to be an intervening relationship.
‘The common percent is an extremely useful measure.
« One can use a percent with any kind of data : Nominal, ordinal, interval and ratio. A
percent is a standardized measure. Percent ineans "per 100 cases."Because the percent is
standardized one can use it to compare results from different population bases that have
different sizes or total case - bases. For example, one could compare the percentage of
home computer ownership among people with living in villages, urban areas and metro
cities.
* The first step in calculating a percent is to isolate the case base of interest.
« The second step is to identify the category of interest.
© The third step is to locate the number of cases in the category of interest ONLY for the
group.
* The fourth step is to take the frequency in category of interest and divide that
frequency by the total number of cases in my case ~ base of interest.
Constructing a bivariate table :
* Percentages can be computed indifferent ways namely,
1. Column percentages - Column totals as base.
2. Row percentages - Row totals as base.
* Typically percentages are provided for the independent variables.
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeData Exploration and Visualization (4-6) Bivariate Analy,
Elaboration
* Elaboration is a process designed to further explore bivariate relationships by intro i
additional variables called control variables.
Limitations of elaboration
* Elaboration can be useful, but it also has its limitations. First, it tends t0 be a fit bit
tedious, especially if done by hand. Second, it's not the most precise form of analy,
Elaboration allows to compare the distribution of one variable across the categories or
another, but there are other measures of association that do a better job of quantifying ih,
relationship between two variables.
Crosstab function Python
© The crosstab() function is used to compute a simple cross tabulation of two (or more)
factors. By default computes a frequency table of the factors unless an array of values ang
an aggregation function are passed. :
pandas.crosstab(parameters) i ae
index ____| array - like, Series or list of arrays / Series Values to group by in the rows,
columns | array - like, Series or list of arrays / Series Values to group by in the
columns.
array « like; optional Array of values to aggregate according to the factors,
Requires ‘aggfine’ be specified,
Sequence, default None If passed, must match a number of row arrays
| passed.
Sequence, default None If passed, must match a number of column arrays *
passed,
function, optional If specified, requires ‘alue be specified as well.
bool, default False Add row / column margins (subtotals).
_| Str, default “All” Name of the row / column that will contain the totals
| when margins is True, E
_| bool, default True Do not include columns whose entries are all NaN:
bool, {‘all’, ‘index’, ‘columns’} or {0,1}, default False Normalize by
dividing all values by the sum of values,
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge-
exploration and Visualization
Bivariato Analysis:
pata E
pros to bulld crosstable and bulld bar chart
example program - 7
sagan 8 pd
iporoato cats
Be pa DateFrame({'Grade jutstanding’,
‘Distinction ‘irstClass’, 'SecondClass'
‘PassClass'’, 'PassClass’, ‘Distinction’, 'Distinction'],
‘age’ [18, 18, 18, 19, 19, 20, 18, 18, 19, 19),
‘Gender: ['M'M, 'F, ‘F, 'F 'M’, 'M 'F 'M', Fl)
pant(af)
igand frequency of each letter grade
‘gracrosstb = pd.crosstab(index=dff'Grade'], column:
‘print(grdcrosstb)
|g Creatingbarplot
pparplot = grdcrosstb.plot.bar(rot=0)
Example program - 1 Output
fo Ase Gender Grade
o 18 M FirstClass
a 18 M Outstanding
2 8 F Outstanding
3 19 F Distinction
4 19 F FirstClass
5 20 M SecondClass
6 18 M PassClass
fi 18 F _ PassClass
8 19 M Distinction,
ge 19 F Distinction
‘col_O count
Grade
Distinction 3
FirstClass 2
Pustanding 2
PassClass 2 .
SecondClass 1 ASsiiots
TECHNICAL PUBLIGATIONS® - an up-thnust for knowtedgeData Exploration and Visualization (4-8) Bivariate Analysis
3.0
25
2.0
15
1.0
0s: a
0.0
Distinction First class Outstanding Pass class Second class
grade
#Building tables with crosstab function
Example program - 2
import pandas as pd
res_names = "Vaishali, ‘Rupali, ‘Lucky’, Gorge, 'BlueNyle’, ‘Goodluck’, ‘SPs'] _
purchase_type = ['Food'’, Food’, ‘Food, 'Drink’, ‘Food, ‘Drink, 'Drink’]
[12, 26, 32, 10, 15, 22, 18]
pric
print (‘Restaurant Names: {}'format(res_names))
print (Purchase Type: {}'format(purchase_type))
print (Price: {}'.format(price))
rescrtb1 = pd.crosstab(index=[res_names], columns=[purchase_type])
print (rescrtb1)
rescrth2 = pd.crosstab(index=[res_names], columns=[purchase_type], values=price, aggfunc=sum)
print(rescrtb2)
rescrtb3 = pd.crosstab(index=[res_names],
‘columns=[purchase_typel,
values=price,
aggfunc=lambda x: x.sum()**2, # Setting a custom agg function
townames=|"Restaurants"], # Giving a title to my rows:
colnames=|'Food Types'|, # Giving a title to my columns
margins=True, # Adding margins (Subtotals on the ends)
| #margins_name="Totals
ft ) # Give my subtotals a title 5
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge—
oration and Visualization
1a EXP!
pa (4-9)
Bivariato Analysis
example program -2 Output
~ t Names: ‘Vaishali’, 1 '
erik (Vaishalt’, 'Rupali’, Lucky’, 'Gorgo’, BluoNyl
, lylo’, ‘Goodluck’, 'SPe')
aso Type: |'Food’, Food’, Food’, ‘Drink’, ‘Food’, ‘Drink,
price :[12, 25: 92. 10, 18, 22, 18] fe fe
Drink Food
col.
yow0
‘pueNyle ° 1
Goodluck 1 0
Gorge 1 oO
Lucky o 1
Rupali 0 1
‘SPs 1 0
Vaishali ° a1
‘col. Drink Food
row_0
SlueNyleNaN 15.0 ji
Goodluck 22.0 NaN
Gorge . 10.0 NaN
Tricky NaN 32.0
‘RupaliNaN 25.0
ses 18.0 NaN
VaishaliNaN 12.0 ‘
Food Types Drink Food =. All
Restaurants
BlueNyle NaN 225.0 226.0 :
[Goodtuck 4940 NaN 4840
Gorge 100.0 NaN 100.0
facy NaN 1024.0 1024.0
Rupali NaN 625.0 625.0
&Ps 3240. NaN 324.0
ee NaN 144.0 144.0
alee “9600.0 7056.0. 17986.0 t -
st for knowledge
TECHNICAL PUBLIGATIONS® - an up-thris. Bivarie
Data Exploration and Visualization (4-10) 7at0 Anaiyeg
Example program - 3
import pandas as pd
# Dictionary
an = {
‘Name’: |Dashrath', ‘Ram’, ‘Bharat’,
‘Lexm
‘Bibhi,
"Math_score': (52, 87, 49,
74, 38, 59,
48)}
# Create a DataFrame
afl = pd.DataFrame(afi,
“Maruti, ‘Shatru’,
columns = [‘Name',
‘Math_score'])
# Calculating Percentage
dfi[‘percent'] = (df1['Math_score'] /
f1'Math_score'].sum()) * 100
# Show the dataframe
‘print(df1),
Example program - 3 Output
Name Math score © percent
© Dasbrath 52 13.098237
1 Ram 87 21.914358
2 Bharat 49 12.942569
3° Laxman 74 18.639798
4 Mamti 28 7.052897
5. Shatm 59, 14.861461
6. Bibhi 48 12.090680
43 Analyzing Contingency Tables
© Contingency table is the techniques for exploring two or even more variables. It is
basically a tally of counts between tivo or more categorical variables.
* Acontingency table depicts the distribution of one variable in rows and another variable in
columns. It is used to study the correlation between the two variables, It is a multiway ae
which describes a dataset in which each observation belongs to one category for each 0!
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeHon an :
poet ——~ ——__(4-11) Bivariato Analysis
i Ue Blvadalo A
Jes, It is basics Ce
sever! variables. It is basically a tally of counts between two or more categorical variables,
‘i en are ad
contingency tables also called crosstabs or (wo - way tables, used in statistics to
sunmarize the relalionship between several catego
i
rical variables,
contingency coefficient is a coeflicient of association which tells whether two variables
The
, > Intiepender pn
or datasets sre independent or dependent of each other, also known as Pearson's Coefficient
» For example, suppose that there is data of 200 people whether they prefer Potato Vada or
Misall and whether they prefer Curd or Coke, This d
Jc, which might look like :
jata can be assembled into a contingency
tabl
Potato Vada | Misal
Curd 40 85
Coke 50 25
« When determining whether two variables are associated, it can be helpful to look at a
contingency table of proportions, Contingency tables are often given in frequencies and can
be converted to proportions by dividing each frequency by the total number of
observations,
Marginal proportions
« A marginal proportion in a contingency table is the proportion of observations in a single
category of one variable, If given a contingency table of proportions, the marginal
proportion can be calculated by taking the row and column sums. If given a contingency
table of frequencies, the marginal proportion can be calculated by dividing the row or
column sum by the total number of observations.
* Contingency tables are a quite popular method during EDA (Exploratory Data Analysis) as
they show relationship between two categorical fields. They are often used to see the
distribution of a categorical field with respect to the class label in classification tasks. They
are also used in statistical tests like chi-squared test which test for the association between
two categorical variables,
“Walia contingency table using Python *
Example program - 4
‘Import pandan as pd
toate a dataframe
“PaDataFrama({
“Vounn Ja 11,2, 3,4, 6,6),
t
a
|
TECHNICAL PUBLIGATIONS® - an up-thrust for knowledge2) Bivanate a,
Data Exploration and Visualization (4-12) Dah
‘color: [Red’, ‘Blue’, ‘Gray’, Blue’, Red’, Red.
‘size! [M.'S, Ly Ly ‘SMT
»
# display the dataframe
print(as)
# contingecy table between ‘color’ and ‘size”
Print(pd.crosstab(difcolor|, df'size'l))
# contingecy table with row and column totals
Print(pd.crosstab(df{color', di'size'], margins=True))
Example program - 4 Output
Jeansid color size
° 1 Red M
1 2 Blue s
2 3 Gry oL
3 4 Blue L
4 5 Red s
5 6 Red M
size LMS
color
Blue 1 0 1
Gray 1 0 0 _.
Red 0 2 1
size LMS All 2
color .
Bue 10 1 2
Gray 10 0 1
Red 0 21 3
al 2226
OQ 4.4 Handling Several Batches
* Many analytics applications require frequent batch processing, which allows them t0
process data in batches at varying interval. Batch systems must be built to scale for all sizes
of data and scale seamlessly to the size of the dataset being processed at various job runs.
* Such data received from batch system can be referred as big data as it is a large data set.
TECHNICAL PUBLICATIONS® - an up-thnust for knowledgea
js exploration and Visualization
pe!
‘wnile dealing with such data sets below techni
. chniques need t .
7 10 be applied,
(4-13)
Bivariate Analysis
itch processing is a techni
eniet When computational tes for processing large amounts of data in a repeatable
rt gata wth litle or no eae are available, the batch technique allows users to
Frethod through which a com ee Simply described, batch processing is the
manner, typically sillanoutly. haa as batches of work in a nonstop, sequential
Ws also ac .
von into smaller chunks for debugging one eee
Reduce memory usage by optimizing data types
When using Pandas to load
° is ad dat it wil : .
told otherwise. Most of the ts a Fron a file, it will automatically infer data types unless
ae le coe it works fine but the inferred type is not necessarily
opt ee : if a numerical column contains missing values than the inferred
‘ype tically be float. For particular case, specifying the data types led to an
important reduction of the used memory.
2. Split data into chunks
o When data is too large to fit into memory, one can use Pandas’ chunksize option to split
the data into chunks instead of dealing with one big block. Using this option creates an
iterator object that can be used to go through the different chunks and perform filtering
or analysis just like one would do when loading the full dataset:
o Chunking can be used from initial exploratory analysis to model training and requires
very little extra setup.
3, Take advantage of lazy evaluation
to the strategy which delays evaluation of an expression until the
o Lazy evaluation refers
pt (used especially in
value is actually needed. Lazy evaluation is an important conce]
functional programming).
on which distributed computation frameworks are built.
© Lazy evaluation is the basis
advantage of them to
Although they were designed to work on clusters one can still take
handle large datasets on personal computer.
0 Pandas is that they do not load tl
4 command is that they scan the data, infer
Computational graphs are built
ly needed (hence
© The main difference with respect t the data directly in
memory. instead what happens during the rea
dtypes and split it into partitions (so far nothing new).
for these partitions independently and they are executed only when real
Shaman
TECHNICAL PUBLICATIONS=
Ho Analy,
Data Exploration and Visualization (4-14) Bivariay
* Steps to work with Python Batch Processing using JoblD. Jobb is atte of py
utilities for lightweight pipelining, It contains unique optimizations for Numpy a a i"
is built to be quick and resifient on large data. It is released under the Berkeley Sou!
Distribution (BSD) license. [
* Some of the kéy features of Joblib are :
© Transparent Disk-Caching of Functions and Lazy Re-Evaluation : A memoig
make - like feature for Python functions that work with any Python object, in
very big NumPy arrays. By expressing the operations as a collection of ste
well-defined inputs and outputs : Python functions, one can separate: persiste,
flow-execution logic from the domain logic or algorithmic code. Joblib can s,
computation to disc and repeat it only if required.
cluding
PS with
Nee ang
‘AVE their
Simple Parallel Computing : Joblib makes it easy to write readable parallel o;
ode and
easily debug it.
Fast Compressed Persistence : An alternative for a pickle that works well with Python,
objects that contain a lot of data ( joblib.dump&joblib.load ).
Benefits of Python Batch Processing
* Speed and Low Costs : Batch Processing can lessen a company’s dependency on other
Pricey pieces of technology, making it a.comparatively low = cost option that saves money
and time. Batch procedures are executed most efficiently and feasibly without the tisk of
user mistakes. As a result, managers have more time to focus on day-to-day operations and
can analyze data more quickly and accurately.
* Offline Features : Batch Processing systems work in a stand - alone mode. This process is
still going strong at the end of the day. To prevent overloading a system and disturbing
regular tasks, Managers can restrict when a process begins. The software can be configured
to execute specific batches overnight, which is a practical option for firms that don't want
jobs like automated downloads to disturb their daily operations.
* Efficiency : When computers or other.resources are readily accessible, Batch Processing
allows a corporation to handle jobs. Companies can plan batch operations for activities thet
aren’t as urgent and prioritize time - sensitive jobs. Batch systems can also run in the
background to reduce processor burden.
* Simplicity : Batch Processing, is a less sophisticated system that does not require particular
hardware or system support for data entry. It requires less maintenance after it is set up than
a stream processing system.
* Improved Data Quality : Batch Processing reduces the chances of mistakes by automating
most or all components of a processing operation and minimizing user contact. To achiev
a greater level of data quality, precision and accuracy are enhanced.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgejoration and Visualization
pata Exel (4-15)
Bivariato Analysis
GAAS Seater PlotsandResistantLines
§ 45-1 Bl-Variate Analysis using scatter Plot
4 Aseatter plot ean be used to visually inspect w
ee ether there is an association between t
quantitative variables. If there is a pattem in the ables are associated; there is
. plot, the variables are associated; if there is
no pattem, the variables are not associated,
associated variables; children who are ol
E
example program - 5
For example, this plot shows a pair of
der tend to weigh more.
#Scatter plot for Bi-variate analysis - 2 Variables
Suport pandas as pd x "
fmport matplotlib.pyplot as pit
|pcreate DataFrame
{c+ pa DataFrame({‘HoursWalked! (1,1, 1,2, 2,2, 3,3, 3,3,
j 3,4, 4,5, 5, 6, 6, 6, 7, 8],
f ‘Heartbeatscore': (75, 66, 68, 74, 78, 72, 85, 82, 90, 82,
| 80, 88, 85, 90, 92, 94, 94, 88, 91, 96]})
#View first five rows of DataFrame
dfhead()
print(dt)
#create scatterplot of hours vs. score
pit.scatter(dfHoursWalked, df-Heartbeatsoore)
pittitle(Hours Walked vs. Heartbeat Score’)
pltxlabel(Hours Walked!)
pitylabel(Heartbeat Score!) : Sone se
Example program - 5 Output °
Heartbeat sdoreHiours Walked
f 0 8 . 1
fet 66
2 68
S, 74
4
|
L
NON NR Bo
TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge~e
Data Exploration and Visualization (4-16) Bivariate Analysis
6 85 3
7 82 3
8 90 2
9 82 3
10 80 =
11 88 4
12 85 4
13 90 5
14 92 5
18 94 6 :
16 4 6
a7 88 6
ns 91 7
19, 96 8
Hours walked vs. Heartbeat score
Heartbeat score
x
3
1 2 3 4 5 6 7 8
Hours walked
4.5.2 Bivariate Analysis Resistant Lines
+ Resistant lines helps to find trend in the data and to identify support and resistance level in
the data
* Resistance lines are technical indication tools used to determine the trend of a specific
variable. They are very useful in predicting the probable movement of two variables.
Resistance lines are usually drawn on a high-to-low basis. THey help estimate resistance
and support levels, A resistance line in an uptrend movement marks the support area and @
resistance line in a downtrend movement marks the resistance area,
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge_— rr
ration and Visualization
ota EXP" (4-17) Bivariate Analysis
resistance
support ane : levels are popular measures in technical analysis for stock trading.
els reflect pri i i
support lev Price ranges at which a certain stock has trouble exceeding while
resistance levels are those at which a stock’s price tends not to fall below.
support and resistance levels are used in technical analysis to predict reversals in price
trends. A falling price might be likelier to stop falling when it nears a support level.
Conversely, @ rising stock price might be likelier to stop increasing when it nears a
resistance level. Support and resistance levels are not infallible and that, determining such
price ranges is no simple task.
‘Support levels become resistance levels once broken and likewise resistance levels become
support levels when they are broken. As such, the same price levels can be either support or
resistance depending on price action.
Resistance
Zs
Resistance
Price
Time
© Inabove figure it can be seen that a rising price “breaking through” a previous level to find
a new range of resistance, after which the previous resistance level becomes a new support
level. This figure illustrates how support and resistance levels can predict price reversals.
* For identifying the support and resistance levels, below are some common points to
consider :
© Horizontal vs. Diagonal
© Intraday vs. Long Term
© Major vs. Minor levels
© Multiple Re-tests
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBivariate
Data Exploration and Visualization (4-18) Analysis
* How one chooses to incorporate each of these considerations greatly influences the nature
of how support and resistance levels are calculated. For example, diagonal support ang
resistance can be powerful in helping predict small pullbacks during an uptrend. In this
case, “breaking support” can be identified as a potential trend reversal.
Calculating Support, Resistance and Trendlines
© There are various ways to calculate support and resistance. Below are two primary means :
1, Long-term support and resistance levels, drawn as horizontal lines.
2. Shorter-term trendlines; drawn as either diagonal or horizontal lines.
* Long-term levels are used to help predict large price reversals marking the start and
completion of price movements on longer timelines such as the daily or weekly charts,
Trendlines are more useful to predict intraday movements or shorter daily movements.
* K-Means clustering can be used to identify long-term support and resistance levels. For
trendlines, a combination of linear regression and minima-maxima calculation is used. Each
offer different benefits but, as with many technical indicators, are more powerful when used
together.
£Q 4.6 Transformations
* Visualization is an important tool for insight generation, but it is rare that one gets the data
in exactly the correct and required form. One will often need to create some new variables
or summaries, rename variables or reorder observations for the data to be easier to manage.
© Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular
data structure with labeled axes (rows and columns). Arithmetic operations align on both
row and column labels. It can be thought of as a dict-like container for Series objects. This
is the primary data structure of the Pandas.
© PandasDataFrame.transform( function call func on self producing a DataFrame with
transformed values and that has the same axis length as self,
© func : It is a function that is used for transforming the data.
© axis : Refers to 0 or ‘index’, 1 or ‘columns’, default value 0.
© *args : It is a positional arguments that is to be passed to a func,
© **kwargs : It is a keyword arguments that is to be passed to a func. :
TECHNICAL PUBLICATIONS® - an up-thrust for knowledgewa
- exploration and Visualization (4-19)
ple dae 6
Bivariato Analysis
exe
"Data2":[7, 2, 54, 3, None},
"Data3":[20, 16,11, 3, 8},
"patad’:[14, 3, None, 2, 6]})
ult = df.transform(func = = lambda x: x + 10)
veo) fiction sucerfy se 010 cach lonet ofthe ven Daa
“TECHNICAL PUBLIGATIONS® - an up-thrus for kowledgeRR RR
Pata Exploration and Visualization 4-20)
Datai Data2 —Datad- Data .
Row1 22.0 17.0 30 a
Row2 140 12.0 26 ae
Row.3 15.0 64.0 aa nen
Row_4 Nan 13.0 13 we)
Row§ 11.0 NaN 18 16.0
Review Questions with Answers
4. What are common ways to perform bivariate analysis ? (Refer section 4.1)
2. Explain various types of bivariate analysis. (Refer section 4.1)
3. Discuss with example percentage tables. (Refer section 4.2)
4. What are the uses of contigency tables ? (Refer section 4.3)
5.
How resistent lines are used in bivariate analysis ? (Refer section 4.5)
EQ 4.7 two Marks Questions with Answers
=
aa
whet ote BGRSIGE Gece of caurciation belgeeRUETS vaicbi in bivariate
analysis.
Ans. : Bivariate data correlation,
1. Numerical and Numerical : In this kind of variable both the variables of the bivariate
data which includes the dependent and the independent variable have a numerical value,
2. Categorical and Categorical : When both the variables in the bivariate data are in the
static form then the data is interpreted and statements and predictions are made about it.
During the research, the analysis will help to determin:
1 the cause and impact to
conclude that the given variable is categorical.
3. Numerical and Categorical : This is when one of the variables is numerical and the
other is categorical. Bivariate analysis is a kind of statistical analysis when two variables
are observed against each other. One of the variables will be dependent and the other is
independent. The variables are denoted by X and Y. The changes are analyzed between
the two variables to understand to what extent the change has occurred.
Q.2 Explain elaboration.
Ans. : Elaboration
Elaboration is a process designed to further explore bivariate relationships by
additional variables called control variables.
TECHNICAL PUBLICATIONS® - an up-thrust for knowledge