EDA Unit 4 Notes

Uploaded by

sivashankarsridevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

1K views22 pages

EDA Unit 4 Notes

Uploaded by

sivashankarsridevi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 22

Relationships between Two Variables - Percent : : tage Tables - Analyzing Contingent Tables - Handling Several Batches - Scatterplots and Resistant Lines ene ommatons. Contents Relationship between Two Variables Percentage Tables Analyzing Contingency Tables Handling Several Batches Scatter Plots and Resistant Lines Transformations Two Marks Questions with AnswersData Exploration and Visualization (4-2) Bivariate Anaya, €Q 4.1 Relationship between Two Variables ¢ The term bivarinte analysis refers to the analysis of two variables. I is a methodieg statistical technique applied (oa pair of variables (Teatures/ attributes) oF data to determin the empirical relationship between them. In order words, it is meant to determine ‘i i i any concurrent relations (usually over and above a simple correlation analysis). ¢ Bivariate analysis is performed to find the relationship between cach variable in the dataset and the target variable of interest (or) using 2 variables and finding the relationg hip between them, For example, Box plot, Violin plot. ariate analysis can be thought of as simple as creating a scatter plot by plotting one variable against another on a Cartesian plane (think X and Y axis) can sometimes Bive a picture of what the data is trying to show. If the data seems to fit a line or curve then there is a relationship or correlation between the two variables. For example, one might choose to plot caloric intake versus weight. * There are three common ways to perform bivariate analysis : 1, Scatter plots - This gives an idea of the pattems that can be formed using the two variables. 2. Correlation coefficients - The coefficient helps to know if the data in question are related. When the correlation coefficient is zero then this means that the variables are not related. If the correlation coefficient is a positive or a negative 1 then this means that the variables are perfectly correlated. 3. Simple linear regression - This uses a wide range of tools to determine how the data post could be related. The post may follow an exponential curve. The regression analysis gives the equation for a line or curve. It also: helps to find the correlation coefficient. * In the context of supervised learning, it can help determine the essential predictors when the bivariate analysis is done keeping one of the variables as the dependent variable (Y) and the other ones as independent variables (X,, Xp, ... and so on) hence plot all Y, X, pairs. So essentially, it is a way of feature selection and feature prioritization. Comparison of correlation and causality ‘ © It is a widespread fallacy to assume that if one variable is observed to vary with a change in values of another empirically, then either of them is “causing” the other to change of leading the other variable to change. In bivariate analysis, it might be observed that on variable (especially the X,) is causing Y to change. Still, in actuality, it might just be ®" indicator and not the actual driver. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge.1, He Visualization 19 exploration and (4-3) Bivariate Analysis of variable and bivariate analysis eS of bivariate analysis is dependent on the kind of attributes and variables that is + MMs to analyze the data. The variables may be ordinal, categorical or numeri. The jndependent yariable is categorical like a brand of a pencil. In this case, prob it regression ox log it regression 1S used, If the dependent and the independent variables are both ordinal Siri means that they have a ranking or position then the rank correlation coefficient is measured: Incase the dependent attribute is ordinal then the ordered probit or the ordered logit is used. Itis possible that the dependent attribute could be internal ar a ratio like the scale of temperature. This is where regression is measured, Below are the kinds of bivariate data correlation. , Numerical and Numerical : In this kind of variable both the variables of the bivariate data which includes the dependent and the independent variable have a numerical value. Categorical and Categorical : When both the variables in the bivariate data are in the static form then the data is interpreted and statements and predictions are made about it During the research, the analysis will help to determine the cause and impact to conclude that the given variable is categorical. . Numerical and Categorical : This is when. one of the variables is numerical and the other is categorical. Bivariate analysis is a kind of statistical analysis when two variables are observed against each other. One of the variables will be dependent and the other is independent. The variables are denoted by X and Y. The changes are analyzed between the two variables to’ understand to what extent the change has occurred. Further, there are two types of variables in data - Categorical and continuous (numerical). Therefore, for bivariate analysis, there are 3 possible combinations for analysis that could be carriéd out nariely; categorical and categorical, categorical and continuous, continuous and continuous. Categorical and categorical variables combination * This is used in case both the variables being analyzed are categorical. In the case of classification, models say, for example, classifying a credit card fraud or not as Y. variables and then checking if the customer is at his hometown or away or outside the country. Another example can be age vs gender and then counting the number of customers who fall in that category. It is important to note that the Visualization / summary shows the count or some mathematical or logical aggregation of a3" Variable / metric like revenue or cost and the like in all such analyses. It can be done using Crosstabs (heat maps) or Pivots in Python. TECHNIGAL PUBLIGATIONS® - an up-trust for knowledgeData Exploration and Visualization (4-4) Bivariate Anatya, stabs It i © Cre sed to count between categories or get summaries between two categories, Pandas library has this functionality. Pivots : Another useful functionality that can be applied to Pandas dataframes 1g Bet Excel like Pivot tables. This can work for 2+ categorical variables when placed in the proper hierarchy. 2. Categorical and continuous (numerical) variables combination : © In this type the variance plotting of a numerical variable in a class is performed, Fo, example, how age varies in each segment or how do income and expenses of g household vary by loan re-payment status. * Categorical plot for aggregates of continuous variables : Used to get total or counts of a numerical variable eg revenue for each month. Also, this can be used for counts of another categorical variable too instead of the numerical. © Plots for distribution of continuous (numerical) variables : Use to see the range and statistics of a;numerical variable across categories. © Plots used are - box plot, violin plot, swarm plot. 3. Continuous and continuous variable combination : © This is the most common use case of bivariate analysis and is used for showing the empirical relationship between two numerical (continuous) variables. This is usually more applicable in regression cases. @ In case there is large datasets with 30 - 70+ features (variables), there might not be sufficient time to run each pair of variables through bivariate analysis one by one. One could use the pair plot or pair grid from the seaborn library in such cases. It makes a grid where each cell is a bivariate graph and Pair grid also allows customizations. 42 Percentage Tables ° A bivariate table addresses the joint distribution of two variables. Bivariate table is 4 table that illustrates the relationship between two variables by displaying the distribution of one variable across the categories of a second variable. To detect association within bivariate tables one can calculate percentages within the categories of the independent variable or can compare percentages across the categories ‘of the independent variable or can perform a Chi Square test of Independence formally determines the statistical significance. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge- exploration and Visualization 4-8) pata Bivariate Analysis ; cross-tabulation : a technique used to explore the relationship between two variables that ize c. ' : ; pave been organi : a a table. Column variable is a variable whose categories comprise a bivi c. : : the columns of a bivariate table. Row variable is a variable whos categories comprise the ows of a bivariate table. Cell is the intersection of a row and a column in a bivariate table, Marginals is the row and column totals in a bivariate table, oh pivariate table displays the distribution of one variable across the categories of another variable. The independent variable usually goes in the columns, while the dependent yariable goes in the rows. Rows and columns intersect at cells. The row and column totals ofa bivariate table are called marginals. « Bivariate relationships come in several different flavors. When the variation in the dependent variable can be attributed only to the independent variable, the relationship is said to be direct. When a third variable affects both the independent and dependent variables, the relationship is said to be spurious. When the independent variable affects the dependent variable only by way of a mediating variable (sort of like a chain reaction), it is said to be an intervening relationship. ‘The common percent is an extremely useful measure. « One can use a percent with any kind of data : Nominal, ordinal, interval and ratio. A percent is a standardized measure. Percent ineans "per 100 cases."Because the percent is standardized one can use it to compare results from different population bases that have different sizes or total case - bases. For example, one could compare the percentage of home computer ownership among people with living in villages, urban areas and metro cities. * The first step in calculating a percent is to isolate the case base of interest. « The second step is to identify the category of interest. © The third step is to locate the number of cases in the category of interest ONLY for the group. * The fourth step is to take the frequency in category of interest and divide that frequency by the total number of cases in my case ~ base of interest. Constructing a bivariate table : * Percentages can be computed indifferent ways namely, 1. Column percentages - Column totals as base. 2. Row percentages - Row totals as base. * Typically percentages are provided for the independent variables. TECHNICAL PUBLICATIONS® - an up-thrust for knowiedgeData Exploration and Visualization (4-6) Bivariate Analy, Elaboration * Elaboration is a process designed to further explore bivariate relationships by intro i additional variables called control variables. Limitations of elaboration * Elaboration can be useful, but it also has its limitations. First, it tends t0 be a fit bit tedious, especially if done by hand. Second, it's not the most precise form of analy, Elaboration allows to compare the distribution of one variable across the categories or another, but there are other measures of association that do a better job of quantifying ih, relationship between two variables. Crosstab function Python © The crosstab() function is used to compute a simple cross tabulation of two (or more) factors. By default computes a frequency table of the factors unless an array of values ang an aggregation function are passed. : pandas.crosstab(parameters) i ae index ____| array - like, Series or list of arrays / Series Values to group by in the rows, columns | array - like, Series or list of arrays / Series Values to group by in the columns. array « like; optional Array of values to aggregate according to the factors, Requires ‘aggfine’ be specified, Sequence, default None If passed, must match a number of row arrays | passed. Sequence, default None If passed, must match a number of column arrays * passed, function, optional If specified, requires ‘alue be specified as well. bool, default False Add row / column margins (subtotals). _| Str, default “All” Name of the row / column that will contain the totals | when margins is True, E _| bool, default True Do not include columns whose entries are all NaN: bool, {‘all’, ‘index’, ‘columns’} or {0,1}, default False Normalize by dividing all values by the sum of values, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge- exploration and Visualization Bivariato Analysis: pata E pros to bulld crosstable and bulld bar chart example program - 7 sagan 8 pd iporoato cats Be pa DateFrame({'Grade jutstanding’, ‘Distinction ‘irstClass’, 'SecondClass' ‘PassClass'’, 'PassClass’, ‘Distinction’, 'Distinction'], ‘age’ [18, 18, 18, 19, 19, 20, 18, 18, 19, 19), ‘Gender: ['M'M, 'F, ‘F, 'F 'M’, 'M 'F 'M', Fl) pant(af) igand frequency of each letter grade ‘gracrosstb = pd.crosstab(index=dff'Grade'], column: ‘print(grdcrosstb) |g Creatingbarplot pparplot = grdcrosstb.plot.bar(rot=0) Example program - 1 Output fo Ase Gender Grade o 18 M FirstClass a 18 M Outstanding 2 8 F Outstanding 3 19 F Distinction 4 19 F FirstClass 5 20 M SecondClass 6 18 M PassClass fi 18 F _ PassClass 8 19 M Distinction, ge 19 F Distinction ‘col_O count Grade Distinction 3 FirstClass 2 Pustanding 2 PassClass 2 . SecondClass 1 ASsiiots TECHNICAL PUBLIGATIONS® - an up-thnust for knowtedgeData Exploration and Visualization (4-8) Bivariate Analysis 3.0 25 2.0 15 1.0 0s: a 0.0 Distinction First class Outstanding Pass class Second class grade #Building tables with crosstab function Example program - 2 import pandas as pd res_names = "Vaishali, ‘Rupali, ‘Lucky’, Gorge, 'BlueNyle’, ‘Goodluck’, ‘SPs'] _ purchase_type = ['Food'’, Food’, ‘Food, 'Drink’, ‘Food, ‘Drink, 'Drink’] [12, 26, 32, 10, 15, 22, 18] pric print (‘Restaurant Names: {}'format(res_names)) print (Purchase Type: {}'format(purchase_type)) print (Price: {}'.format(price)) rescrtb1 = pd.crosstab(index=[res_names], columns=[purchase_type]) print (rescrtb1) rescrth2 = pd.crosstab(index=[res_names], columns=[purchase_type], values=price, aggfunc=sum) print(rescrtb2) rescrtb3 = pd.crosstab(index=[res_names], ‘columns=[purchase_typel, values=price, aggfunc=lambda x: x.sum()**2, # Setting a custom agg function townames=|"Restaurants"], # Giving a title to my rows: colnames=|'Food Types'|, # Giving a title to my columns margins=True, # Adding margins (Subtotals on the ends) | #margins_name="Totals ft ) # Give my subtotals a title 5 TECHNICAL PUBLICATIONS® - an up-thrust for knowledge— oration and Visualization 1a EXP! pa (4-9) Bivariato Analysis example program -2 Output ~ t Names: ‘Vaishali’, 1 ' erik (Vaishalt’, 'Rupali’, Lucky’, 'Gorgo’, BluoNyl , lylo’, ‘Goodluck’, 'SPe') aso Type: |'Food’, Food’, Food’, ‘Drink’, ‘Food’, ‘Drink, price :[12, 25: 92. 10, 18, 22, 18] fe fe Drink Food col. yow0 ‘pueNyle ° 1 Goodluck 1 0 Gorge 1 oO Lucky o 1 Rupali 0 1 ‘SPs 1 0 Vaishali ° a1 ‘col. Drink Food row_0 SlueNyleNaN 15.0 ji Goodluck 22.0 NaN Gorge . 10.0 NaN Tricky NaN 32.0 ‘RupaliNaN 25.0 ses 18.0 NaN VaishaliNaN 12.0 ‘ Food Types Drink Food =. All Restaurants BlueNyle NaN 225.0 226.0 : [Goodtuck 4940 NaN 4840 Gorge 100.0 NaN 100.0 facy NaN 1024.0 1024.0 Rupali NaN 625.0 625.0 &Ps 3240. NaN 324.0 ee NaN 144.0 144.0 alee “9600.0 7056.0. 17986.0 t - st for knowledge TECHNICAL PUBLIGATIONS® - an up-thris. Bivarie Data Exploration and Visualization (4-10) 7at0 Anaiyeg Example program - 3 import pandas as pd # Dictionary an = { ‘Name’: |Dashrath', ‘Ram’, ‘Bharat’, ‘Lexm ‘Bibhi, "Math_score': (52, 87, 49, 74, 38, 59, 48)} # Create a DataFrame afl = pd.DataFrame(afi, “Maruti, ‘Shatru’, columns = [‘Name', ‘Math_score']) # Calculating Percentage dfi[‘percent'] = (df1['Math_score'] / f1'Math_score'].sum()) * 100 # Show the dataframe ‘print(df1), Example program - 3 Output Name Math score © percent © Dasbrath 52 13.098237 1 Ram 87 21.914358 2 Bharat 49 12.942569 3° Laxman 74 18.639798 4 Mamti 28 7.052897 5. Shatm 59, 14.861461 6. Bibhi 48 12.090680 43 Analyzing Contingency Tables © Contingency table is the techniques for exploring two or even more variables. It is basically a tally of counts between tivo or more categorical variables. * Acontingency table depicts the distribution of one variable in rows and another variable in columns. It is used to study the correlation between the two variables, It is a multiway ae which describes a dataset in which each observation belongs to one category for each 0! TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeHon an : poet ——~ ——__(4-11) Bivariato Analysis i Ue Blvadalo A Jes, It is basics Ce sever! variables. It is basically a tally of counts between two or more categorical variables, ‘i en are ad contingency tables also called crosstabs or (wo - way tables, used in statistics to sunmarize the relalionship between several catego i rical variables, contingency coefficient is a coeflicient of association which tells whether two variables The , > Intiepender pn or datasets sre independent or dependent of each other, also known as Pearson's Coefficient » For example, suppose that there is data of 200 people whether they prefer Potato Vada or Misall and whether they prefer Curd or Coke, This d Jc, which might look like : jata can be assembled into a contingency tabl Potato Vada | Misal Curd 40 85 Coke 50 25 « When determining whether two variables are associated, it can be helpful to look at a contingency table of proportions, Contingency tables are often given in frequencies and can be converted to proportions by dividing each frequency by the total number of observations, Marginal proportions « A marginal proportion in a contingency table is the proportion of observations in a single category of one variable, If given a contingency table of proportions, the marginal proportion can be calculated by taking the row and column sums. If given a contingency table of frequencies, the marginal proportion can be calculated by dividing the row or column sum by the total number of observations. * Contingency tables are a quite popular method during EDA (Exploratory Data Analysis) as they show relationship between two categorical fields. They are often used to see the distribution of a categorical field with respect to the class label in classification tasks. They are also used in statistical tests like chi-squared test which test for the association between two categorical variables, “Walia contingency table using Python * Example program - 4 ‘Import pandan as pd toate a dataframe “PaDataFrama({ “Vounn Ja 11,2, 3,4, 6,6), t a | TECHNICAL PUBLIGATIONS® - an up-thrust for knowledge2) Bivanate a, Data Exploration and Visualization (4-12) Dah ‘color: [Red’, ‘Blue’, ‘Gray’, Blue’, Red’, Red. ‘size! [M.'S, Ly Ly ‘SMT » # display the dataframe print(as) # contingecy table between ‘color’ and ‘size” Print(pd.crosstab(difcolor|, df'size'l)) # contingecy table with row and column totals Print(pd.crosstab(df{color', di'size'], margins=True)) Example program - 4 Output Jeansid color size ° 1 Red M 1 2 Blue s 2 3 Gry oL 3 4 Blue L 4 5 Red s 5 6 Red M size LMS color Blue 1 0 1 Gray 1 0 0 _. Red 0 2 1 size LMS All 2 color . Bue 10 1 2 Gray 10 0 1 Red 0 21 3 al 2226 OQ 4.4 Handling Several Batches * Many analytics applications require frequent batch processing, which allows them t0 process data in batches at varying interval. Batch systems must be built to scale for all sizes of data and scale seamlessly to the size of the dataset being processed at various job runs. * Such data received from batch system can be referred as big data as it is a large data set. TECHNICAL PUBLICATIONS® - an up-thnust for knowledgea js exploration and Visualization pe! ‘wnile dealing with such data sets below techni . chniques need t . 7 10 be applied, (4-13) Bivariate Analysis itch processing is a techni eniet When computational tes for processing large amounts of data in a repeatable rt gata wth litle or no eae are available, the batch technique allows users to Frethod through which a com ee Simply described, batch processing is the manner, typically sillanoutly. haa as batches of work in a nonstop, sequential Ws also ac . von into smaller chunks for debugging one eee Reduce memory usage by optimizing data types When using Pandas to load ° is ad dat it wil : . told otherwise. Most of the ts a Fron a file, it will automatically infer data types unless ae le coe it works fine but the inferred type is not necessarily opt ee : if a numerical column contains missing values than the inferred ‘ype tically be float. For particular case, specifying the data types led to an important reduction of the used memory. 2. Split data into chunks o When data is too large to fit into memory, one can use Pandas’ chunksize option to split the data into chunks instead of dealing with one big block. Using this option creates an iterator object that can be used to go through the different chunks and perform filtering or analysis just like one would do when loading the full dataset: o Chunking can be used from initial exploratory analysis to model training and requires very little extra setup. 3, Take advantage of lazy evaluation to the strategy which delays evaluation of an expression until the o Lazy evaluation refers pt (used especially in value is actually needed. Lazy evaluation is an important conce] functional programming). on which distributed computation frameworks are built. © Lazy evaluation is the basis advantage of them to Although they were designed to work on clusters one can still take handle large datasets on personal computer. 0 Pandas is that they do not load tl 4 command is that they scan the data, infer Computational graphs are built ly needed (hence © The main difference with respect t the data directly in memory. instead what happens during the rea dtypes and split it into partitions (so far nothing new). for these partitions independently and they are executed only when real Shaman TECHNICAL PUBLICATIONS= Ho Analy, Data Exploration and Visualization (4-14) Bivariay * Steps to work with Python Batch Processing using JoblD. Jobb is atte of py utilities for lightweight pipelining, It contains unique optimizations for Numpy a a i" is built to be quick and resifient on large data. It is released under the Berkeley Sou! Distribution (BSD) license. [ * Some of the kéy features of Joblib are : © Transparent Disk-Caching of Functions and Lazy Re-Evaluation : A memoig make - like feature for Python functions that work with any Python object, in very big NumPy arrays. By expressing the operations as a collection of ste well-defined inputs and outputs : Python functions, one can separate: persiste, flow-execution logic from the domain logic or algorithmic code. Joblib can s, computation to disc and repeat it only if required. cluding PS with Nee ang ‘AVE their Simple Parallel Computing : Joblib makes it easy to write readable parallel o; ode and easily debug it. Fast Compressed Persistence : An alternative for a pickle that works well with Python, objects that contain a lot of data ( joblib.dump&joblib.load ). Benefits of Python Batch Processing * Speed and Low Costs : Batch Processing can lessen a company’s dependency on other Pricey pieces of technology, making it a.comparatively low = cost option that saves money and time. Batch procedures are executed most efficiently and feasibly without the tisk of user mistakes. As a result, managers have more time to focus on day-to-day operations and can analyze data more quickly and accurately. * Offline Features : Batch Processing systems work in a stand - alone mode. This process is still going strong at the end of the day. To prevent overloading a system and disturbing regular tasks, Managers can restrict when a process begins. The software can be configured to execute specific batches overnight, which is a practical option for firms that don't want jobs like automated downloads to disturb their daily operations. * Efficiency : When computers or other.resources are readily accessible, Batch Processing allows a corporation to handle jobs. Companies can plan batch operations for activities thet aren’t as urgent and prioritize time - sensitive jobs. Batch systems can also run in the background to reduce processor burden. * Simplicity : Batch Processing, is a less sophisticated system that does not require particular hardware or system support for data entry. It requires less maintenance after it is set up than a stream processing system. * Improved Data Quality : Batch Processing reduces the chances of mistakes by automating most or all components of a processing operation and minimizing user contact. To achiev a greater level of data quality, precision and accuracy are enhanced. TECHNICAL PUBLICATIONS® - an up-thrust for knowledgejoration and Visualization pata Exel (4-15) Bivariato Analysis GAAS Seater PlotsandResistantLines § 45-1 Bl-Variate Analysis using scatter Plot 4 Aseatter plot ean be used to visually inspect w ee ether there is an association between t quantitative variables. If there is a pattem in the ables are associated; there is . plot, the variables are associated; if there is no pattem, the variables are not associated, associated variables; children who are ol E example program - 5 For example, this plot shows a pair of der tend to weigh more. #Scatter plot for Bi-variate analysis - 2 Variables Suport pandas as pd x " fmport matplotlib.pyplot as pit |pcreate DataFrame {c+ pa DataFrame({‘HoursWalked! (1,1, 1,2, 2,2, 3,3, 3,3, j 3,4, 4,5, 5, 6, 6, 6, 7, 8], f ‘Heartbeatscore': (75, 66, 68, 74, 78, 72, 85, 82, 90, 82, | 80, 88, 85, 90, 92, 94, 94, 88, 91, 96]}) #View first five rows of DataFrame dfhead() print(dt) #create scatterplot of hours vs. score pit.scatter(dfHoursWalked, df-Heartbeatsoore) pittitle(Hours Walked vs. Heartbeat Score’) pltxlabel(Hours Walked!) pitylabel(Heartbeat Score!) : Sone se Example program - 5 Output ° Heartbeat sdoreHiours Walked f 0 8 . 1 fet 66 2 68 S, 74 4 | L NON NR Bo TECHNICAL PUBLICATIONS® - an up-thrust for knowiedge~e Data Exploration and Visualization (4-16) Bivariate Analysis 6 85 3 7 82 3 8 90 2 9 82 3 10 80 = 11 88 4 12 85 4 13 90 5 14 92 5 18 94 6 : 16 4 6 a7 88 6 ns 91 7 19, 96 8 Hours walked vs. Heartbeat score Heartbeat score x 3 1 2 3 4 5 6 7 8 Hours walked 4.5.2 Bivariate Analysis Resistant Lines + Resistant lines helps to find trend in the data and to identify support and resistance level in the data * Resistance lines are technical indication tools used to determine the trend of a specific variable. They are very useful in predicting the probable movement of two variables. Resistance lines are usually drawn on a high-to-low basis. THey help estimate resistance and support levels, A resistance line in an uptrend movement marks the support area and @ resistance line in a downtrend movement marks the resistance area, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge_— rr ration and Visualization ota EXP" (4-17) Bivariate Analysis resistance support ane : levels are popular measures in technical analysis for stock trading. els reflect pri i i support lev Price ranges at which a certain stock has trouble exceeding while resistance levels are those at which a stock’s price tends not to fall below. support and resistance levels are used in technical analysis to predict reversals in price trends. A falling price might be likelier to stop falling when it nears a support level. Conversely, @ rising stock price might be likelier to stop increasing when it nears a resistance level. Support and resistance levels are not infallible and that, determining such price ranges is no simple task. ‘Support levels become resistance levels once broken and likewise resistance levels become support levels when they are broken. As such, the same price levels can be either support or resistance depending on price action. Resistance Zs Resistance Price Time © Inabove figure it can be seen that a rising price “breaking through” a previous level to find a new range of resistance, after which the previous resistance level becomes a new support level. This figure illustrates how support and resistance levels can predict price reversals. * For identifying the support and resistance levels, below are some common points to consider : © Horizontal vs. Diagonal © Intraday vs. Long Term © Major vs. Minor levels © Multiple Re-tests TECHNICAL PUBLICATIONS® - an up-thrust for knowledgeBivariate Data Exploration and Visualization (4-18) Analysis * How one chooses to incorporate each of these considerations greatly influences the nature of how support and resistance levels are calculated. For example, diagonal support ang resistance can be powerful in helping predict small pullbacks during an uptrend. In this case, “breaking support” can be identified as a potential trend reversal. Calculating Support, Resistance and Trendlines © There are various ways to calculate support and resistance. Below are two primary means : 1, Long-term support and resistance levels, drawn as horizontal lines. 2. Shorter-term trendlines; drawn as either diagonal or horizontal lines. * Long-term levels are used to help predict large price reversals marking the start and completion of price movements on longer timelines such as the daily or weekly charts, Trendlines are more useful to predict intraday movements or shorter daily movements. * K-Means clustering can be used to identify long-term support and resistance levels. For trendlines, a combination of linear regression and minima-maxima calculation is used. Each offer different benefits but, as with many technical indicators, are more powerful when used together. £Q 4.6 Transformations * Visualization is an important tool for insight generation, but it is rare that one gets the data in exactly the correct and required form. One will often need to create some new variables or summaries, rename variables or reorder observations for the data to be easier to manage. © Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). Arithmetic operations align on both row and column labels. It can be thought of as a dict-like container for Series objects. This is the primary data structure of the Pandas. © PandasDataFrame.transform( function call func on self producing a DataFrame with transformed values and that has the same axis length as self, © func : It is a function that is used for transforming the data. © axis : Refers to 0 or ‘index’, 1 or ‘columns’, default value 0. © *args : It is a positional arguments that is to be passed to a func, © **kwargs : It is a keyword arguments that is to be passed to a func. : TECHNICAL PUBLICATIONS® - an up-thrust for knowledgewa - exploration and Visualization (4-19) ple dae 6 Bivariato Analysis exe "Data2":[7, 2, 54, 3, None}, "Data3":[20, 16,11, 3, 8}, "patad’:[14, 3, None, 2, 6]}) ult = df.transform(func = = lambda x: x + 10) veo) fiction sucerfy se 010 cach lonet ofthe ven Daa “TECHNICAL PUBLIGATIONS® - an up-thrus for kowledgeRR RR Pata Exploration and Visualization 4-20) Datai Data2 —Datad- Data . Row1 22.0 17.0 30 a Row2 140 12.0 26 ae Row.3 15.0 64.0 aa nen Row_4 Nan 13.0 13 we) Row§ 11.0 NaN 18 16.0 Review Questions with Answers 4. What are common ways to perform bivariate analysis ? (Refer section 4.1) 2. Explain various types of bivariate analysis. (Refer section 4.1) 3. Discuss with example percentage tables. (Refer section 4.2) 4. What are the uses of contigency tables ? (Refer section 4.3) 5. How resistent lines are used in bivariate analysis ? (Refer section 4.5) EQ 4.7 two Marks Questions with Answers = aa whet ote BGRSIGE Gece of caurciation belgeeRUETS vaicbi in bivariate analysis. Ans. : Bivariate data correlation, 1. Numerical and Numerical : In this kind of variable both the variables of the bivariate data which includes the dependent and the independent variable have a numerical value, 2. Categorical and Categorical : When both the variables in the bivariate data are in the static form then the data is interpreted and statements and predictions are made about it. During the research, the analysis will help to determin: 1 the cause and impact to conclude that the given variable is categorical. 3. Numerical and Categorical : This is when one of the variables is numerical and the other is categorical. Bivariate analysis is a kind of statistical analysis when two variables are observed against each other. One of the variables will be dependent and the other is independent. The variables are denoted by X and Y. The changes are analyzed between the two variables to understand to what extent the change has occurred. Q.2 Explain elaboration. Ans. : Elaboration Elaboration is a process designed to further explore bivariate relationships by additional variables called control variables. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge

EDA Unit 3 Notes
No ratings yet
EDA Unit 3 Notes
35 pages
EDA Unit 2 Notes
No ratings yet
EDA Unit 2 Notes
61 pages
EDA Unit 1 Notes
No ratings yet
EDA Unit 1 Notes
27 pages
Ccs346 Eda Unit 1 Notes
100% (2)
Ccs346 Eda Unit 1 Notes
20 pages
EDA Lab Manual for Students
No ratings yet
EDA Lab Manual for Students
41 pages
Eda Unit 1
No ratings yet
Eda Unit 1
57 pages
EDA Unit V
No ratings yet
EDA Unit V
28 pages
Ad3301 Data Exploration and Visualization
100% (3)
Ad3301 Data Exploration and Visualization
30 pages
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
No ratings yet
Machine Learning - AL3451 - Notes - Unit 1 - Introduction To Machine Learning
29 pages
EDA Unit3
No ratings yet
EDA Unit3
44 pages
Ad3491 Fdsa Unit 2 Notes Eduengg
No ratings yet
Ad3491 Fdsa Unit 2 Notes Eduengg
82 pages
Unit - 1 EDA
No ratings yet
Unit - 1 EDA
123 pages
CCS341 Set1
67% (3)
CCS341 Set1
2 pages
Ad3301 - Data Exploration and Visualization
100% (5)
Ad3301 - Data Exploration and Visualization
2 pages
Cognitive Science UNIT 4
No ratings yet
Cognitive Science UNIT 4
10 pages
OCS353 - Data Science Manual-FULL
100% (2)
OCS353 - Data Science Manual-FULL
64 pages
Ad3301-Data-Exploration-And-Visualization Lab Manual
No ratings yet
Ad3301-Data-Exploration-And-Visualization Lab Manual
24 pages
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
No ratings yet
Unit I Content Beyond Syllabus - I Introduction To Data Mining and Data Warehousing What Are Data Mining and Knowledge Discovery?
12 pages
cs3362 Foundations of Data Science Lab Manual
67% (9)
cs3362 Foundations of Data Science Lab Manual
53 pages
CCS341-Data Warehousing Lab Manual (2021)
100% (1)
CCS341-Data Warehousing Lab Manual (2021)
50 pages
CS8792 CNS Two Marks Questions With Answers
No ratings yet
CS8792 CNS Two Marks Questions With Answers
15 pages
Unit I - Part I Notes
100% (7)
Unit I - Part I Notes
33 pages
Cognitive Science Unit 3
No ratings yet
Cognitive Science Unit 3
15 pages
GE3171 - PSPP Lab Manual Regulation 2021
No ratings yet
GE3171 - PSPP Lab Manual Regulation 2021
60 pages
DDM - Unit 5 - Material
100% (2)
DDM - Unit 5 - Material
45 pages
AD3351 DAA Lab Manual
No ratings yet
AD3351 DAA Lab Manual
47 pages
Inferential Statistics Guide
No ratings yet
Inferential Statistics Guide
37 pages
CS3451 OS UNIT 1 NOTES EduEngg
No ratings yet
CS3451 OS UNIT 1 NOTES EduEngg
34 pages
Ccs341 DW Notes All 5 Units
100% (1)
Ccs341 DW Notes All 5 Units
159 pages
Ad3351 Daa Unit 1 - 5 Important Questions & Answer'
No ratings yet
Ad3351 Daa Unit 1 - 5 Important Questions & Answer'
63 pages
Ocs353dsf Unit Wise Notes
100% (2)
Ocs353dsf Unit Wise Notes
121 pages
Python ADTs & Algorithms Guide
0% (1)
Python ADTs & Algorithms Guide
39 pages
Bayesian Inference & Networks Guide
100% (1)
Bayesian Inference & Networks Guide
21 pages
AD3491 FDSA Syllabus
No ratings yet
AD3491 FDSA Syllabus
2 pages
Unit 2
No ratings yet
Unit 2
34 pages
CS3361 Data Science Lab Manual (II CYS)
100% (1)
CS3361 Data Science Lab Manual (II CYS)
40 pages
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
100% (1)
Lab Manual Daa Ad3351 Aids III Sem Regulation 2021
48 pages
CD3291 Data Structurres and Algorithm Lab Manual
No ratings yet
CD3291 Data Structurres and Algorithm Lab Manual
84 pages
Mobile and Pervasive Computing
No ratings yet
Mobile and Pervasive Computing
13 pages
Data Science Fundamentals QB
No ratings yet
Data Science Fundamentals QB
23 pages
Ccs369 - Text and Speech Analysis - Lab Manual
100% (1)
Ccs369 - Text and Speech Analysis - Lab Manual
23 pages
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
No ratings yet
OCS353 Data Science Fundamentals QB - (Common To EEE, Mech, Civil)
7 pages
CD3281 Dsa Lab 2021 R
100% (2)
CD3281 Dsa Lab 2021 R
3 pages
Ad3301 Dev Full Notes
No ratings yet
Ad3301 Dev Full Notes
53 pages
Dpco Unit-3 Notes
No ratings yet
Dpco Unit-3 Notes
31 pages
ML - LAB Record
No ratings yet
ML - LAB Record
36 pages
Ccs341 - Data Warehousing
100% (1)
Ccs341 - Data Warehousing
2 pages
ReLu Heuristics For Avoiding Local Bad Minima
100% (2)
ReLu Heuristics For Avoiding Local Bad Minima
10 pages
List The Computer Security Hybrid Policies and Explain
No ratings yet
List The Computer Security Hybrid Policies and Explain
21 pages
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
No ratings yet
AD3391 Database Design and Management Nov Dec 2023 Question Paper Download
3 pages
GE3171 Problem Solving and Python Programming Lab Manual Shanen
No ratings yet
GE3171 Problem Solving and Python Programming Lab Manual Shanen
44 pages
Cognitive Science Unit 1
No ratings yet
Cognitive Science Unit 1
15 pages
ML Lab Manual - Ex No. 1 To 9
No ratings yet
ML Lab Manual - Ex No. 1 To 9
26 pages
CS3461 Oslab
No ratings yet
CS3461 Oslab
2 pages
AIDS Syllabus 2021 L
No ratings yet
AIDS Syllabus 2021 L
87 pages
Ad3461 ML Lab Manual
100% (1)
Ad3461 ML Lab Manual
54 pages
cs3361 Data Science Lab Record Manual
89% (9)
cs3361 Data Science Lab Record Manual
92 pages
Unit 1
No ratings yet
Unit 1
24 pages
Capslet: Practical Research Ii
100% (1)
Capslet: Practical Research Ii
14 pages
Univariate vs Bivariate Analysis Guide
100% (1)
Univariate vs Bivariate Analysis Guide
6 pages

EDA Unit 4 Notes

Uploaded by

EDA Unit 4 Notes

Uploaded by

You might also like