0% found this document useful (0 votes)

22 views12 pages

Data Wrangling

Uploaded by

sameer raza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views12 pages

Data Wrangling

Uploaded by

sameer raza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 12

Data Binning & Data Wrangling

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
class(iris)
this is a data.frame , data frames not a better ways to view the data sets , tables give a
better view
Tabular Representation … Other forms of view of data

> tbl_df(iris)

# A tibble: 150 x 5

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

<dbl> <dbl> <dbl> <dbl> <fct>

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
To view the data sets in Excel kind of tables one can use

View(iris)

> glimpse(iris)

Rows: 150

Columns: 5

$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1,

5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5...

$ Sepal.Width <dbl> 3.

Statistics / Category wise

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Finding the mean , minimum , maximum categories wise ,

In telco finding the mean minimum maximum churner wise ,

Grouping *****

# first grouping by churn

> bychurn = tl %>% group_by(churn) ….
The data set by appearance is same as original file and after grouping but the data set gets
segregated by the churn
The summary measures by churn
> bychurn%>%summarise(meanage = mean(age),minage = min(age),maxage=max(age))

`summarise()` ungrouping output (override with `.groups` argument)

# A tibble: 2 x 4

churn meanage minage maxage

<chr> <dbl> <int> <int>

1 No 43.6 19 77

2 Yes 36.5 18 69

Grouping by the education qualifications and churning the measures are as follows

by_ed_churn = tl%>% group_by(ed,churn)

then finding the statistics

t=by_ed_churn %>% summarise(mean(age))

data.frame(t)

> data.frame(t)

ed churn mean.age.

1 College degree No 41.30986

2 College degree Yes 36.27174

3 Did not complete high school No 47.15116

4 Did not complete high school Yes 37.12500

5 High school degree No 43.92411

Summarising the Data *****

> summarise(tl, avg=mean(age))

avg

1 41.684

To get mean of each variable wherever possible one can take up

> summarise_each(tl, funs(mean))

region tenure age marital address income ed employ retire gender reside tollfree equip callcard

1 NA 35.526 41.684 NA 11.551 77.535 NA 10.987 NA NA 2.331 NA NA NA

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Slicing and Dicing of the data

Selecting the records satisfying conditions ;

> freq = filter(tl,age>68 & income > 100) the same can be achieved by tel[age>70 & income < 60,]

> freq

region tenure age marital address income ed employ retire gender reside tollfree

1 Zone 3 37 76 Unmarried 38 117 College degree 21 No Female 1 No

2 Zone 2 71 70 Unmarried 30 153 Some college 34 Yes Female 1 Yes

3 Zone 1 47 69 Married 17 228 Some college 39 No Female 4 Yes

4 Zone 1 65 70 Married 9 115 High school degree 39

Slicing of the data … Selecting a specific range of the records

> slice(tl,21:32)

Selecting the records

Selecting all the records with all the records except the records from 30 to 40

Selecting top n records

> top_n(iris,2,ed)

If the data is grouped then it will take the top n rows from every group

* Selecting the Columns of a data Set ***

# 1 One can select the variables by using the variable names or positions

> select(tl , age, income ,4 , 7 ) ; t1 is the telco file ; the data gets selected are age income and the

4th , 7th variables.

# 2 Selecting the data set with out some variables

> select(tl,-age)

> select(tl,-c(age,income,tenure))

Selecting those variables which starts with “e”

> select(tl,starts_with("e"))
ed employ equip equipmon equipten ebill

1 College degree 5 No 0.00 0.00 No

Selects those records with the variables from 3 to 5.

Selecting the unique / distinct elements

> ir = select(iris,3)

> unique(ir)

Petal.Length

1 1.4

3 1.3

4 1.5

6 1.7

12 1.6

14 1.1

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Selecting Distinct Records

Selecting those records which have no repetitions

> ag = c(34,45,34); incm = c(120,130,120); tble = data.frame(cbind(ag,incm))

> tble

ag incm

1 34 120

2 45 130

3 34 120

> distinct(tble)

ag incm

1 34 120

2 45 130

Selecting distinct records … of the file iris

> iris : tibble(iris) ; as_tibble(iris) ;

> distinct(iris)

Logical true / false , when the repetitions are present

> incm = c(120,130,120)

> duplicated(incm)

[1] FALSE FALSE TRU

Finding the repeated elements

incm[duplicated(incm)]

the items that are left out after removing the repeated elements , Removing / deleting the

repeated items

> incm[!duplicated(incm)]

Finding the unique records of iris , based on Sepal.length

The records where Sepal.length’s are different

> iris%>%distinct(Sepal.Length) # keeps only the Sepal.Length other records will be deleted

Sepal.Length

1 5.1

> iris%>%distinct(Sepal.Length,.keep_all = T)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Selecting Random Numbers

Simulating flipping of the cions

A coin is tossed for ten times list the possible outcomes

> sample(c(“H”,”T”),10,replace = TRUE)

The coin is biased , 80 % of the times it gives heads

> sample(c(“H”,”T”), 10, prob = c(.8 ,.2) ,replace = TRUE)

chest <- c(rep("gold", 20),

rep("silver", 30),

rep("bronze", 50))

# Draw 10 coins from the chest

sample(x = chest,

size = 10)

rnorm(n = 5, mean = 0, sd = 1)

# 5 samples from Uniform dist with bounds at 0 and 1

runif(n = 5, min = 0, max = 1)

## [1] 0.0019 0.8019 0.1661 0.3628 0.9268

# 10 samples from Uniform dist with bounds at -100 and +100

runif(n = 10, min = -100, max = 100)

## [1] -10.8 -37.7 2.2 -38.4 -34.6 46.2 -68.8 5.3 92.9 -14.4

Random sample selection

> smpl = sample_frac(tl,.10,replace = T)

> dim(smpl)

> smpl = sample_n(tl,10,replace = T)

> dim(smpl)

[1] 10 421] 100 42

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
************* Computing the new variables and appending them ***************

For example in telco computing the present age and the revised income

> mutate(tl, nage = age + 4 , revincome = income * 1.2)

Another function

dplyr :: mutate_each(iris, funs(min_rank))

Apply window function to each column.

The window functions are

dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width)

Compute one or more new columns. Drop original columns.

Computing the sum etc

> mutate(tl, cumsum (age))

Sorting records of a file based on the values of a variable ,

For example let us sort the records of mtcars based on mpg ;

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
X1 X2 X1 X3
101 P1 101 M
102 P2 102 N M Married
103 P3 103 N
104 P4 106 M N Not Married
105 P5 107 M

> X1 = c(101,102,103,104,105); X2 = c("P1","P2","P3","P4","P5"); X3 = c("M","N","N","M","M")

> T1 = data.frame (X1,X2) ;

> T1

X1 X2

1 101 P1

2 102 P2

3 103 P3

4 104 P4

5 105 P5

> X1 = c(101,102,103,106,107);X3 = c("M","N","N","M","M")

> T2=data.frame(X1,X3);

> T2

X1 X3

1 101 M

2 102 N

3 103 N

4 106 M

5 107 M

Joins ::: are for adding tables

Left Join

This will join the rows from left table to right table ,

> left_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N

4 104 P4 <NA>

5 105 P5 <NA>

Right Join

> right_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N

4 106 <NA> M

5 107 <NA> M

Inner Join

> # Common records in both the tables

> inner_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N

Full Join

> # all the records in both the tables

> full_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N
4 104 P4 <NA>

5 105 P5 <NA>

6 106 <NA> M

7 107 <NA> M

Semi Join

Will retain the records of the first table among the common (primary key)

> semi_join(T1,T2,by = "X1")

X1 X2

1 101 P1

2 102 P2

3 103 P3

Anti - Join

Will do the conserve of Semi Join

It will retain the records which are in second table and not present in first table

> anti_join(T1,T2,by = "X1")

X1 X2

1 104 P4

2 105 P5

Operations On Sets

Operations on the sets which have the same Columns

Consider the following Tables

> X1 = c(101,102,103,104,105); X2 =c("P1","P2","P3","P4","P5")

> T3 = data.frame(X1,X2)

> T3

X1 X2

1 101 P1

2 102 P2

3 103 P3

4 104 P4

5 105 P5

> X1 = c(101,102,108,109);X2 =c("P1","P2","P3","P6")

> T4 = data.frame(X1,X2)

> T4

X1 X2

1 101 P1

2 102 P2

3 108 P3

4 109 P6

Operations ….

> intersect(T3,T4)

X1 X2

1 101 P1

2 102 P2

> union(T3,T4)

X1 X2

1 101 P1

2 102 P2

3 103 P3

4 104 P4

5 105 P5

6 108 P3

7 109 P6

> setdiff(T3,T4)

X1 X2

1 103 P3

2 104 P4

3 105 P5

bind_rows ; bind_cols for appending the rows and columns

Selecting those records with specific values of a variables , for example in the file telco we wish to
select those records where churn is either Yes or No

One can use the following

> newban = filter(ban,(default == "Yes") )

> newban$default

[1] "Yes" "Yes”

> newban = filter(ban,(default == "No") )

> newban$default

[1] "No" "No" "No"

Factors
No ratings yet
Factors
23 pages
R Programming Cheatsheet
100% (2)
R Programming Cheatsheet
6 pages
Which Device (A-H) Would You Use For The Tasks (1-8) ? ( ../8)
100% (3)
Which Device (A-H) Would You Use For The Tasks (1-8) ? ( ../8)
3 pages
R Studio Notes
No ratings yet
R Studio Notes
10 pages
R Basics for Beginners
No ratings yet
R Basics for Beginners
24 pages
BigData - BCom Unit 4
No ratings yet
BigData - BCom Unit 4
9 pages
R Course Own English HS
No ratings yet
R Course Own English HS
70 pages
R Basic and Advanced
No ratings yet
R Basic and Advanced
9 pages
Fonction Dplyr
No ratings yet
Fonction Dplyr
5 pages
R Machine Learning Lab Guide
0% (1)
R Machine Learning Lab Guide
9 pages
Presentation 1
No ratings yet
Presentation 1
34 pages
Experiment 5
No ratings yet
Experiment 5
13 pages
Factors in R
No ratings yet
Factors in R
6 pages
Daur Unit 2
No ratings yet
Daur Unit 2
28 pages
R - Tutorial: Matrices Are Vectors
No ratings yet
R - Tutorial: Matrices Are Vectors
13 pages
Week 7
No ratings yet
Week 7
10 pages
Stastistics and Probability With R Programming Language: Lab Report
50% (2)
Stastistics and Probability With R Programming Language: Lab Report
44 pages
Module 2.9
No ratings yet
Module 2.9
12 pages
EDA With R Lab Manual
No ratings yet
EDA With R Lab Manual
110 pages
Data Table
No ratings yet
Data Table
2 pages
Datatable
No ratings yet
Datatable
2 pages
R File Code
No ratings yet
R File Code
16 pages
DataFramesCheatSheet v1.x Rev1
No ratings yet
DataFramesCheatSheet v1.x Rev1
2 pages
Data Transformation Cheatsheet R
No ratings yet
Data Transformation Cheatsheet R
2 pages
Enhanced Data
No ratings yet
Enhanced Data
12 pages
Lecture 1
No ratings yet
Lecture 1
167 pages
R Programs
No ratings yet
R Programs
30 pages
R Functions
No ratings yet
R Functions
8 pages
Data Transformation Cheatsheet
No ratings yet
Data Transformation Cheatsheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
Data Transformation With Data - Table: Cheat Sheet
No ratings yet
Data Transformation With Data - Table: Cheat Sheet
2 pages
R Programming Cheat Sheet: Ata Tructures
No ratings yet
R Programming Cheat Sheet: Ata Tructures
2 pages
R Lecture 2-1
No ratings yet
R Lecture 2-1
28 pages
Teaching R
No ratings yet
Teaching R
15 pages
(Practical) Programming With R
No ratings yet
(Practical) Programming With R
5 pages
R Examples
No ratings yet
R Examples
56 pages
R Working Materials Prep
No ratings yet
R Working Materials Prep
43 pages
Data Transformation With Dplyr - Cheatsheet
100% (1)
Data Transformation With Dplyr - Cheatsheet
2 pages
Lab 02 - Compound Data Structures
No ratings yet
Lab 02 - Compound Data Structures
12 pages
6 Working With Data Frames in R
No ratings yet
6 Working With Data Frames in R
8 pages
Day3 2017
No ratings yet
Day3 2017
27 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
Data Wrangling Cheatsheet PDF
No ratings yet
Data Wrangling Cheatsheet PDF
2 pages
The Impact of Risk Management On Construction Projects Success PDF
No ratings yet
The Impact of Risk Management On Construction Projects Success PDF
33 pages
Pothole Detection via Lightweight Networks
No ratings yet
Pothole Detection via Lightweight Networks
90 pages
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
No ratings yet
Assigning A Sound File To An Instance. Assigning A Keyboard Key To An Instance. Assigning An Image File To An Instance. All of The Above. ( )
4 pages
Lynx
No ratings yet
Lynx
6 pages
19 B9 IELTS T2 Essays 240 T2 Questions
100% (1)
19 B9 IELTS T2 Essays 240 T2 Questions
116 pages
MySQL JOIN Types Explained
No ratings yet
MySQL JOIN Types Explained
1 page
Pure Mathematics Coordinate Geometry Project
No ratings yet
Pure Mathematics Coordinate Geometry Project
25 pages
IEEE Paper Batch02
No ratings yet
IEEE Paper Batch02
4 pages
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
No ratings yet
ECEN3250 Lab 7: Design of Common-Source MOS Amplifiers Prelab Assignment
14 pages
Chapter 3 BJT
No ratings yet
Chapter 3 BJT
45 pages
Wi-Fi Test Suite Release Notes
No ratings yet
Wi-Fi Test Suite Release Notes
10 pages
Multimedia Unit 4
No ratings yet
Multimedia Unit 4
16 pages
Utr - PLN Suar PDF
100% (1)
Utr - PLN Suar PDF
86 pages
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
No ratings yet
User's Guide For The AT&T Global Network Client For Linux: System Requirements and Installation
2 pages
Computer 1
No ratings yet
Computer 1
8 pages
Native Otp Authentication With Netscaler
No ratings yet
Native Otp Authentication With Netscaler
14 pages
TJR TUJR WF4 Manual 01 25 15
No ratings yet
TJR TUJR WF4 Manual 01 25 15
62 pages
How To Find The Where Used List of Query Restrictions
No ratings yet
How To Find The Where Used List of Query Restrictions
14 pages
Living Now - Catalogue - 2MOD AD-EXLNW2M22C - GB
No ratings yet
Living Now - Catalogue - 2MOD AD-EXLNW2M22C - GB
132 pages
Example Network Diagram: Msa Bts1 Bsc1 Msc/Vlr1 Air Interface/Lapdm Abis Interface/Lapd A Interface Map - E Interface
No ratings yet
Example Network Diagram: Msa Bts1 Bsc1 Msc/Vlr1 Air Interface/Lapdm Abis Interface/Lapd A Interface Map - E Interface
40 pages
Succinctly
100% (1)
Succinctly
121 pages
Products
No ratings yet
Products
78 pages
Anaconda Training PDF
100% (1)
Anaconda Training PDF
2 pages
E-Sahal Wallet Intro Jemal
No ratings yet
E-Sahal Wallet Intro Jemal
18 pages
Keyword Protocol 2000 - Part 1 - Physical Layer - Swedish
No ratings yet
Keyword Protocol 2000 - Part 1 - Physical Layer - Swedish
12 pages
Business Process Simulation Guide
No ratings yet
Business Process Simulation Guide
24 pages
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
No ratings yet
Harnessing The Reasoning Economy A Survey of Efficient Reasoning For Large Language Models
24 pages
EAadhaar 0648019028606520240216115645 26022024194147
No ratings yet
EAadhaar 0648019028606520240216115645 26022024194147
1 page
SPL Transient Designer Manual
No ratings yet
SPL Transient Designer Manual
16 pages

Data Wrangling

Uploaded by

Data Wrangling

Uploaded by

Data Binning & Data Wrangling

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

<dbl> <dbl> <dbl> <dbl> <fct>

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

Statistics / Category wise

In telco finding the mean minimum maximum churner wise ,

******************** Grouping *************************

# first grouping by churn

`summarise()` ungrouping output (override with `.groups` argument)

churn meanage minage maxage

<chr> <dbl> <int> <int>

by_ed_churn = tl%>% group_by(ed,churn)

then finding the statistics

t=by_ed_churn %>% summarise(mean(age))

1 College degree No 41.30986

2 College degree Yes 36.27174

3 Did not complete high school No 47.15116

4 Did not complete high school Yes 37.12500

5 High school degree No 43.92411

**** Summarising the Data *********

> summarise(tl, avg=mean(age))

To get mean of each variable wherever possible one can take up

> summarise_each(tl, funs(mean))

1 NA 35.526 41.684 NA 11.551 77.535 NA 10.987 NA NA 2.331 NA NA NA

Selecting the records satisfying conditions ;

1 Zone 3 37 76 Unmarried 38 117 College degree 21 No Female 1 No

2 Zone 2 71 70 Unmarried 30 153 Some college 34 Yes Female 1 Yes

3 Zone 1 47 69 Married 17 228 Some college 39 No Female 4 Yes

4 Zone 1 65 70 Married 9 115 High school degree 39

Slicing of the data … Selecting a specific range of the records

Selecting the records

Selecting top n records

************* Selecting the Columns of a data Set ***************

4th , 7th variables.

# 2 Selecting the data set with out some variables

Selecting those variables which starts with “e”

1 College degree 5 No 0.00 0.00 No

Selects those records with the variables from 3 to 5.

Selecting the unique / distinct elements

Selecting those records which have no repetitions

> ag = c(34,45,34); incm = c(120,130,120); tble = data.frame(cbind(ag,incm))

Selecting distinct records … of the file iris

> iris : tibble(iris) ; as_tibble(iris) ;

Logical true / false , when the repetitions are present

> incm = c(120,130,120)

[1] FALSE FALSE TRU

Finding the repeated elements

Finding the unique records of iris , based on Sepal.length

The records where Sepal.length’s are different

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

Simulating flipping of the cions

A coin is tossed for ten times list the possible outcomes

> sample(c(“H”,”T”),10,replace = TRUE)

The coin is biased , 80 % of the times it gives heads

> sample(c(“H”,”T”), 10, prob = c(.8 ,.2) ,replace = TRUE)

# Draw 10 coins from the chest

# 5 samples from Uniform dist with bounds at 0 and 1

runif(n = 5, min = 0, max = 1)

## [1] 0.0019 0.8019 0.1661 0.3628 0.9268

# 10 samples from Uniform dist with bounds at -100 and +100

runif(n = 10, min = -100, max = 100)

Random sample selection

> smpl = sample_frac(tl,.10,replace = T)

> smpl = sample_n(tl,10,replace = T)

[1] 10 421] 100 42

> mutate(tl, nage = age + 4 , revincome = income * 1.2)

dplyr :: mutate_each(iris, funs(min_rank))

Apply window function to each column.

The window functions are

Compute one or more new columns. Drop original columns.

Computing the sum etc

> mutate(tl, cumsum (age))

Sorting records of a file based on the values of a variable ,

Grouping *****

Summarising the Data *****

* Selecting the Columns of a data Set ***