Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
22 views12 pages

Data Wrangling

Uploaded by

sameer raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views12 pages

Data Wrangling

Uploaded by

sameer raza
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Data Binning & Data Wrangling

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
class(iris)
this is a data.frame , data frames not a better ways to view the data sets , tables give a
better view
Tabular Representation … Other forms of view of data

> tbl_df(iris)

# A tibble: 150 x 5

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

<dbl> <dbl> <dbl> <dbl> <fct>

1 5.1 3.5 1.4 0.2 setosa

2 4.9 3 1.4 0.2 setosa

3 4.7 3.2 1.3 0.2 setosa

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
To view the data sets in Excel kind of tables one can use

View(iris)

> glimpse(iris)

Rows: 150

Columns: 5

$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1,

5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5...

$ Sepal.Width <dbl> 3.

Statistics / Category wise

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Finding the mean , minimum , maximum categories wise ,

In telco finding the mean minimum maximum churner wise ,

******************** Grouping *************************

# first grouping by churn


> bychurn = tl %>% group_by(churn) ….
The data set by appearance is same as original file and after grouping but the data set gets
segregated by the churn
The summary measures by churn
> bychurn%>%summarise(meanage = mean(age),minage = min(age),maxage=max(age))

`summarise()` ungrouping output (override with `.groups` argument)

# A tibble: 2 x 4

churn meanage minage maxage

<chr> <dbl> <int> <int>

1 No 43.6 19 77

2 Yes 36.5 18 69

Grouping by the education qualifications and churning the measures are as follows

by_ed_churn = tl%>% group_by(ed,churn)

then finding the statistics

t=by_ed_churn %>% summarise(mean(age))

data.frame(t)

> data.frame(t)

ed churn mean.age.

1 College degree No 41.30986

2 College degree Yes 36.27174

3 Did not complete high school No 47.15116

4 Did not complete high school Yes 37.12500

5 High school degree No 43.92411

**** Summarising the Data *********

> summarise(tl, avg=mean(age))

avg

1 41.684

To get mean of each variable wherever possible one can take up

> summarise_each(tl, funs(mean))

region tenure age marital address income ed employ retire gender reside tollfree equip callcard

1 NA 35.526 41.684 NA 11.551 77.535 NA 10.987 NA NA 2.331 NA NA NA


*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Slicing and Dicing of the data

Selecting the records satisfying conditions ;

> freq = filter(tl,age>68 & income > 100) the same can be achieved by tel[age>70 & income < 60,]

> freq

region tenure age marital address income ed employ retire gender reside tollfree

1 Zone 3 37 76 Unmarried 38 117 College degree 21 No Female 1 No

2 Zone 2 71 70 Unmarried 30 153 Some college 34 Yes Female 1 Yes

3 Zone 1 47 69 Married 17 228 Some college 39 No Female 4 Yes

4 Zone 1 65 70 Married 9 115 High school degree 39

Slicing of the data … Selecting a specific range of the records

> slice(tl,21:32)

Selecting the records

Selecting all the records with all the records except the records from 30 to 40

Selecting top n records

> top_n(iris,2,ed)

If the data is grouped then it will take the top n rows from every group

************* Selecting the Columns of a data Set ***************

# 1 One can select the variables by using the variable names or positions

> select(tl , age, income ,4 , 7 ) ; t1 is the telco file ; the data gets selected are age income and the

4th , 7th variables.

# 2 Selecting the data set with out some variables

> select(tl,-age)

> select(tl,-c(age,income,tenure))

Selecting those variables which starts with “e”

> select(tl,starts_with("e"))
ed employ equip equipmon equipten ebill

1 College degree 5 No 0.00 0.00 No

Selects those records with the variables from 3 to 5.

Selecting the unique / distinct elements

> ir = select(iris,3)

> unique(ir)

Petal.Length

1 1.4

3 1.3

4 1.5

6 1.7

12 1.6

14 1.1

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Selecting Distinct Records

Selecting those records which have no repetitions

> ag = c(34,45,34); incm = c(120,130,120); tble = data.frame(cbind(ag,incm))

> tble

ag incm

1 34 120

2 45 130

3 34 120

> distinct(tble)

ag incm

1 34 120

2 45 130

Selecting distinct records … of the file iris

> iris : tibble(iris) ; as_tibble(iris) ;

> distinct(iris)

Logical true / false , when the repetitions are present

> incm = c(120,130,120)


> duplicated(incm)

[1] FALSE FALSE TRU

Finding the repeated elements

incm[duplicated(incm)]

the items that are left out after removing the repeated elements , Removing / deleting the

repeated items

> incm[!duplicated(incm)]

Finding the unique records of iris , based on Sepal.length

The records where Sepal.length’s are different

> iris%>%distinct(Sepal.Length) # keeps only the Sepal.Length other records will be deleted

Sepal.Length

1 5.1

> iris%>%distinct(Sepal.Length,.keep_all = T)

> iris%>%distinct(Sepal.Length,.keep_all = T)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species

1 5.1 3.5 1.4 0.2 setosa

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Selecting Random Numbers

Simulating flipping of the cions

A coin is tossed for ten times list the possible outcomes

> sample(c(“H”,”T”),10,replace = TRUE)

The coin is biased , 80 % of the times it gives heads

> sample(c(“H”,”T”), 10, prob = c(.8 ,.2) ,replace = TRUE)


chest <- c(rep("gold", 20),

rep("silver", 30),

rep("bronze", 50))

# Draw 10 coins from the chest

sample(x = chest,

size = 10)

rnorm(n = 5, mean = 0, sd = 1)

# 5 samples from Uniform dist with bounds at 0 and 1

runif(n = 5, min = 0, max = 1)

## [1] 0.0019 0.8019 0.1661 0.3628 0.9268

# 10 samples from Uniform dist with bounds at -100 and +100

runif(n = 10, min = -100, max = 100)

## [1] -10.8 -37.7 2.2 -38.4 -34.6 46.2 -68.8 5.3 92.9 -14.4

Random sample selection

> smpl = sample_frac(tl,.10,replace = T)

> dim(smpl)

> smpl = sample_n(tl,10,replace = T)

> dim(smpl)

[1] 10 421] 100 42

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
************* Computing the new variables and appending them ***************

For example in telco computing the present age and the revised income

> mutate(tl, nage = age + 4 , revincome = income * 1.2)

Another function

dplyr :: mutate_each(iris, funs(min_rank))

Apply window function to each column.

The window functions are


dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width)

Compute one or more new columns. Drop original columns.

Computing the sum etc

> mutate(tl, cumsum (age))

Sorting records of a file based on the values of a variable ,

For example let us sort the records of mtcars based on mpg ;

*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
X1 X2 X1 X3
101 P1 101 M
102 P2 102 N M Married
103 P3 103 N
104 P4 106 M N Not Married
105 P5 107 M

> X1 = c(101,102,103,104,105); X2 = c("P1","P2","P3","P4","P5"); X3 = c("M","N","N","M","M")

> T1 = data.frame (X1,X2) ;

> T1

X1 X2

1 101 P1

2 102 P2

3 103 P3

4 104 P4

5 105 P5

> X1 = c(101,102,103,106,107);X3 = c("M","N","N","M","M")

> T2=data.frame(X1,X3);

> T2

X1 X3

1 101 M

2 102 N

3 103 N

4 106 M

5 107 M

Joins ::: are for adding tables


Left Join

This will join the rows from left table to right table ,

> left_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N

4 104 P4 <NA>

5 105 P5 <NA>

Right Join

> right_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N

4 106 <NA> M

5 107 <NA> M

Inner Join

> # Common records in both the tables

> inner_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N

Full Join

> # all the records in both the tables

> full_join(T1,T2,by = "X1")

X1 X2 X3

1 101 P1 M

2 102 P2 N

3 103 P3 N
4 104 P4 <NA>

5 105 P5 <NA>

6 106 <NA> M

7 107 <NA> M

Semi Join

Will retain the records of the first table among the common (primary key)

> semi_join(T1,T2,by = "X1")

X1 X2

1 101 P1

2 102 P2

3 103 P3

Anti - Join

Will do the conserve of Semi Join

It will retain the records which are in second table and not present in first table

> anti_join(T1,T2,by = "X1")

X1 X2

1 104 P4

2 105 P5

Operations On Sets

Operations on the sets which have the same Columns

Consider the following Tables

> X1 = c(101,102,103,104,105); X2 =c("P1","P2","P3","P4","P5")

> T3 = data.frame(X1,X2)

> T3

X1 X2

1 101 P1

2 102 P2

3 103 P3

4 104 P4

5 105 P5

> X1 = c(101,102,108,109);X2 =c("P1","P2","P3","P6")


> T4 = data.frame(X1,X2)

> T4

X1 X2

1 101 P1

2 102 P2

3 108 P3

4 109 P6

Operations ….

> intersect(T3,T4)

X1 X2

1 101 P1

2 102 P2

> union(T3,T4)

X1 X2

1 101 P1

2 102 P2

3 103 P3

4 104 P4

5 105 P5

6 108 P3

7 109 P6

> setdiff(T3,T4)

X1 X2

1 103 P3

2 104 P4

3 105 P5

bind_rows ; bind_cols for appending the rows and columns

Selecting those records with specific values of a variables , for example in the file telco we wish to
select those records where churn is either Yes or No

One can use the following


> newban = filter(ban,(default == "Yes") )

> newban$default

[1] "Yes" "Yes”

> newban = filter(ban,(default == "No") )

> newban$default

[1] "No" "No" "No"

You might also like