Data Binning & Data Wrangling
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
class(iris)
this is a data.frame , data frames not a better ways to view the data sets , tables give a
better view
Tabular Representation … Other forms of view of data
> tbl_df(iris)
# A tibble: 150 x 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
To view the data sets in Excel kind of tables one can use
View(iris)
> glimpse(iris)
Rows: 150
Columns: 5
$ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4, 4.8, 4.8, 4.3, 5.8, 5.7, 5.4, 5.1,
5.7, 5.1, 5.4, 5.1, 4.6, 5.1, 4.8, 5.0, 5.0, 5.2, 5.2, 4.7, 4.8, 5.4, 5.2, 5.5, 4.9, 5.0, 5.5, 4.9, 4.4, 5...
$ Sepal.Width <dbl> 3.
Statistics / Category wise
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Finding the mean , minimum , maximum categories wise ,
In telco finding the mean minimum maximum churner wise ,
******************** Grouping *************************
# first grouping by churn
> bychurn = tl %>% group_by(churn) ….
The data set by appearance is same as original file and after grouping but the data set gets
segregated by the churn
The summary measures by churn
> bychurn%>%summarise(meanage = mean(age),minage = min(age),maxage=max(age))
`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 4
churn meanage minage maxage
<chr> <dbl> <int> <int>
1 No 43.6 19 77
2 Yes 36.5 18 69
Grouping by the education qualifications and churning the measures are as follows
by_ed_churn = tl%>% group_by(ed,churn)
then finding the statistics
t=by_ed_churn %>% summarise(mean(age))
data.frame(t)
> data.frame(t)
ed churn mean.age.
1 College degree No 41.30986
2 College degree Yes 36.27174
3 Did not complete high school No 47.15116
4 Did not complete high school Yes 37.12500
5 High school degree No 43.92411
**** Summarising the Data *********
> summarise(tl, avg=mean(age))
avg
1 41.684
To get mean of each variable wherever possible one can take up
> summarise_each(tl, funs(mean))
region tenure age marital address income ed employ retire gender reside tollfree equip callcard
1 NA 35.526 41.684 NA 11.551 77.535 NA 10.987 NA NA 2.331 NA NA NA
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Slicing and Dicing of the data
Selecting the records satisfying conditions ;
> freq = filter(tl,age>68 & income > 100) the same can be achieved by tel[age>70 & income < 60,]
> freq
region tenure age marital address income ed employ retire gender reside tollfree
1 Zone 3 37 76 Unmarried 38 117 College degree 21 No Female 1 No
2 Zone 2 71 70 Unmarried 30 153 Some college 34 Yes Female 1 Yes
3 Zone 1 47 69 Married 17 228 Some college 39 No Female 4 Yes
4 Zone 1 65 70 Married 9 115 High school degree 39
Slicing of the data … Selecting a specific range of the records
> slice(tl,21:32)
Selecting the records
Selecting all the records with all the records except the records from 30 to 40
Selecting top n records
> top_n(iris,2,ed)
If the data is grouped then it will take the top n rows from every group
************* Selecting the Columns of a data Set ***************
# 1 One can select the variables by using the variable names or positions
> select(tl , age, income ,4 , 7 ) ; t1 is the telco file ; the data gets selected are age income and the
4th , 7th variables.
# 2 Selecting the data set with out some variables
> select(tl,-age)
> select(tl,-c(age,income,tenure))
Selecting those variables which starts with “e”
> select(tl,starts_with("e"))
ed employ equip equipmon equipten ebill
1 College degree 5 No 0.00 0.00 No
Selects those records with the variables from 3 to 5.
Selecting the unique / distinct elements
> ir = select(iris,3)
> unique(ir)
Petal.Length
1 1.4
3 1.3
4 1.5
6 1.7
12 1.6
14 1.1
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Selecting Distinct Records
Selecting those records which have no repetitions
> ag = c(34,45,34); incm = c(120,130,120); tble = data.frame(cbind(ag,incm))
> tble
ag incm
1 34 120
2 45 130
3 34 120
> distinct(tble)
ag incm
1 34 120
2 45 130
Selecting distinct records … of the file iris
> iris : tibble(iris) ; as_tibble(iris) ;
> distinct(iris)
Logical true / false , when the repetitions are present
> incm = c(120,130,120)
> duplicated(incm)
[1] FALSE FALSE TRU
Finding the repeated elements
incm[duplicated(incm)]
the items that are left out after removing the repeated elements , Removing / deleting the
repeated items
> incm[!duplicated(incm)]
Finding the unique records of iris , based on Sepal.length
The records where Sepal.length’s are different
> iris%>%distinct(Sepal.Length) # keeps only the Sepal.Length other records will be deleted
Sepal.Length
1 5.1
> iris%>%distinct(Sepal.Length,.keep_all = T)
> iris%>%distinct(Sepal.Length,.keep_all = T)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
Selecting Random Numbers
Simulating flipping of the cions
A coin is tossed for ten times list the possible outcomes
> sample(c(“H”,”T”),10,replace = TRUE)
The coin is biased , 80 % of the times it gives heads
> sample(c(“H”,”T”), 10, prob = c(.8 ,.2) ,replace = TRUE)
chest <- c(rep("gold", 20),
rep("silver", 30),
rep("bronze", 50))
# Draw 10 coins from the chest
sample(x = chest,
size = 10)
rnorm(n = 5, mean = 0, sd = 1)
# 5 samples from Uniform dist with bounds at 0 and 1
runif(n = 5, min = 0, max = 1)
## [1] 0.0019 0.8019 0.1661 0.3628 0.9268
# 10 samples from Uniform dist with bounds at -100 and +100
runif(n = 10, min = -100, max = 100)
## [1] -10.8 -37.7 2.2 -38.4 -34.6 46.2 -68.8 5.3 92.9 -14.4
Random sample selection
> smpl = sample_frac(tl,.10,replace = T)
> dim(smpl)
> smpl = sample_n(tl,10,replace = T)
> dim(smpl)
[1] 10 421] 100 42
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
************* Computing the new variables and appending them ***************
For example in telco computing the present age and the revised income
> mutate(tl, nage = age + 4 , revincome = income * 1.2)
Another function
dplyr :: mutate_each(iris, funs(min_rank))
Apply window function to each column.
The window functions are
dplyr::transmute(iris, sepal = Sepal.Length + Sepal. Width)
Compute one or more new columns. Drop original columns.
Computing the sum etc
> mutate(tl, cumsum (age))
Sorting records of a file based on the values of a variable ,
For example let us sort the records of mtcars based on mpg ;
*&*&*****&*&&*&*&*&**&*&*&*&*&*&*&*&*&*&*&*&*&*&****&*&*&*&*&*&
X1 X2 X1 X3
101 P1 101 M
102 P2 102 N M Married
103 P3 103 N
104 P4 106 M N Not Married
105 P5 107 M
> X1 = c(101,102,103,104,105); X2 = c("P1","P2","P3","P4","P5"); X3 = c("M","N","N","M","M")
> T1 = data.frame (X1,X2) ;
> T1
X1 X2
1 101 P1
2 102 P2
3 103 P3
4 104 P4
5 105 P5
> X1 = c(101,102,103,106,107);X3 = c("M","N","N","M","M")
> T2=data.frame(X1,X3);
> T2
X1 X3
1 101 M
2 102 N
3 103 N
4 106 M
5 107 M
Joins ::: are for adding tables
Left Join
This will join the rows from left table to right table ,
> left_join(T1,T2,by = "X1")
X1 X2 X3
1 101 P1 M
2 102 P2 N
3 103 P3 N
4 104 P4 <NA>
5 105 P5 <NA>
Right Join
> right_join(T1,T2,by = "X1")
X1 X2 X3
1 101 P1 M
2 102 P2 N
3 103 P3 N
4 106 <NA> M
5 107 <NA> M
Inner Join
> # Common records in both the tables
> inner_join(T1,T2,by = "X1")
X1 X2 X3
1 101 P1 M
2 102 P2 N
3 103 P3 N
Full Join
> # all the records in both the tables
> full_join(T1,T2,by = "X1")
X1 X2 X3
1 101 P1 M
2 102 P2 N
3 103 P3 N
4 104 P4 <NA>
5 105 P5 <NA>
6 106 <NA> M
7 107 <NA> M
Semi Join
Will retain the records of the first table among the common (primary key)
> semi_join(T1,T2,by = "X1")
X1 X2
1 101 P1
2 102 P2
3 103 P3
Anti - Join
Will do the conserve of Semi Join
It will retain the records which are in second table and not present in first table
> anti_join(T1,T2,by = "X1")
X1 X2
1 104 P4
2 105 P5
Operations On Sets
Operations on the sets which have the same Columns
Consider the following Tables
> X1 = c(101,102,103,104,105); X2 =c("P1","P2","P3","P4","P5")
> T3 = data.frame(X1,X2)
> T3
X1 X2
1 101 P1
2 102 P2
3 103 P3
4 104 P4
5 105 P5
> X1 = c(101,102,108,109);X2 =c("P1","P2","P3","P6")
> T4 = data.frame(X1,X2)
> T4
X1 X2
1 101 P1
2 102 P2
3 108 P3
4 109 P6
Operations ….
> intersect(T3,T4)
X1 X2
1 101 P1
2 102 P2
> union(T3,T4)
X1 X2
1 101 P1
2 102 P2
3 103 P3
4 104 P4
5 105 P5
6 108 P3
7 109 P6
> setdiff(T3,T4)
X1 X2
1 103 P3
2 104 P4
3 105 P5
bind_rows ; bind_cols for appending the rows and columns
Selecting those records with specific values of a variables , for example in the file telco we wish to
select those records where churn is either Yes or No
One can use the following
> newban = filter(ban,(default == "Yes") )
> newban$default
[1] "Yes" "Yes”
> newban = filter(ban,(default == "No") )
> newban$default
[1] "No" "No" "No"