Thanks to visit codestin.com
Credit goes to www.scribd.com

0% found this document useful (0 votes)
35 views11 pages

Datanot

The document outlines a SQL query to analyze customer spending and savings data, connecting multiple tables to derive insights about customer behavior. It includes R code for data processing, model training using decision trees, and prediction evaluation through resampling methods. Additionally, it discusses the implementation of KNN for gender prediction based on financial metrics, emphasizing the importance of proper data handling and model validation.

Uploaded by

zeynep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views11 pages

Datanot

The document outlines a SQL query to analyze customer spending and savings data, connecting multiple tables to derive insights about customer behavior. It includes R code for data processing, model training using decision trees, and prediction evaluation through resampling methods. Additionally, it discusses the implementation of KNN for gender prediction based on financial metrics, emphasizing the importance of proper data handling and model validation.

Uploaded by

zeynep
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 11

SELECT T2.*, T4.GENDER, T4.MSTAT,T4.CITY,T8.

BALANCE
FROM
(SELECT T1.CREDT_CARD_NUM,
SUM(T1.TRANS_AMOUNT) TOT_EXP,
MIN(T1.TRANS_DATE) FIRST_DATE,
MAX(T1.TRANS_DATE) LAST_DATE
FROM CARD_TRANS T1
WHERE T1.TRANS_TYPE = '-'
GROUP BY T1.CREDT_CARD_NUM) T2,
M_CREDIT_CARDS T3,
M_CUSTOMERS T4,
(SELECT T6.*,T7.BALANCE
FROM
(SELECT T5.CUSTOMER_ID,
MAX(T5.TRANSACTION_DATE) LAST_DATE_SAVE
FROM SAVINGS_TRANS T5
GROUP BY T5.CUSTOMER_ID) T6,
SAVINGS_TRANS T7
WHERE T6.CUSTOMER_ID = T7.CUSTOMER_ID
AND T6.LAST_DATE_SAVE = T7.TRANSACTION_DATE) T8
WHERE T2.CREDT_CARD_NUM = T3.CREDIT_CARD_ID
AND T3.CUSTOMER_ID = T4.CUSTOMER_ID
AND T8.CUSTOMER_ID = T3.CUSTOMER_ID

We need to connect CARD_TRANS (T1) table to M_CUSTOMERs (T4). Therefore, although we won't use
M_CREDIT_CARDS (T3) we need to include it.

We have two subqueries here


(SELECT T1.CREDT_CARD_NUM,
SUM(T1.TRANS_AMOUNT) TOT_EXP,
MIN(T1.TRANS_DATE) FIRST_DATE,
MAX(T1.TRANS_DATE) LAST_DATE
FROM CARD_TRANS T1
WHERE T1.TRANS_TYPE = '-'
GROUP BY T1.CREDT_CARD_NUM) T2
This is the first one , it is to get total expending fist date and last date,
Because we need average spending

(SELECT T5.CUSTOMER_ID,
MAX(T5.TRANSACTION_DATE) LAST_DATE_SAVE
FROM SAVINGS_TRANS T5
GROUP BY T5.CUSTOMER_ID) T6
This is the second one, we contructed a table where we get maximumum transaction date to get the
current balance for a customer.

WHERE T6.CUSTOMER_ID = T7.CUSTOMER_ID


AND T6.LAST_DATE_SAVE = T7.TRANSACTION_DATE) T8
WHERE T2.CREDT_CARD_NUM = T3.CREDIT_CARD_ID
AND T3.CUSTOMER_ID = T4.CUSTOMER_ID
AND T8.CUSTOMER_ID = T3.CUSTOMER_ID
This is the last part; we did this to join tables and prevent excessive information.

R tutorial:
#THIS CODE TO GET DATA INTO R=
library(odbc)
library(DBI)
con<- DBI::dbConnect(
odbc::odbc(),
Driver = "SQL Server",
Server = "10.1.10.6",
Database = "SaversBankDb",
UID= "student",
PWD= "Khas2020!",
Port= 1433
) #WE HAVE THE CONNECTION

dataseQ1t<- dbGetQuery(con,
"SELECT T3.CUSTOMER_ID,T2.*, T4.GENDER,
T4.MSTAT,T4.CITY,T8.BALANCE
FROM
(SELECT T1.CREDT_CARD_NUM,
SUM(T1.TRANS_AMOUNT) TOT_SPEND,
MIN(T1.TRANS_DATE) FIRST_DATE,
MAX(T1.TRANS_DATE) LAST_DATE
FROM CARD_TRANS T1
WHERE T1.TRANS_TYPE = '-'
GROUP BY T1.CREDT_CARD_NUM) T2,
M_CREDIT_CARDS T3,
M_CUSTOMERS T4,
(SELECT T6.*,T7.BALANCE
FROM
(SELECT T5.CUSTOMER_ID,
MAX(T5.TRANSACTION_DATE) LAST_DATE_SAVE
FROM SAVINGS_TRANS T5
GROUP BY T5.CUSTOMER_ID) T6,
SAVINGS_TRANS T7
WHERE T6.CUSTOMER_ID = T7.CUSTOMER_ID
AND T6.LAST_DATE_SAVE = T7.TRANSACTION_DATE) T8
WHERE T2.CREDT_CARD_NUM = T3.CREDIT_CARD_ID
AND T3.CUSTOMER_ID = T4.CUSTOMER_ID
AND T3.CREDIT_CARD_ID = T4.CREDIT_CARD_ID
AND T8.CUSTOMER_ID = T3.CUSTOMER_ID")

#always check out the summary to make sure


summary(dataset)

#date is in character format as we see in summary so we need to change it

datasetQ1$FIRST_DATE <- as.Date(datasetQ1$FIRST_DATE)


datasetQ1$LAST_DATE <- as.Date(datasetQ1$LAST_DATE)
datasetQ1$day_diff <- as.numeric(datasetQ1$LAST_DATE - datasetQ1$FIRST_DATE)

#what we need is to calculate average spending


datasetQ1$AVG_AMOUNT <- datasetQ1$TOT_SPEND / datasetQ1$day_diff

#LAB
#we have average daily spending and we are going to create a class for consumer
datasetQ1$Consumer <- (datasetQ1$AVG_AMOUNT<150)*1
head(datasetQ1)
#this will give us whenever we have larger than 150 we will have class 1 the rest is zero. Now we have
class variable Consumer and we will use city, martial status, gender, and savings balance as classifiers

#as we see we have problem in gender column it shows 1 for male and F for female so change 1 as M
to prevent problematic situations:
datasetQ1$GENDER[datasetQ1==1] = "M"

#lapply decision tree model, to be able to do that we need to call rpart library
library(rpart)

#our dependent variable is Consumer because it is the class variable


m1 <- rpart(Consumer ~ CITY + GENDER + MSTAT + BALANCE , data= datasetQ1)
m1
#read results:

#1) we have root and


#2) we have dinstinction on CITY (ADN, ANK, ANTEP…); 3297 observations and zeros. That zeros
represent:
[n_i (number of observations from class i)] /[ sum(n_i)(all observations in the subset ] =
0/3297 = 0 (we have perfect homogeneous subset)
#so if this is zero then it means there is no consumer in these cities.
#3) On the other hand in İstanbul there are 499 observations and percentage of consumers are
0.9458918
(this is an estimation for probability, apparently model splits the data into two parts just using the
city, the rest should be insignificant.

#Question: give a prediction for a new female customer who lives in istanbul, married and has a 5000
balance in their account?
#by just looking at the previous table you can say that with 0.94589 probability, it is a consumer and
in class one (because she is in İstanbul)

newdata = data.frame(CITY= "IST", GENDER ="F", MSTAT ="M", BALANCE=5000)


predict(m1, newdata=newdata)
#this will give the same result which is 0.9458918
#we built a new dataset for new customer and then we developed the predictions

We also interested in problems of predictions: resampling methods to verify our models. How do we do
resampling?
One approach is divide your dataset into two parts, validation set and test set. Split is done by taking a
random sample from the dataset and we take the first part as train set and use the rest for test set.
For validation we will use 50 + 50 for now. Half for training and half for testing.
How can we take a random sample from our dataset? We can control the row indices of the table. To do
this we need a special function called: sample()

#RESAMPLING CODE:
c(1:10)
sample(c(1:10), 4)
#we create a vector from 1 to 10 and we will use 4 observations from this vector. If you execute the
same code you will get different results because it is a random sampling method.
dim(datasetQ1)
#function dim() gives you the dimension of the table, first one gives you the number of rows second
one gives you the number of columns:

#row number is 3796,so we will create a vector of 1 to 3796 and we will take a random sample
#so contructing random vector for indicating row indeces of table for training set:
#we create a vector from 1 to 3796 and take randomly half of it and call it training set:
index.train <- sample(c(1:dim(datasetQ1)[1]), dim(datasetQ1[1]/2//2)

datasetQ1.train <- datasetQ1[index.train,]


datasetQ1.test <- datasetQ1[-index.train,] #there is a minus sign before index.train because it will
remove that element e.g your vector c(1:4)[-2] = romoves 2nd element = 1,3,4

#now we will use training set to train our model and test set for calculating the accuracy:
m2 <- rpart(Consumer ~ CITY + GENDER + MSTAT + BALANCE , data= datasetQ1.train)
m2 #see the results, it is slightly different than m1 but the structure is same
predictions <- predict(m1,newdata= datasetQ1.test)

datasetQ1.test$Predict = predictions

#Last column is predicted value , consumer column is our observations. Last column is probability
between zero and 1, but how can i turn this probability into class prediction? By using a treshold , lets
use 0.8 as a treshold. We do not know 0.8 is a good choice or not, it depends on false negative and
false positive.

tau=0.8
datasetQ1.test$Predict.Class <- (predictions>tau)*1
#*1 means if it is larger than tau we are going to call this class 1
#result:
#now we need to build a confusion matrix to do that we should have TP,TN, FP,FN:
#to calculate true positive we need observations that are predicted positive and real observations
that are equal to positive at the same time:
TP = sum((datasetQ1.test$Predict.Class==1)&(datasetQ1.test$Consumer==1))
TN = sum((datasetQ1.test$Predict.Class==0)&(datasetQ1.test$Consumer==0))
FP = sum((datasetQ1.test$Predict.Class==1)&(datasetQ1.test$Consumer==0))
FN = sum((datasetQ1.test$Predict.Class==0)&(datasetQ1.test$Consumer==1))
#we took the summation of them because these are logical matrices ,but we need to have the number
of TPs or etc.)

confusion.mat <- matrix(c(TN,FP,FN,TP),2,2) #be careful: matrix function locates vectors vertically ->
TN, FP first column FN,TP second column
confusion.mat
#result:

#precision = TP/P* this is the precision value for tau= 0.8


Precision08 = TP / (TP+FP)

#precision value depends on random sample that we took, so how can we say this is a reliable,
accurate calculation of precision? If not, how can we change it to make it reliable? --> Do this sampling
over and over again and then take the average of precision values. By doing this, we remove this
situation from randomness. USE FOR loop.
#for each replication we are going to calculate precision value and keep it in a vector so then later on
we can take the average. Same code in a for loop :

#initializing precision vectors (NOTE: in class we also calculated tau=0.2 here):

Precision08.vect <- 0
Precision02.vect <- 0
for(r in 1:100)
{
index.train <- sample(c(1:dim(datasetQ1)[1]), dim(datasetQ1[1]/2)
datasetQ1.train <- datasetQ1[index.train,]
datasetQ1.test <- datasetQ1[-index.train,]
m2 <- rpart(Consumer ~ CITY + GENDER + MSTAT + BALANCE , data= datasetQ1.train)
m2
predictions <- predict(m1,newdata= datasetQ1.test)
datasetQ1.test$Predict = predictions
tau=0.8
datasetQ1.test$Predict.Class <- (predictions>tau)*1
TP = sum((datasetQ1.test$Predict.Class==1)&(datasetQ1.test$Consumer==1))
TN = sum((datasetQ1.test$Predict.Class==0)&(datasetQ1.test$Consumer==0))
FP = sum((datasetQ1.test$Predict.Class==1)&(datasetQ1.test$Consumer==0))
FN = sum((datasetQ1.test$Predict.Class==0)&(datasetQ1.test$Consumer==1))
confusion.mat <- matrix(c(TN,FP,FN,TP),2,2)
Precision08.vect[r] = TP / (TP+FP)

index.train <- sample(c(1:dim(datasetQ1)[1]), dim(datasetQ1[1]/2//2)


datasetQ1.train <- datasetQ1[index.train,]
datasetQ1.test <- datasetQ1[-index.train,]
m2 <- rpart(Consumer ~ CITY + GENDER + MSTAT + BALANCE , data= datasetQ1.train)
m2
predictions <- predict(m1,newdata= datasetQ1.test)
datasetQ1.test$Predict = predictions
tau=0.2
datasetQ1.test$Predict.Class <- (predictions>tau)*1
TP = sum((datasetQ1.test$Predict.Class==1)&(datasetQ1.test$Consumer==1))
TN = sum((datasetQ1.test$Predict.Class==0)&(datasetQ1.test$Consumer==0))
FP = sum((datasetQ1.test$Predict.Class==1)&(datasetQ1.test$Consumer==0))
FN = sum((datasetQ1.test$Predict.Class==0)&(datasetQ1.test$Consumer==1))
confusion.mat <- matrix(c(TN,FP,FN,TP),2,2)
Precision02.vect[r] = TP / (TP+FP)
}

#then we will take their mean value (this is more reliable than we calculated for just one value)
mean(Precision 02.vect)
mean(Precision08.vect)

NOTE: 2nd question of this lab is cancelled

#knn cross-validation
#QUESTION3
datasetQ3<- dbGetQuery(con,
"SELECT T2.*,T3.GENDER, T4.MONTHLY_INCOME
FROM
(SELECT T1.CUSTOMER_ID,T1.SAVINGS_ACCOUNT,
AVG(T1.BALANCE) AVG_BALANCE,AVG(T1.INVESTMENT) AVG_TRANS
FROM SAVINGS_TRANS T1
GROUP BY T1.CUSTOMER_ID,T1.SAVINGS_ACCOUNT) T2,
M_CUSTOMERS T3,
M_CREDIT_CARDS T4
WHERE T2.CUSTOMER_ID = T3.CUSTOMER_ID
AND T2.SAVINGS_ACCOUNT = T3.SAVING_ACCNT
AND T3.CREDIT_CARD_ID = T4.CREDIT_CARD_ID
AND T3.CUSTOMER_ID = T4.CUSTOMER_ID")
head(datasetQ3)
summary(datasetQ3)
datasetQ3$GENDER[datasetQ3$GENDER=="1"] = "M"
datasetQ3$GENDER.num<- (datasetQ3$GENDER=="M")*1
datasetQ3$MONTHLY_INCOME<- as.numeric(datasetQ3$MONTHLY_INCOME)

#we reorganized our table to make it ready for use and it is now ready for knn
#knn algorithm works with class library
library(class)
?knn

#knn algorithm checks k-closest observation and choose the most frequent one as the prediction. R
asks us to provide k value = number of neighbours considered.
#suppose we do not have observations for test set, what we can do is validation. Lets take 80 percent
of this data as training set, and take 20 percent as test set and for this split get this data randomly:

index.train <- sample( c( 1:dim(datasetQ3)[1] ), round( dim( datasetQ3 )[1]*0.80 ) )


#this is the vector of all row index. 1st element of the dimension function gives us #number of rows,
thats why we took just first element of it.-
# --> c(1:dim(datasetQ3)[1])

datasetQ3.train <- datasetQ3[index.train,]


datasetQ3.test <- datasetQ3[-index.train,]

#now we can run our knn but first we need to think about which column should be #used as an
independent variable. The ones that have prediction power over our class variable which is gender in
this case will be used (e.g. Customer id is just a random number, but avg balance, avg trans and
monthly income can have a prediction power over gender)

head(datasetQ3)

#so we will only get these 3 columns: balance, transaction and monthly income as training set and we
will only provide these columns to the knn algorithm. THIS IS SMTHNG THAT YOU SHOULD BE
CAREFUL IN KNN
train <- datasetQ3.train[,c(3,4,6)] #you are assigning just 3rd, 4th and 5th columns of datasetQ3.train
to the new dataset "train", same idea for test set:
test <- datasetQ3.test[,c(3,4,6)]
class <- datasetQ3.train$GENDER.num
pred1 <- knn(train, test, class, k=1)
pred1
#result:

#so far we learned some data formats like numbers, characters, date, etc. Now we have another
format which is factor (levels:0 1, that has shown in the result). It is a categorical variable for R, so R
has a specific format for categorical variables. It is difficult to deal with it. We will change it into
numbers. To do so, first we need to change factor into character, and then change it into numeric:
datasetQ3.test$pred1 <- as.numeric(as.character(pred1))

#whenever you see a factor variable you need to make this transformation
#next step is to calculate true positive and true negatives rate.

head(datasetQ3.test)
#result (now we have predictions for each gender for k=1):

#just last one is true positive,


#from 1 to 4 --> 4 false positive
#5th one --> true negative

#to calculate all, you will compare GENDER.num and pred1 columns and count them
TP = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred1 ==1))
TN = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred1 ==0))
FP = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred1 ==1))
FN = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred1 ==0))
confuse.mat <- matrix(c(TN,FP,FN,TP),2,2)
confuse.mat
#result:

#without confusion matrix we can't calculate precision(true positives over all positives) for k=1.
precision1 = TP/ sum((datasetQ3.test$pred1==1))

#change k value to find the best result for precision:

pred3 <- knn(train, test, class, k=3)


datasetQ3.test$pred3 <- as.numeric(as.character(pred3))
TP = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred3 ==1))
TN = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred3 ==0))
FP = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred3 ==1))
FN = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred3 ==0))
confuse.mat <- matrix(c(TN,FP,FN,TP),2,2)
precision3 = TP/ sum((datasetQ3.test$pred3==1))

Pred5 <- knn(train, test, class, k=5)


datasetQ3.test$pred5 <- as.numeric(as.character(pred5))
TP = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred5 ==1))
TN = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred5 ==0))
FP = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred5 ==1))
FN = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred5 ==0))
confuse.mat <- matrix(c(TN,FP,FN,TP),2,2)
precision5 = TP/ sum((datasetQ3.test$pred5==1))

#we calculated precision values for k= 1 , 3 and 5. compare them:


c(precision1,precision3,precision5)
#result (k=3 seems the best, but we should be careful; you get this result based on a random sample.
We need to replicate this couple of times then we should calculate the best value afterwards):

precision1.vect = precision3.vect=precision.5=0
for( r in 1:100)
{
index.train <- sample( c( 1:dim(datasetQ3)[1] ), round( dim( datasetQ3 )[1]*0.80 ) )
datasetQ3.train <- datasetQ3[index.train,]
datasetQ3.test <- datasetQ3[-index.train,]
train <- datasetQ3.train[,c(3,4,6)]
test <- datasetQ3.test[,c(3,4,6)]
class <- datasetQ3.train$GENDER.num

pred1 <- knn(train, test, class, k=1)


datasetQ3.test$pred1 <- as.numeric(as.character(pred1))
TP = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred1 ==1))
TN = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred1 ==0))
FP = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred1 ==1))
FN = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred1 ==0))
confuse.mat <- matrix(c(TN,FP,FN,TP),2,2)
precision1.vect[r] = TP/ sum((datasetQ3.test$pred1==1))

pred3 <- knn(train, test, class, k=3)


datasetQ3.test$pred3 <- as.numeric(as.character(pred3))
TP = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred3 ==1))
TN = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred3 ==0))
FP = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred3 ==1))
FN = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred3 ==0))
confuse.mat <- matrix(c(TN,FP,FN,TP),2,2)
Precision3.vect[r] = TP/ sum((datasetQ3.test$pred1==1))

Pred5 <- knn(train, test, class, k=5)


datasetQ3.test$pred5 <- as.numeric(as.character(pred5))
TP = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred5 ==1))
TN = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred5 ==0))
FP = sum((datasetQ3.test$GENDER.num == 0)&(datasetQ3.test$pred5 ==1))
FN = sum((datasetQ3.test$GENDER.num == 1)&(datasetQ3.test$pred5 ==0))
confuse.mat <- matrix(c(TN,FP,FN,TP),2,2)
precision5.vect[r] = TP/ sum((datasetQ3.test$pred1==1))

c(precision1,precision3,precision5)
}

#after creating these codes inside of a for loop we can take the mean values:

#now best value is changed. For one sample it was k=3, but for 100 replications the best k value
becomes k=1.

You might also like