lOMoARcPSD|41453364
Lab Record 21BCG10126 - hgv 7huyh bihkbih
Computer Science (Anna University)
Scan to open on Studocu
Studocu is not sponsored or endorsed by any college or university
Downloaded by Tanay Vyas (
[email protected])
lOMoARcPSD|41453364
VIT Bhopal University
NAS1001 – Associative Data Analytics (LTP-4)
Slot: B11+B12+B13+B14
Class ID: BL2023241000207
FALL SEMESTER 2023-2024
Course Instructor: Dr. D Lakshmi
Name of the Student: Aniket Shrivastava
Register Number: 21BCG10126
List of Experiments
lOMoARcPSD|41453364
List of Challenging Experiments (Indicative) SLO:
1,2,5,9,12
1. Understanding of R System and installation and configuration of R 1-4
Environment and R-Studio, Understanding R Packages, their installation
and management
2. Understanding of nuts and bolts of R: 4-5
a. R program Structure
b. R Data Type, Command Syntax and Control Structures
c. File Operations in R
3. Excel and R integration with R connector. 5-7
4. Preparing Data in R 7-9
a. Data Cleaning
b. Data imputation
c. Data conversion
5. Outliers detection using R 9-12
6. Correlation and Regression Analysis in R 10-13
7. Clustering Algorithms implementation using R 13-15
8. Classification Algorithm implementation using R 15-17
Classification (Spam/Not spam)
9. Case study on Stock Market Analysis and applications. Stock data can be 17-19
obtained from Yahoo! Finance, Google Finance. A team of students can
apply statistical modeling on the stock data to uncover hidden patterns. R
provides tools for moving averages, auto regression and time-series
analysis which forms the crux of financial applications.
10. Detect credit card fraudulent transactions - The dataset can be obtained 19-20
from Kaggle. The team will use a variety of machine learning algorithms
that will be able to discern fraudulent from non-fraudulent one.
Experiment No: 1
lOMoARcPSD|41453364
Aim: Understanding of R System and installation and configuration of R Environment and R-
Studio, Understanding R Packages, their installation and management
Data Description: R is a programming language for statistical computing and graphics
supported by the R Core Team and the R Foundation for Statistical Computing.
Designed by: Ross Ihaka, Robert Gentleman
Installing R:
Download R:
1. Go to the R Project's official website: https://www.r-project.org/
2. Click on the "CRAN" link under the "Download and Install R" section.
3. For Windows: Double-click the downloaded executable file and follow the installation
instructions.
4. For macOS: Double-click the downloaded package file and follow the installation
instructions.
5. For Linux: Follow the installation instructions specific to your Linux distribution.
Installing RStudio:
Download RStudio:
1. Go to the RStudio download page: https://www.rstudio.com/products/rstudio/download/
2. Under "RStudio Desktop," click the appropriate download link for your operating system
(Windows, macOS, or Linux).
3. Install RStudio:
4. For Windows: Double-click the downloaded installer and follow the installation
instructions.
5. For macOS: Double-click the downloaded disk image (.dmg) file, drag the RStudio icon
to the Applications folder, and then open RStudio from the Applications folder.
6. For Linux: Follow the installation instructions specific to your Linux distribution.
Installing R packages
It is a fundamental part of working with R. R packages contain pre-built functions, data sets, and
documentation that extend the capabilities of the R programming language. Here are the steps
to install R packages using the R console within RStudio:
Open RStudio:
Launch RStudio on your computer.
Open R Console:
lOMoARcPSD|41453364
Once RStudio is open, you'll see several panels. The left-top panel is the R Console. This is
where you can directly interact with R by typing commands.
Install a Package:
To install an R package, you'll use the install.packages() function followed by the name of the
package you want to install. For example, to install the "ggplot2" package, type the following
command in the R Console and press Enter: install.packages("ggplot2")
Load the Package:
After installing a package, you need to load it into your R session to use its functions. Use the
library() function for this purpose. For example, to load the "ggplot2" package, type:
library(ggplot2)
Experiment No: 2
Aim: Understanding of nuts and bolts of R:
a. R program Structure
b. R Data Type, Command Syntax and Control Structures
c. File Operations in R
Data Description
a. R Program Structure: An R program consists of a series of commands
that are executed sequentially. These commands can be typed directly into
the R console or saved in a script file with a .R extension.
b. R Data Types, Command Syntax, and Control Structures: R
supports various data types, including numeric, character, logical, factor, and
more. Here's a quick overview: Numeric: Used for storing numeric values
(integers or decimals). Character: Used for storing text data. Logical:
Represents binary values TRUE or FALSE. Factor: Represents categorical data
with levels or categories.
c. File Operations in R: R provides functions to perform various file
operations:
R Code
a. R Program Structure:
library(package_name)
print(result)
my_function <- function(arg1, arg2) {
return(result)
lOMoARcPSD|41453364
result <- my_function(value1, value2)
b. R Data Types, Command Syntax, and Control Structures:
x <- 5
name <- "John"
is_valid <- TRUE
sum_result <- 3 + 7
c. File Operations in R:
Reading files
# Reading text files
data <- read.table("data.txt", header = TRUE)
# Reading CSV files
data <- read.csv("data.csv")
# Reading Excel files (requires 'readxl' package)
library(readxl)
data <- read_excel("data.xlsx")
Writing files
# Writing data to text file
write.table(data, "output.txt", sep = "\t", row.names = FALSE)
# Writing data to CSV file
write.csv(data, "output.csv", row.names = FALSE)
# Writing data to Excel file (requires 'openxlsx' package)
library(openxlsx)
write.xlsx(data, "output.xlsx")
Experiment No: 3
Aim: Excel and R integration with R connector.
Data Description:
In this example, the CSV file has two columns:
experience_years: This column represents the number of years of experience each person
has.
salary: This column contains the corresponding salary for each person based on their
experience.
Sample rows and columns
lOMoARcPSD|41453364
R Code
> install.packages("csv")
> library("csv")
> Salary_Dataset = read.csv(file.choose(), 1)
> Salary_Dataset
Sample Input and Output
lOMoARcPSD|41453364
Experiment No: 4
Aim: Preparing Data in R
a. Data Cleaning
b. Data imputation
c. Data conversion
Data Description
In this example, the CSV file has two columns:
experience_years: This column represents the number of years of experience each person
has.
salary: This column contains the corresponding salary for each person based on their
experience.
Sample rows and columns
lOMoARcPSD|41453364
R Code
# Load libraries
library(dplyr)
library(missForest)
# Read dataset
data <- read.csv("data.csv")
# Data Cleaning
cleaned_data <- data %>%
distinct() %>%
select(-Irrelevant_Column)
# Check for missing values
missing_values <- sum(is.na(cleaned_data))
if (missing_values > 0) {
# Data Imputation
imputed_data <- missForest(cleaned_data, verbose = TRUE)
} else {
imputed_data <- cleaned_data
}
# Data Conversion (if needed)
imputed_data$Categorical_Column <- as.factor(imputed_data$Categorical_Column)
# Display prepared dataset
print(imputed_data)
Sample Input and Output
lOMoARcPSD|41453364
Experiment No: 5
Aim: Outliers detection using R
Data Description
In this example, the CSV file has two columns:
experience_years: This column represents the number of years of experience each person
has.
salary: This column contains the corresponding salary for each person based on their
experience.
Sample rows and columns
lOMoARcPSD|41453364
R Code
Sample Input and Output
Experiment No: 6
Aim: Correlation and Regression Analysis in R
Data Description
In this example, the CSV file has two columns:
experience_years: This column represents the number of years of experience each person
has.
salary: This column contains the corresponding salary for each person based on their
experience.
lOMoARcPSD|41453364
Sample rows and columns
R Code
lOMoARcPSD|41453364
lOMoARcPSD|41453364
Experiment No: 7
Aim: Clustering Algorithms implementation using R
Data Description
In this example, the CSV file has two columns:
experience_years: This column represents the number of years of experience each person
has.
salary: This column contains the corresponding salary for each person based on their
experience.
Sample rows and columns
lOMoARcPSD|41453364
R Code
lOMoARcPSD|41453364
Sample Input and Output
Experiment No: 8
Aim: Classification Algorithm implementation using R
Classification (Spam/Not spam)
R Code
# Load required libraries
library(tm) # Text mining
library(e1071) # For Naive Bayes classifier
library(caret) # For model evaluation
lOMoARcPSD|41453364
# Load the SpamAssassin dataset (replace with your actual file path)
spam_data <- read.csv("path/to/spamassassin_data.csv", stringsAsFactors = FALSE)
# Preprocess the text data
corpus <- Corpus(VectorSource(spam_data$text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, stripWhitespace)
# Create a document-term matrix
dtm <- DocumentTermMatrix(corpus)
# Convert the document-term matrix to a data frame
spam_df <- as.data.frame(as.matrix(dtm))
colnames(spam_df) <- make.names(colnames(spam_df))
# Combine with labels
spam_df$label <- spam_data$label
# Split data into training and testing sets
set.seed(123)
train_indices <- sample(1:nrow(spam_df), 0.7 * nrow(spam_df))
train_data <- spam_df[train_indices, ]
test_data <- spam_df[-train_indices, ]
# Train a Naive Bayes classifier
naive_bayes_model <- naiveBayes(label ~ ., data = train_data)
# Make predictions
predictions <- predict(naive_bayes_model, newdata = test_data, type = "class")
# Evaluate the model
conf_matrix <- confusionMatrix(predictions, test_data$label)
print(conf_matrix)
Sample Input and Output
lOMoARcPSD|41453364
Experiment No: 9
Aim:Case study on Stock Market Analysis and applications. Stock data can be obtained from
Yahoo! Finance, Google Finance. A team of students can apply statistical modeling on the stock
data to uncover hidden patterns. R provides tools for moving averages, auto regression and
time-series analysis which forms the crux of financial applications.
Data Description
Stock data imported from Yahoo FInances.
lOMoARcPSD|41453364
R Code
# Load required libraries
library(dplyr)
library(lubridate)
# Read the stock data CSV file (or load data from API)
stock_data <- read.csv("stock_data.csv")
# Convert date column to Date format
stock_data$Date <- ymd(stock_data$Date)
# Calculate 50-day and 200-day moving averages
stock_data$MA_50 <- SMA(stock_data$Close, n = 50)
stock_data$MA_200 <- SMA(stock_data$Close, n = 200)
# Load required library
library(forecast)
# Convert data to time series format
stock_ts <- ts(stock_data$Close, frequency = 365)
# Fit auto-regression model (ARIMA)
ar_model <- auto.arima(stock_ts)
# Load required libraries
library(ggplot2)
library(forecast)
# Decompose time series into trend, seasonal, and residual components
decomposed <- decompose(stock_ts)
# Plot decomposed components
plot(decomposed)
# Create a time series plot of stock prices and moving averages
ggplot(stock_data, aes(x = Date)) +
geom_line(aes(y = Close, color = "Stock Price")) +
geom_line(aes(y = MA_50, color = "50-day MA")) +
geom_line(aes(y = MA_200, color = "200-day MA")) +
labs(title = "Stock Price and Moving Averages", y = "Price") +
scale_color_manual(values = c("Stock Price" = "blue", "50-day MA" = "red", "200-day MA" =
"green"))
Sample Input and Output
lOMoARcPSD|41453364
Experiment No: 10
Aim: Detect credit card fraudulent transactions - The dataset can be obtained from Kaggle. The
team will use a variety of machine learning algorithms that will be able to discern fraudulent
from non-fraudulent one.
Data Description
The dataset was obtained from Kaggle
R Code
# Load required libraries
library(AnomalyDetection)
library(randomForest)
# Load the CreditCardFraud dataset
data("CreditCardFraud")
# Split data into training and testing sets (70% training, 30% testing)
set.seed(123)
train_indices <- sample(1:nrow(CreditCardFraud), 0.7 * nrow(CreditCardFraud))
train_data <- CreditCardFraud[train_indices, ]
test_data <- CreditCardFraud[-train_indices, ]
# Build Random Forest model
rf_model <- randomForest(Class ~ ., data = train_data, ntree = 100)
# Make predictions
predictions <- predict(rf_model, newdata = test_data)
lOMoARcPSD|41453364
# Calculate accuracy
accuracy <- sum(predictions == test_data$Class) / nrow(test_data)
print(paste("Accuracy score on Test Data: :", accuracy))
Sample Input and Output