Muthayammal College of Arts And Science
Rasipuram
Assignment No - 1
Name : K.Haritha
Roll no : 21UST004
Department : III- B.Sc., Statistics
Subject : R Programming For Data Analysis
Date :
K.Haritha
Student Signature Staff Signature
R PROGRAMMING FOR DATA ANALYSIS
UNIT -1
1.BUILT IN FUNCTION
R is a popular statistical computing and data analysis programming language. The
built-in functions in R support data manipulation, summary statistics, filtering, and generating
random numbers. These built-in functions are broadly categorized into the following
categories based on the operations they perform:
Mathematical Functions
Statistical Probability Functions
String Functions
Other Statistical Functions
1. Mathematical Functions
sqrt(): Square root
abs(): Absolute value
log(): Natural logarithm
exp(): Exponential function
2. Statistical Functions
mean(): Mean (average)
median(): Median
sd(): Standard deviation
var(): Variance
3. Data Manipulation Functions
subset(): Subset of data
merge(): Merge datasets
rbind(): Combine data frames by rows
cbind(): Combine data frames by columns
4. Data Inspection Functions
head(): Display the first part of a data frame
tail(): Display the last part of a data frame
summary(): Summary statistics
5. Data Visualization Functions
plot(): Create scatterplots, line plots, etc.
hist(): Create histograms
boxplot(): Create boxplots
6. String Manipulation Functions
paste(): Conca…
8. File Handling Functions
read.csv(): Read data from a CSV file
write.csv(): Write data to a CSV file
load(): Load saved objects
9.Statistical Modeling Functions
1m(): Linear regression
glm(): Generalized linear models
t.test(): Perform t-test
2.GRAPHICS IN R
There are hundreds of charts and graphs present in R. For example, bar plot, box plot,
mosaic plot, dot chart, coplot, histogram, pie chart, scatter graph, etc.
Types of R – Charts
Bar Plot or Bar Chart
Pie Diagram or Pie Chart
Histogram
Scatter Plot
Box Plot
Bar Plot or Bar Chart
Bar plot or Bar Chart in R is used to represent the values in data vector as height of
the bars. The data vector passed to the function is represented over y-axis of the graph. Bar
chart can behave like histogram by using table() function instead of data vector.
Syntax: barplot(data, xlab, ylab)
where
data is the data vector to be represented on y-axis
xlab is the label given to x-axis
ylab is the label given to y-axis
Pie Diagram or Pie Chart
Pie chart is a circular chart divided into different segments according to the ratio of
data provided. The total value of the pie is 100 and the segments tell the fraction of the whole
pie. It is another method to represent statistical data in graphical form and pie() function is
used to perform the same.
Syntax: pie(x, labels, col, main, radius)
where
x is data vector
labels shows names given to slices
col fills the color in the slices as given parameter
main shows title name of the pie chart
radius indicates radius of the pie chart. It can be between -1 to +1
Histogram
Histogram is a graphical representation used to create a graph with bars representing
the frequency of grouped data in vector. Histogram is same as bar chart but only difference
between them is histogram represents frequency of grouped data rather than data itself.
Syntax: hist(x, col, border, main, xlab, ylab)
where
x is data vector
col specifies the color of the bars to be filled
border specifies the color of border of bars
main specifies the title name of histogram
xlab specifies the x-axis label
ylab specifies the y-axis label
Scatter Plot
A Scatter plot is another type of graphical representation used to plot the points to
show relationship between two data vectors. One of the data vectors is represented on x-axis
and another on y-axis.
Syntax: plot(x, y, type, xlab, ylab, main)
Where
x is the data vector represented on x-axis
y is the data vector represented on y-axis
type specifies the type of plot to be drawn. For example, “l” for lines, “p” for points,
“s” for stair steps, etc.
xlab specifies the label for x-axis
ylab specifies the label for y-axis
main specifies the title name of the graph
Box Plot
Box plot shows how the data is distributed in the data vector. It represents five values
in the graph i.e., minimum, first quartile, second quartile(median), third quartile, the
maximum value of the data vector.
Syntax: boxplot(x, xlab, ylab, notch)
where
x specifies the data vector
xlab specifies the label for x-axis
ylab specifies the label for y-axis
notch, if TRUE then creates notch on both the sides of the box
3. DATA INPUTTING, DATA ACCESSING AND INDEXING
DATA INPUTTING METHOD
1. Entering Data Manually
You can create a data frame by entering data manually using the `data.frame()` function. For
example:
# Create a data frame with two columns
my_data <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 22)
2. Reading Data from a CSV File
If your data is in a CSV (Comma-Separated Values) file, you can use the `read.csv()` function
to import it into R. For example:
# Read data from a CSV file
my_data <- read.csv("data.csv")
3. Reading Data from Excel
To read data from an Excel file, you can use packages like `readxl` or `openxlsx`. For
example, with the `readxl` package
# Install and load the readxl package
install.packages("readxl")
library(readxl)
# Read data from an Excel file
my_data <- read_excel("data.xlsx")
4. Reading Data from a URL
You can read data directly from a URL using functions like `read.csv()` or other specific
functions based on the data format.
5. Generating Synthetic Data
R provides various functions to generate synthetic data, such as `rnorm()` for generating
random numbers following a normal distribution.
6. Using External Packages for Specific Data Sources
Depending on your data source, you may need to use specialized packages. For instance, the
`jsonlite` package can be used to read JSON data, and the `httr` package can help you retrieve data
from web APIs.
DATA ACCESSING
1. Indexing Data Frames
You can access specific rows and columns in a data frame using square brackets `[ ]`.
For example:
# Access the first row of a data frame
first_row <- my_data[1, ]
# Access the "Name" column
name_column <- my_data$Name
# Access multiple columns
selected_columns <- my_data[, c("Name", "Age")]
2. Filtering Data
You can filter rows based on specific conditions using logical indexing. For example:
# Filter rows where Age is greater than 25
older_than_25 <- my_data[my_data$Age > 25, ]
3. Subsetting Data
You can create subsets of your data based on criteria. For example:
# Create a sub set of data for people named "Alice"
alice_data <- subset(my_data, Name == "Alice")
4. Sorting Data
You can sort your data by one or more columns using the `order()` function. For example:
# Sort data by Age in ascending order
sorted_data <- my_data[order(my_data$Age), ]
5. Aggregating Data
You can compute summary statistics on your data using functions like `sum()`, `mean()`,
`median()`, etc., or use the `aggregate()` function for more complex aggregations.
6. Merging Data
You can merge two or more data frames together based on common columns using functions
like `merge()` or `dplyr` package functions like `left_join()`, `inner_join()`, etc.
7. Accessing Lists and Matrices
You can access elements in lists and matrices using indexing similar to data frames, but with
double square brackets `[[]]` for lists and single brackets `[]` for matrices.
INDEXING
Indexing in R refers to the process of accessing or retrieving specific elements from
data structures like vectors, matrices, or data frames. In R, indexing typically involves using
numerical or logical indices to pinpoint particular elements within a data structure.
For example:
• Vector Indexing
In a vector, you can access elements by their position using square brackets. For
instance, my_vector[3] retrieves the third element of the vector.
• Matrix Indexing
For matrices, you can use row and column indices to access specific elements. For
example, my_matrix[2, 4] retrieves the element in the second row and fourth column.
• Data Frame Indexing
Data frames allow indexing by both row and column, such as my_data_frame[1,
"Name"] to access the "Name" column of the first row.