Unit 3

The document discusses reading data into R from various sources, including locally stored files, web URLs, databases, and APIs. It covers using functions like read_csv(), read_tsv(), and read_excel() to import tabular data files in different formats. For databases, it describes connecting to SQLite and retrieving table names. The document also defines tidy data as having variables as columns, observations as rows, and each type of observational unit in its own table. This standard structure aids in data analysis and sharing results.

Uploaded by

liman69609

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views3 pages

Unit 3

Uploaded by

liman69609

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 3

Reading in data locally and from the web

 Reading data is the gateway for any data analysis.

 Data can be read from local device or from web.
 In R, “Reading” or “loading” is the process of converting data (stored as plain text, a
database, HTML, etc.) into an object (e.g., a data frame)
 There are many ways to store data as well as many ways to read them.
 Different functions are available in R to import data from various file formats.
 While loading a data set into R, we need to tell R where those files live. The file could live
on your computer (local) or somewhere on the internet (remote).
 The place where the file lives on your computer is called the “path.”
 There are two kinds of paths: relative paths and absolute paths.
 A relative path is where the file is with respect to our current computer.
 An absolute path is where the file is in respect to the computer’s file system.
 As per the figure,
o We are working in a file named worksheet_02.ipynb .
o If we want to read the .csv file named happiness_report.csv into R, we could do this
using either a relative or an absolute path.

Reading happiness_report.csv using a relative path

happy_data <- read_csv("data/happiness_report.csv")

Reading happiness_report.csv using an absolute pat:

happy_data <- read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")

 In case of remote files, a Uniform Resource Locator (URL) (web address) indicates the
location of a file/resource.
Reading tabular data from a plain text file into R
 read_csv() to read in comma-separated files (csv file)
data <- read_csv("data/xyz.csv")

Data filename is “xyz.csv” stored under “data” folder.

 read_tsv to read in tab-separated files

data <- read_tsv("data/xyz.tsv")

Reading tabular data directly from a URL

 read_csv( ), read_tsv( ), read_delim( ) functions are used to read in data directly from
a Uniform Resource Locator (URL) that contains tabular data.
url <- "https://xxx.com/data/xyz.csv"
data <- read_csv(url)

Reading tabular data from a Microsoft Excel file

data <- read_excel("data/xyz.xlsx")

Reading data from a database

 Relational database is a common form of data storage for large data sets or multiple
users working on a project.
 There are many relational database management systems, such as SQLite, MySQL,
PostgreSQL, Oracle and many more.
 Reading data from a SQLite database
o SQLite database is self-contained and usually stored and accessed locally.
o Data is usually stored in a file with a .db extension.
o To read data into R from a database we need to connect the database.
o dbConnect( ) function is used from the DBI (database interface) package to
connect the database.
data <- dbConnect(RSQLite::SQLite(), "data/xyz.db")
o Relational databases may have many tables. In order to retrieve data from a
database, we need to know the name of the table in which the data is stored.
o We can get the names of all the tables in the database using
the dbListTables function:
tables <- dbListTables(conn_lang_data)

Obtaining data from the web using API

 Accessing data stored in a plain text, spread sheets, comma or tab separated files from a
web URL using one of the read_* functions from the tidyverse.
 Now websites use Application Programming Interface (API), which provides a
programmatic way to read data set.
 This allows the website owner to control who has access to the data, what portion of the
data they have access to, and how much data they can access.
 We can collect data programmatically - in the form of Hypertext Markup Language
(HTML) and Cascading Style Sheet (CSS) code - and process it to extract useful
information.
 HTML provides the basic structure of a site and CSS helps style the content.
What is Tidy Data?
 In a Data Science project, tidying data is a necessary after importing data in order to
communicate results.

 Tidy datasets provide a standardized way to link the structure of a dataset (its physical
layout) with its semantics (its meaning).
o Structure is the form and shape of data. In statistics, most datasets are rectangular
data tables(data frames) and are made up of rows and columns.
o Semantics is the meaning for the dataset. Datasets are a collection of values,
either quantitative or qualitative. These values are organized in 2 ways —
variable & observation.
 Variables — all values that measure the same underlying attribute across units
 Observations — all values measured on the same unit across attributes
o The 3 rules of tidy data help simplify the concept and make it more intuitive.

 Each variable is a column

 Each observation is a row
 Each type of observational unit is a table

Messy Data
 Messy data is any kind of data that does not follow the above framework.
 To narrow it down, the paper gives 5 common problems of messy data:
o Column headers are values, not variable names.
o Multiple variables are stored in one column.
o Variables are stored in both rows and columns.
o Multiple types of observational units are stored in the same table.
o A single observational unit is stored in multiple tables.

Why is Tidy Data important?

 If the data set is in standardized framework then we spend less time on data cleaning and
wrangling and more time to focus on answering the problem.
 It is a good practice to have the data in a format which makes it reproducible and easy for
others to understand.
 Another more technical reason is that the concept of tidy data is complemented with the tools
in R to work with. Since R works with vectors of values (R functions are vectorized by nature),
we able to naturally apply our tidy data to the tools used.

CRC Data Science
No ratings yet
CRC Data Science
443 pages
UNIT II (R Programming)
No ratings yet
UNIT II (R Programming)
89 pages
Data Science - A First Introduction With Python (Z-Lib - Io)
No ratings yet
Data Science - A First Introduction With Python (Z-Lib - Io)
452 pages
R1 Uptovisualisation
No ratings yet
R1 Uptovisualisation
122 pages
Unit - 5 R
No ratings yet
Unit - 5 R
21 pages
1.importing Data From External Files
No ratings yet
1.importing Data From External Files
33 pages
Data Cleansing Using R
0% (1)
Data Cleansing Using R
10 pages
Week4 Slides
No ratings yet
Week4 Slides
54 pages
R Notes Based On Text Module 2
No ratings yet
R Notes Based On Text Module 2
24 pages
Data Analysis with R for Beginners
No ratings yet
Data Analysis with R for Beginners
4 pages
Advanced R Programming Tidyverse Packages Notes
No ratings yet
Advanced R Programming Tidyverse Packages Notes
12 pages
Data Import & Tidy Tools Guide
No ratings yet
Data Import & Tidy Tools Guide
2 pages
MBA Sem 1 Unit 3 Fundamentals of R
No ratings yet
MBA Sem 1 Unit 3 Fundamentals of R
41 pages
02-Data Gathering and Preparation
No ratings yet
02-Data Gathering and Preparation
54 pages
Explainer Tidy Data
No ratings yet
Explainer Tidy Data
22 pages
04 Data Interfaces in R
No ratings yet
04 Data Interfaces in R
42 pages
M3 Dar
No ratings yet
M3 Dar
52 pages
SEU - DS510 - Module 4 Input-Output and Data Structure
No ratings yet
SEU - DS510 - Module 4 Input-Output and Data Structure
68 pages
Data Analytics Lesson 10 Notes
No ratings yet
Data Analytics Lesson 10 Notes
7 pages
Data Import::: Cheat Sheet
No ratings yet
Data Import::: Cheat Sheet
2 pages
Data Sceince - UNIT - 4
No ratings yet
Data Sceince - UNIT - 4
70 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
20 pages
Data Analysis Courses Notes
No ratings yet
Data Analysis Courses Notes
3 pages
Data Science Wrangling
No ratings yet
Data Science Wrangling
121 pages
Ai Module-1
No ratings yet
Ai Module-1
197 pages
Tidyverse Handout
No ratings yet
Tidyverse Handout
30 pages
02.session-Notes-1 and 2-Basic Data Analysis
No ratings yet
02.session-Notes-1 and 2-Basic Data Analysis
11 pages
R Assignment
No ratings yet
R Assignment
5 pages
Generative AI For Project Managers
No ratings yet
Generative AI For Project Managers
45 pages
Module 7 - (Data Analysis With R Programming)
No ratings yet
Module 7 - (Data Analysis With R Programming)
18 pages
Data Science with R for Beginners
No ratings yet
Data Science with R for Beginners
3 pages
Mod3 Tables EPP
No ratings yet
Mod3 Tables EPP
9 pages
Data Import
No ratings yet
Data Import
2 pages
Apunts BLOC 1 Estadística
No ratings yet
Apunts BLOC 1 Estadística
15 pages
Importing Data
No ratings yet
Importing Data
21 pages
Handout 2
No ratings yet
Handout 2
15 pages
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
No ratings yet
R Basics Continued - Factors and Data Frames - Intro To R and RStudio For Genomics
17 pages
ProgrammingForDS14 Rbasics
No ratings yet
ProgrammingForDS14 Rbasics
32 pages
I R A E D: Mport EAD ND Xport ATA
No ratings yet
I R A E D: Mport EAD ND Xport ATA
28 pages
Data Cleaning Course Notes
No ratings yet
Data Cleaning Course Notes
27 pages
Intro To Data Managment With The Tidyverse
No ratings yet
Intro To Data Managment With The Tidyverse
13 pages
Unit - I: Topic - 1
No ratings yet
Unit - I: Topic - 1
13 pages
MIT 201 - Tutorial 02
No ratings yet
MIT 201 - Tutorial 02
12 pages
03 Data Input Output
No ratings yet
03 Data Input Output
43 pages
CS109a Lecture1
No ratings yet
CS109a Lecture1
67 pages
Lecture2 Data
No ratings yet
Lecture2 Data
57 pages
Chapter - 03 - Review of Basic Data
No ratings yet
Chapter - 03 - Review of Basic Data
92 pages
Unit 1 Big Data Analytics - An Introduction (Final)
No ratings yet
Unit 1 Big Data Analytics - An Introduction (Final)
65 pages
Getting Started With R
No ratings yet
Getting Started With R
155 pages
Full Stack AI SaaS Roadmap
No ratings yet
Full Stack AI SaaS Roadmap
33 pages
Summer Internship Reports DSA Using C++
No ratings yet
Summer Internship Reports DSA Using C++
40 pages
R Programming Essentials
No ratings yet
R Programming Essentials
27 pages
PDF to Image Converter Guide
No ratings yet
PDF to Image Converter Guide
3 pages
R Programming: © 2016 SMART Training Resources Pvt. LTD
No ratings yet
R Programming: © 2016 SMART Training Resources Pvt. LTD
28 pages
RStudio Cookbook: Data Analysis Recipes
100% (1)
RStudio Cookbook: Data Analysis Recipes
38 pages
R Programming 2 MARKS
No ratings yet
R Programming 2 MARKS
12 pages
R Module 4 - Data - IO
No ratings yet
R Module 4 - Data - IO
21 pages
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
No ratings yet
Module 1: Unit - 1.1: Introduction To Analytics or R Programming
26 pages
Introduction To R
No ratings yet
Introduction To R
39 pages
YAML For Home Assistant UI
No ratings yet
YAML For Home Assistant UI
6 pages
Digital Fluency - Question Bank
No ratings yet
Digital Fluency - Question Bank
36 pages
Introduction To R PDF
No ratings yet
Introduction To R PDF
56 pages
Unit 2 - Basic Computer Engineering - WWW - Rgpvnotes.in
No ratings yet
Unit 2 - Basic Computer Engineering - WWW - Rgpvnotes.in
27 pages
Lab1 411 Eman Yahya 7773225
No ratings yet
Lab1 411 Eman Yahya 7773225
16 pages
Nursing Informatics Essentials
No ratings yet
Nursing Informatics Essentials
4 pages
IT Contract Bidding Guide
No ratings yet
IT Contract Bidding Guide
24 pages
Cambridge Brouchure
No ratings yet
Cambridge Brouchure
20 pages
Normalization of Database Tables
No ratings yet
Normalization of Database Tables
52 pages
2023-24AI Exam Paper Answers
No ratings yet
2023-24AI Exam Paper Answers
18 pages
Lukaniszyn Et Al 2024 Digital Twins Generated by AI
No ratings yet
Lukaniszyn Et Al 2024 Digital Twins Generated by AI
17 pages
Question Bank For Seen Pre-Board - AI - Grade 10 - 2021-22
No ratings yet
Question Bank For Seen Pre-Board - AI - Grade 10 - 2021-22
7 pages
Linkedin-Skill-Assessments-Quizzes - Mysql-Quiz - MD at Main Ebazhanov - Linkedin-Skill-Assessments-Quizzes GitHub
No ratings yet
Linkedin-Skill-Assessments-Quizzes - Mysql-Quiz - MD at Main Ebazhanov - Linkedin-Skill-Assessments-Quizzes GitHub
25 pages
Java-Spring Boot Developer PDF
No ratings yet
Java-Spring Boot Developer PDF
3 pages
Artificial Intelligence - An Overview - ScienceDirect Topics
No ratings yet
Artificial Intelligence - An Overview - ScienceDirect Topics
10 pages
SQL Basics for Beginners
No ratings yet
SQL Basics for Beginners
17 pages
Artificial Intelligence in Financial Underwriting - Automating Processes, Enhancing Decision-Making, and Improving Risk Management
No ratings yet
Artificial Intelligence in Financial Underwriting - Automating Processes, Enhancing Decision-Making, and Improving Risk Management
3 pages
Year 9 ICT MID TERM Exam
No ratings yet
Year 9 ICT MID TERM Exam
9 pages
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
No ratings yet
BDA-24 - Lect (3-4) - (Fundamentals of Data Analysis)
15 pages
Data Augmentation On Plant Leaf Disease Image Dataset Using Image Manipulation and Deep Learning Techniques
No ratings yet
Data Augmentation On Plant Leaf Disease Image Dataset Using Image Manipulation and Deep Learning Techniques
6 pages
Lab Activity
No ratings yet
Lab Activity
2 pages
CS4622 Machine Learning PROJECT
No ratings yet
CS4622 Machine Learning PROJECT
3 pages
Roshan Resume
No ratings yet
Roshan Resume
1 page
L'Oreal - Gen AI As A Service With Cloud Run & LangChain
No ratings yet
L'Oreal - Gen AI As A Service With Cloud Run & LangChain
3 pages
Ws 1
No ratings yet
Ws 1
5 pages
Mt04400 Medical Data Processing Officer
No ratings yet
Mt04400 Medical Data Processing Officer
2 pages
C++ PYQ Answers
No ratings yet
C++ PYQ Answers
2 pages