Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Commit 0a338d2

Browse files
committed
corrections for course 1 - 4
1 parent 2d2b3ee commit 0a338d2

12 files changed

+385
-381
lines changed

1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd

Lines changed: 12 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,16 @@
22
title: "Data Scientist’s Toolbox Course Notes"
33
author: "Xing Su"
44
output:
5+
pdf_document:
6+
toc: yes
7+
toc_depth: 3
58
html_document:
69
highlight: pygments
710
theme: spacelab
811
toc: yes
9-
pdf_document:
10-
toc: yes
11-
toc_depth: 3
1212
---
1313

14+
$\pagebreak$
1415

1516
## CLI (Command Line Interface)
1617

@@ -34,21 +35,21 @@ output:
3435
* `move <file> <directory>` = move file to directory
3536
* `move <fileName> <newName>` = rename file
3637
* `echo` = print arguments you give/variables
37-
* `date` = print current date
38+
* `date` = print current date
3839

3940

4041

4142
## GitHub
4243

43-
* **Workflow**
44+
* **Workflow**
4445
1. make edits in workspace
4546
2. update index/add files
46-
3. commit to local repo
47+
3. commit to local repo
4748
4. push to remote repository
4849
* `git add .` = add all new files to be tracked
4950
* `git add -u` = updates tracking for files that are renamed or deleted
5051
* `git add -A` = both of the above
51-
* ***Note**: `add` is performed before committing*
52+
* ***Note**: `add` is performed before committing *
5253
* `git commit -m "message"` = commit the changes you want to be saved to the local copy
5354
* `git checkout -b branchname` = create new branch
5455
* `git branch` = tells you what branch you are on
@@ -68,7 +69,7 @@ output:
6869

6970
## R Packages
7071

71-
* Primary location for R packages --> CRAN
72+
* Primary location for R packages $\rightarrow$ CRAN
7273
* `available.packages()` = all packages available
7374
* `head(rownames(a),3)` = returns first three names of a
7475
* `install.packages("nameOfPackage")` = install single package
@@ -83,7 +84,7 @@ output:
8384

8485
## Types of Data Science Questions
8586

86-
* in order of difficulty: ***Descriptive*** --> ***Exploratory*** --> ***Inferential*** --> ***Predictive*** --> ***Causal*** --> ***Mechanistic***
87+
* in order of difficulty: ***Descriptive*** $\rightarrow$ ***Exploratory*** $\rightarrow$ ***Inferential*** $\rightarrow$ ***Predictive*** $\rightarrow$ ***Causal*** $\rightarrow$ ***Mechanistic***
8788
* **Descriptive analysis** = describe set of data, interpret what you see (census, Google Ngram)
8889
* **Exploratory analysis** = discovering connections (correlation does not = causation)
8990
* **Inferential analysis** = use data conclusions from smaller population for the broader group
@@ -101,7 +102,7 @@ output:
101102
* **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data)
102103

103104
## Experimental Design
104-
* Formulate you question in advance
105+
* Formulate you question in advance
105106
* **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly
106107
* ***[Inference]*** **Variability** = lower variability + clearer differences = decision
107108
* ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation)
@@ -115,7 +116,7 @@ output:
115116
* **Positive Predictive Value** = Pr(disease | positive test)
116117
* **Negative Predictive Value** = Pr(no disease | negative test)
117118
* **Accuracy** = Pr(correct outcome)
118-
* **Data dredging** = use data to fit hypothesis
119+
* **Data dredging** = use data to fit hypothesis
119120
* **Good experiments** = have replication, measure variability, generalize problem, transparent
120121
* Prediction is not inference, and be ware of data dredging
121122

1_DATASCITOOLBOX/Data_Scientists_Toolbox_Course_Notes.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ <h2>GitHub</h2>
109109
<li><code>git add -u</code> = updates tracking for files that are renamed or deleted</li>
110110
<li><code>git add -A</code> = both of the above
111111
<ul>
112-
<li><em><strong>Note</strong>: <code>add</code> is performed before committing</em></li>
112+
<li><em><strong>Note</strong>: <code>add</code> is performed before committing </em></li>
113113
</ul></li>
114114
<li><code>git commit -m &quot;message&quot;</code> = commit the changes you want to be saved to the local copy</li>
115115
<li><code>git checkout -b branchname</code> = create new branch</li>
@@ -130,7 +130,7 @@ <h2>Markdown</h2>
130130
<div id="r-packages" class="section level2">
131131
<h2>R Packages</h2>
132132
<ul>
133-
<li>Primary location for R packages –&gt; CRAN</li>
133+
<li>Primary location for R packages <span class="math">\(\rightarrow\)</span> CRAN</li>
134134
<li><code>available.packages()</code> = all packages available</li>
135135
<li><code>head(rownames(a),3)</code> = returns first three names of a</li>
136136
<li><code>install.packages(&quot;nameOfPackage&quot;)</code> = install single package</li>
@@ -147,7 +147,7 @@ <h2>R Packages</h2>
147147
<div id="types-of-data-science-questions" class="section level2">
148148
<h2>Types of Data Science Questions</h2>
149149
<ul>
150-
<li>in order of difficulty: <strong><em>Descriptive</em></strong> –&gt; <strong><em>Exploratory</em></strong> –&gt; <strong><em>Inferential</em></strong> –&gt; <strong><em>Predictive</em></strong> –&gt; <strong><em>Causal</em></strong> –&gt; <strong><em>Mechanistic</em></strong></li>
150+
<li>in order of difficulty: <strong><em>Descriptive</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Exploratory</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Inferential</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Predictive</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Causal</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Mechanistic</em></strong></li>
151151
<li><strong>Descriptive analysis</strong> = describe set of data, interpret what you see (census, Google Ngram)</li>
152152
<li><strong>Exploratory analysis</strong> = discovering connections (correlation does not = causation)</li>
153153
<li><strong>Inferential analysis</strong> = use data conclusions from smaller population for the broader group</li>
2.67 KB
Binary file not shown.

2_RPROG/R Programming Course Notes.Rmd

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ $\pagebreak$
1919
* 1988 rewritten in C (version 3 of language)
2020
* 1998 version 4 (what we use today)
2121
* **History of S**
22-
* Bell labs --> insightful --> Lucent --> Alcatel-Lucent
22+
* Bell labs $\rightarrow$ insightful $\rightarrow$ Lucent $\rightarrow$ Alcatel-Lucent
2323
* in 1998, S won the Association for computing machinery’s software system award
2424
* **History of R**
2525
* 1991 created in New Zealand by Ross Ihaka & RobertGentleman
@@ -105,7 +105,7 @@ $\pagebreak$
105105

106106
### Vectors and Lists
107107
* **atomic vector** = contains one data type, most basic object
108-
* `vector <- c(value1, value2, )` = creates a vector with specified values
108+
* `vector <- c(value1, value2, ...)` = creates a vector with specified values
109109
* `vector1*vector2` = element by element multiplication (rather than matrix multiplication)
110110
* if the vectors are of different lengths, shorter vector will be recycled until the longer runs out
111111
* computation on vectors/between vectors (`+`, `-`, `==`, `/`, etc.) are done element by element by default
@@ -122,7 +122,7 @@ $\pagebreak$
122122
* `as.character(list)` = converts list into a character vector
123123
* **implicit coercion**
124124
* matrix/vector can only contain one data type, so when attempting to create matrix/vector with different classes, forced coercion occurs to make every element to same class
125-
* *least common denominator* is the approach used (basically everything is converted to a class that all values can take, numbers --> characters) and *no errors generated*
125+
* *least common denominator* is the approach used (basically everything is converted to a class that all values can take, numbers $\rightarrow$ characters) and *no errors generated*
126126
* coercion occurs to make every element to same class (implicit)
127127
- `x <- c(NA, 2, "D")` will create a vector of character class
128128
* `list()` = special vector wit different classes of elements
@@ -131,15 +131,15 @@ $\pagebreak$
131131
* **logical vectors** = contain values `TRUE`, `FALSE`, and `NA`, values are generated as result of logical conditions comparing two objects/values
132132
* `paste(characterVector, collapse = " ")` = join together elements of the vector and separating with the `collapse` parameter
133133
* `paste(vec1, vec2, sep = " ")` = join together different vectors and separating with the `sep` parameter
134-
* ***Note**: vector recycling applies here too*
134+
* ***Note**: vector recycling applies here too *
135135
* `LETTERS`, `letters`= predefined vectors for all 26 upper and lower letters
136136
* `unique(values)` = returns vector with all duplicates removed
137137

138138
### Matrices and Data Frames
139139
* `matrix` can contain **only 1** type of data
140140
* `data.frame` can contain **multiple**
141141
* `matrix(values, nrow = n, ncol = m)` = creates a n by m matrix
142-
* constructed **COLUMN WISE** --> the elements are placed into the matrix from top to bottom for each column, and by column from left to right
142+
* constructed **COLUMN WISE** $\rightarrow$ the elements are placed into the matrix from top to bottom for each column, and by column from left to right
143143
* matrices can also be created by adding the dimension attribute to vector
144144
* `dim(m) <- c(2, 5)`
145145
* matrices can also be created by binding columns and rows
@@ -192,7 +192,7 @@ x
192192
* `array(data, dim, dimnames)`
193193
- `data` = data to be stored in array
194194
- `dim` = dimensions of the array
195-
+ `dim = c(2, 2, 5)` = 3 dimensional array --> creates 5 2x2 array
195+
+ `dim = c(2, 2, 5)` = 3 dimensional array $\rightarrow$ creates 5 2x2 array
196196
- `dimnames` = add names to the dimensions
197197
+ input must be a `list`
198198
+ every element of the `list` must correspond in length to the dimensions of the array
@@ -252,7 +252,7 @@ $\pagebreak$
252252

253253

254254
## Subsetting
255-
* R uses **one based index** --> starts counting at $1$
255+
* R uses **one based index** $\rightarrow$ starts counting at $1$
256256
* `x[0]` returns `numeric(0)`, not error
257257
* `x[3000]` returns `NA` (not out of bounds/error)
258258
* `[]` = always returns object of same class, can select more than one element of an object (ex. `[1:2]`)
@@ -421,7 +421,7 @@ mapply(rep, 1:4, 4:1)
421421
* `factorVar1, factorVar1` = factor variables to split the data by
422422
* ***Note**: order matters here in terms of how to break down the data *
423423
* `function` = what is applied to the subsets of data, can be sum/mean/median/etc
424-
* `na.rm = TRUE` --> removes NA values
424+
* `na.rm = TRUE` $\rightarrow$ removes NA values
425425

426426
$\pagebreak$
427427

@@ -435,10 +435,10 @@ $\pagebreak$
435435
* `sample(c(y, z), 100)` = select 100 random elements from combination of values y and z
436436
* `sample(10)` = select positive integer sample of size 10 without repeat
437437
* Each probability distribution functions usually have 4 functions associated with them:
438-
* `r***` function (for "random") --> random number generation (ex. `rnorm`)
439-
* `d***` function (for "density") --> calculate density (ex. `dunif`)
440-
* `p***` function (for "probability") --> cumulative distribution (ex. `ppois`)
441-
* `q***` function (for "quantile") --> quantile function (ex. `qbinom`)
438+
* `r***` function (for "random") $\rightarrow$ random number generation (ex. `rnorm`)
439+
* `d***` function (for "density") $\rightarrow$ calculate density (ex. `dunif`)
440+
* `p***` function (for "probability") $\rightarrow$ cumulative distribution (ex. `ppois`)
441+
* `q***` function (for "quantile") $\rightarrow$ quantile function (ex. `qbinom`)
442442
* If $\Phi$ is the cumulative distribution function for a standard Normal distribution, then `pnorm(q)` = $\Phi(q)$ and qnorm(p) = $\Phi^{-1}(q)$.
443443
* `set.seed()` = sets seed for randon number generator to ensure that the same data/analysis can be reproduced
444444

@@ -550,7 +550,7 @@ $\pagebreak$
550550

551551
### Larger Tables
552552
* ***Note**: help page for read.table important*
553-
* need to know how much RAM is required --> calculating memory requirements
553+
* need to know how much RAM is required $\rightarrow$ calculating memory requirements
554554
* `numRow` x `numCol` x 8 bytes/numeric value = size required in bites
555555
* double the above results and convert into GB = amount of memory recommended
556556
* set `comment.char = ""` to save time if there are no comments in the file

2_RPROG/R_Programming_Course_Notes.html

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -151,7 +151,7 @@ <h2>Overview and History of R</h2>
151151
</ul></li>
152152
<li><strong>History of S</strong>
153153
<ul>
154-
<li>Bell labs –&gt; insightful –&gt; Lucent –&gt; Alcatel-Lucent</li>
154+
<li>Bell labs <span class="math">\(\rightarrow\)</span> insightful <span class="math">\(\rightarrow\)</span> Lucent <span class="math">\(\rightarrow\)</span> Alcatel-Lucent</li>
155155
<li>in 1998, S won the Association for computing machinery’s software system award</li>
156156
</ul></li>
157157
<li><strong>History of R</strong>
@@ -269,7 +269,7 @@ <h3>Vectors and Lists</h3>
269269
<ul>
270270
<li><strong>atomic vector</strong> = contains one data type, most basic object
271271
<ul>
272-
<li><code>vector &lt;- c(value1, value2, )</code> = creates a vector with specified values</li>
272+
<li><code>vector &lt;- c(value1, value2, ...)</code> = creates a vector with specified values</li>
273273
<li><code>vector1*vector2</code> = element by element multiplication (rather than matrix multiplication)
274274
<ul>
275275
<li>if the vectors are of different lengths, shorter vector will be recycled until the longer runs out</li>
@@ -297,7 +297,7 @@ <h3>Vectors and Lists</h3>
297297
<ul>
298298
<li>matrix/vector can only contain one data type, so when attempting to create matrix/vector with different classes, forced coercion occurs to make every element to same class
299299
<ul>
300-
<li><em>least common denominator</em> is the approach used (basically everything is converted to a class that all values can take, numbers –&gt; characters) and <em>no errors generated</em></li>
300+
<li><em>least common denominator</em> is the approach used (basically everything is converted to a class that all values can take, numbers <span class="math">\(\rightarrow\)</span> characters) and <em>no errors generated</em></li>
301301
<li>coercion occurs to make every element to same class (implicit)</li>
302302
<li><code>x &lt;- c(NA, 2, &quot;D&quot;)</code> will create a vector of character class</li>
303303
</ul></li>
@@ -311,7 +311,7 @@ <h3>Vectors and Lists</h3>
311311
<li><code>paste(characterVector, collapse = &quot; &quot;)</code> = join together elements of the vector and separating with the <code>collapse</code> parameter</li>
312312
<li><code>paste(vec1, vec2, sep = &quot; &quot;)</code> = join together different vectors and separating with the <code>sep</code> parameter
313313
<ul>
314-
<li><em><strong>Note</strong>: vector recycling applies here too</em></li>
314+
<li><em><strong>Note</strong>: vector recycling applies here too </em></li>
315315
<li><code>LETTERS</code>, <code>letters</code>= predefined vectors for all 26 upper and lower letters</li>
316316
</ul></li>
317317
<li><code>unique(values)</code> = returns vector with all duplicates removed</li>
@@ -324,7 +324,7 @@ <h3>Matrices and Data Frames</h3>
324324
<li><code>data.frame</code> can contain <strong>multiple</strong></li>
325325
<li><code>matrix(values, nrow = n, ncol = m)</code> = creates a n by m matrix
326326
<ul>
327-
<li>constructed <strong>COLUMN WISE</strong> –&gt; the elements are placed into the matrix from top to bottom for each column, and by column from left to right</li>
327+
<li>constructed <strong>COLUMN WISE</strong> <span class="math">\(\rightarrow\)</span> the elements are placed into the matrix from top to bottom for each column, and by column from left to right</li>
328328
<li>matrices can also be created by adding the dimension attribute to vector
329329
<ul>
330330
<li><code>dim(m) &lt;- c(2, 5)</code></li>
@@ -413,7 +413,7 @@ <h3>Arrays</h3>
413413
<li><code>data</code> = data to be stored in array</li>
414414
<li><code>dim</code> = dimensions of the array
415415
<ul>
416-
<li><code>dim = c(2, 2, 5)</code> = 3 dimensional array –&gt; creates 5 2x2 array</li>
416+
<li><code>dim = c(2, 2, 5)</code> = 3 dimensional array <span class="math">\(\rightarrow\)</span> creates 5 2x2 array</li>
417417
</ul></li>
418418
<li><code>dimnames</code> = add names to the dimensions
419419
<ul>
@@ -505,7 +505,7 @@ <h2>Sequence of Numbers</h2>
505505
<div id="subsetting" class="section level2">
506506
<h2>Subsetting</h2>
507507
<ul>
508-
<li>R uses <strong>one based index</strong> –&gt; starts counting at <span class="math">\(1\)</span>
508+
<li>R uses <strong>one based index</strong> <span class="math">\(\rightarrow\)</span> starts counting at <span class="math">\(1\)</span>
509509
<ul>
510510
<li><code>x[0]</code> returns <code>numeric(0)</code>, not error</li>
511511
<li><code>x[3000]</code> returns <code>NA</code> (not out of bounds/error)</li>
@@ -778,7 +778,7 @@ <h3><code>aggregate()</code></h3>
778778
<li><code>factorVar1, factorVar1</code> = factor variables to split the data by</li>
779779
<li><em><strong>Note</strong>: order matters here in terms of how to break down the data </em></li>
780780
<li><code>function</code> = what is applied to the subsets of data, can be sum/mean/median/etc</li>
781-
<li><code>na.rm = TRUE</code> –&gt; removes NA values</li>
781+
<li><code>na.rm = TRUE</code> <span class="math">\(\rightarrow\)</span> removes NA values</li>
782782
</ul></li>
783783
</ul>
784784
</div>
@@ -798,10 +798,10 @@ <h2>Simulation</h2>
798798
</ul></li>
799799
<li>Each probability distribution functions usually have 4 functions associated with them:
800800
<ul>
801-
<li><code>r***</code> function (for “random”) –&gt; random number generation (ex. <code>rnorm</code>)</li>
802-
<li><code>d***</code> function (for “density”) –&gt; calculate density (ex. <code>dunif</code>)</li>
803-
<li><code>p***</code> function (for “probability”) –&gt; cumulative distribution (ex. <code>ppois</code>)</li>
804-
<li><code>q***</code> function (for “quantile”) –&gt; quantile function (ex. <code>qbinom</code>)</li>
801+
<li><code>r***</code> function (for “random”) <span class="math">\(\rightarrow\)</span> random number generation (ex. <code>rnorm</code>)</li>
802+
<li><code>d***</code> function (for “density”) <span class="math">\(\rightarrow\)</span> calculate density (ex. <code>dunif</code>)</li>
803+
<li><code>p***</code> function (for “probability”) <span class="math">\(\rightarrow\)</span> cumulative distribution (ex. <code>ppois</code>)</li>
804+
<li><code>q***</code> function (for “quantile”) <span class="math">\(\rightarrow\)</span> quantile function (ex. <code>qbinom</code>)</li>
805805
</ul></li>
806806
<li>If <span class="math">\(\Phi\)</span> is the cumulative distribution function for a standard Normal distribution, then <code>pnorm(q)</code> = <span class="math">\(\Phi(q)\)</span> and qnorm(p) = <span class="math">\(\Phi^{-1}(q)\)</span>.</li>
807807
<li><code>set.seed()</code> = sets seed for randon number generator to ensure that the same data/analysis can be reproduced</li>
@@ -948,7 +948,7 @@ <h2>Reading Tabular Data</h2>
948948
<h3>Larger Tables</h3>
949949
<ul>
950950
<li><em><strong>Note</strong>: help page for read.table important</em></li>
951-
<li>need to know how much RAM is required –&gt; calculating memory requirements
951+
<li>need to know how much RAM is required <span class="math">\(\rightarrow\)</span> calculating memory requirements
952952
<ul>
953953
<li><code>numRow</code> x <code>numCol</code> x 8 bytes/numeric value = size required in bites</li>
954954
<li>double the above results and convert into GB = amount of memory recommended</li>
@@ -1298,7 +1298,7 @@ <h3>Optimization</h3>
12981298
## b &lt;- -0.5*sum((data-mu)^2) / (sigma^2)
12991299
## -(a + b)
13001300
## }
1301-
## &lt;environment: 0x7fef6462d588&gt;</code></pre>
1301+
## &lt;environment: 0x7ff878f72bb8&gt;</code></pre>
13021302
<pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Estimating Prameters</span>
13031303
<span class="kw">optim</span>(<span class="kw">c</span>(<span class="dt">mu =</span> <span class="dv">0</span>, <span class="dt">sigma =</span> <span class="dv">1</span>), nLL)$par</code></pre>
13041304
<pre><code>## mu sigma
@@ -1365,7 +1365,7 @@ <h2>R Profiler</h2>
13651365
}
13661366
})</code></pre>
13671367
<pre><code>## user system elapsed
1368-
## 0.149 0.005 0.211</code></pre>
1368+
## 0.155 0.004 0.191</code></pre>
13691369
<ul>
13701370
<li><code>system.time(expression)</code>
13711371
<ul>
1.42 KB
Binary file not shown.

0 commit comments

Comments
 (0)