ubali1
diff --git a/‎1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd‎
Lines changed: 12 additions & 11 deletions b/‎1_DATASCITOOLBOX/Data Scientists Toolbox Course Notes.Rmd‎
Lines changed: 12 additions & 11 deletions
diff --git a/‎1_DATASCITOOLBOX/Data_Scientists_Toolbox_Course_Notes.html‎
Lines changed: 3 additions & 3 deletions b/‎1_DATASCITOOLBOX/Data_Scientists_Toolbox_Course_Notes.html‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎1_DATASCITOOLBOX/Data_Scientists_Toolbox_Course_Notes.pdf‎
2.67 KB b/‎1_DATASCITOOLBOX/Data_Scientists_Toolbox_Course_Notes.pdf‎
2.67 KB
diff --git a/‎2_RPROG/R Programming Course Notes.Rmd‎
Lines changed: 13 additions & 13 deletions b/‎2_RPROG/R Programming Course Notes.Rmd‎
Lines changed: 13 additions & 13 deletions
diff --git a/‎2_RPROG/R_Programming_Course_Notes.html‎
Lines changed: 15 additions & 15 deletions b/‎2_RPROG/R_Programming_Course_Notes.html‎
Lines changed: 15 additions & 15 deletions
diff --git a/‎2_RPROG/R_Programming_Course_Notes.pdf‎
1.42 KB b/‎2_RPROG/R_Programming_Course_Notes.pdf‎
1.42 KB
@@ -2,15 +2,16 @@
 title: "Data Scientist’s Toolbox Course Notes"
 author: "Xing Su"
 output:
+  pdf_document:
+    toc: yes
+    toc_depth: 3
   html_document:
     highlight: pygments
     theme: spacelab
     toc: yes
-  pdf_document:
-    toc: yes
-    toc_depth: 3
 ---
 
+$\pagebreak$
 
 ## CLI (Command Line Interface)
 
@@ -34,21 +35,21 @@ output:
     * `move <file> <directory>` = move file to directory
     * `move <fileName> <newName>` = rename file
 * `echo` = print arguments you give/variables
-* `date` = print current date 
+* `date` = print current date
 
 
 
 ## GitHub
 
-* **Workflow** 
+* **Workflow**
     1. make edits in workspace
     2. update index/add files
-    3. commit to local repo 
+    3. commit to local repo
     4. push to remote repository
 * `git add .` = add all new files to be tracked
 * `git add -u` = updates tracking for files that are renamed or deleted
 * `git add -A` = both of the above
-    * ***Note**: `add` is performed before committing*
+    * ***Note**: `add` is performed before committing *
 * `git commit -m "message"` = commit the changes you want to be saved to the local copy
 * `git checkout -b branchname` = create new branch
 * `git branch` = tells you what branch you are on
@@ -68,7 +69,7 @@ output:
 
 ## R Packages
 
-* Primary location for R packages --> CRAN
+* Primary location for R packages $\rightarrow$ CRAN
 * `available.packages()` = all packages available
 * `head(rownames(a),3)` = returns first three names of a
 * `install.packages("nameOfPackage")` = install single package
@@ -83,7 +84,7 @@ output:
 
 ## Types of Data Science Questions
 
-* in order of difficulty: ***Descriptive*** --> ***Exploratory*** --> ***Inferential*** --> ***Predictive*** --> ***Causal*** --> ***Mechanistic***
+* in order of difficulty: ***Descriptive*** $\rightarrow$ ***Exploratory*** $\rightarrow$ ***Inferential*** $\rightarrow$ ***Predictive*** $\rightarrow$ ***Causal*** $\rightarrow$ ***Mechanistic***
 * **Descriptive analysis** = describe set of data, interpret what you see (census, Google Ngram)
 * **Exploratory analysis** = discovering connections (correlation does not = causation)
 * **Inferential analysis** = use data conclusions from smaller population for the broader group
@@ -101,7 +102,7 @@ output:
 * **Big data** = now possible to collect data cheap, but not necessarily all useful (need the right data)
 
 ## Experimental Design
-* Formulate you question in advance 
+* Formulate you question in advance
 * **Statistical inference** = select subset, run experiment, calculate descriptive statistics, use inferential statistics to determine if results can be applied broadly
 * ***[Inference]*** **Variability** = lower variability + clearer differences = decision
 * ***[Inference]*** **Confounding** = underlying variable might be causing the correlation (sometimes called Spurious correlation)
@@ -115,7 +116,7 @@ output:
     * **Positive Predictive Value** = Pr(disease | positive test)
     * **Negative Predictive Value** = Pr(no disease | negative test)
     * **Accuracy** = Pr(correct outcome)
-* **Data dredging** = use data to fit hypothesis 
+* **Data dredging** = use data to fit hypothesis
 * **Good experiments** = have replication, measure variability, generalize problem, transparent
 * Prediction is not inference, and be ware of data dredging
 
@@ -109,7 +109,7 @@ <h2>GitHub</h2>
 <li><code>git add -u</code> = updates tracking for files that are renamed or deleted</li>
 <li><code>git add -A</code> = both of the above
 <ul>
-<li><em><strong>Note</strong>: <code>add</code> is performed before committing</em></li>
+<li><em><strong>Note</strong>: <code>add</code> is performed before committing </em></li>
 </ul></li>
 <li><code>git commit -m &quot;message&quot;</code> = commit the changes you want to be saved to the local copy</li>
 <li><code>git checkout -b branchname</code> = create new branch</li>
@@ -130,7 +130,7 @@ <h2>Markdown</h2>
 <div id="r-packages" class="section level2">
 <h2>R Packages</h2>
 <ul>
-<li>Primary location for R packages –&gt; CRAN</li>
+<li>Primary location for R packages <span class="math">\(\rightarrow\)</span> CRAN</li>
 <li><code>available.packages()</code> = all packages available</li>
 <li><code>head(rownames(a),3)</code> = returns first three names of a</li>
 <li><code>install.packages(&quot;nameOfPackage&quot;)</code> = install single package</li>
@@ -147,7 +147,7 @@ <h2>R Packages</h2>
 <div id="types-of-data-science-questions" class="section level2">
 <h2>Types of Data Science Questions</h2>
 <ul>
-<li>in order of difficulty: <strong><em>Descriptive</em></strong> –&gt; <strong><em>Exploratory</em></strong> –&gt; <strong><em>Inferential</em></strong> –&gt; <strong><em>Predictive</em></strong> –&gt; <strong><em>Causal</em></strong> –&gt; <strong><em>Mechanistic</em></strong></li>
+<li>in order of difficulty: <strong><em>Descriptive</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Exploratory</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Inferential</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Predictive</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Causal</em></strong> <span class="math">\(\rightarrow\)</span> <strong><em>Mechanistic</em></strong></li>
 <li><strong>Descriptive analysis</strong> = describe set of data, interpret what you see (census, Google Ngram)</li>
 <li><strong>Exploratory analysis</strong> = discovering connections (correlation does not = causation)</li>
 <li><strong>Inferential analysis</strong> = use data conclusions from smaller population for the broader group</li>
 
@@ -19,7 +19,7 @@ $\pagebreak$
     * 1988 rewritten in C (version 3 of language)
     * 1998 version 4 (what we use today)
 * **History of S**
-    * Bell labs --> insightful --> Lucent --> Alcatel-Lucent
+    * Bell labs $\rightarrow$ insightful $\rightarrow$ Lucent $\rightarrow$ Alcatel-Lucent
     * in 1998, S won the Association for computing machinery’s software system award
 * **History of R**
     * 1991     created in New Zealand by Ross Ihaka & RobertGentleman
@@ -105,7 +105,7 @@ $\pagebreak$
 
 ### Vectors and Lists
 * **atomic vector** = contains one data type, most basic object
-    * `vector <- c(value1, value2, …)` = creates a vector with specified values
+    * `vector <- c(value1, value2, ...)` = creates a vector with specified values
     * `vector1*vector2` = element by element multiplication (rather than matrix multiplication)
         * if the vectors are of different lengths, shorter vector will be recycled until the longer runs out
         * computation on vectors/between vectors (`+`, `-`, `==`, `/`, etc.) are done element by element by default
@@ -122,7 +122,7 @@ $\pagebreak$
     * `as.character(list)` = converts list into a character vector
 * **implicit coercion**
     * matrix/vector can only contain one data type, so when attempting to create matrix/vector with different classes, forced coercion occurs to make every element to same class
-        * *least common denominator* is the approach used (basically everything is converted to a class that all values can take, numbers --> characters) and *no errors generated*
+        * *least common denominator* is the approach used (basically everything is converted to a class that all values can take, numbers $\rightarrow$ characters) and *no errors generated*
         * coercion occurs to make every element to same class (implicit)
         - `x <- c(NA, 2, "D")` will create a vector of character class
 * `list()` = special vector wit different classes of elements
@@ -131,15 +131,15 @@ $\pagebreak$
 * **logical vectors** = contain values `TRUE`, `FALSE`, and `NA`, values are generated as result of logical conditions comparing two objects/values
 * `paste(characterVector, collapse = " ")` = join together elements of the vector and separating with the `collapse` parameter
 * `paste(vec1, vec2, sep = " ")` = join together different vectors and separating with the `sep` parameter
-    * ***Note**: vector recycling applies here too*
+    * ***Note**: vector recycling applies here too *
     * `LETTERS`, `letters`= predefined vectors for all 26 upper and lower letters
 * `unique(values)` = returns vector with all duplicates removed
 
 ### Matrices and Data Frames
 * `matrix` can contain **only 1** type of data
 * `data.frame` can contain **multiple**
 * `matrix(values, nrow = n, ncol = m)` = creates a n by m matrix
-    * constructed **COLUMN WISE** --> the elements are placed into the matrix from top to bottom for each column, and by column from left to right
+    * constructed **COLUMN WISE** $\rightarrow$ the elements are placed into the matrix from top to bottom for each column, and by column from left to right
     * matrices can also be created by adding the dimension attribute to vector
         * `dim(m) <- c(2, 5)`
     * matrices can also be created by binding columns and rows
@@ -192,7 +192,7 @@ x
 * `array(data, dim, dimnames)`
     - `data` = data to be stored in array
     - `dim` = dimensions of the array
-        + `dim = c(2, 2, 5)` = 3 dimensional array --> creates 5 2x2 array
+        + `dim = c(2, 2, 5)` = 3 dimensional array $\rightarrow$ creates 5 2x2 array
     - `dimnames` = add names to the dimensions
         + input must be a `list`
         + every element of the `list` must correspond in length to the dimensions of the array
@@ -252,7 +252,7 @@ $\pagebreak$
 
 
 ## Subsetting
-* R uses **one based index** --> starts counting at $1$
+* R uses **one based index** $\rightarrow$ starts counting at $1$
     * `x[0]` returns `numeric(0)`, not error
     * `x[3000]` returns `NA` (not out of bounds/error)
 * `[]` = always returns object of same class, can select more than one element of an object (ex. `[1:2]`)
@@ -421,7 +421,7 @@ mapply(rep, 1:4, 4:1)
     * `factorVar1, factorVar1` = factor variables to split the data by
     * ***Note**: order matters here in terms of how to break down the data *
     * `function` = what is applied to the subsets of data, can be sum/mean/median/etc
-    * `na.rm = TRUE` --> removes NA values
+    * `na.rm = TRUE` $\rightarrow$ removes NA values
 
 $\pagebreak$
 
@@ -435,10 +435,10 @@ $\pagebreak$
     * `sample(c(y, z), 100)` = select 100 random elements from combination of values y and z
     * `sample(10)` = select positive integer sample of size 10 without repeat
 * Each probability distribution functions usually have 4 functions associated with them:
-    * `r***` function (for "random") --> random number generation (ex. `rnorm`)
-    * `d***` function (for "density") --> calculate density (ex. `dunif`)
-    * `p***` function (for "probability") --> cumulative distribution (ex. `ppois`)
-    * `q***` function (for "quantile") --> quantile function (ex. `qbinom`)
+    * `r***` function (for "random") $\rightarrow$ random number generation (ex. `rnorm`)
+    * `d***` function (for "density") $\rightarrow$ calculate density (ex. `dunif`)
+    * `p***` function (for "probability") $\rightarrow$ cumulative distribution (ex. `ppois`)
+    * `q***` function (for "quantile") $\rightarrow$ quantile function (ex. `qbinom`)
 * If $\Phi$ is the cumulative distribution function for a standard Normal distribution, then `pnorm(q)` = $\Phi(q)$ and qnorm(p) = $\Phi^{-1}(q)$.
 * `set.seed()` = sets seed for randon number generator to ensure that the same data/analysis can be reproduced
 
@@ -550,7 +550,7 @@ $\pagebreak$
 
 ### Larger Tables
  * ***Note**: help page for read.table important*
- * need to know how much RAM is required --> calculating memory requirements
+ * need to know how much RAM is required $\rightarrow$ calculating memory requirements
     * `numRow` x `numCol` x 8 bytes/numeric value = size required in bites
     * double the above results and convert into GB = amount of memory recommended
  * set `comment.char = ""` to save time if there are no comments in the file
 
@@ -151,7 +151,7 @@ <h2>Overview and History of R</h2>
 </ul></li>
 <li><strong>History of S</strong>
 <ul>
-<li>Bell labs –&gt; insightful –&gt; Lucent –&gt; Alcatel-Lucent</li>
+<li>Bell labs <span class="math">\(\rightarrow\)</span> insightful <span class="math">\(\rightarrow\)</span> Lucent <span class="math">\(\rightarrow\)</span> Alcatel-Lucent</li>
 <li>in 1998, S won the Association for computing machinery’s software system award</li>
 </ul></li>
 <li><strong>History of R</strong>
@@ -269,7 +269,7 @@ <h3>Vectors and Lists</h3>
 <ul>
 <li><strong>atomic vector</strong> = contains one data type, most basic object
 <ul>
-<li><code>vector &lt;- c(value1, value2, …)</code> = creates a vector with specified values</li>
+<li><code>vector &lt;- c(value1, value2, ...)</code> = creates a vector with specified values</li>
 <li><code>vector1*vector2</code> = element by element multiplication (rather than matrix multiplication)
 <ul>
 <li>if the vectors are of different lengths, shorter vector will be recycled until the longer runs out</li>
@@ -297,7 +297,7 @@ <h3>Vectors and Lists</h3>
 <ul>
 <li>matrix/vector can only contain one data type, so when attempting to create matrix/vector with different classes, forced coercion occurs to make every element to same class
 <ul>
-<li><em>least common denominator</em> is the approach used (basically everything is converted to a class that all values can take, numbers –&gt; characters) and <em>no errors generated</em></li>
+<li><em>least common denominator</em> is the approach used (basically everything is converted to a class that all values can take, numbers <span class="math">\(\rightarrow\)</span> characters) and <em>no errors generated</em></li>
 <li>coercion occurs to make every element to same class (implicit)</li>
 <li><code>x &lt;- c(NA, 2, &quot;D&quot;)</code> will create a vector of character class</li>
 </ul></li>
@@ -311,7 +311,7 @@ <h3>Vectors and Lists</h3>
 <li><code>paste(characterVector, collapse = &quot; &quot;)</code> = join together elements of the vector and separating with the <code>collapse</code> parameter</li>
 <li><code>paste(vec1, vec2, sep = &quot; &quot;)</code> = join together different vectors and separating with the <code>sep</code> parameter
 <ul>
-<li><em><strong>Note</strong>: vector recycling applies here too</em></li>
+<li><em><strong>Note</strong>: vector recycling applies here too </em></li>
 <li><code>LETTERS</code>, <code>letters</code>= predefined vectors for all 26 upper and lower letters</li>
 </ul></li>
 <li><code>unique(values)</code> = returns vector with all duplicates removed</li>
@@ -324,7 +324,7 @@ <h3>Matrices and Data Frames</h3>
 <li><code>data.frame</code> can contain <strong>multiple</strong></li>
 <li><code>matrix(values, nrow = n, ncol = m)</code> = creates a n by m matrix
 <ul>
-<li>constructed <strong>COLUMN WISE</strong> –&gt; the elements are placed into the matrix from top to bottom for each column, and by column from left to right</li>
+<li>constructed <strong>COLUMN WISE</strong> <span class="math">\(\rightarrow\)</span> the elements are placed into the matrix from top to bottom for each column, and by column from left to right</li>
 <li>matrices can also be created by adding the dimension attribute to vector
 <ul>
 <li><code>dim(m) &lt;- c(2, 5)</code></li>
@@ -413,7 +413,7 @@ <h3>Arrays</h3>
 <li><code>data</code> = data to be stored in array</li>
 <li><code>dim</code> = dimensions of the array
 <ul>
-<li><code>dim = c(2, 2, 5)</code> = 3 dimensional array –&gt; creates 5 2x2 array</li>
+<li><code>dim = c(2, 2, 5)</code> = 3 dimensional array <span class="math">\(\rightarrow\)</span> creates 5 2x2 array</li>
 </ul></li>
 <li><code>dimnames</code> = add names to the dimensions
 <ul>
@@ -505,7 +505,7 @@ <h2>Sequence of Numbers</h2>
 <div id="subsetting" class="section level2">
 <h2>Subsetting</h2>
 <ul>
-<li>R uses <strong>one based index</strong> –&gt; starts counting at <span class="math">\(1\)</span>
+<li>R uses <strong>one based index</strong> <span class="math">\(\rightarrow\)</span> starts counting at <span class="math">\(1\)</span>
 <ul>
 <li><code>x[0]</code> returns <code>numeric(0)</code>, not error</li>
 <li><code>x[3000]</code> returns <code>NA</code> (not out of bounds/error)</li>
@@ -778,7 +778,7 @@ <h3><code>aggregate()</code></h3>
 <li><code>factorVar1, factorVar1</code> = factor variables to split the data by</li>
 <li><em><strong>Note</strong>: order matters here in terms of how to break down the data </em></li>
 <li><code>function</code> = what is applied to the subsets of data, can be sum/mean/median/etc</li>
-<li><code>na.rm = TRUE</code> –&gt; removes NA values</li>
+<li><code>na.rm = TRUE</code> <span class="math">\(\rightarrow\)</span> removes NA values</li>
 </ul></li>
 </ul>
 </div>
@@ -798,10 +798,10 @@ <h2>Simulation</h2>
 </ul></li>
 <li>Each probability distribution functions usually have 4 functions associated with them:
 <ul>
-<li><code>r***</code> function (for “random”) –&gt; random number generation (ex. <code>rnorm</code>)</li>
-<li><code>d***</code> function (for “density”) –&gt; calculate density (ex. <code>dunif</code>)</li>
-<li><code>p***</code> function (for “probability”) –&gt; cumulative distribution (ex. <code>ppois</code>)</li>
-<li><code>q***</code> function (for “quantile”) –&gt; quantile function (ex. <code>qbinom</code>)</li>
+<li><code>r***</code> function (for “random”) <span class="math">\(\rightarrow\)</span> random number generation (ex. <code>rnorm</code>)</li>
+<li><code>d***</code> function (for “density”) <span class="math">\(\rightarrow\)</span> calculate density (ex. <code>dunif</code>)</li>
+<li><code>p***</code> function (for “probability”) <span class="math">\(\rightarrow\)</span> cumulative distribution (ex. <code>ppois</code>)</li>
+<li><code>q***</code> function (for “quantile”) <span class="math">\(\rightarrow\)</span> quantile function (ex. <code>qbinom</code>)</li>
 </ul></li>
 <li>If <span class="math">\(\Phi\)</span> is the cumulative distribution function for a standard Normal distribution, then <code>pnorm(q)</code> = <span class="math">\(\Phi(q)\)</span> and qnorm(p) = <span class="math">\(\Phi^{-1}(q)\)</span>.</li>
 <li><code>set.seed()</code> = sets seed for randon number generator to ensure that the same data/analysis can be reproduced</li>
@@ -948,7 +948,7 @@ <h2>Reading Tabular Data</h2>
 <h3>Larger Tables</h3>
 <ul>
 <li><em><strong>Note</strong>: help page for read.table important</em></li>
-<li>need to know how much RAM is required –&gt; calculating memory requirements
+<li>need to know how much RAM is required <span class="math">\(\rightarrow\)</span> calculating memory requirements
 <ul>
 <li><code>numRow</code> x <code>numCol</code> x 8 bytes/numeric value = size required in bites</li>
 <li>double the above results and convert into GB = amount of memory recommended</li>
@@ -1298,7 +1298,7 @@ <h3>Optimization</h3>
 ##          b &lt;- -0.5*sum((data-mu)^2) / (sigma^2)
 ##          -(a + b)
 ##     }
-## &lt;environment: 0x7fef6462d588&gt;</code></pre>
+## &lt;environment: 0x7ff878f72bb8&gt;</code></pre>
 <pre class="sourceCode r"><code class="sourceCode r"><span class="co"># Estimating Prameters</span>
 <span class="kw">optim</span>(<span class="kw">c</span>(<span class="dt">mu =</span> <span class="dv">0</span>, <span class="dt">sigma =</span> <span class="dv">1</span>), nLL)$par</code></pre>
 <pre><code>##       mu    sigma 
@@ -1365,7 +1365,7 @@ <h2>R Profiler</h2>
     }
 })</code></pre>
 <pre><code>##    user  system elapsed 
-##   0.149   0.005   0.211</code></pre>
+##   0.155   0.004   0.191</code></pre>
 <ul>
 <li><code>system.time(expression)</code>
 <ul>