The final project for MIDS w203. In this lab, students will apply what they have learned about building linear models to produce a report that analyzes a specific research question.
The assignment prompt is in ./prompt/assignment.qmd and rendered in ./prompt/assignment.pdf.
We have created a folder structure to help you organize your work. This is based on cookiecutter data science.
├── LICENSE
├── README.md <- The top-level README describing the project aims
├── data
│ ├── external <- Data from third party sources.
│ ├── interim <- Intermediate data that has been transformed.
│ └── processed <- The final, canonical data sets for modeling.
├── prompt <- The assignment prompt.
├── peer_review <- A starting place for each person's peer evaluation.
├── notebooks <- .Rmd notebooks.
├── references <- Data dictionaries, manuals, and all other explanatory materials.
├── reports <- Generated analysis as HTML, PDF, LaTeX, etc.
└── src <- Source code for use in this project.
└── data <- Scripts to download or generate data
-
At the beginning of your assignment, you and your team are going to be working on several things, all at the same time. You're going to be exploring data, asking questions, and learning about what is going on. All of this work is nicely contained in the
./notebooks/section of folder../notebooks, when it is included on themainbranch of the repository should be more than just your first, very messy look at some question -- you should have brought it at least to the point that others can follow your logic and potentially contribute to your code. -
In the
./data/folder and its sub-folders go all the data that you're going to use or derive for use in this project. We have added any.csvfile in this folder structure to the.gitignorefile, which means that they will not be stored in version control. This is a best practice, because version control should really be about storing code and not data. -
We have built this around a "Project" in Rstudio. These projects allow you to isolate all of your code, but also much of your environment to that specific project. To manage the environment, we have used the
renvpackage, and to give you a starting point, we have installed many of the packages that you might be interested in using:tidyverse(which includesdplyrandggplot2),herefor locating your data (more on that in a moment),data.tableif you've got GB of data, but also packages likesandwichandlmtestwhich we have used earlier in the semeseter. In order for you to install these packages on the computer you're using, you will simply need to issuerenv::restore()once, and these will install. -
We recommend that you use the
herepackage link for all of your file paths. Here is why: You're working on a team of 3-4 people, and each of you have different locations where you're working on your computer. This is very frustrating to collaborate, and it leads to just enough friction that you won't end up actually collaborating.herestarts from the top-level of the project, but no deeper, and then creates conforming file paths. If you were looking to reference your raw data, in any file, you would do so by issuinghere("data", "raw", "my_data.csv")where"my_data.csv"is actually the name of the file that you're trying to load. -
We would recommend that you write a small service to yourselves -- perhaps storing this in
./src/that takes on each step of your data gathering and cleaning pipeline. It is really, really nice to be able to interactively work with data while you're modeling. However, as your project develops, we would strongly recommend that you move this work to clean and transform your data to as early as possible, and certainly before you move into yourReport.Rmdfile. Here's why: (a) As a team, you need to know what the canonical form of "this variable" is; you can't have two competing versions; (b) As a contributor, you want to know what is available to you, and what you've got to derive anew -- if there are no new variable derivations in the modeling code, then you know that any new thing you create has to get moved into a new place at some point; (c) as a contributor, the modeling gets wild every. single. time. You want to isolate the hard work that you're doing to read people's modeling code from the sometimes unclear work of the variable transforms.