Cleaning Data:
library(dplyr)
Peek into data glimpse(data) str(data) head(data, number_of_columns) tail(data) args(data) summary(data) % Gives mean, min, max, median, NA's
Change class of a column class(data$column) as.numeric(data$column) as.factor(data$column)...
====================================================================================== Re-order data gather(data, key, value, ..., na.rm = FALSE, convert = FALSE, factor_key = FALSE) The most important function in tidyr is gather(). It should be used when you have columns that are not variables and you want to collapse them into key-value pairs. Arguments: data, key, variable, (-)columns to ignore
spread(data, key, value, fill = NA, convert = FALSE, drop = TRUE, sep = NULL) The opposite of gather() is spread(), which takes key-values pairs and spreads them across multiple columns. This is useful when values in a column should actually be column names (i.e. variables). It can also make data more compact and easier to read. Arguments: same as before
seperate(data, col, into, sep = "[^[:alnum:]]+", remove = TRUE, convert = FALSE, extra = "warn", fill = "warn", ...) The separate() function allows you to separate one column into multiple columns. Unless you tell it otherwise, it will attempt to separate on any character that is not a letter or number. You can also specify a specific separator using the sep argument. Arguments: data, Column to be seperated, names of new columns, seperator "-/_etc."
unite(data, col, ..., sep = "_", remove = TRUE) The opposite of separate() is unite(), which takes multiple columns and pastes them together. By default, the contents of the columns will be separated by underscores in the new column, but this behavior can be altered via the sep argument. Arguments: Data, Name of new united column, columns to be united, seperator "-/_etc."
=============================================================================== String Manipulation with stringr library(stringr)
str_trim(" this is string ") [1] "this is string"
%pad a string with any character on any sides str_pad("256459", width = 10, side = "left", pad = "0") [1] 00002566459
toupper("string") [1] STRING tolower()
str_detect(data$column, "string") Detects if "string" is present in the column and returns a vector of TRUE and FALSE values.
str_replace(data$column, "string", "newstring") Replaces all instances of "string" in the column to "newstring"
=============================================================================== Missing data
is.na(data) Returns a matrix of TRUE and FALSE values, TRUE if the corresponding cell is NA
any(is.na(data)) returns true if there is any NA cell