Conversation
|
|
||
| These are some code chunks that I frequently come back to when processing data for the Arctic Data Center. | ||
|
|
||
| #Reading in raw data |
There was a problem hiding this comment.
GitHub is finicky about spaces after # for the headers, so make sure to include them! RStudio will preview it just fine, but GitHub won't. (#Reading--> # Reading)
| #Reading in raw data | ||
| ##Single data file | ||
| ```{r eval=FALSE} | ||
| df <- read.table("path/to/data", |
There was a problem hiding this comment.
Any reason you use read.table rather than read.csv or read_csv? I'm curious, but it might also be adding that those other options also exist.
| ##Single data file | ||
| ```{r eval=FALSE} | ||
| df <- read.table("path/to/data", | ||
| header=T, |
There was a problem hiding this comment.
There should be spaces around the = sign. Doesn't affect the code at all, but it makes it more readable (especially once your code gets long/complicated). This is our go-to reference for style: http://style.tidyverse.org/
| ```{r eval=FALSE} | ||
| dataList <- vector("list", length(rawPaths)) # makes an empty list with same length as file paths vector | ||
| i=0 | ||
| for(i in 1:length(rawPaths)){ |
There was a problem hiding this comment.
It's generally better practice to use seq_along(rawPaths), rather than 1:length(x) (which I also do all the time). It allows the code to fail more gracefully. See the discussion here: https://stackoverflow.com/questions/24917228/proper-way-to-loop-over-the-length-of-a-dataframe-in-r
|
|
||
| Read in data using a for loop. Remember to initialize all variables that you will be using outside of the for loop. | ||
| ```{r eval=FALSE} | ||
| dataList <- vector("list", length(rawPaths)) # makes an empty list with same length as file paths vector |
There was a problem hiding this comment.
Great job initializing a list! I always have to stop myself from growing vectors.
| for(i in 1:length(rawPaths)){ | ||
| dataList[[i]] <- read.table(rawPaths[i], | ||
| na.strings = c("", "NA"), | ||
| header=T) |
There was a problem hiding this comment.
It looks like the indentation is a little bit off here (though maybe it's GitHub, I'm not sure). A neat trick I learned from Bryce is to highlight your code and then use Cmd + I to fix the indentation!
| header=T) | ||
| } | ||
| ``` | ||
| Note: list() creates an empty list of length 0. However, vector("list", length(rawPaths)) allocates a designated number of slots within the list instead of the list being constantly updated every time the for loop interates. With a small number of iterations, the time it takes for the code to run is not noticeable. However, for a large number of iterations, not allocating space will cause the code to run very slowly. |
There was a problem hiding this comment.
Perhaps this reference (or something similar) is worth including in here: https://paulvanderlaken.com/2017/10/13/functional-programming-and-why-you-should-not-grow-vectors-in-r/
|
|
||
| Iterate through all the rows in a data frame. | ||
| allRows is a vector containing "TRUE" and "FALSE". Each element corresponds to a row in dataFrame. | ||
| is.na(dataFrame[i,]) outputs "TRUE" if the row contains at least one blank cell, and "FALSE" otherwise. |
There was a problem hiding this comment.
You can use ` to indicate code within sentences in Rmarkdown (like we do in slack)
|
|
||
| #Searching Through Strings - Dates | ||
|
|
||
| Use the grepl() function to search for a particular string. Since we often have to reformat dates in our data sets, searching for particular dates or times could be useful. |
There was a problem hiding this comment.
Perhaps this would be a good place to introduce some helpful resources. I personally like this cheatsheet: https://www.rstudio.com/wp-content/uploads/2016/09/RegExCheatsheet.pdf
|
|
||
| Run unique() to see what kind of formats there are. | ||
| ```{r} | ||
| unique(dates) |
There was a problem hiding this comment.
Discovered the get_dupes function yesterday. Could be interesting to add! (or at least link to) https://cran.r-project.org/web/packages/janitor/vignettes/introduction.html
|
|
||
| ```{r} | ||
| indDates1 <- which(grepl("/16",dates)) | ||
| dates[indDates1] <- format(as.POSIXct(dates[indDates1], tz = "", format="%m/%d/%y"), format = "%Y-%m-%d") |
There was a problem hiding this comment.
I like to use the lubridate package to work with dates. If you haven't tried it, I'd definitely recommend checking it out! https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html There are also some other date/time packages, but I'm not as familiar with them. tibbletime is another one seems promising.
This this .Rmd file contains some code chunks that I often refer to when cleaning data.