Introduction to the tidyverse
Contributors: Michael T. Hallworth
What is the tidyverse?
If you have been using R over the last several years, you’ve probably noticed an ever-growing number of R enthusiasts extolling the virtues of the tidyverse, “an opinionated collection of R packages designed for data science.” The tidyverse is built around several core packages, including dplyr
, tidyr
, ggplot2
, and purrr
, plus many others that provide more specialized solutions for common data science problems. All packages in the tidyverse share a common philosophy about data structure, coding style, and workflow, which allows tidyverse functions to work together more-or-less seamlessly. Although not all R users agree with this philosophy or like using the tidyverse, the main advantages of the tidyverse include:
- Consistent data input/output structure: The tidyverse is built around data frames. This means that, for the most part, tidyverse functions take a data frame as input and produce a new data frame as output. Perhaps more importantly, tidyverse functions are built around ‘tidy’ data frames - i.e., data frames where each row is a single observation and each column is a variable. Although not the most compact way to store data, tidy data has many advantages for manipulation, modeling, and visualization.
- Intuitive function names: To the extent possible, tidyverse functions strive to have intuitive names so users can more easily understand what a function does. Examples include
filter
,arrange
,summarise
, andgroup_by
. - Consistent style: Tidyverse functions use a common style guide. This means that naming conventions (e.g., functions names are verbs, all lower case with words separated by
_
), argument order (e.g., data is always the first argument), spacing and indentations, etc. are consistent across all tidyverse packages. As Hadley Wickham put it, this consistency “is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread”. - Piping functions: Rather than nesting functions (e.g.
head(mean())
), tidyverse functions can be sequentially linked using the pipe operator%>%
. This greatly increases readability of code.