Content from Before we start
Last updated on 2024-03-12 | Edit this page
What is R? What is RStudio?
The term “R
” is used to refer to both the programming
language and the software that interprets the scripts written using
it.
RStudio is a popular way to write R scripts and interact with the R software. To function correctly, RStudio needs R and therefore both need to be installed on your computer.
Why learn R?
R does not involve lots of pointing and clicking, and that’s a good thing
In R, the results of your analysis rely on a series of written commands, and not on remembering a succession of pointing and clicking. If you want to redo your analysis because you collected more data, you can rerun the script and R will process the new dataset exactly the same way as before.
Working with scripts makes the steps you used in your analysis clear, and the code you write can be inspected by someone else who can give you feedback and spot mistakes.
R code is great for reproducibility
Reproducibility is when someone else, including your future self, can obtain the same results from the same dataset by running the same analysis.
R can also be integrated with other tools to generate manuscripts from your code. If you collect more data, or fix a mistake in your dataset, the figures and the statistical tests in your manuscript can be updated automatically.
R is widely used in academia and in industries such as pharma and biotech where analyses are sometimes expected to be reproducible, so knowing R will give you an edge.
R is interdisciplinary and extensible
With 10,000+ packages that can be installed to extend its capabilities, R provides a framework that allows you to combine statistical approaches from many scientific disciplines to best suit the analytical framework you need to analyze your data. For instance, R has packages for image analysis, GIS, time series, population genetics, and a lot more.
R works on data of all shapes and sizes
The skills you learn with R scale easily with the size of your dataset. Whether your dataset has hundreds or millions of lines, it won’t make much difference to you. R can connect to spreadsheets, databases, and many other data formats, on your computer or on the web.
R produces high-quality graphics
The plotting functionalities in R are endless, and allow you to adjust any aspect of your graph to visualize your data more effectively.
R has a large and welcoming community
Thousands of people use R daily. Many of them are willing to help you
through mailing lists and websites such as Stack Overflow, RStudio community, and Slack
channels such as
the R for Data Science online community (https://www.rfordatasci.com/).
In addition, there are numerous online and in person meetups organised
globally through organisations such as R Ladies Global (https://rladies.org/).
Finding your way around RStudio
Let’s start by learning about RStudio, which is an Integrated Development Environment (IDE) for working with R.
We will use RStudio IDE to write code, navigate the files on our computer, inspect the variables we are going to create, and visualize the plots we will generate. RStudio can also be used for many other useful things (e.g., version control, developing packages, writing Shiny apps) that we will not cover during the workshop.
RStudio is divided into 4 “panes”:
- The Source pane for your scripts and documents (top-left, in the default layout)
- The Environment/History pane (top-right) which shows all the objects in your working space (Environment), and your command history (History)
- The Files/Plots/Packages/Help/Viewer pane (bottom-right)
- The R Console (bottom-left)
The placement of these panes and their content can be customized (see menu, Tools -> Global Options -> Pane Layout). For ease of use, settings such as background color, font color, font size, and zoom level can also be adjusted in this menu (Global Options -> Appearance).
One of the advantages of using RStudio is that all the information you need to write code is available in a single window. Additionally, with many shortcuts, autocompletion, and highlighting for the major file types you use while developing in R, RStudio will make typing easier and less error-prone.
Project-based organisation
It is good practice to keep a set of related data, analyses, and text self-contained in a single project folder, and when programming this is called the working directory. Any scripts inside the project folder can then use relative paths to point to files and folders within the working directory. Working this way allows you to move your project around on your computer and share it with others while still allowing the scripts to work. This is opposed to using absolute paths, which point to where a file is on a specific computer, but which will break scripts that are being run on another machine.
RStudio provides a helpful set of tools for owrking with projects through it’s “Projects” interface, including creating a working directory for you, but also remembering its location (allowing you to quickly navigate to it) and optionally preserves custom settings and (re-)open files to resume work after a break. To create an “RStudio Project” for this tutorial:
- Start RStudio.
- Under the
File
menu, click onNew Project
. ChooseNew Directory
, thenNew Project
. - Enter a name for this new folder (or “directory”), and choose a
convenient location for it. This will be your working
directory for the rest of the day (e.g.,
~/Desktop/data-carpentry
). - Click on
Create Project
. - Download the code
handout, place it in your working directory and rename it (e.g.,
data-carpentry-script.R
). - (Recommended) Set Preferences to ‘Never’ save workspace in RStudio.
A workspace is your current working environment in R which includes any user-defined objects. By default, all of these objects will be saved, and automatically loaded, when you reopen your project. This can be cumbersome, and have unintentional consequences, especially if you are working with larger datasets, and it can lead to hard to debug errors by having objects in memory you forgot you had. Therefore, it is often a good idea to turn this off. To do so, go to Tools –> ‘Global Options’ and select the ‘Never’ option for ‘Save workspace to .RData’ on exit.’
Organizing your working directory
Using a consistent folder structure across your different research projects will help to keep things organized, and will help you to find things in the future. This can be especially helpful when you have multiple projects on the go at once. In general, we recommend creating directories (folders) for scripts, data, figures and documents.
-
data_raw/
&data/
Use these folders to store raw data and intermediate datasets you may create for the need of a particular analysis. You should always keep a copy of your raw data accessible and do as much of your data cleanup and preprocessing programmatically (i.e., with scripts, rather than manually) as possible. Separating raw data from processed data is a good idea. -
documents/
This would be a place to keep outlines, drafts, and other text relating to manuscripts or notes. -
fig/
This is where you can save any plots or figures. -
scripts/
This would be the location to keep your R scripts for different analyses or plotting, and potentially a separate folder for your functions (more on that later). - Additional (sub)directories depending on your project needs.
For this workshop, we will need a data_raw/
folder to
store our raw data, and we will use data/
for when we learn
how to export data as CSV files, and a fig/
folder for the
figures that we will save.
- Under the
Files
tab on the right of the screen, click onNew Folder
and create a folder nameddata_raw
within your newly created working directory (e.g.,~/data-carpentry/
). (Alternatively, typedir.create("data_raw")
at your R console.) Repeat these operations to create adata
andfig
folder.
We are going to keep the script in the root of our working directory because we are only going to use one file. Later, when you start to create more complex projects, it might make sense to organize scripts in sub-directories.
Your working directory should now look like this:
The working directory
The working directory is an important concept to understand. It is the place from where R will be looking for and saving the files. When you write code for your project, it should refer to files in relation to the root of your working directory and only need files within this structure.
RStudio assists you in this regard and sets the working directory
automatically to the directory where you have placed your project in. If
you need to check it, you can use getwd()
. If for some
reason your working directory is not what it should be, you can change
it in the RStudio interface by navigating in the file browser where your
working directory should be, and clicking on the blue gear icon “More”,
and select “Set As Working Directory”. Alternatively you can use
setwd("/path/to/working/directory")
to reset your working
directory. However, your scripts should not include this line because it
will fail on someone else’s computer.
Interacting with R
The basis of programming is that we write down instructions for the computer to follow, and then we tell the computer to follow those instructions. We write, or code, instructions in R because it is a common language that both the computer and we can understand. We call the instructions commands and we tell the computer to follow the instructions by executing (also called running) those commands.
There are two main ways of interacting with R: by using the console
or by using script files (plain text files that contain your code). The
console pane (in RStudio, the bottom left panel) is the place where
commands written in the R language can be typed and executed immediately
by the computer. It is also where the results will be shown for commands
that have been executed. You can type commands directly into the console
and press Enter
to execute those commands, but they will be
forgotten when you close the session.
Because we want our code and workflow to be reproducible, it is better to type the commands we want in the script editor, and save the script. This way, there is a complete record of what we did, and anyone (including our future selves!) can easily replicate the results on their computer.
RStudio allows you to execute commands directly from the script
editor by using the Ctrl
+
Enter
shortcut (on Macs,
Cmd
+ Return
will work,
too). The command on the current line in the script (indicated by the
cursor) or all of the commands in the currently selected text will be
sent to the console and executed when you press
Ctrl
+ Enter
. You can
find other keyboard shortcuts in this RStudio
cheatsheet about the RStudio IDE.
At some point in your analysis you may want to check the content of a
variable or the structure of an object, without necessarily keeping a
record of it in your script. You can type these commands and execute
them directly in the console. RStudio provides the
Ctrl
+ 1
and
Ctrl
+ 2
shortcuts allow
you to jump between the script and the console panes.
If R is ready to accept commands, the R console shows a
>
prompt. If it receives a command (by typing,
copy-pasting or sent from the script editor using
Ctrl
+ Enter
), R will
try to execute it, and when ready, will show the results and come back
with a new >
prompt to wait for new commands.
If R is still waiting for you to enter more data because it isn’t
complete yet, the console will show a +
prompt. It means
that you haven’t finished entering a complete command. This is because
you have not ‘closed’ a parenthesis or quotation, i.e. you don’t have
the same number of left-parentheses as right-parentheses, or the same
number of opening and closing quotation marks. When this happens, and
you thought you finished typing your command, click inside the console
window and press Esc
; this will cancel the
incomplete command and return you to the >
prompt.
Seeking help
Searching function documentation with ?
and
??
If you need help with a specific function, let’s say
mean()
, you can type ?mean
or press
F1 while your cursor is on the function name. If you are
looking for a function to do a particular task, but don’t know the
function name, you can use the double question mark ??
, for
example ??kruskall
. Both commands will open matching help
files in RStudio’s help panel in the lower right corner. You can also
use the help panel to search help directly, as seen in the
screenshot.
Automatic code completion
When you write code in RStudio, you can use its automatic code completion to remind yourself of a function’s name or arguments. Start typing the function name and pay attention to the suggestions that pop up. Use the up and down arrow to select a suggested code completion and Tab to apply it. You can also use code completion to complete function’s argument names, object, names and file names. It even works if you don’t get the spelling 100% correct.
Package vignettes and cheat sheets
In addition to the documentation for individual functions, many
packages have vignettes – instructions for how to use the
package to do certain tasks. Vignettes are great for learning by
example. Vignettes are accessible via the package help and by using the
function browseVignettes()
.
There is also a Help menu at the top of the RStudio window, that has cheat sheets for popular packages, RStudio keyboard shortcuts, and more.
Finding more functions and packages
RStudio’s help only searches the packages that you have installed on your machine, but there are many more available on CRAN and GitHub. To search across all available R packages, you can use the website rdocumentation.org. Often, a generic Google or internet search “R <task>” will send you to the appropriate package documentation or a forum where someone else has already asked your question. Many packages also have websites with additional help, tutorials, news and more (for example tidyverse.org).
Dealing with error messages
Don’t get discouraged if your code doesn’t run immediately! Error messages are common when programming, and fixing errors is part of any programmer’s daily work. Often, the problem is a small typo in a variable name or a missing parenthesis. Watch for the red x’s next to your code in RStudio. These may provide helpful hints about the source of the problem.
If you can’t fix an error yourself, start by googling it. Some error messages are too generic to diagnose a problem (e.g. “subscript out of bounds”). In that case it might help to include the name of the function or package you’re using in your query.
Asking for help
If your Google search is unsuccessful, you may want to ask other R users for help:
- Stack Overflow: Many questions have already been answered, but the challenge is to use the right words in your search to find them. If your question hasn’t been answered before and is well crafted, chances are you will get an answer in less than 5 min. Remember to follow their guidelines on how to ask a good question.
- The R 4 Data Science Slack Community: is a fantastic place to get answers to specific questions that you can’t find answers to with Google or Stack Exchange. Make sure you are at least familiar with the material in R 4 Data Science before asking a question here.
The key to receiving help from someone is for them to rapidly grasp your problem. Thus, you should be as precise as possible when describing your problem and help others to pinpoint where the issue might be. Try to:
- Use the correct words to describe your problem. Otherwise you might get an answer pointing to the misuse of your words rather than answering your question.
- Generalize what you are trying to do, so people outside your field can understand the question.
- Reduce what does not work to a simple reproducible example. For instance, instead of using your real data set, create a small generic one. For more information on how to write a reproducible example see this article from the reprex package. Learning how to use the reprex package is also very helpful for this.
- Include the output of
sessionInfo()
in your question. It provides information about your platform, the versions of R and the packages that you are using. As an example, here you can see the versions of R and all the packages that we are using to run the code in this lesson:
R
sessionInfo()
OUTPUT
#> R version 4.3.3 (2024-02-29)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] RSQLite_2.3.1 lubridate_1.9.2 forcats_1.0.0 stringr_1.5.0
#> [5] dplyr_1.1.2 purrr_1.0.1 readr_2.1.4 tidyr_1.3.0
#> [9] tibble_3.2.1 ggplot2_3.4.2 tidyverse_2.0.0 knitr_1.42
#>
#> loaded via a namespace (and not attached):
#> [1] bit_4.0.5 gtable_0.3.3 compiler_4.3.3 renv_1.0.5
#> [5] highr_0.10 tidyselect_1.2.0 blob_1.2.4 scales_1.2.1
#> [9] fastmap_1.1.1 yaml_2.3.7 R6_2.5.1 generics_0.1.3
#> [13] munsell_0.5.0 DBI_1.1.3 pillar_1.9.0 tzdb_0.3.0
#> [17] rlang_1.1.1 utf8_1.2.3 cachem_1.0.8 stringi_1.7.12
#> [21] xfun_0.39 bit64_4.0.5 memoise_2.0.1 timechange_0.2.0
#> [25] cli_3.6.1 withr_2.5.0 magrittr_2.0.3 grid_4.3.3
#> [29] hms_1.1.3 lifecycle_1.0.3 vctrs_0.6.2 evaluate_0.20
#> [33] glue_1.6.2 fansi_1.0.4 colorspace_2.1-0 tools_4.3.3
#> [37] pkgconfig_2.0.3
- The rOpenSci community call “How to ask questions so they get answered”, (rOpenSci site and video recording) includes a presentation of the reprex package and of its philosophy.
- blog.Revolutionanalytics.com and this blog post by Jon Skeet have comprehensive advice on how to ask programming questions.
How to learn more after the workshop?
The material we cover during this workshop will give you a taste of how you can use R to analyze data for your own research. However, to do advanced operations such as cleaning your dataset, using statistical methods, or creating advanced graphics you will need to learn more.
A useful next step is R 4 Data Science, a free online book which starts in a similar way to this workshop, but which contains more intermediate and some advanced concepts too. There are lots of exercises to work through to really cement your understanding, so we recommend progressing through this as far as you can.
The best way to become proficient and efficient at R, as with any other tool, is to use it to address your actual research questions. As a beginner, it can feel daunting to have to write a script from scratch. But given that many people make their code available online, it can be useful to see how other’s structure their code.
Content from Introduction to R
Last updated on 2024-03-12 | Edit this page
Creating objects in R
You can get output from R simply by typing math in the console:
R
3 + 5
12 / 7
However, to do useful and interesting things, we need to assign
values to objects. To create an object, we need to
give it a name followed by the assignment operator <-
,
and the value we want to give it:
R
weight_kg <- 55
<-
is the assignment operator. It assigns values on
the right to objects on the left. So, after executing
x <- 3
, the value of x
is 3
.
For historical reasons, you can also use =
for assignments,
but it is good practice to always use <-
for
assignments.
In RStudio, typing Alt + - (push Alt
at the same time as the - key) will write <-
in a single keystroke in a PC, while typing Option +
- (push Option at the same time as the
- key) does the same in a Mac.
Objects can be given almost any name such as x
,
current_temperature
, or subject_id
. Here are
some further guidelines on naming objects:
- Keep names short and explicit.
- They cannot start with a number (
2x
is not valid, butx2
is). - R is case sensitive, so for example,
weight_kg
is different fromWeight_kg
. - There are some names that cannot be used because they are the names
of fundamental functions in R (e.g.,
if
,else
,for
, see here for a complete list). In general, even if it’s allowed, it’s best to not use other function names (e.g.,c
,T
,mean
,data
,df
,weights
). If in doubt, check the help to see if the name is already in use. - It’s best to avoid dots (
.
) within names. Many function names in R itself have them and dots also have a special meaning (methods) in R and other programming languages. To avoid confusion, don’t include dots in names. - It is recommended to use nouns for object names and verbs for function names.
- Be consistent in the styling of your code, such as where you put
spaces, how you name objects, etc. Styles can include “lower_snake”,
“UPPER_SNAKE”, “lowerCamelCase”, “UpperCamelCase”, etc. Using a
consistent coding style makes your code clearer to read for your future
self and your collaborators. In R, the tidyverse style is quite
popular. You can install the
lintr
package to automatically check for issues in the styling of your code.
Objects vs. variables
What are known as objects
in R
are known as
variables
in many other programming languages. Depending on
the context, object
and variable
can have
drastically different meanings. However, in this lesson, the two words
are used synonymously. For more information see: https://cran.r-project.org/doc/manuals/r-release/R-lang.html#Objects
When assigning a value to an object, R does not print anything. You can force R to print the value by using parentheses or by typing the object name:
R
weight_kg <- 55 # doesn't print anything
(weight_kg <- 55) # but putting parenthesis around the call prints the value of `weight_kg`
weight_kg # and so does typing the name of the object
Now that R has weight_kg
in memory, we can do arithmetic
with it. For instance, we may want to convert this weight into pounds
(weight in pounds is 2.2 times the weight in kg):
R
2.2 * weight_kg
We can also change an object’s value by assigning it a new one:
R
weight_kg <- 57.5
2.2 * weight_kg
This means that assigning a value to one object does not change the
values of other objects. For example, let’s store the animal’s weight in
pounds in a new object, weight_lb
:
R
weight_lb <- 2.2 * weight_kg
and then change weight_kg
to 100.
R
weight_kg <- 100
What do you think is the current content of the object
weight_lb
? 126.5 or 220?
Saving your code
Up to now, your code has been in the console. This is useful for
quick queries but not so helpful if you want to revisit your work for
any reason. A script can be opened by pressing Ctrl +
Shift + N. It is wise to save your script file
immediately. To do this press Ctrl + S. This will
open a dialogue box where you can decide where to save your script file,
and what to name it. The .R
file extension is added
automatically and ensures your file will open with RStudio.
Don’t forget to save your work periodically by pressing Ctrl + S.
Comments
The comment character in R is #
. Anything to the right
of a #
in a script will be ignored by R. It is useful to
leave notes and explanations in your scripts. For convenience, RStudio
provides a keyboard shortcut to comment or uncomment a paragraph: after
selecting the lines you want to comment, press at the same time on your
keyboard Ctrl + Shift + C. If you only
want to comment out one line, you can put the cursor at any location of
that line (i.e. no need to select the whole line), then press
Ctrl + Shift + C.
Functions and their arguments
Functions are “canned scripts” that automate more complicated sets of
commands including operations assignments, etc. Many functions are
predefined, or can be made available by importing R packages
(more on that later). A function usually takes one or more inputs called
arguments. Functions often (but not always) return a
value. A typical example would be the function
sqrt()
. The input (the argument) must be a number, and the
return value (in fact, the output) is the square root of that number.
Executing a function (‘running it’) is called calling the
function. An example of a function call is:
R
weight_kg <- sqrt(10)
Here, the value of 10 is given to the sqrt()
function,
the sqrt()
function calculates the square root, and returns
the value which is then assigned to the object weight_kg
.
This function takes one argument, other functions might take
several.
The return ‘value’ of a function need not be numerical (like that of
sqrt()
), and it also does not need to be a single item: it
can be a set of things, or even a dataset. We’ll see that when we read
data files into R.
Arguments can be anything, not only numbers or filenames, but also other objects. Exactly what each argument means differs per function, and must be looked up in the documentation (see below). Some functions take arguments which may either be specified by the user, or, if left out, take on a default value: these are called options. Options are typically used to alter the way the function operates, such as whether it ignores ‘bad values’, or what symbol to use in a plot. However, if you want something specific, you can specify a value of your choice which will be used instead of the default.
Let’s try a function that can take multiple arguments:
round()
.
R
round(3.14159)
OUTPUT
#> [1] 3
Here, we’ve called round()
with just one argument,
3.14159
, and it has returned the value 3
.
That’s because the default is to round to the nearest whole number. If
we want more digits we can see how to do that by getting information
about the round
function. We can use
args(round)
to find what arguments it takes, or look at the
help for this function using ?round
.
R
args(round)
OUTPUT
#> function (x, digits = 0)
#> NULL
R
?round
We see that if we want a different number of digits, we can type
digits = 2
or however many we want.
R
round(3.14159, digits = 2)
OUTPUT
#> [1] 3.14
If you provide the arguments in the exact same order as they are defined you don’t have to name them:
R
round(3.14159, 2)
OUTPUT
#> [1] 3.14
And if you do name the arguments, you can switch their order:
R
round(digits = 2, x = 3.14159)
OUTPUT
#> [1] 3.14
It’s good practice to put the non-optional arguments (like the number you’re rounding) first in your function call, and to then specify the names of all optional arguments. If you don’t, someone reading your code might have to look up the definition of a function with unfamiliar arguments to understand what you’re doing.
Vectors and data types
A vector is the most common and basic data type in R, and is pretty
much the workhorse of R. A vector is composed by a series of values,
which can be either numbers or characters. We can assign a series of
values to a vector using the c()
function. For example we
can create a vector of animal weights and assign it to a new object
weight_g
:
R
weight_g <- c(50, 60, 65, 82)
weight_g
A vector can also contain characters:
R
animals <- c("mouse", "rat", "dog")
animals
The quotes around “mouse”, “rat”, etc. are essential here. Without
the quotes R will assume objects have been created called
mouse
, rat
and dog
. As these
objects don’t exist in R’s memory, there will be an error message.
There are many functions that allow you to inspect the content of a
vector. length()
tells you how many elements are in a
particular vector:
R
length(weight_g)
length(animals)
An important feature of a vector, is that all of the elements are the
same type of data. The function class()
indicates what kind
of object you are working with:
R
class(weight_g)
class(animals)
The function str()
provides an overview of the structure
of an object and its elements. It is a useful function when working with
large and complex objects:
R
str(weight_g)
str(animals)
You can use the c()
function to add other elements to
your vector:
R
weight_g <- c(weight_g, 90) # add to the end of the vector
weight_g <- c(30, weight_g) # add to the beginning of the vector
weight_g
In the first line, we take the original vector weight_g
,
add the value 90
to the end of it, and save the result back
into weight_g
. Then we add the value 30
to the
beginning, again saving the result back into weight_g
.
We can do this over and over again to grow a vector, or assemble a dataset. As we program, this may be useful to add results that we are collecting or calculating.
An atomic vector is the simplest R data
type and is a linear vector of a single type. Above, we saw 2
of the 6 main atomic vector types that R uses:
"character"
and "numeric"
(or
"double"
). These are the basic building blocks that all R
objects are built from. The other 4 atomic vector types
are:
-
"logical"
forTRUE
andFALSE
(the boolean data type) -
"integer"
for integer numbers (e.g.,2L
, theL
indicates to R that it’s an integer) -
"complex"
to represent complex numbers with real and imaginary parts (e.g.,1 + 4i
) and that’s all we’re going to say about them -
"raw"
for bitstreams that we won’t discuss further
You can check the type of your vector using the typeof()
function and inputting your vector as the argument.
Vectors are one of the many data structures that R
uses. Other important ones are lists (list
), matrices
(matrix
), data frames (data.frame
), factors
(factor
) and arrays (array
).
R implicitly converts them to all be the same type
Vectors can be of only one data type. R tries to convert (coerce) the content of this vector to find a “common denominator” that doesn’t lose any information.
Only one. There is no memory of past data types, and the coercion
happens the first time the vector is evaluated. Therefore, the
TRUE
in num_logical
gets converted into a
1
before it gets converted into "1"
in
combined_logical
.
Challenge(continued)
- You’ve probably noticed that objects of different types get converted into a single, shared type within a vector. In R, we call converting objects from one class into another class coercion. These conversions happen according to a hierarchy, whereby some types get preferentially coerced into other types. Can you draw a diagram that represents the hierarchy of how these data types are coerced?
logical → numeric → character ← logical
Subsetting vectors
If we want to extract one or several values from a vector, we must provide one or several indices in square brackets. For instance:
R
animals <- c("mouse", "rat", "dog", "cat")
animals[2]
OUTPUT
#> [1] "rat"
R
animals[c(3, 2)]
OUTPUT
#> [1] "dog" "rat"
We can also repeat the indices to create an object with more elements than the original one:
R
more_animals <- animals[c(1, 2, 3, 2, 1, 4)]
more_animals
OUTPUT
#> [1] "mouse" "rat" "dog" "rat" "mouse" "cat"
R indices start at 1. Programming languages like Fortran, MATLAB, Julia, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.
Conditional subsetting
Another common way of subsetting is by using a logical vector.
TRUE
will select the element with the same index, while
FALSE
will not:
R
weight_g <- c(21, 34, 39, 54, 55)
weight_g[c(TRUE, FALSE, FALSE, TRUE, TRUE)]
OUTPUT
#> [1] 21 54 55
Typically, these logical vectors are not typed by hand, but are the output of other functions or logical tests. For instance, if you wanted to select only the values above 50:
R
weight_g > 50 # will return logicals with TRUE for the indices that meet the condition
OUTPUT
#> [1] FALSE FALSE FALSE TRUE TRUE
R
## so we can use this to select only the values above 50
weight_g[weight_g > 50]
OUTPUT
#> [1] 54 55
You can combine multiple tests using &
(both
conditions are true, AND) or |
(at least one of the
conditions is true, OR):
R
weight_g[weight_g > 30 & weight_g < 50]
OUTPUT
#> [1] 34 39
R
weight_g[weight_g <= 30 | weight_g == 55]
OUTPUT
#> [1] 21 55
R
weight_g[weight_g >= 30 & weight_g == 21]
OUTPUT
#> numeric(0)
Here, >
for “greater than”, <
stands
for “less than”, <=
for “less than or equal to”, and
==
for “equal to”. The double equal sign ==
is
a test for numerical equality between the left and right hand sides, and
should not be confused with the single =
sign, which
performs variable assignment (similar to <-
).
A common task is to search for certain strings in a vector. One could
use the “or” operator |
to test for equality to multiple
values, but this can quickly become tedious. The function
%in%
allows you to test if any of the elements of a search
vector are found:
R
animals <- c("mouse", "rat", "dog", "cat", "cat")
# return both rat and cat
animals[animals == "cat" | animals == "rat"]
OUTPUT
#> [1] "rat" "cat" "cat"
R
# return a logical vector that is TRUE for the elements within animals
# that are found in the character vector and FALSE for those that are not
animals %in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish")
OUTPUT
#> [1] FALSE TRUE TRUE TRUE TRUE
R
# use the logical vector created by %in% to return elements from animals
# that are found in the character vector
animals[animals %in% c("rat", "cat", "dog", "duck", "goat", "bird", "fish")]
OUTPUT
#> [1] "rat" "dog" "cat" "cat"
When using “>” or “<” on strings, R compares their alphabetical order. Here “four” comes after “five”, and therefore is “greater than” it.
Missing data
As R was designed to analyze datasets, it includes the concept of
missing data (which is uncommon in other programming languages). Missing
data are represented in vectors as NA
.
When doing operations on numbers, most functions will return
NA
if the data you are working with include missing values.
This feature makes it harder to overlook the cases where you are dealing
with missing data. You can add the argument na.rm = TRUE
to
calculate the result as if the missing values were removed
(rm
stands for ReMoved) first.
R
heights <- c(2, 4, 4, NA, 6)
mean(heights)
max(heights)
mean(heights, na.rm = TRUE)
max(heights, na.rm = TRUE)
If your data include missing values, you may want to become familiar
with the functions is.na()
, na.omit()
, and
complete.cases()
. See below for examples.
R
## Extract those elements which are not missing values.
heights[!is.na(heights)]
## Returns the object with incomplete cases removed.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
na.omit(heights)
## Extract those elements which are complete cases.
#The returned object is an atomic vector of type `"numeric"` (or #`"double"`).
heights[complete.cases(heights)]
Recall that you can use the typeof()
function to find
the type of your atomic vector.
Challenge
- Using this vector of heights in inches, create a new vector,
heights_no_na
, with the NAs removed.
R
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
Use the function
median()
to calculate the median of theheights
vector.Use R to figure out how many people in the set are taller than 67 inches.
R
heights <- c(63, 69, 60, 65, NA, 68, 61, 70, 61, 59, 64, 69, 63, 63, NA, 72, 65, 64, 70, 63, 65)
# 1.
heights_no_na <- heights[!is.na(heights)]
# or
heights_no_na <- na.omit(heights)
# or
heights_no_na <- heights[complete.cases(heights)]
# 2.
median(heights, na.rm = TRUE)
# 3.
heights_above_67 <- heights_no_na[heights_no_na > 67]
length(heights_above_67)
Now that we have learned how to write scripts, and the basics of R’s data structures, we are ready to start working with the Portal dataset we have been using in the other lessons, and learn about data frames.
Content from Starting with data
Last updated on 2024-03-12 | Edit this page
Loading the survey data
For the purposes of this lesson we will be looking at an ecological data set generated from investigating animal species diversity and weights found within plots at a study site. The dataset is stored as a comma separated value (CSV) file. Each row holds information for a single animal, and the columns represent:
Column | Description |
---|---|
record_id | Unique id for the observation |
month | month of observation |
day | day of observation |
year | year of observation |
plot_id | ID of a particular experimental plot of land |
species_id | 2-letter code |
sex | sex of animal (“M”, “F”) |
hindfoot_length | length of the hindfoot in mm |
weight | weight of the animal in grams |
genus | genus of animal |
species | species of animal |
taxon | e.g. Rodent, Reptile, Bird, Rabbit |
plot_type | type of plot |
When working with your own data, you will have a copy on your machine already. But for the purposes of this workshop we will use R to download a copy of the data before reading it in.
Downloading the data
We created the folder that will store the downloaded data
(data_raw
) in the chapter “Before
we start”. If you skipped that part, it may be a good idea to have a
look now, to make sure your working directory is set up properly.
We are going to use the R function download.file()
to
download the CSV file that contains the survey data from Figshare, and
we will use read_csv()
to load the content of the CSV file
into R.
Inside the download.file
command, the first entry is a
character string with the source URL (“https://ndownloader.figshare.com/files/2292169”).
This source URL downloads a CSV file from figshare. The text after the
comma (“data_raw/portal_data_joined.csv”) is the destination of the file
on your local machine. You’ll need to have a folder on your machine
called “data_raw” where you’ll download the file. So this command
downloads a file from Figshare, names it “portal_data_joined.csv” and
adds it to a preexisting folder named “data_raw”.
R
download.file(url = "https://ndownloader.figshare.com/files/2292169",
destfile = "data_raw/portal_data_joined.csv")
Reading the data into R
The file has now been downloaded to the destination you specified,
but R has not yet loaded the data from the file into memory. To do this,
we can use the read_csv()
function from the
tidyverse
package.
Packages in R are sets of additional functions that let you do more
stuff. The functions we’ve been using so far, like round()
,
sqrt()
, or c()
come built into R. Packages
give you access to additional functions beyond base R. A similar
function to read_csv()
from the tidyverse package is
read.csv()
from base R. We don’t have time to cover their
differences but notice that the exact spelling determines which function
is used. Before you use a package for the first time you need to install
it on your machine, and then you should load it in every subsequent R
session when you need it, using the library()
function.
To install the tidyverse
package, we
can type install.packages("tidyverse")
straight into the
console. In fact, it’s better to write this in the console than in our
script for any package, as there’s no need to re-install packages every
time we run the script. Then, to load the package type:
R
## load the tidyverse package
library(tidyverse)
Now we can use the functions from the
tidyverse
package. Let’s use
read_csv()
to read the data into a data frame (we will
learn more about data frames later):
R
surveys <- read_csv("data_raw/portal_data_joined.csv")
OUTPUT
#> Rows: 34786 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): species_id, sex, genus, species, taxa, plot_type
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
When you execute read_csv
on a data file, it looks
through the first 1000 rows of each column and guesses its data type.
For example, in this dataset, read_csv()
reads
weight
as col_double
(a numeric data type),
and species
as col_character
. You have the
option to specify the data type for a column manually by using the
col_types
argument in read_csv
.
Note
read_csv()
assumes that fields are delineated by commas.
However, in several countries, the comma is used as a decimal separator
and the semicolon (;) is used as a field delineator. If you want to read
in this type of files in R, you can use the read_csv2()
function. It behaves like read_csv()
but uses different
parameters for the decimal and the field separators. There is also the
read_tsv()
for tab separated data files and
read_delim()
for less common formats. Check out the help
for read_csv()
by typing ?read_csv
to learn
more.
In addition to the above versions of the csv format, you should develop the habits of looking at and recording some parameters of your csv files. For instance, the character encoding, control characters used for line ending, date format (if the date is not split into three variables), and the presence of unexpected newlines are important characteristics of your data files. Those parameters will ease up the import step of your data in R.
We can see the contents of the first few lines of the data by typing
its name: surveys
. By default, this will show you as many
rows and columns of the data as fit on your screen. If you wanted the
first 50 rows, you could type print(surveys, n = 50)
We can also extract the first few lines of this data using the
function head()
:
R
head(surveys)
OUTPUT
#> # A tibble: 6 × 13
#> record_id month day year plot_id species_id sex hindfoot_length weight
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 7 16 1977 2 NL M 32 NA
#> 2 72 8 19 1977 2 NL M 31 NA
#> 3 224 9 13 1977 2 NL <NA> NA NA
#> 4 266 10 16 1977 2 NL <NA> NA NA
#> 5 349 11 12 1977 2 NL <NA> NA NA
#> 6 363 11 12 1977 2 NL <NA> NA NA
#> # ℹ 4 more variables: genus <chr>, species <chr>, taxa <chr>, plot_type <chr>
Unlike the print()
function, head()
returns
the extracted data. You could use it to assign the first 100 rows of
surveys
to an object using
surveys_sample <- head(surveys, 100)
. This can be useful
if you want to try out complex computations on a subset of your data
before you apply them to the whole data set. There is a similar function
that lets you extract the last few lines of the data set. It is called
(you might have guessed it) tail()
.
To open the dataset in RStudio’s Data Viewer, use the
view()
function:
R
view(surveys)
What are data frames?
When we loaded the data into R, it got stored as an object of class
tibble
, which is a special kind of data frame (the
difference is not important for our purposes, but you can learn more
about tibbles here). Data
frames are the de facto data structure for most tabular data,
and what we use for statistics and plotting. Data frames can be created
by hand, but most commonly they are generated by functions like
read_csv()
; in other words, when importing spreadsheets
from your hard drive or the web.
A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length. Because columns are vectors, each column must contain a single type of data (e.g., characters, integers, factors). For example, here is a figure depicting a data frame comprising a numeric, a character, and a logical vector.
We can see this also when inspecting the structure of a data
frame with the function str()
:
R
str(surveys)
Inspecting data frames
We already saw how the functions head()
and
str()
can be useful to check the content and the structure
of a data frame. Here is a non-exhaustive list of functions to get a
sense of the content/structure of the data. Let’s try them out!
-
Size:
-
dim(surveys)
- returns a vector with the number of rows in the first element, and the number of columns as the second element (the dimensions of the object) -
nrow(surveys)
- returns the number of rows -
ncol(surveys)
- returns the number of columns
-
-
Content:
-
head(surveys)
- shows the first 6 rows -
tail(surveys)
- shows the last 6 rows
-
-
Names:
-
names(surveys)
- returns the column names (synonym ofcolnames()
fordata.frame
objects) -
rownames(surveys)
- returns the row names
-
-
Summary:
-
str(surveys)
- structure of the object and information about the class, length and content of each column -
summary(surveys)
- summary statistics for each column
-
Note: most of these functions are “generic”, they can be used on
other types of objects besides data.frame
.
R
str(surveys)
OUTPUT
#> spc_tbl_ [34,786 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
#> $ record_id : num [1:34786] 1 72 224 266 349 363 435 506 588 661 ...
#> $ month : num [1:34786] 7 8 9 10 11 11 12 1 2 3 ...
#> $ day : num [1:34786] 16 19 13 16 12 12 10 8 18 11 ...
#> $ year : num [1:34786] 1977 1977 1977 1977 1977 ...
#> $ plot_id : num [1:34786] 2 2 2 2 2 2 2 2 2 2 ...
#> $ species_id : chr [1:34786] "NL" "NL" "NL" "NL" ...
#> $ sex : chr [1:34786] "M" "M" NA NA ...
#> $ hindfoot_length: num [1:34786] 32 31 NA NA NA NA NA NA NA NA ...
#> $ weight : num [1:34786] NA NA NA NA NA NA NA NA 218 NA ...
#> $ genus : chr [1:34786] "Neotoma" "Neotoma" "Neotoma" "Neotoma" ...
#> $ species : chr [1:34786] "albigula" "albigula" "albigula" "albigula" ...
#> $ taxa : chr [1:34786] "Rodent" "Rodent" "Rodent" "Rodent" ...
#> $ plot_type : chr [1:34786] "Control" "Control" "Control" "Control" ...
#> - attr(*, "spec")=
#> .. cols(
#> .. record_id = col_double(),
#> .. month = col_double(),
#> .. day = col_double(),
#> .. year = col_double(),
#> .. plot_id = col_double(),
#> .. species_id = col_character(),
#> .. sex = col_character(),
#> .. hindfoot_length = col_double(),
#> .. weight = col_double(),
#> .. genus = col_character(),
#> .. species = col_character(),
#> .. taxa = col_character(),
#> .. plot_type = col_character()
#> .. )
#> - attr(*, "problems")=<externalptr>
R
## * class: data frame
## * how many rows: 34786, how many columns: 13
Indexing and subsetting data frames
Our survey data frame has rows and columns (it has 2 dimensions), if we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers. However, note that different ways of specifying these coordinates lead to results with different classes.
R
# We can extract specific values by specifying row and column indices
# in the format:
# data_frame[row_index, column_index]
# For instance, to extract the first row and column from surveys:
surveys[1, 1]
# First row, sixth column:
surveys[1, 6]
# We can also use shortcuts to select a number of rows or columns at once
# To select all columns, leave the column index blank
# For instance, to select all columns for the first row:
surveys[1, ]
# The same shortcut works for rows --
# To select the first column across all rows:
surveys[, 1]
# An even shorter way to select first column across all rows:
surveys[1] # No comma!
# To select multiple rows or columns, use vectors!
# To select the first three rows of the 5th and 6th column
surveys[c(1, 2, 3), c(5, 6)]
# We can use the : operator to create those vectors for us:
surveys[1:3, 5:6]
# This is equivalent to head_surveys <- head(surveys)
head_surveys <- surveys[1:6, ]
# As we've seen, when working with tibbles
# subsetting with single square brackets ("[]") always returns a data frame.
# If you want a vector, use double square brackets ("[[]]")
# For instance, to get the first column as a vector:
surveys[[1]]
# To get the first value in our data frame:
surveys[[1, 1]]
:
is a special function that creates numeric vectors of
integers in increasing or decreasing order, test 1:10
and
10:1
for instance.
You can also exclude certain indices of a data frame using the
“-
” sign:
R
surveys[, -1] # The whole data frame, except the first column
surveys[-(7:nrow(surveys)), ] # Equivalent to head(surveys)
Data frames can be subset by calling indices (as shown previously), but also by calling their column names directly:
R
# As before, using single brackets returns a data frame:
surveys["species_id"]
surveys[, "species_id"]
# Double brackets returns a vector:
surveys[["species_id"]]
# We can also use the $ operator with column names instead of double brackets
# This returns a vector:
surveys$species_id
In RStudio, you can use the autocompletion feature to get the full and correct names of the columns.
Challenge
Create a
data.frame
(surveys_200
) containing only the data in row 200 of thesurveys
dataset.Notice how
nrow()
gave you the number of rows in adata.frame
?
- Use that number to pull out just that last row from the
surveys
dataset. - Compare that with what you see as the last row using
tail()
to make sure it’s meeting expectations. - Pull out that last row using
nrow()
instead of the row number. - Create a new data frame (
surveys_last
) from that last row.
Use
nrow()
to extract the row that is in the middle of the data frame. Store the content of this row in an object namedsurveys_middle
.Combine
nrow()
with the-
notation above to reproduce the behavior ofhead(surveys)
, keeping just the first through 6th rows of the surveys dataset.
R
## 1.
surveys_200 <- surveys[200, ]
## 2.
# Saving `n_rows` to improve readability and reduce duplication
n_rows <- nrow(surveys)
surveys_last <- surveys[n_rows, ]
## 3.
surveys_middle <- surveys[n_rows / 2, ]
## 4.
surveys_head <- surveys[-(7:n_rows), ]
Factors
When we did str(surveys)
we saw that several of the
columns consist of integers. The columns genus
,
species
, sex
, plot_type
, …
however, are of the class character
. Arguably, these
columns contain categorical data, that is, they can only take on a
limited number of values.
R has a special class for working with categorical data, called
factor
. Factors are very useful and actually contribute to
making R particularly well suited to working with data. So we are going
to spend a little time introducing them.
Once created, factors can only contain a pre-defined set of values, known as levels. Factors are stored as integers associated with labels and they can be ordered or unordered. While factors look (and often behave) like character vectors, they are actually treated as integer vectors by R. So you need to be very careful when treating them as strings.
When importing a data frame with read_csv()
, the columns
that contain text are not automatically coerced (=converted) into the
factor
data type, but once we have loaded the data we can
do the conversion using the factor()
function:
R
surveys$sex <- factor(surveys$sex)
We can see that the conversion has worked by using the
summary()
function again. This produces a table with the
counts for each factor level:
R
summary(surveys$sex)
By default, R always sorts levels in alphabetical order. For instance, if you have a factor with 2 levels:
R
sex <- factor(c("male", "female", "female", "male"))
R will assign 1
to the level "female"
and
2
to the level "male"
(because f
comes before m
, even though the first element in this
vector is "male"
). You can see this by using the function
levels()
and you can find the number of levels using
nlevels()
:
R
levels(sex)
nlevels(sex)
Sometimes, the order of the factors does not matter, other times you
might want to specify the order because it is meaningful (e.g., “low”,
“medium”, “high”), it improves your visualization, or it is required by
a particular type of analysis. Here, one way to reorder our levels in
the sex
vector would be:
R
sex # current order
OUTPUT
#> [1] male female female male
#> Levels: female male
R
sex <- factor(sex, levels = c("male", "female"))
sex # after re-ordering
OUTPUT
#> [1] male female female male
#> Levels: male female
In R’s memory, these factors are represented by integers (1, 2, 3),
but are more informative than integers because factors are self
describing: "female"
, "male"
is more
descriptive than 1
, 2
. Which one is “male”?
You wouldn’t be able to tell just from the integer data. Factors, on the
other hand, have this information built in. It is particularly helpful
when there are many levels (like the species names in our example
dataset).
R
surveys$taxa <- factor(surveys$taxa)
surveys$genus <- factor(surveys$genus)
summary(surveys)
nlevels(surveys$genus)
## * how many genera: There are 26 unique genera in the `genus` column.
## * how many rabbts: There are 75 rabbits in the `taxa` column.
Converting factors
If you need to convert a factor to a character vector, you use
as.character(x)
.
R
as.character(sex)
In some cases, you may have to convert factors where the levels
appear as numbers (such as concentration levels or years) to a numeric
vector. For instance, in one part of your analysis the years might need
to be encoded as factors (e.g., comparing average weights across years)
but in another part of your analysis they may need to be stored as
numeric values (e.g., doing math operations on the years). This
conversion from factor to numeric is a little trickier. The
as.numeric()
function returns the index values of the
factor, not its levels, so it will result in an entirely new (and
unwanted in this case) set of numbers. One method to avoid this is to
convert factors to characters, and then to numbers.
Another method is to use the levels()
function.
Compare:
R
year_fct <- factor(c(1990, 1983, 1977, 1998, 1990))
as.numeric(year_fct) # Wrong! And there is no warning...
as.numeric(as.character(year_fct)) # Works...
as.numeric(levels(year_fct))[year_fct] # The recommended way.
Notice that in the levels()
approach, three important
steps occur:
- We obtain all the factor levels using
levels(year_fct)
- We convert these levels to numeric values using
as.numeric(levels(year_fct))
- We then access these numeric values using the underlying integers of
the vector
year_fct
inside the square brackets
Renaming factors
When your data is stored as a factor, you can use the
plot()
function to get a quick glance at the number of
observations represented by each factor level. Let’s look at the number
of males and females captured over the course of the experiment:
R
## bar plot of the number of females and males captured during the experiment:
plot(surveys$sex)
However, as we saw when we used summary(surveys$sex)
,
there are about 1700 individuals for which the sex information hasn’t
been recorded. To show them in the plot, we can turn the missing values
into a factor level with the addNA()
function. We will also
have to give the new factor level a label. We are going to work with a
copy of the sex
column, so we’re not modifying the working
copy of the data frame:
R
sex <- surveys$sex
levels(sex)
OUTPUT
#> [1] "F" "M"
R
sex <- addNA(sex)
levels(sex)
OUTPUT
#> [1] "F" "M" NA
R
head(sex)
OUTPUT
#> [1] M M <NA> <NA> <NA> <NA>
#> Levels: F M <NA>
R
levels(sex)[3] <- "undetermined"
levels(sex)
OUTPUT
#> [1] "F" "M" "undetermined"
R
head(sex)
OUTPUT
#> [1] M M undetermined undetermined undetermined
#> [6] undetermined
#> Levels: F M undetermined
Now we can plot the data again, using plot(sex)
.
R
levels(sex)[1:2] <- c("female", "male")
sex <- factor(sex, levels = c("undetermined", "female", "male"))
plot(sex)
Challenge
- We have seen how data frames are created when using
read_csv()
, but they can also be created by hand with thedata.frame()
function. There are a few mistakes in this hand-crafteddata.frame
. Can you spot and fix them? Don’t hesitate to experiment!
R
animal_data <- data.frame(
animal = c(dog, cat, sea cucumber, sea urchin),
feel = c("furry", "squishy", "spiny"),
weight = c(45, 8 1.1, 0.8)
)
- Can you predict the class for each of the columns in the following
example? Check your guesses using
str(country_climate)
:
- Are they what you expected? Why? Why not?
- What would you need to change to ensure that each column had the accurate data type?
R
country_climate <- data.frame(
country = c("Canada", "Panama", "South Africa", "Australia"),
climate = c("cold", "hot", "temperate", "hot/temperate"),
temperature = c(10, 30, 18, "15"),
northern_hemisphere = c(TRUE, TRUE, FALSE, "FALSE"),
has_kangaroo = c(FALSE, FALSE, FALSE, 1)
)
The automatic conversion of data type is sometimes a blessing, sometimes an annoyance. Be aware that it exists, learn the rules, and double check that data you import in R are of the correct type within your data frame. If not, use it to your advantage to detect mistakes that might have been introduced during data entry (for instance, a letter in a column that should only contain numbers).
Learn more in this RStudio tutorial
Formatting dates
A common issue that new (and experienced!) R users have is converting
date and time information into a variable that is suitable for analyses.
One way to store date information is to store each component of the date
in a separate column. Using str()
, we can confirm that our
data frame does indeed have a separate column for day, month, and year,
and that each of these columns contains integer values.
R
str(surveys)
We are going to use the ymd()
function from the package
lubridate
(which belongs to the
tidyverse
; learn more here).
lubridate
gets installed as part as the
tidyverse
installation. When you load the
tidyverse
(library(tidyverse)
), the core packages (the packages used
in most data analyses) get loaded.
lubridate
however does not belong to the
core tidyverse, so you have to load it explicitly with
library(lubridate)
Start by loading the required package:
R
library(lubridate)
The lubridate
package has many useful
functions for working with dates. These can help you extract dates from
different string representations, convert between timezones, calculate
time differences and more. You can find an overview of them in the lubridate
cheat sheet.
Here we will use the function ymd()
, which takes a
vector representing year, month, and day, and converts it to a
Date
vector. Date
is a class of data
recognized by R as being a date and can be manipulated as such. The
argument that the function requires is flexible, but, as a best
practice, is a character vector formatted as “YYYY-MM-DD”.
Let’s create a date object and inspect the structure:
R
my_date <- ymd("2015-01-01")
str(my_date)
Now let’s paste the year, month, and day separately - we get the same result:
R
# sep indicates the character to use to separate each component
my_date <- ymd(paste("2015", "1", "1", sep = "-"))
str(my_date)
Now we apply this function to the surveys dataset. Create a character
vector from the year
, month
, and
day
columns of surveys
using
paste()
:
R
paste(surveys$year, surveys$month, surveys$day, sep = "-")
This character vector can be used as the argument for
ymd()
:
R
ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-"))
WARNING
#> Warning: 129 failed to parse.
There is a warning telling us that some dates could not be parsed
(understood) by the ymd()
function. For these dates, the
function has returned NA
, which means they are treated as
missing values. We will deal with this problem later, but first we add
the resulting Date
vector to the surveys
data
frame as a new column called date
:
R
surveys$date <- ymd(paste(surveys$year, surveys$month, surveys$day, sep = "-"))
WARNING
#> Warning: 129 failed to parse.
R
str(surveys) # notice the new column, with 'date' as the class
Let’s make sure everything worked correctly. One way to inspect the
new column is to use summary()
:
R
summary(surveys$date)
OUTPUT
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> "1977-07-16" "1984-03-12" "1990-07-22" "1990-12-15" "1997-07-29" "2002-12-31"
#> NA's
#> "129"
Let’s investigate why some dates could not be parsed.
We can use the functions we saw previously to deal with missing data
to identify the rows in our data frame that are failing. If we combine
them with what we learned about subsetting data frames earlier, we can
extract the columns “year,”month”, “day” from the records that have
NA
in our new column date
. We will also use
head()
so we don’t clutter the output:
R
missing_dates <- surveys[is.na(surveys$date), c("year", "month", "day")]
head(missing_dates)
OUTPUT
#> # A tibble: 6 × 3
#> year month day
#> <dbl> <dbl> <dbl>
#> 1 2000 9 31
#> 2 2000 4 31
#> 3 2000 4 31
#> 4 2000 4 31
#> 5 2000 4 31
#> 6 2000 9 31
Why did these dates fail to parse? If you had to use these data for your analyses, how would you deal with this situation?
The answer is because the dates provided as input for the
ymd()
function do not actually exist. If we refer to the
output we got above, September and April only have 30 days, not 31 days
as it is specified in our dataset.
There are several ways you could deal with situation:
- If you have access to the raw data (e.g., field sheets) or supporting information (e.g., field trip reports/logs), check them and ensure the electronic database matches the information in the original data source.
- If you are able to contact the person responsible for collecting the data, you could refer to them and ask for clarification.
- You could also check the rest of the dataset for clues about the correct value for the erroneous dates.
- If your project has guidelines on how to correct this sort of errors, refer to them and apply any recommendations.
- If it is not possible to ascertain the correct value for these observations, you may want to leave them as missing data.
Regardless of the option you choose, it is important that you document the error and the corrections (if any) that you apply to your data.
Content from Manipulating, analyzing and exporting data with tidyverse
Last updated on 2024-03-12 | Edit this page
Data manipulation using dplyr
and
tidyr
Bracket subsetting is handy, but it can be cumbersome and difficult
to read, especially for complicated operations. Enter
dplyr
. dplyr
is a package for helping with tabular data manipulation. It pairs nicely
with tidyr
which enables you to swiftly
convert between different data formats for plotting and analysis.
The tidyverse
package is an
“umbrella-package” that installs tidyr
,
dplyr
, and several other useful packages
for data analysis, such as ggplot2
,
tibble
, etc.
The tidyverse
package tries to address
3 common issues that arise when doing data analysis in R:
- The results from a base R function sometimes depend on the type of data.
- R expressions are used in a non standard way, which can be confusing for new learners.
- The existence of hidden arguments having default operations that new learners are not aware of.
You should already have installed and loaded the
tidyverse
package. If you haven’t already
done so, you can type install.packages("tidyverse")
straight into the console. Then, type library(tidyverse)
to
load the package.
What are dplyr
and
tidyr
?
The package dplyr
provides helper tools
for the most common data manipulation tasks. It is built to work
directly with data frames, with many common tasks optimized by being
written in a compiled language (C++). An additional feature is the
ability to work directly with data stored in an external database. The
benefits of doing this are that the data can be managed natively in a
relational database, queries can be conducted on that database, and only
the results of the query are returned.
This addresses a common problem with R in that all operations are conducted in-memory and thus the amount of data you can work with is limited by available memory. The database connections essentially remove that limitation in that you can connect to a database of many hundreds of GB, conduct queries on it directly, and pull back into R only what you need for analysis.
The package tidyr
addresses the common
problem of wanting to reshape your data for plotting and usage by
different R functions. For example, sometimes we want data sets where we
have one row per measurement. Other times we want a data frame where
each measurement type has its own column, and rows are instead more
aggregated groups (e.g., a time period, an experimental unit like a plot
or a batch number). Moving back and forth between these formats is
non-trivial, and tidyr
gives you tools for
this and more sophisticated data manipulation.
To learn more about dplyr
and
tidyr
after the workshop, you may want to
check out this handy
data transformation with dplyr
cheatsheet and this one
about tidyr
.
As before, we’ll read in our data using the read_csv()
function from the tidyverse package
readr
.
R
surveys <- read_csv("data_raw/portal_data_joined.csv")
OUTPUT
#> Rows: 34786 Columns: 13
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (6): species_id, sex, genus, species, taxa, plot_type
#> dbl (7): record_id, month, day, year, plot_id, hindfoot_length, weight
#>
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
R
## inspect the data
str(surveys)
R
## preview the data
view(surveys)
Next, we’re going to learn some of the most common
dplyr
functions:
-
select()
: subset columns -
filter()
: subset rows on conditions -
mutate()
: create new columns by using information from other columns -
group_by()
andsummarize()
: create summary statistics on grouped data -
arrange()
: sort results -
count()
: count discrete values
Selecting columns and filtering rows
To select columns of a data frame, use select()
. The
first argument to this function is the data frame
(surveys
), and the subsequent arguments are the columns to
keep.
R
select(surveys, plot_id, species_id, weight)
To select all columns except certain ones, put a “-” in front of the variable to exclude it.
R
select(surveys, -record_id, -species_id)
This will select all the variables in surveys
except
record_id
and species_id
.
To choose rows based on a specific criterion, use
filter()
:
R
filter(surveys, year == 1995)
Pipes
What if you want to select and filter at the same time? There are three ways to do this: use intermediate steps, nested functions, or pipes.
With intermediate steps, you create a temporary data frame and use that as input to the next function, like this:
R
surveys2 <- filter(surveys, weight < 5)
surveys_sml <- select(surveys2, species_id, sex, weight)
This is readable, but can clutter up your workspace with lots of objects that you have to name individually. With multiple steps, that can be hard to keep track of.
You can also nest functions (i.e. one function inside of another), like this:
R
surveys_sml <- select(filter(surveys, weight < 5), species_id, sex, weight)
This is handy, but can be difficult to read if too many functions are nested, as R evaluates the expression from the inside out (in this case, filtering, then selecting).
The last option, pipes, are a recent addition to R. Pipes
let you take the output of one function and send it directly to the
next, which is useful when you need to do many things to the same
dataset. Pipes in R look like %>%
and are made available
via the magrittr
package, installed
automatically with dplyr
. If you use
RStudio, you can type the pipe with Ctrl
- Shift + M if you have a PC or Cmd + Shift + M if you have a Mac.
R
surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
In the above code, we use the pipe to send the surveys
dataset first through filter()
to keep rows where
weight
is less than 5, then through select()
to keep only the species_id
, sex
, and
weight
columns. Since %>%
takes the object
on its left and passes it as the first argument to the function on its
right, we don’t need to explicitly include the data frame as an argument
to the filter()
and select()
functions any
more.
Some may find it helpful to read the pipe like the word “then.” For
instance, in the example above, we took the data frame
surveys
, then we filter
ed for rows
with weight < 5
, then we select
ed
columns species_id
, sex
, and
weight
. The dplyr
functions
by themselves are somewhat simple, but by combining them into linear
workflows with the pipe we can accomplish more complex manipulations of
data frames.
If we want to create a new object with this smaller version of the data, we can assign it a new name:
R
surveys_sml <- surveys %>%
filter(weight < 5) %>%
select(species_id, sex, weight)
surveys_sml
Note that the final data frame is the leftmost part of this expression.
R
surveys %>%
filter(year < 1995) %>%
select(year, sex, weight)
Mutate
Frequently you’ll want to create new columns based on the values in
existing columns, for example to do unit conversions, or to find the
ratio of values in two columns. For this we’ll use
mutate()
.
To create a new column of weight in kg:
R
surveys %>%
mutate(weight_kg = weight / 1000)
You can also create a second new column based on the first new column
within the same call of mutate()
:
R
surveys %>%
mutate(weight_kg = weight / 1000,
weight_lb = weight_kg * 2.2)
If this runs off your screen and you just want to see the first few
rows, you can use a pipe to view the head()
of the data.
(Pipes work with non-dplyr
functions, too,
as long as the dplyr
or
magrittr
package is loaded).
R
surveys %>%
mutate(weight_kg = weight / 1000) %>%
head()
The first few rows of the output are full of NA
s, so if
we wanted to remove those we could insert a filter()
in the
chain:
R
surveys %>%
filter(!is.na(weight)) %>%
mutate(weight_kg = weight / 1000) %>%
head()
is.na()
is a function that determines whether something
is an NA
. The !
symbol negates the result, so
we’re asking for every row where weight is not an
NA
.
Challenge
Create a new data frame from the surveys
data that meets
the following criteria: contains only the species_id
column
and a new column called hindfoot_cm
containing the
hindfoot_length
values (currently in mm) converted to
centimeters. In this hindfoot_cm
column, there are no
NA
s and all values are less than 3.
Hint: think about how the commands should be ordered to produce this data frame!
R
surveys_hindfoot_cm <- surveys %>%
filter(!is.na(hindfoot_length)) %>%
mutate(hindfoot_cm = hindfoot_length / 10) %>%
filter(hindfoot_cm < 3) %>%
select(species_id, hindfoot_cm)
Split-apply-combine data analysis and the summarize()
function
Many data analysis tasks can be approached using the
split-apply-combine paradigm: split the data into groups, apply
some analysis to each group, and then combine the results. Key functions
of dplyr
for this workflow are
group_by()
and summarize()
.
The group_by()
and summarize()
functions
group_by()
is often used together with
summarize()
, which collapses each group into a single-row
summary of that group. group_by()
takes as arguments the
column names that contain the categorical variables for
which you want to calculate the summary statistics. So to compute the
mean weight
by sex:
R
surveys %>%
group_by(sex) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE))
You may also have noticed that the output from these calls doesn’t
run off the screen anymore. It’s one of the advantages of
tbl_df
over data frame.
You can also group by multiple columns:
R
surveys %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight, na.rm = TRUE)) %>%
tail()
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Here, we used tail()
to look at the last six rows of our
summary. Before, we had used head()
to look at the first
six rows. We can see that the sex
column contains
NA
values because some animals had escaped before their sex
and body weights could be determined. The resulting
mean_weight
column does not contain NA
but
NaN
(which refers to “Not a Number”) because
mean()
was called on a vector of NA
values
while at the same time setting na.rm = TRUE
. To avoid this,
we can remove the missing values for weight before we attempt to
calculate the summary statistics on weight. Because the missing values
are removed first, we can omit na.rm = TRUE
when computing
the mean:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Here, again, the output from these calls doesn’t run off the screen
anymore. If you want to display more data, you can use the
print()
function at the end of your chain with the argument
n
specifying the number of rows to display:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight)) %>%
print(n = 15)
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Once the data are grouped, you can also summarize multiple variables at the same time (and not necessarily on the same variable). For instance, we could add a column indicating the minimum weight for each species for each sex:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
It is sometimes useful to rearrange the result of a query to inspect
the values. For instance, we can sort on min_weight
to put
the lighter species first:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(min_weight)
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
To sort in descending order, we need to add the desc()
function. If we want to sort the results by decreasing order of mean
weight:
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(sex, species_id) %>%
summarize(mean_weight = mean(weight),
min_weight = min(weight)) %>%
arrange(desc(mean_weight))
OUTPUT
#> `summarise()` has grouped output by 'sex'. You can override using the `.groups`
#> argument.
Counting
When working with data, we often want to know the number of
observations found for each factor or combination of factors. For this
task, dplyr
provides count()
.
For example, if we wanted to count the number of rows of data for each
sex, we would do:
R
surveys %>%
count(sex)
The count()
function is shorthand for something we’ve
already seen: grouping by a variable, and summarizing it by counting the
number of observations in that group. In other words,
surveys %>% count()
is equivalent to:
R
surveys %>%
group_by(sex) %>%
summarize(count = n())
For convenience, count()
provides the sort
argument:
R
surveys %>%
count(sex, sort = TRUE)
Previous example shows the use of count()
to count the
number of rows/observations for one factor (i.e.,
sex
). If we wanted to count combination of
factors, such as sex
and species
, we
would specify the first and the second factor as the arguments of
count()
:
R
surveys %>%
count(sex, species)
With the above code, we can proceed with arrange()
to
sort the table according to a number of criteria so that we have a
better comparison. For instance, we might want to arrange the table
above in (i) an alphabetical order of the levels of the species and (ii)
in descending order of the count:
R
surveys %>%
count(sex, species) %>%
arrange(species, desc(n))
From the table above, we may learn that, for instance, there are 75
observations of the albigula species that are not specified for
its sex (i.e. NA
).
R
surveys %>%
count(plot_type)
R
surveys %>%
filter(!is.na(hindfoot_length)) %>%
group_by(species_id) %>%
summarize(
mean_hindfoot_length = mean(hindfoot_length),
min_hindfoot_length = min(hindfoot_length),
max_hindfoot_length = max(hindfoot_length),
n = n()
)
R
surveys %>%
filter(!is.na(weight)) %>%
group_by(year) %>%
filter(weight == max(weight)) %>%
select(year, genus, species, weight) %>%
arrange(year)
Reshaping with pivot_longer and pivot_wider
In the spreadsheet lesson, we discussed how to structure our data leading to the four rules defining a tidy dataset:
- Each variable has its own column
- Each observation has its own row
- Each value must have its own cell
- Each type of observational unit forms a table
Here we examine the fourth rule: Each type of observational unit forms a table.
In surveys
, the rows of surveys
contain the
values of variables associated with each record (the unit), values such
as the weight or sex of each animal associated with each record. What if
instead of comparing records, we wanted to compare the different mean
weight of each genus between plots? (Ignoring plot_type
for
simplicity).
We’d need to create a new table where each row (the unit) is
comprised of values of variables associated with each plot. In practical
terms this means the values in genus
would become the names
of column variables and the cells would contain the values of the mean
weight observed on each plot.
Having created a new table, it is therefore straightforward to explore the relationship between the weight of different genera within, and between, the plots. The key point here is that we are still following a tidy data structure, but we have reshaped the data according to the observations of interest: average genus weight per plot instead of recordings per date.
The opposite transformation would be to transform column names into values of a variable.
We can do both these of transformations with two tidyr
functions, pivot_wider()
and
pivot_longer()
.
These may sound like dramatically different data layouts, but there are some tools that make transitions between these layouts more straightforward than you might think! The gif below shows how these two formats relate to each other, and gives you an idea of how we can use R to shift from one format to the other.
Pivoting from long to wide format
pivot_wider()
takes three principal arguments:
- the data
- the names_from column variable whose values will become new column names.
- the values_from column variable whose values will fill the new column variables.
Further arguments include values_fill
which, if set,
fills in missing values with the value provided.
Let’s use pivot_wider()
to transform surveys to find the
mean weight of each genus in each plot over the entire survey period. We
use filter()
, group_by()
and
summarize()
to filter our observations and variables of
interest, and create a new variable for the
mean_weight
.
R
surveys_gw <- surveys %>%
filter(!is.na(weight)) %>%
group_by(plot_id, genus) %>%
summarize(mean_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'plot_id'. You can override using the
#> `.groups` argument.
R
str(surveys_gw)
This yields surveys_gw
where the observations for each
plot are distributed across multiple rows, 196 observations of 3
variables. Using pivot_wider()
with the names from
genus
and with values from mean_weight
this
becomes 24 observations of 11 variables, one row for each plot.
R
surveys_wide <- surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight)
str(surveys_wide)
We could now plot comparisons between the weight of genera (one is called a genus, multiple are called genera) in different plots, although we may wish to fill in the missing values first.
R
surveys_gw %>%
pivot_wider(names_from = genus, values_from = mean_weight, values_fill = 0) %>%
head()
Pivoting from wide to long format
The opposing situation could occur if we had been provided with data
in the form of surveys_wide
, where the genus names are
column names, but we wish to treat them as values of a genus variable
instead.
In this situation we are reshaping the column names and turning them into a pair of new variables. One variable represents the column names as values, and the other variable contains the values previously associated with the column names.
pivot_longer()
takes four principal arguments:
- the data
- the names_to column variable we wish to create from column names.
- the values_to column variable we wish to create and fill with values.
- cols are the name of the columns we use to make this pivot (or to drop).
To recreate surveys_gw
from surveys_wide
we
would create a names variable called genus
and value
variable called mean_weight
.
In pivoting longer, we also need to specify what columns to reshape.
If the columns are directly adjacent as they are here, we don’t even
need to list the all out: we can just use the :
operator!
R
surveys_long <- surveys_wide %>%
pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id)
str(surveys_long)
Note that now the NA
genera are included in the long
format data frame. Pivoting wider and then longer can be a useful way to
balance out a dataset so that every replicate has the same
composition
We could also have used a specification for what columns to exclude.
In this example, we will use all columns except
plot_id
for the names variable. By using the minus sign in
the cols
argument, we omit plot_id
from being
reshaped
R
surveys_wide %>%
pivot_longer(names_to = "genus", values_to = "mean_weight", cols = -plot_id) %>%
head()
Challenge
- Reshape the
surveys
data frame withyear
as columns,plot_id
as rows, and the number of genera per plot as the values. You will need to summarize before reshaping, and use the functionn_distinct()
to get the number of unique genera within a particular chunk of data. It’s a powerful function! See?n_distinct
for more.
R
surveys_wide_genera <- surveys %>%
group_by(plot_id, year) %>%
summarize(n_genera = n_distinct(genus)) %>%
pivot_wider(names_from = year, values_from = n_genera)
OUTPUT
#> `summarise()` has grouped output by 'plot_id'. You can override using the
#> `.groups` argument.
R
head(surveys_wide_genera)
R
surveys_wide_genera %>%
pivot_longer(names_to = "year", values_to = "n_genera", cols = -plot_id)
Challenge(continued)
- The
surveys
data set has two measurement columns:hindfoot_length
andweight
. This makes it difficult to do things like look at the relationship between mean values of each measurement per year in different plot types. Let’s walk through a common solution for this type of problem. First, usepivot_longer()
to create a dataset where we have a names column calledmeasurement
and avalue
column that takes on the value of eitherhindfoot_length
orweight
. Hint: You’ll need to specify which columns will be part of the reshape.
R
surveys_long <- surveys %>%
pivot_longer(names_to = "measurement", values_to = "value", cols = c(hindfoot_length, weight))
- With this new data set, calculate the average of each
measurement
in eachyear
for each differentplot_type
. Thenpivot_wider()
them into a data set with a column forhindfoot_length
andweight
. Hint: You only need to specify the names and values columns forpivot_wider()
.
R
surveys_long %>%
group_by(year, measurement, plot_type) %>%
summarize(mean_value = mean(value, na.rm=TRUE)) %>%
pivot_wider(names_from = measurement, values_from = mean_value)
OUTPUT
#> `summarise()` has grouped output by 'year', 'measurement'. You can override
#> using the `.groups` argument.
Exporting data
Now that you have learned how to use
dplyr
to extract information from or
summarize your raw data, you may want to export these new data sets to
share them with your collaborators or for archival.
Similar to the read_csv()
function used for reading CSV
files into R, there is a write_csv()
function that
generates CSV files from data frames.
Before using write_csv()
, we are going to create a new
folder, data
, in our working directory that will store this
generated dataset. We don’t want to write generated datasets in the same
directory as our raw data. It’s good practice to keep them separate. The
data_raw
folder should only contain the raw, unaltered
data, and should be left alone to make sure we don’t delete or modify
it. In contrast, our script will generate the contents of the
data
directory, so even if the files it contains are
deleted, we can always re-generate them.
In preparation for our next lesson on plotting, we are going to prepare a cleaned up version of the data set that doesn’t include any missing data.
Let’s start by removing observations of animals for which
weight
and hindfoot_length
are missing, or the
sex
has not been determined:
R
surveys_complete <- surveys %>%
filter(!is.na(weight), # remove missing weight
!is.na(hindfoot_length), # remove missing hindfoot_length
!is.na(sex)) # remove missing sex
Because we are interested in plotting how species abundances have changed through time, we are also going to remove observations for rare species (i.e., that have been observed less than 50 times). We will do this in two steps: first we are going to create a data set that counts how often each species has been observed, and filter out the rare species; then, we will extract only the observations for these more common species:
R
## Extract the most common species_id
species_counts <- surveys_complete %>%
count(species_id) %>%
filter(n >= 50)
## Only keep the most common species
surveys_complete <- surveys_complete %>%
filter(species_id %in% species_counts$species_id)
To make sure that everyone has the same data set, check that
surveys_complete
has 30463 rows and 13 columns by typing
dim(surveys_complete)
.
Now that our data set is ready, we can save it as a CSV file in our
data
folder.
R
write_csv(surveys_complete, file = "data/surveys_complete.csv")
Content from Data visualization with ggplot2
Last updated on 2024-03-12 | Edit this page
We start by loading the required packages.
ggplot2
is included in the
tidyverse
package.
R
library(tidyverse)
If not still in the workspace, load the data we saved in the previous lesson.
R
surveys_complete <- read_csv("data/surveys_complete.csv")
Plotting with ggplot2
ggplot2
is a plotting package that
provides helpful commands to create complex plots from data in a data
frame. It provides a more programmatic interface for specifying what
variables to plot, how they are displayed, and general visual
properties. Therefore, we only need minimal changes if the underlying
data change or if we decide to change from a bar plot to a scatterplot.
This helps in creating publication quality plots with minimal amounts of
adjustments and tweaking.
ggplot2
refers to the name of the
package itself. When using the package we use the function
ggplot()
to generate the plots, and so
references to using the function will be referred to as
ggplot()
and the package as a whole as
ggplot2
ggplot2
plots work best with data in
the ‘long’ format, i.e., a column for every variable, and a row for
every observation. Well-structured data will save you lots of time when
making figures with ggplot2
ggplot graphics are built layer by layer by adding new elements. Adding layers in this fashion allows for extensive flexibility and customization of plots.
To build a ggplot, we will use the following basic template that can be used for different types of plots:
ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()
- use the
ggplot()
function and bind the plot to a specific data frame using thedata
argument
R
ggplot(data = surveys_complete)
- define an aesthetic mapping (using the aesthetic (
aes
) function), by selecting the variables to be plotted and specifying how to present them in the graph, e.g., as x/y positions or characteristics such as size, shape, color, etc.
R
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length))
-
add ‘geoms’ – graphical representations of the data in the plot (points, lines, bars).
ggplot2
offers many different geoms; we will use some common ones today, including:-
geom_point()
for scatter plots, dot plots, etc. -
geom_boxplot()
for, well, boxplots! -
geom_line()
for trend lines, time series, etc.
-
To add a geom to the plot use +
operator. Because we
have two continuous variables, let’s use geom_point()
first:
R
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
The +
in the ggplot2
package is particularly useful because it allows you to modify existing
ggplot
objects. This means you can easily set up plot
“templates” and conveniently explore different types of plots, so the
above plot can also be generated with code like this:
R
# Assign plot to a variable
surveys_plot <- ggplot(data = surveys_complete,
mapping = aes(x = weight, y = hindfoot_length))
# Draw the plot
surveys_plot +
geom_point()
Notes
- Anything you put in the
ggplot()
function can be seen by any geom layers that you add (i.e., these are universal plot settings). This includes the x- and y-axis you set up inaes()
. - You can also specify aesthetics for a given geom independently of
the aesthetics defined globally in the
ggplot()
function. - The
+
sign used to add layers must be placed at the end of each line containing a layer. If, instead, the+
sign is added in the line before the other layer,ggplot2
will not add the new layer and will return an error message. - You may notice that we sometimes reference ‘ggplot2’ and sometimes ‘ggplot’. To clarify, ‘ggplot2’ is the name of the most recent version of the package. However, any time we call the function itself, it’s just called ‘ggplot’.
- The previous version of the
ggplot2
package, calledggplot
, which also contained theggplot()
function is now unsupported and has been removed from CRAN in order to reduce accidental installations and further confusion.
R
# This is the correct syntax for adding layers
surveys_plot +
geom_point()
# This will not add the new layer and will return an error message
surveys_plot
+ geom_point()
Challenge (optional)
Scatter plots can be useful exploratory tools for small datasets. For
data sets with large numbers of observations, such as the
surveys_complete
data set, overplotting of points can be a
limitation of scatter plots. One strategy for handling such settings is
to use hexagonal binning of observations. The plot space is tessellated
into hexagons. Each hexagon is assigned a color based on the number of
observations that fall within its boundaries. To use hexagonal binning
with ggplot2
, first install the R package
hexbin
from CRAN:
R
install.packages("hexbin")
library(hexbin)
Then use the geom_hex()
function:
R
surveys_plot +
geom_hex()
- What are the relative strengths and weaknesses of a hexagonal bin plot compared to a scatter plot? Examine the above scatter plot and compare it with the hexagonal bin plot that you created.
Building your plots iteratively
Building plots with ggplot2
is
typically an iterative process. We start by defining the dataset we’ll
use, lay out the axes, and choose a geom:
R
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point()
Then, we start modifying this plot to extract more information from
it. For instance, we can add transparency (alpha
) to avoid
overplotting:
R
ggplot(data = surveys_complete, aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1)
We can also add colors for all the points:
R
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, color = "blue")
Or to color each species in the plot differently, you could use a
vector as an input to the argument color.
ggplot2
will provide a different color
corresponding to different values in the vector. Here is an example
where we color with species_id
:
R
ggplot(data = surveys_complete, mapping = aes(x = weight, y = hindfoot_length)) +
geom_point(alpha = 0.1, aes(color = species_id))
R
ggplot(data = surveys_complete,
mapping = aes(x = species_id, y = weight)) +
geom_point(aes(color = plot_type))
Boxplot
We can use boxplots to visualize the distribution of weight within each species:
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot()
By adding points to the boxplot, we can have a better idea of the
number of measurements and of their distribution. Because the boxplot
will show the outliers by default these points will be plotted twice –
by geom_boxplot
and geom_jitter
. To avoid this
we must specify that no outliers should be added to the boxplot by
specifying outlier.shape = NA
.
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.3, color = "tomato")
Notice how the boxplot layer is behind the jitter layer? What do you need to change in the code to put the boxplot in front of the points such that it’s not hidden?
Challenges
Boxplots are useful summaries, but hide the shape of the distribution. For example, if there is a bimodal distribution, it would not be observed with a boxplot. An alternative to the boxplot is the violin plot (sometimes known as a beanplot), where the shape (of the density of points) is drawn.
- Replace the box plot with a violin plot; see
geom_violin()
.
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_violin()
Challenges(continued)
In many types of data, it is important to consider the scale of the observations. For example, it may be worth changing the scale of the axis to better distribute the observations in the space of the plot. Changing the scale of the axes is done similarly to adding/modifying other components (i.e., by incrementally adding commands). Try making these modifications:
- Represent weight on the log10 scale; see
scale_y_log10()
.
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = weight)) +
scale_y_log10() +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA)
R
ggplot(data = surveys_complete, mapping = aes(x = species_id, y = hindfoot_length)) +
geom_jitter(alpha = 0.3, color = "tomato") +
geom_boxplot(outlier.shape = NA)
Plotting time series data
Let’s calculate number of counts per year for each genus. First we need to group the data and count records within each group:
R
yearly_counts <- surveys_complete %>%
count(year, genus)
Timelapse data can be visualized as a line plot with years on the x-axis and counts on the y-axis:
R
ggplot(data = yearly_counts, aes(x = year, y = n)) +
geom_line()
Unfortunately, this does not work because we plotted data for all the
genera together. We need to tell ggplot to draw a line for each genus by
modifying the aesthetic function to include
group = genus
:
R
ggplot(data = yearly_counts, aes(x = year, y = n, group = genus)) +
geom_line()
We will be able to distinguish genera in the plot if we add colors
(using color
also automatically groups the data):
R
ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line()
Integrating the pipe operator with ggplot2
In the previous lesson, we saw how to use the pipe operator
%>%
to use different functions in a sequence and create
a coherent workflow. We can also use the pipe operator to pass the
data
argument to the ggplot()
function. The
hard part is to remember that to build your ggplot, you need to use
+
and not %>%
.
R
yearly_counts %>%
ggplot(mapping = aes(x = year, y = n, color = genus)) +
geom_line()
The pipe operator can also be used to link data manipulation with consequent data visualization.
R
yearly_counts_graph <- surveys_complete %>%
count(year, genus) %>%
ggplot(mapping = aes(x = year, y = n, color = genus)) +
geom_line()
yearly_counts_graph
Faceting
ggplot
has a special technique called faceting
that allows the user to split one plot into multiple plots based on a
factor included in the dataset. We will use it to make a time series
plot for each genus:
R
ggplot(data = yearly_counts, aes(x = year, y = n)) +
geom_line() +
facet_wrap(facets = vars(genus))
Now we would like to split the line in each plot by the sex of each
individual measured. To do that we need to make counts in the data frame
grouped by year
, genus
, and
sex
:
R
yearly_sex_counts <- surveys_complete %>%
count(year, genus, sex)
We can now make the faceted plot by splitting further by sex using
color
(within a single plot):
R
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(facets = vars(genus))
We can also facet both by sex and genus:
R
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(sex), cols = vars(genus))
You can also organise the panels only by rows (or only by columns):
R
# One column, facet by rows
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(rows = vars(genus))
R
# One row, facet by column
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_grid(cols = vars(genus))
Note: ggplot2
before version 3.0.0 used
formulas to specify how plots are faceted. If you encounter
facet_grid
/wrap(...)
code containing
~
, please read https://ggplot2.tidyverse.org/news/#tidy-evaluation.
ggplot2
themes
Usually plots with white background look more readable when printed.
Every single component of a ggplot
graph can be customized
using the generic theme()
function, as we will see below.
However, there are pre-loaded themes available that change the overall
appearance of the graph without much effort.
For example, we can change our previous graph to have a simpler white
background using the theme_bw()
function:
R
ggplot(data = yearly_sex_counts,
mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
theme_bw()
In addition to theme_bw()
, which changes the plot
background to white, ggplot2
comes with
several other themes which can be useful to quickly change the look of
your visualization. The complete list of themes is available at https://ggplot2.tidyverse.org/reference/ggtheme.html.
theme_minimal()
and theme_light()
are popular,
and theme_void()
can be useful as a starting point to
create a new hand-crafted theme.
The ggthemes package provides a wide variety of options.
R
yearly_weight <- surveys_complete %>%
group_by(year, species_id) %>%
summarize(avg_weight = mean(weight))
OUTPUT
#> `summarise()` has grouped output by 'year'. You can override using the
#> `.groups` argument.
R
ggplot(data = yearly_weight, mapping = aes(x=year, y=avg_weight)) +
geom_line() +
facet_wrap(vars(species_id)) +
theme_bw()
Customization
Take a look at the ggplot2
cheat sheet, and think of ways you could improve the plot.
Now, let’s change names of axes to something more informative than ‘year’ and ‘n’ and add a title to the figure:
R
ggplot(data = yearly_sex_counts, aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw()
The axes have more informative names, but their readability can be
improved by increasing the font size. This can be done with the generic
theme()
function:
R
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(text=element_text(size = 16))
Note that it is also possible to change the fonts of your plots. If
you are on Windows, you may have to install the extrafont
package, and follow the instructions included in the README for this
package.
After our manipulations, you may notice that the values on the x-axis
are still not properly readable. Let’s change the orientation of the
labels and adjust them vertically and horizontally so they don’t
overlap. You can use a 90 degree angle, or experiment to find the
appropriate angle for diagonally oriented labels. We can also modify the
facet label text (strip.text
) to italicize the genus
names:
R
ggplot(data = yearly_sex_counts, mapping = aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90, hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
strip.text = element_text(face = "italic"),
text = element_text(size = 16))
If you like the changes you created better than the default theme, you can save them as an object to be able to easily apply them to other plots you may create:
R
grey_theme <- theme(axis.text.x = element_text(colour="grey20", size = 12,
angle = 90, hjust = 0.5,
vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text=element_text(size = 16))
ggplot(surveys_complete, aes(x = species_id, y = hindfoot_length)) +
geom_boxplot() +
grey_theme
Challenge
With all of this information in hand, please take another five
minutes to either improve one of the plots generated in this exercise or
create a beautiful graph of your own. Use the RStudio ggplot2
cheat sheet for inspiration.
Here are some ideas:
- See if you can change the thickness of the lines.
- Can you find a way to change the name of the legend? What about its labels?
- Try using a different color palette (see https://r-graphics.org/chapter-colors).
Arranging plots
Faceting is a great tool for splitting one plot into multiple plots,
but sometimes you may want to produce a single figure that contains
multiple plots using different variables or even different data frames.
The patchwork
package allows us to combine
separate ggplots into a single figure while keeping everything aligned
properly. Like most R packages, we can install patchwork
from CRAN, the R package repository:
R
install.packages("patchwork")
After you have loaded the patchwork
package you can use
+
to place plots next to each other, /
to
arrange them vertically, and plot_layout()
to determine how
much space each plot uses:
R
library(patchwork)
plot_weight <- ggplot(data = surveys_complete, aes(x = species_id, y = weight)) +
geom_boxplot() +
labs(x = "Species", y = expression(log[10](Weight))) +
scale_y_log10()
plot_count <- ggplot(data = yearly_counts, aes(x = year, y = n, color = genus)) +
geom_line() +
labs(x = "Year", y = "Abundance")
plot_weight / plot_count + plot_layout(heights = c(3, 2))
You can also use parentheses ()
to create more complex
layouts. There are many useful examples on the patchwork website
Exporting plots
After creating your plot, you can save it to a file in your favorite
format. The Export tab in the Plot pane in RStudio will
save your plots at low resolution, which will not be accepted by many
journals and will not scale well for posters. The ggplot2
extensions website provides a list of packages that extend the
capabilities of ggplot2
, including
additional themes.
Instead, use the ggsave()
function, which allows you to
easily change the dimension and resolution of your plot by adjusting the
appropriate arguments (width
, height
and
dpi
):
R
my_plot <- ggplot(data = yearly_sex_counts,
aes(x = year, y = n, color = sex)) +
geom_line() +
facet_wrap(vars(genus)) +
labs(title = "Observed genera through time",
x = "Year of observation",
y = "Number of individuals") +
theme_bw() +
theme(axis.text.x = element_text(colour = "grey20", size = 12, angle = 90,
hjust = 0.5, vjust = 0.5),
axis.text.y = element_text(colour = "grey20", size = 12),
text = element_text(size = 16))
ggsave("name_of_file.png", my_plot, width = 15, height = 10)
## This also works for plots combined with patchwork
plot_combined <- plot_weight / plot_count + plot_layout(heights = c(3, 2))
ggsave("plot_combined.png", plot_combined, width = 10, dpi = 300)
Note: The parameters width
and height
also
determine the font size in the saved plot.