Using R for GIS

In this course, we will be using R for all our GIS needs. If you’ve never used R before, no worries! We will move at a slow and steady pace and will provide support along the way. If you’re an R pro, feel free to flex your R skills and (hopefully) build on them!

The objectives of the guide are as follows

  • Install and set up R and RStudio
  • Understand R data types
  • Understand R data structures
  • Understand R functions
  • Introduction to tidyverse and its suite of data wrangling functions
  • Understand R Markdown
  • This lab guide follows closely and supplements the material presented in Chapters 2, 4, 5, 7, and 21 in the textbook R for Data Science (RDS).


What is R?

R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization. R is an interpreted language, not a compiled one. This means that you type something into R and it does what you tell it. It is both a command line software and a programming environment. It is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms, which allows for the user to freely distribute, study, change, and improve the software. It is basically a free, super big, and complex calculator. You will be using R to accomplish all data analysis tasks in this class. You might be wondering “Why in the world do we need to know how to use a statistical software program?” Here are the main reasons:

  1. You will be learning about new concepts in lecture and the readings. Applying these concepts using real data is an important form of learning. A statistical software program is the most efficient (and in many cases the only) means of running data analyses, not just in the cloistered setting of a university classroom, but especially in the real world. Applied data analysis will be the way we bridge statistical theory to the “real world.” And R is the vehicle for accomplishing this.

  2. In order to do applied data analysis outside of the classroom, you need to know how to use a statistical program. There is no way around it. If you want to collect data on health, you need a program to store and analyze that data.


The next question you may have is “I love Excel or SAS or SPSS or Stata [or insert your favorite program]. Why can’t I use that and forget your stupid R?” Here are some reasons:

  1. R is free. Most programs are not.
  2. R is open source. Which means the software is community supported. This allows you to get help not from some big corporation (e.g. Microsoft with Excel), but people all around the world who are using R. And R has a lot of users, which means that if you have a problem, and you pose it to the user community, someone will help you.
  3. R is powerful and extensible (meaning that procedures for analyzing data that don’t currently exist can be readily developed);
  4. R has the capability for mapping data, an asset not generally available in other statistical software.
  5. If it isn’t already, R is becoming the de-facto data analysis tool in many fields, including for many CA DPH positions.


R is different from Excel in that it is generally not a point-and-click program. You will be primarily writing code to clean and analyze data. What does writing or sourcing code mean? A basic example will help clarify. Let’s say you are given a dataset with cancer cases across CA. You have a variable in the dataset representing age Let’s say this variable is named AGE. To get the mean age of the people in your dataset, you would write code that would look something like this

# Download Cancer Dataset
download.file("https://raw.githubusercontent.com/pjames-ucdavis/SPH215/refs/heads/main/CA_Cancer_Data.rds", "ca_cancer.rds", mode = "wb")

# Read in Cancer Dataset
cancer <- readRDS("ca_cancer.rds")

# Get names of columns or variables
head(cancer)
##         time event AGE INS             geometry
## 1   1.275976     1  67 Mcr   -122.3492, 38.3025
## 14  3.509907     1  69 Mcr -121.98325, 37.82052
## 17 10.297702     0  75 Mng   -122.3092, 38.3314
## 36  7.012532     0  46 Mcr -122.20308, 38.09592
## 55  3.389200     0  70 Mcr -122.63560, 38.26257
## 92  6.110251     1  59 Unk -122.01982, 37.35523
# Get mean of AGE variable
mean(cancer$AGE)
## [1] 60.4692


The command tells the program to get the mean of the variable AGE. If you wanted the sum, you write the command sum(cancer$AGE).

Now, where do you write this command? You write it in a script. A script is basically a text file. Think of writing code as something similar to writing an essay in a word document. Instead of sentences to produce an essay, in a programming script you are writing lines of code to run a data analysis. We’ll go through scripting in more detail later in this lab, but the basic process of sourcing code to run a data analysis task is as follows.

  1. Write code. First, you open your script file, and write code or various commands (like mean(cancer$AGE)) that will execute data analysis tasks in this file.
  2. Send code to the software program to run (R in our case).
  3. Program produces results based on code. The program reads in your commands from the script and executes them, spitting out results in its console screen.


I am skipping over many details, but the above steps outline the general work flow. You might now be thinking that you’re perfectly happy pointing and clicking your mouse in Excel (or wherever) to do your data analysis tasks. So, why should you adopt the statistical programming approach to conducting a data analysis?

  1. Your script documents the decisions you made during the data analysis process. This is beneficial for many reasons.
  • It allows you to recreate your steps if you need to rerun or alter your analysis many weeks, months, or even years in the future.
  • It allows you to share your steps with other people. If someone asks you what were the decisions made in the data analysis process, - just hand them the script.
  • Related to the above points, a script promotes transparency (here is what I did) and reproducibility (you can do it too). When you write code, you are forced to explicitly state the steps you took to do your research. When you do research by clicking through drop-down menus, your steps are lost, or at least documenting them requires considerable extra effort.
  1. If you make a mistake in a data analysis step, you can go back, change a few lines of code, and poof, you’ve fixed your problem.

  2. It is more efficient. In particular, cleaning data can encompass a lot of tedious work that can be streamlined using statistical programming.

Hopefully, I’ve convinced you that statistical programming and R are worthwhile to learn. Now let’s talk about getting R on your computer!


Getting R

R can be downloaded from one of the “CRAN” (Comprehensive R Archive Network) sites. In the US, the main site is at http://cran.us.r-project.org/. Look in the “Download and Install R” area at the top. Click on the appropriate link based on your operating system.

If you already have R on your computer, make sure you have the most updated version of R on your personal computer (R version 4.5.2 ([Not] Part in a Rumble)).


Mac OS X

  1. On the “R for Mac OS” page, there are multiple packages that could be downloaded. Depending on the model of your Mac, pick the appropriate .pkg file. Note the details for some operating systems. If you are using an older operating system, please follow instructions.

  2. After the package finishes downloading, locate the installer on your hard drive, double-click on the installer package, and after a few screens, select a destination for the installation of the R framework (the program) and the R.app GUI. Note that you will have to supply the Administrator’s password. Close the window when the installation is done.

  3. An application will appear in the Applications folder: R.app.

  4. Browse to the XQuartz download page. Click on the most recent version of XQuartz to download the application.

  5. Run the XQuartz installer. XQuartz is needed to create windows to display many types of R graphics: this used to be included in MacOS until version 10.8 but now must be downloaded separately.


Windows

  1. On the “R for Windows” page, click on the “base” link, which should take you to the “R-4.5.2 for Windows” page

  2. On this page, click “Download R-4.5.2 for Windows”, and save the .exe file to your hard disk when prompted. Saving to the desktop is fine.

  3. To begin the installation, double-click on the downloaded file. Don’t be alarmed if you get unknown publisher type warnings. Window’s User Account Control will also worry about an unidentified program wanting access to your computer. Click on “Run”.

  4. Select the proposed options in each part of the install dialog. When the “Select Components” screen appears, just accept the standard choices


What is R Studio?

If you click on the R program you just downloaded, you will find a very basic user interface. For example, below is what I get on a Mac:

The Basic R Console
The Basic R Console


We will not use R’s direct interface to run analyses in this class. Instead, we will use the program RStudio, which is much easier to interact with! RStudio gives you a true integrated development environment (IDE), where you can write code in a window, see results in other windows, see locations of files, see objects you’ve created, and so on. To clarify which is which: R is the name of the programming language itself and RStudio is an interface that makes writing code, running analyses, and visualizing data in R so much easier.


Getting R Studio

To download and install RStudio, follow the directions below

  1. Navigate to RStudio’s download site

  2. We’ve already downloaded R, so click on the appropriate link to Install RStudio based on your OS (Windows, Mac, Linux and many others). Do not download anything from the “All Installers and Tarballs” section.

  3. Click on the installer that you downloaded. Follow the installation directions, making sure to keep all defaults intact. After installation, RStudio should pop up in your Applications or Programs folder/menu.


The RStudio Interface

Open up RStudio. You should see the interface shown in the figure below which has three windows.

The RStudio Console
The RStudio Console


  • Console (bottom left) - The way R works is you write a line of code to execute some kind of task on a data object. - The R Console allows you to run code interactively. The screen prompt > is an invitation from R to enter its - world. This is where you type code in, press enter to execute the code, and see the results.
  • Environment, History, and Connections tabs (upper-right)
    • Environment - shows all the R objects that are currently open in your workspace. This is the place, for example, where you will see any data you’ve loaded into R. When you exit RStudio, R will clear all objects in this window. You can also click on broom to clear out all the objects loaded and created in your current session.
    • History - shows a list of executed commands in the current session.
    • Connections - you can connect to a variety of data sources, and explore the objects and data inside the connection. I typically don’t use this window, but you can.
  • Files, Plots, Packages, Help and Viewer tabs (lower-right)
    • Files – shows all the files and folders in your current working directory
    • Plots – shows any charts, graphs, maps and plots you’ve executed
    • Packages – shows available R packages
    • Help – displays help documentation
    • Viewer – displays local web content

There is also a fourth window. But, we’ll get to this window a little later. The assignment guidelines also have more on this window!


Setting RStudio Defaults

While not required, I strongly suggest that you change preferences in RStudio to never save the workspace so you always open with a clean environment. See Ch. 8.1 of R4DS for some more background

  1. From the Tools menu on RStudio, open the Tools menu and then select Global Options.

  2. If not already highlighted, click on the General button from the left panel.

  3. Uncheck the following Restore boxes

  • Restore most recently opened project at startup
  • Restore previously open source documents at startup
  • Restore .RData into workspace at startup
  1. Set Save Workspace to .RData on exit to “Never”.

  2. Click OK at the bottom to save the changes and close the preferences window. You may need to restart RStudio.

The reason for making these changes is that it is preferable for reproducibility to start each R session with a clean environment. You can restore a previous environment either by rerunning code or by manually loading a previously saved session.

The R Studio environment is modified when you execute code from files or from the console. If you always start fresh, you do not need to be concerned about things not working because of something you typed in the console, but did not save in a file.

You only need to set these preferences once.


R Data Types

Let’s now explore what R can do. R is really just a big fancy calculator. For example, type in the following mathematical expression in the R console (left window):

1+1
## [1] 2

Note that spacing does not matter: 1+1 will generate the same answer as 1 + 1. Can you say hello to the world?

"hello world"
## [1] "hello world"

Looks great! Note, we need to put quotes around it. “hello world” is a character and R recognizes characters only if there are quotes around it. This brings us to the topic of basic data types in R. There are four basic data types in R: character, logical, numeric, and factors (there are two others - complex and raw - but we won’t cover them because they are rarely used in practice).


Characters

Characters are used to represent words or letters in R. We saw this above with “hello world”. Character values are also known as strings. You might think that the value "1" is a number. Well, if you put quotes around, it isn’t! Anything with quotes will be interpreted as a character. No ifs, ands or buts about it.


Logicals

A logical takes on two values: FALSE or TRUE. Logicals are usually constructed with comparison operators, which we’ll go through more carefully in Lab 2. Think of a logical as the answer to a question like “Is this value greater than (lower than/equal to) this other value?” The answer will be either TRUE or FALSE. TRUE and FALSE are logical values in R. For example, typing in the following

3 > 2
## [1] TRUE


This gives you a TRUE What about the following?

"declan" == "catherine"
## [1] FALSE


Numeric

Numerics are separated into two types: integer and double. The distinction between integers and doubles is usually not important. R treats numerics as doubles by default because it is a less restrictive data type. You can do any mathematical operation on numeric values. We added one and one above. We can also multiply using the * operator.

2*3
## [1] 6


And divide

2/3
## [1] 0.6666667


And take logs

log(1)
## [1] 0


log(0)
## [1] -Inf

Hold up! What is -Inf? Well, you can’t take the logarithm of 0, so R is telling you that you’re getting a non numeric value in return. The value -Inf is another value type that you can get in R.


Factors

Think of a factor as a categorical variable. It is sort of like a character, but not really. It is actually a numeric code with character-valued levels. Think of a character as a true string and a factor as a set of categories represented as characters. We won’t use factors too much in this course, so maybe don’t worry about it for now!


R Data Structures

You just learned that R has four basic data types. Now, let’s go through how we can store data in R. That is, you type in the character “hello world” or the number 3, and you want to store these values. You do this by using R’s various data structures.

Vectors

A vector is the most common and basic R data structure and is pretty much the workhorse of the language. A vector is simply a sequence of values which can be of any data type but all of the same type. There are a number of ways to create a vector depending on the data type, but the most common is to insert the data you want to save in a vector into the command c(). For example, to represent the values 4, 16, and 9 in a vector type in

c(4, 16, 9)
## [1]  4 16  9


You can also have a vector of character values

c("catherine", "declan", "gwen")
## [1] "catherine" "declan"    "gwen"

The above code does not actually “save” the values 4, 16, and 9 or catherine, declan, gwen – it just presents it on the screen in a vector. If you want to use these values again without having to type out c(4, 16, 9), you can save it in a data object. At the heart of almost everything you will do (or are ever likely to do) in R is the concept that everything in R is an object. These objects can be almost anything, from a single number or character string (like a word) to highly complex structures like the output of a plot, a map, a summary of your statistical analysis or a set of R commands that perform a specific task.

You assign data to an object using the arrow sign <-. This will create an object in R’s memory that can be called back into the command window at any time. For example, you can save “hello world” to a vector called b by typing in

b <- "hello world"
b
## [1] "hello world"

You can pronounce the above as “b becomes ‘hello world’”.

The first line tells R to store b as ‘hello world.’ In the next line, we are telling R to print what b is.

Note that R is case sensitive, if you type in B instead of b, you will get an error.

Similarly, you can save the numbers 4, 16 and 9 into a vector called v1.

v1 <- c(4, 16, 9)
v1
## [1]  4 16  9


You should see the objects b and v1 pop up in the Environment tab on the top right window of your RStudio interface.


Environment Window
Environment Window


Note that the name v1 is nothing special here. You could have named the object x or sph215 or your pet’s name (mine was Ali Baba). You can’t, however, name objects using special characters (e.g. !, @, $) or only numbers (although you can combine numbers and letters, but a number cannot be at the beginning e.g. 2d2). For example, you’ll get an error if you save the vector c(4,16,9) to an object with the following names

123 <- c(4, 16, 9)
!!! <- c(4, 16, 9)
## Error: <text>:2:5: unexpected assignment
## 1: 123 <- c(4, 16, 9)
## 2: !!! <-
##        ^


Also note that to distinguish a character value from a variable name, it needs to be quoted. “v1” is a character value whereas v1 is a variable. One of the most common mistakes for beginners is to forget the quotes.

james
## ## Error in eval(expr, envir, enclos): object 'james' not found

The error occurs because R tries to print the value of object james, but there is no such variable. So remember that any time you get the error message object 'something' not found, the most likely reason is that you forgot to quote a character value. If not, it probably means that you have misspelled, or not yet created, the object that you are referring to. We’ve included the common pitfalls and R tips in this class resource.

Every vector has two key properties: type and length. The type property indicates the data type that the vector is holding. Use the command typeof() to determine the type.

typeof(b)
## [1] "character"


typeof(v1)
## [1] "double"

Note that a vector cannot hold values of different types. If different data types exist, R will coerce the values into the highest type based on its internal hierarchy: logical < integer < double < character. Type in test <- c("r", 6, TRUE) in your R console. What is the vector type of test?


The command length() determines the number of data values that the vector is storing.

length(b)
## [1] 1
length(v1)
## [1] 3

You can also directly determine if a vector is of a specific data type by using the command is.X() where you replace X with the data type. For example, to find out if v1 is numeric, type in:

is.numeric(b)
## [1] FALSE
is.numeric(v1)
## [1] TRUE


There is also is.logical(), is.character(), and is.factor(). You can also coerce a vector of one data type to another. For example, save the value “1” and “2” (both in quotes) into a vector named x1.

x1 <- c("1", "2")
typeof(x1)
## [1] "character"


To convert x1 into a numeric, use the command as.numeric()

x2 <- as.numeric(x1)
typeof(x2)
## [1] "double"


There is also as.logical(), as.character(), and as.factor().

An important practice you should adopt early is to keep only necessary objects in your current R Environment. For example, we will not be using x2 any longer in this guide. To remove this object from R forever, use the command rm()

rm(x2)

The data frame object x2 should have disappeared from the Environment tab. Au revoir!


Also note that when you close down R Studio, the objects you created above will disappear for good. Unless you save them onto your hard drive (we’ll touch on saving data in later labs), all data objects you create in your current R session will go bye bye when you exit the program.


Data Frames

We learned that data values can be stored in data structures known as vectors. The next step is to learn how to store vectors into an even higher level data structure. The data frame can do this. Data frames store vectors of the same length. Create a vector called v2 storing the values 5, 12, and 25.

v2 <- c(5,12,25)


We can create a data frame using the command data.frame() storing the vectors v1 and v2 as columns.

data.frame(v1,v2)
##   v1 v2
## 1  4  5
## 2 16 12
## 3  9 25


Store this data frame in an object called df1

df1<-data.frame(v1, v2)

df1 should pop up in your Environment window. You’ll notice a lab0fig2.png next to df1. This tells you that df1 possesses or holds more than one object. Click on lab0fig2.png and you’ll see the two vectors we saved into df1. Another nice thing you can do is directly click on df1 from the Environment window to bring up an Excel style worksheet on the top left of your RStudio interface. You can also type in:

View(df1)

to bring the worksheet up. You can’t edit this worksheet directly, but it allows you to see the values that a higher level R data object contains.


We can store different types of vectors in a data frame. For example, we can store one character vector and one numeric vector in a single data frame.

v3 <- c("catherine", "declan", "gwen")
df2 <- data.frame(v1, v3)
df2
##   v1        v3
## 1  4 catherine
## 2 16    declan
## 3  9      gwen


For higher level data structures like a data frame, use the function class() to figure out what kind of object you’re working with.

class(df2)
## [1] "data.frame"


We can’t use length() on a data frame because it has more than one vector. Instead, it has dimensions - the number of rows and columns. You can find the number of rows and columns that a data frame has by using the command dim()

dim(df1)
## [1] 3 2

Here, the data frame df1 has 3 rows and 2 columns. Data frames also have column names, which are characters.


We can figure out the names of the columns using colnames.

colnames(df1)
## [1] "v1" "v2"

In this case, the data frame used the vector names for the column names.


We can extract columns from data frames by referring to their names using the $ sign.

df1$v1
## [1]  4 16  9


We can also extra data from data frames using brackets [ , ]

df1[,1]
## [1]  4 16  9

The value before the comma indicates the row, which you leave empty if you are not selecting by row, which we did above. The value after the comma indicates the column, which you leave empty if you are not selecting by column. The above line of code selected the first column.


Let’s now select the 2nd row.

df1[2,]
##   v1 v2
## 2 16 12


OK, so that wasn’t too hard. Now let’s try something a little trickier! What is the value in the 2nd row and 1st column?

df1[2,1]
## [1] 16

See – we can do hard things!


Functions

Let’s take a step back and talk about functions (also known as commands or macros (in SAS)). An R function is a packaged recipe that converts one or more inputs (called arguments) into a single output. You execute all of your tasks in R using functions. We have already used a couple of functions above including typeof() and colnames(). Every function in R will have the following basic format

functionName(arg1 = val1, arg2 = val2, ...)

In R, you type in the function’s name and set a number of options or parameters within parentheses that are separated by commas. Some options need to be set by the user - i.e. the function will spit out an error because a required option is blank - whereas others can be set but are not required because there is a default value established.

Let’s use the function seq() which makes regular sequences of numbers. You can find out what the options are for a function by calling up its help documentation by typing ? and the function name

? seq
## starting httpd help server ... done

The help documentation should pop up in the bottom right window of your RStudio interface. The documentation should also provide some examples of the function at the bottom of the page. Type the arguments from = 1, to = 10 inside the parentheses.

seq(from = 1, to = 10)
##  [1]  1  2  3  4  5  6  7  8  9 10


You should get the same result if you type in:

seq(1, 10)
##  [1]  1  2  3  4  5  6  7  8  9 10

The code above demonstrates something about how R resolves function arguments. When you use a function, you can always specify all the arguments in arg = value form. But if you do not, R attempts to resolve by position. So in the code above, it is assumed that we want a sequence from = 1 that goes to = 10 because we typed 1 before 10. Type in 10 before 1 and see what happens. Since we didn’t specify step size, the default value of by in the function definition is used, which ends up being 1 in this case.


Packages

Functions do not exist in a vacuum, but exist within R packages. Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. At the top left of a function’s help documentation, you’ll find in curly brackets the R package that the function is housed in. For example, type in your console ? seq. At the top right of the help documentation, you’ll find that seq() is in the package base. All the functions we have used so far are part of packages that have been pre-installed and pre-loaded into R.

In order to use functions in a new package, you first need to install the package using the install.packages() command. For example, we will be using commands from the package tidyverse in this lab.

options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("tidyverse")
## Installing package into 'C:/Users/skpar/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'tidyverse' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\skpar\AppData\Local\Temp\RtmpeU3eRu\downloaded_packages

You should see a bunch of gibberish roll through your console screen. Don’t worry, that’s just R downloading all of the other packages and applications that tidyverse relies on. These are known as dependencies. Unless you get a message in red that indicates there is an error (like we saw when we typed in “hello world” without quotes), you should be fine.

Next, you will need to load packages in your working environment (every time you start RStudio). We do this with the library() function. Notice there are no quotes around tidyverse this time (just to make things trickier for us!).

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.1     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.2
## ✔ purrr     1.2.0     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The Packages window at the lower-right of your RStudio shows you all the packages you currently have installed. If you don’t have a package listed in this window, you’ll need to use the install.packages() function to install it. If the package is checked, that means it is loaded into your current R session

For example, here is a section of my Packages window window1.png


The only packages loaded into my current session is methods, a package that is loaded every time you open an R session. Let’s say I use install.packages() to install the package matrixStats. The window now looks like: window2.png


Let’s load matrixStats using library(), and then we will see a check mark appears next to matrixStats.

install.packages("matrixStats")
## Installing package into 'C:/Users/skpar/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'matrixStats' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\skpar\AppData\Local\Temp\RtmpeU3eRu\downloaded_packages
library(matrixStats)
## 
## Attaching package: 'matrixStats'
## The following object is masked from 'package:dplyr':
## 
##     count


Look at us!

To uninstall a package, use the function remove.packages().

Note that you only need to install packages once with install.packages(), but you need to load them each time you relaunch RStudio with library(). Repeat after me: Install once, library every time. If you need to reinstall R or update to a new version of R, you will need to reinstall all packages. And as noted earlier, R has several packages already preloaded into your working environment. These are known as base packages and a list of their functions can be found here.


Tidyverse

In most labs, we will be using commands from the tidyverse package. Tidyverse is a collection of high-powered, consistent, and easy-to-use packages developed by a number of thoughtful and talented R developers. The consistency of the tidyverse, together with the goal of increasing productivity, mean that the syntax of tidy functions is typically straightforward to learn. You can read more about tidyverse principles in Chapter 9, pages 147-151 in RDS.

Excited about entering the tidyverse? I bet you are, so here is a badge to show your excitement!

Your Tidyverse Badge
Your Tidyverse Badge


Tibbles

Although the tidyverse works with all data objects, its fundamental object type is the tibble. Tibbles are not only a super fun word to say, they are data frames that tweak some older behaviors to make life a little easier. There are two main differences in the usage of a data frame vs a tibble: printing and subsetting. Let’s be clear here – tibbles are just a special kind of data frame. They just make things “tidier.” Let’s bring in some data to illustrate the differences and similarities between data frames and tibbles. Install the package nycflights13

install.packages("nycflights13")
## Installing package into 'C:/Users/skpar/AppData/Local/R/win-library/4.5'
## (as 'lib' is unspecified)
## package 'nycflights13' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\skpar\AppData\Local\Temp\RtmpeU3eRu\downloaded_packages


Make sure you also load the package.

library(nycflights13)


If you look in the upper right hand Environment tab and click on Global Environment, you will see there is a dataset called flights included in this package. It includes information on all 336,776 flights that departed from New York City in 2013. Let’s save this file in the local R environment.

nyctibble <- flights
class(nyctibble)
## [1] "tbl_df"     "tbl"        "data.frame"


This dataset is a tibble. Let’s also save it as a regular data frame by using the as.data.frame() function.

nycdf <- as.data.frame(flights)
class(nycdf)
## [1] "data.frame"


The first difference between data frames and tibbles is how the dataset looks. Tibbles have a refined print method that shows only the first 10 rows, and only the columns that fit on the screen. In addition, each column reports its name and type.

nyctibble
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>


Tibbles are designed so that you don’t overwhelm your console when you print large data frames. Compare the print output above to what you get with a data frame.

nycdf


Um, that was a lot….Tibble is much cleaner. You can bring up the Excel like worksheet of the tibble (or data frame) using the View() function.

View(nyctibble)


You can identify the names of the columns (and hence the variables in the dataset) by using the function names().

names(nyctibble)
##  [1] "year"           "month"          "day"            "dep_time"      
##  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
##  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
## [13] "origin"         "dest"           "air_time"       "distance"      
## [17] "hour"           "minute"         "time_hour"

Those may come in handy if we wanted to analyze the data!


Finally, let’s convert a regular data frame to a tibble using the as_tibble() function.

as_tibble(nycdf)
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ℹ 336,766 more rows
## # ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
## #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
## #   hour <dbl>, minute <dbl>, time_hour <dttm>

Not all functions work with tibbles, particularly those that are specific to spatial data. As such, we’ll be using a combination of tibbles and regular data frames throughout the class, with a preference towards tibbles where possible. Note that when you search on Google for how to do something in R, you will likely get non-tidy ways of doing things. Most of these suggestions are fine, but some are not and may screw you up down the road. My advice is to try to stick with tidy functions to do things in R.

Anyway, you earned another badge. Yes!

Your Tibble Badge
Your Tibble Badge


---
title: 'Lab 1: Intro to R'
---

\

# Using R for GIS

In this course, we will be using R for all our GIS needs. If you've never used R before, no worries! We will move at a slow and steady pace and will provide support along the way. If you're an R pro, feel free to flex your R skills and (hopefully) build on them! 

The objectives of the guide are as follows

- Install and set up R and RStudio
- Understand R data types
- Understand R data structures
- Understand R functions
- Introduction to tidyverse and its suite of data wrangling functions
- Understand R Markdown
- This lab guide follows closely and supplements the material presented in Chapters 2, 4, 5, 7, and 21 in the textbook [R for Data Science (RDS)](https://r4ds.hadley.nz/).

\

# What is R?

R is a free, open source statistical programming language. It is useful for data cleaning, analysis, and visualization. R is an interpreted language, not a compiled one. This means that you type something into R and it does what you tell it. It is both a command line software and a programming environment. It is an extensible, open-source language and computing environment for Windows, Macintosh, UNIX, and Linux platforms, which allows for the user to freely distribute, study, change, and improve the software. It is basically a free, super big, and complex calculator. You will be using R to accomplish all data analysis tasks in this class. You might be wondering “Why in the world do we need to know how to use a statistical software program?” Here are the main reasons:

1. You will be learning about new concepts in lecture and the readings. Applying these concepts using real data is an important form of learning. A statistical software program is the most efficient (and in many cases the only) means of running data analyses, not just in the cloistered setting of a university classroom, but especially in the real world. Applied data analysis will be the way we bridge statistical theory to the “real world.” And R is the vehicle for accomplishing this.

2. In order to do applied data analysis outside of the classroom, you need to know how to use a statistical program. There is no way around it. If you want to collect data on health, you need a program to store and analyze that data. 

\

The next question you may have is “I love Excel or SAS or SPSS or Stata [or insert your favorite program]. Why can’t I use that and forget your stupid R?” Here are some reasons:

1. R is free. Most programs are not.
2. R is open source. Which means the software is community supported. This allows you to get help not from some big corporation (e.g. Microsoft with Excel), but people all around the world who are using R. And R has **a lot** of users, which means that if you have a problem, and you pose it to the user community, someone will help you.
3. R is powerful and extensible (meaning that procedures for analyzing data that don’t currently exist can be readily developed);
4. R has the capability for mapping data, an asset not generally available in other statistical software.
5. If it isn’t already, R is becoming the de-facto data analysis tool in many fields, including for many CA DPH positions.

\

R is different from Excel in that it is generally not a point-and-click program. You will be primarily writing code to clean and analyze data. What does *writing* or *sourcing* code mean? A basic example will help clarify. Let’s say you are given a dataset with cancer cases across CA. You have a variable in the dataset representing age Let’s say this variable is named **AGE**. To get the mean age of the people in your dataset, you would write code that would look something like this


```{r}
# Download Cancer Dataset
download.file("https://raw.githubusercontent.com/pjames-ucdavis/SPH215/refs/heads/main/CA_Cancer_Data.rds", "ca_cancer.rds", mode = "wb")

# Read in Cancer Dataset
cancer <- readRDS("ca_cancer.rds")

# Get names of columns or variables
head(cancer)

# Get mean of AGE variable
mean(cancer$AGE)
```

\

The command tells the program to get the mean of the variable **AGE**. If you wanted the sum, you write the command `sum(cancer$AGE)`.

Now, where do you write this command? You write it in a script. A script is basically a text file. Think of writing code as something similar to writing an essay in a word document. Instead of sentences to produce an essay, in a programming script you are writing lines of code to run a data analysis. We’ll go through scripting in more detail later in this lab, but the basic process of sourcing code to run a data analysis task is as follows.

1. Write code. First, you open your script file, and write code or various commands (like `mean(cancer$AGE)`) that will execute data analysis tasks in this file.
2. Send code to the software program to run (R in our case).
3. Program produces results based on code. The program reads in your commands from the script and executes them, spitting out results in its console screen.

\

I am skipping over many details, but the above steps outline the general work flow. You might now be thinking that you’re perfectly happy pointing and clicking your mouse in Excel (or wherever) to do your data analysis tasks. So, why should you adopt the statistical programming approach to conducting a data analysis?

1. Your script documents the decisions you made during the data analysis process. This is beneficial for many reasons.
  - It allows you to recreate your steps if you need to rerun or alter your analysis many weeks, months, or even years in the future.
  - It allows you to share your steps with other people. If someone asks you what were the decisions made in the data analysis process, - just hand them the script.
  - Related to the above points, a script promotes **transparency** (here is what I did) and **reproducibility** (you can do it too). When you write code, you are forced to explicitly state the steps you took to do your research. When you do research by clicking through drop-down menus, your steps are lost, or at least documenting them requires considerable extra effort.
  
2. If you make a mistake in a data analysis step, you can go back, change a few lines of code, and **poof**, you’ve fixed your problem.

3. It is more efficient. In particular, cleaning data can encompass a lot of tedious work that can be streamlined using statistical programming.

Hopefully, I’ve convinced you that statistical programming and R are worthwhile to learn. Now let's talk about getting R on your computer!

\

# Getting R

R can be downloaded from one of the “CRAN” (Comprehensive R Archive Network) sites. In the US, the main site is at http://cran.us.r-project.org/. Look in the “Download and Install R” area at the top. Click on the appropriate link based on your operating system.

**If you already have R on your computer, make sure you have the most updated version of R on your personal computer (R version 4.5.2 ([Not] Part in a Rumble)).**

\

## Mac OS X
1. On the “R for Mac OS” page, there are multiple packages that could be downloaded. Depending on the model of your Mac, pick the appropriate .pkg file. *Note the details for some operating systems. If you are using an older operating system, please follow instructions.*

2. After the package finishes downloading, locate the installer on your hard drive, double-click on the installer package, and after a few screens, select a destination for the installation of the R framework (the program) and the R.app GUI. Note that you will have to supply the Administrator’s password. Close the window when the installation is done.

3. An application will appear in the Applications folder: R.app.

4. Browse to the [XQuartz](https://www.xquartz.org/) download page. Click on the most recent version of XQuartz to download the application.
 
5. Run the XQuartz installer. XQuartz is needed to create windows to display many types of R graphics: this used to be included in MacOS until version 10.8 but now must be downloaded separately.

\

## Windows
1. On the “R for Windows” page, click on the “base” link, which should take you to the “R-4.5.2 for Windows” page

2. On this page, click “Download R-4.5.2 for Windows”, and save the .exe file to your hard disk when prompted. Saving to the desktop is fine.

3. To begin the installation, double-click on the downloaded file. Don’t be alarmed if you get unknown publisher type warnings. Window’s User Account Control will also worry about an unidentified program wanting access to your computer. Click on “Run”.

4. Select the proposed options in each part of the install dialog. When the “Select Components” screen appears, just accept the standard choices

\

# What is R Studio?
If you click on the R program you just downloaded, you will find a very basic user interface. For example, below is what I get on a Mac:

![The Basic R Console](R_basic.png)

\

We will not use R’s direct interface to run analyses in this class. Instead, we will use the program **RStudio**, which is much easier to interact with! RStudio gives you a true integrated development environment (IDE), where you can write code in a window, see results in other windows, see locations of files, see objects you’ve created, and so on. To clarify which is which: R is the name of the programming language itself and RStudio is an interface that makes writing code, running analyses, and visualizing data in R so much easier.

\

# Getting R Studio

To download and install RStudio, follow the directions below

1. Navigate to [RStudio’s download site](https://posit.co/download/rstudio-desktop/)

2. We've already downloaded R, so click on the appropriate link to Install RStudio based on your OS (Windows, Mac, Linux and many others). Do not download anything from the “All Installers and Tarballs” section.

3. Click on the installer that you downloaded. Follow the installation directions, making sure to keep all defaults intact. After installation, RStudio should pop up in your Applications or Programs folder/menu.

\

## The RStudio Interface

Open up RStudio. You should see the interface shown in the figure below which has three windows.

![The RStudio Console](RStudio_image.png)

\

- **Console** (bottom left) - The way R works is you write a line of code to execute some kind of task on a data object. - The R Console allows you to run code interactively. The screen prompt `>` is an invitation from R to enter its - world. This is where you type code in, press enter to execute the code, and see the results.
- **Environment, History, and Connections tabs** (upper-right)
  - **Environment** - shows all the R objects that are currently open in your workspace. This is the place, for example, where you will see any data you’ve loaded into R. When you exit RStudio, R will clear all objects in this window. You can also click on ![broom](broom.png) to clear out all the objects loaded and created in your current session.
  - **History** - shows a list of executed commands in the current session.
  - **Connections** - you can connect to a variety of data sources, and explore the objects and data inside the connection. I typically don’t use this window, but you [can](https://support.rstudio.com/hc/en-us/articles/115010915687-Using-RStudio-Connections).
- **Files, Plots, Packages, Help and Viewer tabs** (lower-right)
  - **Files** – shows all the files and folders in your current working directory
  - **Plots** – shows any charts, graphs, maps and plots you’ve executed
  - **Packages** – shows available R packages
  - **Help** – displays help documentation
  - **Viewer** – displays local web content


There is also a fourth window. But, we’ll get to this window a little later. The [assignment guidelines](Assignments_2026.html) also have more on this window!

\

## Setting RStudio Defaults

While not required, I strongly suggest that you change preferences in RStudio to never save the workspace so you always open with a clean environment. See [Ch. 8.1](https://r4ds.had.co.nz/workflow-projects.html#what-is-real) of R4DS for some more background

1. From the Tools menu on RStudio, open the Tools menu and then select Global Options.

2. If not already highlighted, click on the General button from the left panel.

3. Uncheck the following Restore boxes
  - Restore most recently opened project at startup
  - Restore previously open source documents at startup
  - Restore .RData into workspace at startup

4. Set Save Workspace to .RData on exit to "Never".

5. Click OK at the bottom to save the changes and close the preferences window. You may need to restart RStudio.

The reason for making these changes is that it is preferable for reproducibility to start each R session with a clean environment. You can restore a previous environment either by rerunning code or by manually loading a previously saved session.

The R Studio environment is modified when you execute code from files or from the console. If you always start fresh, you do not need to be concerned about things not working because of something you typed in the console, but did not save in a file.

You only need to set these preferences once.

\

# R Data Types

Let’s now explore what R can do. R is really just a big fancy calculator. For example, type in the following mathematical expression in the R console (left window):
```{r r1}
1+1
```
Note that spacing does not matter: `1+1` will generate the same answer as `1      +       1`. Can you say hello to the world?
```{r r2}
"hello world"
```
Looks great! **Note, we need to put quotes around it.** “hello world” is a character and R recognizes characters only if there are quotes around it. This brings us to the topic of basic data types in R. There are four basic data types in R: character, logical, numeric, and factors (there are two others - complex and raw - but we won’t cover them because they are rarely used in practice).

\

## Characters
Characters are used to represent words or letters in R. We saw this above with “hello world”. Character values are also known as strings. You might think that the value `"1"` is a number. Well, if you put quotes around, it isn’t! Anything with quotes will be interpreted as a character. No ifs, ands or buts about it.

\

## Logicals

A logical takes on two values: `FALSE` or `TRUE`. Logicals are usually constructed with comparison operators, which we’ll go through more carefully in Lab 2. Think of a logical as the answer to a question like “Is this value greater than (lower than/equal to) this other value?” The answer will be either `TRUE` or `FALSE`. `TRUE` and `FALSE` are logical values in R. For example, typing in the following

```{r r3}
3 > 2
```

\

This gives you a `TRUE` What about the following?
```{r r4}
"declan" == "catherine"
```

\

## Numeric
Numerics are separated into two types: integer and double. The distinction between integers and doubles is usually not important. R treats numerics as doubles by default because it is a less restrictive data type. You can do any mathematical operation on numeric values. We added one and one above. We can also multiply using the `*` operator.

```{r r5}
2*3
```

\

And divide
```{r r6}
2/3
```

\

And take logs
```{r}
log(1)
```

\

```{r}
log(0)
```

Hold up! What is `-Inf`? Well, you can’t take the logarithm of 0, so R is telling you that you’re getting a non numeric value in return. The value `-Inf` is another value type that you can get in R.

\

## Factors

Think of a factor as a categorical variable. It is sort of like a character, but not really. It is actually a numeric code with character-valued levels. Think of a character as a true string and a factor as a set of categories represented as characters. We won’t use factors too much in this course, so maybe don't worry about it for now!

\

# R Data Structures
You just learned that R has four basic data types. Now, let’s go through how we can store data in R. That is, you type in the character “hello world” or the number 3, and you want to store these values. You do this by using R’s various data structures.

## Vectors

A vector is the most common and basic R data structure and is pretty much the workhorse of the language. A vector is simply a sequence of values which can be of any data type but all of the same type. There are a number of ways to create a vector depending on the data type, but the most common is to insert the data you want to save in a vector into the command `c()`. For example, to represent the values 4, 16, and 9 in a vector type in

```{r}
c(4, 16, 9)
```

\

You can also have a vector of character values
```{r}
c("catherine", "declan", "gwen")
```

The above code does not actually “save” the values 4, 16, and 9 or catherine, declan, gwen -- it just presents it on the screen in a vector. If you want to use these values again without having to type out `c(4, 16, 9)`, you can save it in a data **object**. At the heart of almost everything you will do (or are ever likely to do) in R is the concept that everything in R is an object. These objects can be almost anything, from a single number or character string (like a word) to highly complex structures like the output of a plot, a map, a summary of your statistical analysis or a set of R commands that perform a specific task.

You assign data to an object using the arrow sign `<-`. This will create an object in R’s memory that can be called back into the command window at any time. For example, you can save “hello world” to a vector called *b* by typing in
```{r}
b <- "hello world"
b
```

You can pronounce the above as “b becomes ‘hello world’”.

The first line tells R to store b as 'hello world.' In the next line, we are telling R to print what b is.

Note that R is **case sensitive**, if you type in *B* instead of *b*, you will get an error.

Similarly, you can save the numbers 4, 16 and 9 into a vector called *v1*.

```{r}
v1 <- c(4, 16, 9)
v1
```

\

You should see the objects *b* and *v1* pop up in the Environment tab on the top right window of your RStudio interface.

\

![Environment Window](lab0fig.png)

\

Note that the name *v1* is nothing special here. You could have named the object *x* or *sph215* or your pet’s name (mine was Ali Baba). You can’t, however, name objects using special characters (e.g. !, @, $) or only numbers (although you can combine numbers and letters, but a number cannot be at the beginning e.g. *2d2*). For example, you’ll get an error if you save the vector *c(4,16,9)* to an object with the following names

```{r, results='asis', echo=FALSE}
cat("````markdown\n",
    "123 <- c(4, 16, 9)\n",
    "!!! <- c(4, 16, 9)\n",
    "````", sep = "")
```


```{r}
## Error: <text>:2:5: unexpected assignment
## 1: 123 <- c(4, 16, 9)
## 2: !!! <-
##        ^
```

\

Also note that to distinguish a character value from a variable name, it needs to be quoted. “v1” is a character value whereas `v1` is a variable. One of the most common mistakes for beginners is to forget the quotes.

```{r, results='asis', echo=FALSE}
cat("````markdown\n",
    "james\n",
    "````", sep = "")
```

```{r}
## ## Error in eval(expr, envir, enclos): object 'james' not found
```

The error occurs because R tries to print the value of object *james*, but there is no such variable. So remember that any time you get the error message `object 'something' not found`, the most likely reason is that you forgot to quote a character value. If not, it probably means that you have misspelled, or not yet created, the object that you are referring to. We’ve included the common pitfalls and R tips in this class [resource](R_help_2026.html).

Every vector has two key properties: *type* and *length*. The type property indicates the data type that the vector is holding. Use the command `typeof()` to determine the type.

```{r}
typeof(b)
```

\

```{r}
typeof(v1)
```

Note that a vector cannot hold values of different types. If different data types exist, R will coerce the values into the highest type based on its internal hierarchy: logical < integer < double < character. Type in `test <- c("r", 6, TRUE)` in your R console. What is the vector type of `test`?

\

The command `length()` determines the number of data values that the vector is storing.

```{r}
length(b)
```

```{r}
length(v1)
```

You can also directly determine if a vector is of a specific data type by using the command `is.X()` where you replace `X` with the data type. For example, to find out if *v1* is numeric, type in:

```{r}
is.numeric(b)
```

```{r}
is.numeric(v1)
```

\

There is also `is.logical()`, `is.character()`, and `is.factor()`. You can also coerce a vector of one data type to another. For example, save the value “1” and “2” (both in quotes) into a vector named *x1*.
```{r}
x1 <- c("1", "2")
typeof(x1)
```

\

To convert *x1* into a numeric, use the command `as.numeric()`
```{r}
x2 <- as.numeric(x1)
typeof(x2)
```

\

There is also `as.logical()`, `as.character()`, and `as.factor()`.

An important practice you should adopt early is to keep only necessary objects in your current R Environment. For example, we will not be using *x2* any longer in this guide. To remove this object from R forever, use the command `rm()`
```{r}
rm(x2)
```

The data frame object *x2* should have disappeared from the Environment tab. Au revoir! 

\

Also note that when you close down R Studio, the objects you created above will disappear for good. Unless you save them onto your hard drive (we’ll touch on saving data in later labs), all data objects you create in your current R session will go bye bye when you exit the program.

\

## Data Frames
We learned that data values can be stored in data structures known as vectors. The next step is to learn how to store vectors into an even higher level data structure. The data frame can do this. Data frames store vectors of the same length. Create a vector called v2 storing the values 5, 12, and 25.
```{r}
v2 <- c(5,12,25)
```

\

We can create a data frame using the command `data.frame()` storing the vectors *v1* and *v2* as columns.
```{r}
data.frame(v1,v2)
```

\

Store this data frame in an object called df1
```{r}
df1<-data.frame(v1, v2)
```

df1 should pop up in your Environment window. You’ll notice a ![lab0fig2.png](lab0fig2.png) next to *df1*. This tells you that *df1* possesses or holds more than one object. Click on ![lab0fig2.png](lab0fig2.png) and you’ll see the two vectors we saved into *df1*. Another nice thing you can do is directly click on *df1* from the Environment window to bring up an Excel style worksheet on the top left of your RStudio interface. You can also type in:
```{r}
View(df1)
```
to bring the worksheet up. You can’t edit this worksheet directly, but it allows you to see the values that a higher level R data object contains.

\

We can store different types of vectors in a data frame. For example, we can store one character vector and one numeric vector in a single data frame.
```{r}
v3 <- c("catherine", "declan", "gwen")
df2 <- data.frame(v1, v3)
df2
```

\

For higher level data structures like a data frame, use the function `class()` to figure out what kind of object you’re working with.
```{r}
class(df2)
```

\

We can’t use `length()` on a data frame because it has more than one vector. Instead, it has *dimensions* - the number of rows and columns. You can find the number of rows and columns that a data frame has by using the command `dim()`
```{r}
dim(df1)
```
Here, the data frame *df1* has 3 rows and 2 columns. Data frames also have column names, which are characters.

\

We can figure out the names of the columns using `colnames`.
```{r}
colnames(df1)
```

In this case, the data frame used the vector names for the column names.

\

We can extract columns from data frames by referring to their names using the `$` sign.
```{r}
df1$v1
```

\

We can also extra data from data frames using brackets [ , ]
```{r}
df1[,1]
```

The value before the comma indicates the row, which you leave empty if you are not selecting by row, which we did above. The value after the comma indicates the column, which you leave empty if you are not selecting by column. The above line of code selected the first column. 

\

Let’s now select the 2nd row.
```{r}
df1[2,]
```

\

OK, so that wasn't too hard. Now let's try something a little trickier! What is the value in the 2nd row and 1st column?
```{r}
df1[2,1]
```

See -- we can do hard things!

\

# Functions
Let’s take a step back and talk about functions (also known as commands or macros (in SAS)). An R function is a packaged recipe that converts one or more inputs (called arguments) into a single output. You execute all of your tasks in R using functions. We have already used a couple of functions above including `typeof()` and `colnames()`. Every function in R will have the following basic format

`functionName(arg1 = val1, arg2 = val2, ...)`

In R, you type in the function’s name and set a number of options or parameters within parentheses that are separated by commas. Some options need to be set by the user - i.e. the function will spit out an error because a required option is blank - whereas others can be set but are not required because there is a default value established.

Let’s use the function `seq()` which makes regular sequences of numbers. You can find out what the options are for a function by calling up its help documentation by typing `?` and the function name

```{r}
? seq
```

The help documentation should pop up in the bottom right window of your RStudio interface. The documentation should also provide some examples of the function at the bottom of the page. Type the arguments from = 1, to = 10 inside the parentheses.

```{r}
seq(from = 1, to = 10)
```

\

You should get the same result if you type in:
```{r}
seq(1, 10)
```

The code above demonstrates something about how R resolves function arguments. When you use a function, you can always specify all the arguments in `arg = value` form. But if you do not, R attempts to resolve by position. So in the code above, it is assumed that we want a sequence `from = 1` that goes `to = 10` because we typed 1 before 10. Type in 10 before 1 and see what happens. Since we didn’t specify step size, the default value of `by` in the function definition is used, which ends up being 1 in this case.

\

# Packages

Functions do not exist in a vacuum, but exist within [R packages](https://r-pkgs.org/intro.html). Packages are the fundamental units of reproducible R code. They include reusable R functions, the documentation that describes how to use them, and sample data. At the top left of a function’s help documentation, you’ll find in curly brackets the R package that the function is housed in. For example, type in your console `? seq`. At the top right of the help documentation, you’ll find that `seq()` is in the package **base**. All the functions we have used so far are part of packages that have been pre-installed and pre-loaded into R.

In order to use functions in a new package, you first need to install the package using the `install.packages()` command. For example, we will be using commands from the package **tidyverse** in this lab.

```{r}
options(repos = c(CRAN = "https://cloud.r-project.org"))
install.packages("tidyverse")
```
You should see a bunch of gibberish roll through your console screen. Don’t worry, that’s just R downloading all of the other packages and applications that **tidyverse** relies on. These are known as [dependencies.](https://r-pkgs.org/dependencies-mindset-background.html) Unless you get a message in red that indicates there is an error (like we saw when we typed in “hello world” without quotes), you should be fine.

Next, you will need to load packages in your working environment (every time you start RStudio). We do this with the `library()` function. Notice there are no quotes around **tidyverse** this time (just to make things trickier for us!).

```{r}
library(tidyverse)
```
The Packages window at the lower-right of your RStudio shows you all the packages you currently have installed. If you don’t have a package listed in this window, you’ll need to use the `install.packages()` function to install it. If the package is checked, that means it is loaded into your current R session

For example, here is a section of my Packages window
![window1.png](window1.png)

\

The only packages loaded into my current session is **methods**, a package that is loaded every time you open an R session. Let’s say I use `install.packages()` to install the package **matrixStats**. The window now looks like:
![window2.png](window2.png)

\

Let's load **matrixStats** using `library()`, and then we will see a check mark appears next to **matrixStats**.

```{r}
install.packages("matrixStats")
library(matrixStats)
```


\

Look at us!

To uninstall a package, use the function `remove.packages()`.

**Note that you only need to install packages once with `install.packages()`, but you need to load them each time you relaunch RStudio with `library()`.** Repeat after me: Install once, library every time. If you need to reinstall R or update to a new version of R, you will need to reinstall all packages. And as noted earlier, R has several packages already preloaded into your working environment. These are known as **base** packages and a list of their functions can be found [here](https://stat.ethz.ch/R-manual/R-devel/library/base/html/00Index.html). 

\

# Tidyverse

In most labs, we will be using commands from the **tidyverse** package. [Tidyverse](https://www.tidyverse.org/) is a collection of high-powered, consistent, and easy-to-use packages developed by a number of thoughtful and talented R developers. The consistency of the **tidyverse**, together with the goal of increasing productivity, mean that the syntax of tidy functions is typically straightforward to learn. You can read more about **tidyverse** principles in Chapter 9, pages 147-151 in RDS.

Excited about entering the tidyverse? I bet you are, so here is a badge to show your excitement!

![Your Tidyverse Badge](tidyverse.png)

\

## Tibbles

Although the **tidyverse** works with all data objects, its fundamental object type is the tibble. Tibbles are not only a super fun word to say, they are data frames that tweak some older behaviors to make life a little easier. There are two main differences in the usage of a data frame vs a tibble: printing and subsetting. Let’s be clear here -- tibbles are just a special kind of data frame. They just make things “tidier.” Let’s bring in some data to illustrate the differences and similarities between data frames and tibbles. Install the package **nycflights13**

```{r}
install.packages("nycflights13")
```

\

Make sure you also load the package.

```{r}
library(nycflights13)
```

\

If you look in the upper right hand *Environment* tab and click on *Global Environment*, you will see there is a dataset called **flights** included in this package. It includes information on all 336,776 flights that departed from New York City in 2013. Let’s save this file in the local R environment.

```{r}
nyctibble <- flights
class(nyctibble)
```

\

This dataset is a tibble. Let’s also save it as a regular data frame by using the `as.data.frame()` function.

```{r}
nycdf <- as.data.frame(flights)
class(nycdf)
```

\

The first difference between data frames and tibbles is how the dataset looks. Tibbles have a refined print method that shows only the first 10 rows, and only the columns that fit on the screen. In addition, each column reports its name and type.

```{r}
nyctibble
```

\

Tibbles are designed so that you don’t overwhelm your console when you print large data frames. Compare the print output above to what you get with a data frame.

```{r eval=FALSE}
nycdf
```

\

Um, that was a lot....Tibble is much cleaner. You can bring up the Excel like worksheet of the tibble (or data frame) using the `View()` function.
```{r}
View(nyctibble)
```

\

You can identify the names of the columns (and hence the variables in the dataset) by using the function `names()`.
```{r}
names(nyctibble)
```
Those may come in handy if we wanted to analyze the data!

\

Finally, let's convert a regular data frame to a tibble using the `as_tibble()` function.

```{r}
as_tibble(nycdf)
```

Not all functions work with tibbles, particularly those that are specific to spatial data. As such, we’ll be using a combination of tibbles and regular data frames throughout the class, with a preference towards tibbles where possible. Note that when you search on Google for how to do something in R, you will likely get **non-tidy ways** of doing things. Most of these suggestions are fine, but some are not and may screw you up down the road. My advice is to try to stick with tidy functions to do things in R.

Anyway, you earned another badge. Yes!

![Your Tibble Badge](tibble.png)

\


