Presented by the Environmental Leadership and Training Initiative Instructor: David Woodbury Masters of Forest Science Candidate, 2019 Yale School of Forestry and Environmental Studies An Introduction to R: A tool for data analysis of ecological restoration monitoring The Goals of This Course • Provide the tools and knowledge to get past the initial hurdle of learning the basics of R and begin to tackle challenges in R independently. • Widen the range of data analyses you are able to perform. • Show the power of R to inspire continued study of the R programing language. • Create a network of people learning R to tackle challenges together and encourage each other to continue studies in R after the course is complete. Course Schedule Day 1 • Downloading R, R studio, and installing them on each student’s computer. • Lesson 1 - The basics. • Lab 1 - An introduction to swirl() (a program within the R studio interface that teaches beginners how to use R). • Lesson 2 – Importing and manipulating data. Day 2 • Lesson 3 – Descriptive stats. • Lesson 4 – Statistical Tests and Linear regression. • Lesson 5 - Community ecology vegetation analysis. Day 3 • Small group projects conducted by 4-5 students. • Short presentations of results from individual projects. Course Format The course will be taught in “Lessons”. Each “Lesson” will include: 1. A brief introduction presentation introducing a set of R operations. 2. Demonstrations of how to perform the operations in R, with individual or group work to reaffirm learning. Final Group Project: This is a chance for you to demonstrate the skills that you have learned in the course. Using a volunteer dataset, you will do data analysis relevant for that data. In the presentation you will tell us what analyses you did and why, describe operations you did in R, show us your results (including graphs), and speak about their interpretation. Download and Install Instructions Windows To install R: Open an internet browser and go to www.r-project.org. Click the “download R” link in the middle of the first paragraph under “Getting Started.” Select a CRAN location (a mirror site) and click the corresponding link. Click on the “Download R for Windows” link at the top of the page. Click on the “install R for the first time” link at the top of the page. Click “Download R for Windows” and save the executable file somewhere on your computer. Run the .exe file and follow the installation instructions. Now that R is installed, you need to download and install RStudio. To install RStudio: Go to www.rstudio.com and click on the “Download RStudio” button. Click on “Download RStudio Desktop.” Click on the version recommended for your system, or the latest Windows version, and save the executable file. Run the .exe file and follow the installation instructions. If a window pops up to ask if you want to install “command line developer tools”, there is no need. Download and Install Instructions Mac To install R: Open an internet browser and go to www.r-project.org. Click the “download R” link in the middle of the first paragraph under “Getting Started.” Select a CRAN location (a mirror site) and click the corresponding link. Click on the “Download R for (Mac) OS X” link at the top of the page. Click on the file containing the latest version of R under “Files.” Save the .pkg file, double-click it to open, and follow the installation instructions. Now that R is installed, you need to download and install RStudio. To install RStudio: Go to www.rstudio.com and click on the “Download RStudio” button. Click on “Download RStudio Desktop.” Click on the version recommended for your system, or the latest Mac version, and save the .dmg file on your computer. Double-click it to open, and then drag and drop it to your applications folder. If a window pops up to ask if you want to install “command line developer tools”, there is no need. Lesson 1: The Basics Outline of Lesson 1 1. 2. 3. 4. 5. A brief history of R. What is R? Why is it so popular? Types of Data used in R. Functions and Packages. 1.) A Brief History of R • Two statistics professors, Robert Gentlemen and Ross Ihaka, from the University of Auckland came up with the idea in 1991. • They both wanted technology better suited for their statistics students, who needed to analyze data and produce graphical models of the information. At that point, most comparable software had been designed by computer scientists and proved hard to use. • They officially launched the open source program in 1996. • It quickly gained popularity because it is a free, could be customized to fit the needs of individuals, and was – in fact – easier to use then many other statistical software programs. 2.) So what is R exactly? • Simply R is a computer programming language used for data analysis. So what is a computer programming language? • A Computer programming language, is any of various languages for expressing a set of detailed instructions for a digital computer. So again what is R exactly? • R is a computer programming language made for statistics, it gives a computer detailed sets of instructions on how to perform statistical analyses and graphics. • Here is a link to a New York Times article about R and why it is so popular : https://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=1&pagewanted=all 3.) Why is R so popular with ecologists? • Ecologists began customizing R to fit their specific data analysis needs almost as soon as R was released. • At present, there are many packages (toolboxes that contain functions for specific and complex analyses) available specifically for ecology. • Vegan, labdsv, are both used widely for community vegetation ecology analyses. • Adehabitat has many functions useful in analysis of wildlife data. 4.) What kinds of data can R use? • R can use a wide variety of data types including vectors, matrices, data frames, and lists. All of these can be made of numbers, characters, or logical values. Matrices and Vectors Matrix • Is a rectangular array of one type of data (usually numbers and characters) arranged in rows and columns. Vectors • A list of values of one type of data (either numeric or character). • A column for DBH measurements from an excel datasheet is a number vector (i.e., a list of numbers). • A column of species names is a character vector (i.e., a list of characters). • A vector can either be vertical or horizontal. Data Frames and Lists Data Frame • A data frame is more general than a matrix, in that different columns can have different modes (numeric, character, factor, etc.). Lists • An ordered collection of objects (components). A list allows you to gather a variety of (possibly unrelated) objects under one name. 5.) Functions and Packages Functions • Data manipulation or calculation shortcuts. • Easiest to explain with an example. Packages • Can be thought of as a toolbox that contains functions in R do data analyses in R. • Often they are created by individuals to perform specific types of data analyses. • The package which we will use most in this class is the “vegan” package or the Vegetation Analysis package. It contains functions for community ecology analyses like functions for calculating rarefaction curves and species diversity indices. Lab 1: The Basic Building Blocks of R (Using the “swirl()” package) Lesson 2: Importing and Manipulating Data Outline of Lesson 2 1. 2. 3. 4. 5. 6. 7. The data. The working directory. Importing data. Subsetting data. Using the “$” operator. Using “[]” commands. Using the “==“ operator. The Data 2016 7 2010 8 9 1 2 3 5 6 4 2m 2m Transect Design 60 m 2012 The Data Understory Vegetation Data Variables • Planting Year • Plot Number • Subplot • Species • Life Form • Count (number of individuals) • Whether species was planted or recruited naturally 1.) Setting The Working Directory What is the “working directory”? It is simply the folder on your computer where you have saved the data that you wish to bring into R. What is “setting your working directory”? It is telling R what folder to look in to find the data file that you want to bring into R. So how do I set the working directory? With this function: setwd(“Address of your desired working directory folder goes here"). 2.) Importing Data Into R • • • • • • R cannot read .xlsx file types. Excel files must first be saved as .csv files before they can be read into R. The function to import a .csv file into R is read.csv(). Often when we import data we want to assign the data to a variable so we can view and manipulate it. Here is an example of what that command might look like: data <- read.csv("David_Understory_Data.csv") 3.) Tidy Data It is best to format a table to be used in R as follows: • Each column is a variable. • Each row is an observation. This is known as having tidy data and having tidy data makes manipulating data much easier in R. 4.) Subsetting Data Usually we are only interested in a small portion of a dataset for a particular analysis. Ex.) If we want to know the number of shrubs in Plot 1 from the data table on the right, we need three pieces of information. 1. The observations (rows) that are shrubs. 2. The variable (column) identifying which plot the observation belongs to. 3. The variable (column) that holds the abundance data, in this case the column titled “Count”. How do we extract only the data needed? Well, R has several ways of doing this in a process called “subsetting.” 5.) The “$” Operator The data on the right is a data.frame we have pulled into R using the read.csv function and assigned to a new variable (data object) called “data”. And we want to subset it. data <- read.csv(“David_Understory_Data.csv”) The $ operator will give you a single column from your dataset The code: species <- data$Species Will give us a new variable named “species” that contains just a string of all the species names. 6.) Brackets “[]” Brackets allow you to specify locations within data frames where the data is that you want. The format for using brackets is: newobject <- data[row, column] The variable (also known as a data object) that contains the data frame you are subsetting always comes before the brackets. Ex.) We want the value of the cell in the data frame to the right that gives the abundance for Scleria sp. In Plot 1, Subplot Aa. To get that single cell subset of our data we would use the command: X <- data[3, 6] Column #6 Brackets “[]” What if we want more than just one number? Column # 4 To get a whole column of data leave the term to specify row blank. Ex.) column4 <- data[,4] This will give you all a new data object with only the data from column 4. Note: this is the same as doing column4 <- data$Species Brackets “[]” What if we want just the data from Subplot Aa? We need a range of rows from Row 1 to Row 5. Here is the code to get the data we want: subAa <- data[1:5,] The “:” lets R know we want everything between 1 and 5. Also, by leaving the term for specifying the column blank we get all columns in the dataset Brackets “[]” Now what if we want just subplot Ab and just the columns for the year planted and species name? That code is: Ab.year.species <- data[6:12, c(1, 4)] We are telling R we want all the rows between 6 and 12 and that we want only the columns 1 and 4. 7.) The If and Only If Operator “==” What if we want only the observations (rows) for shrubs? For this data table we could do: shrub <- data[c(1,12,13),] But imagine now that we have a dataset that has hundreds or even thousands of observations! It would be very tedious to go through every line of the table to find every shrub observation. Luckily there is another way to subset data. We can use the “==“ operator which will help us choose only the rows that are observations for shrubs. The If and Only If Operator “==” What if we want only the observations (rows) for shrubs? The “==“ operator means “if and only if” . The new subset method for just choosing shrubs thus, would look like this: shrub <- data[data[,5] == “Shrub”,] With the first square bracket we are specifying that we want whole rows because we leave the column term blank. In the row term we are saying we want the row data “if and only if” the data in column 5 is a shrub. Lesson 3: Descriptive Statistics Data: Barro Colorado Island Tree Data • Data we will use comes from the CTFS 50 hectare plot on Barro Colorado Island in Panama. • The variables are: • Tree number, forest type, species, DBH, height of measurement, basal area, and plot number. • Each plot is 20 X 20 m and every tree was identified and recorded within each plot that is greater than 1 cm DBH. • There are two separate datasets that we will use. Each is a subset of the full 50 hectare plot. Each dataset has 6 plots. One dataset is plots randomly selected from secondary forest and one dataset is plots randomly selected from primary forest within the 50 hectare plot. Descriptive Stats Functions Measures of Center • Average - mean(BCI.P$dbh.mm) • Median - median(BCI.P$dbh.mm) • Mode - mode(BCI.P$dbh.mm) Measures of Spread • Variance – var(BCI.P$dbh.mm) • Standard Deviation - sd(BCI.P$dbh.mm) • Range – range(BCI.P$dbh.mm) Sum and Summary • Sum – sum(BCI.P$dbh.mm) • Summary Statistics – summary(BCI.P$dbh.mm) • Output: Min. 1st Qu. Median Mean 3rd Qu. Max. 50.0 62.0 89.0 131.2 138.0 1150.0 More Functions For Data Manipulation The aggregate() function • Finds descriptive statistics by a factor or group within a data set • Ex.) finds average dbh and groups the results by plot. rbind() and cbind() functions • Row bind and column bind are functions that can bind two two separate data objects together if they have the same data structure • To put one data set on top of another use rbind() because you are binding rows together (the datasets must have the same number of columns, with the same type of data in each column). Lesson 4: Statistical Tests and Graphs New Dataset DANIELA.xlsx contains data collected by Daniela Cusack from three plantations. Each plantation was divided into areas with homogenous overstory tree species of six types. There are factors which predicted the number of individual saplings in each of three height classes. The saplings were also classified in terms of the dispersal mechanism associated with that species. The three dispersal mechanisms were birds, mammals, or other (includes wind, water, bats, gravity). The Stat Universe From Dr. Jonathan Reuning-Scherer Statistical Tests, and Functions In these examples x is a vector, y is a vector of the same data type, and A is a grouping factor. Parametric tests for comparing two or more groups • Paired t-test – t.test(x, y, paired = TRUE) • Unpaired t-test – t.test(x, y paired = FALSE) • Pearson correlation – cor(x, y) • One-way analysis of variance – aov(y ~ A) Non-parametric equivalents • Wilcoxon rank sum test - wilcox.test(x, y, paired = TRUE) • Mann-Whitney U test – wilcox.test(x, y, paired = FALSE) • Spearman correlation range(x, y, type = spearman) • Kruskal Wallis test - kruskal.test(y ~ A) Meeting Test Assumptions Assumptions for T-tests and ANOVA 1.) Independent observations. 2.) Normality: the dependent variable must follow a normal distribution in the population. This is only needed for samples smaller than some 25 units. 3.) Homogeneity: Equal variance. We only need this assumption if our sample sizes are (sharply) unequal. Functions Used: hist() – Creates a histogram. qqPlot() – Creates a normal quantile plot (this function is from the “car” package). Regression Linear Regression where x is an independent variable that predicts a response in a dependent variable y. R Function is lm() Ex.) Reg <- lm(y~x) This can be read as “y is predicted by x” The summary() function can be used to look at the results of the regression. It gives: • Coefficients • R – squared • P – value Regression Checking Model Assumptions Assumptions for Linear Regression 1.) Linearity. A linear relationship between the dependent and independent variables. 2.) Normally distributed. Residuals are normally distributed. 3.) No autocorrelation. 4.) Homoscedasticity. Functions used: plot(y~x) – Creates a scatterplot of y predicted by x. plot(Reg) – Creates Residuals vs Fits and normal quantile plot, when “Reg” is a linear model data object. Adding Labels and Regression Lines to Graphs Adding a regression line: plot(x,y) or plot(y~x) where both x and y are continuous variables makes a scatterplot. abline(“Reg”) where “Reg” is a linear model data object can be used in the code line following the plot() function to add a regression line to the plot. Adding labels: Labels are added using extra arguments within the plotting function. xlab = “x” - will label the x-axis “x” ylab = “y” - will label the y-axis “y” Main = “y predicted by x” – will create a title for the graph that says, “y predicted by x”. * All of the same arguments can used within the hist() function. People Whose Work Contributed to This Presentation • I would like to give a special thanks to both Dr. Jonathan ReuningScherer and Dr. Simon Queensborough from Yale University for advising me and contributing to this presentation. List of Online Resources For Learning R http://www.dataanalytics.org.uk/Publications/S4E2e%20Support/exercises/Preparing%20and%20managing%20community%20data.htm#spsite • Statistics For Ecologists Using R and Excel http://r-statistics.co • Introductions to many different statistical operations in R http://www.simonqueenborough.info/R • Lectures, labs, and information introducing many statistical operations in R https://www.statmethods.net/stats • Step-by-step guides to doing basic stats in R http://rpubs.com/SusanEJohnston/7953 • How to plot graphs using basic graphics in R. https://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html • If you want to expand your graphing abilities here is a workshop for how to use ggplot2