Uploaded by User12534

Introduction to R - ELTI

advertisement
Presented by the Environmental Leadership and Training Initiative
Instructor: David Woodbury
Masters of Forest Science Candidate, 2019
Yale School of Forestry and Environmental Studies
An Introduction to R: A
tool for data analysis of
ecological restoration
monitoring
The Goals of This Course
• Provide the tools and knowledge to get past the initial hurdle of learning the basics of R and begin
to tackle challenges in R independently.
• Widen the range of data analyses you are able to perform.
• Show the power of R to inspire continued study of the R programing language.
• Create a network of people learning R to tackle challenges together and encourage each other to
continue studies in R after the course is complete.
Course Schedule
Day 1
• Downloading R, R studio, and installing them on each student’s computer.
• Lesson 1 - The basics.
• Lab 1 - An introduction to swirl() (a program within the R studio interface that teaches beginners how to use
R).
• Lesson 2 – Importing and manipulating data.
Day 2
• Lesson 3 – Descriptive stats.
• Lesson 4 – Statistical Tests and Linear regression.
• Lesson 5 - Community ecology vegetation analysis.
Day 3
• Small group projects conducted by 4-5 students.
• Short presentations of results from individual projects.
Course Format
The course will be taught in “Lessons”. Each “Lesson” will include:
1. A brief introduction presentation introducing a set of R operations.
2. Demonstrations of how to perform the operations in R, with individual or group work to reaffirm learning.
Final Group Project: This is a chance for you to demonstrate the skills that you have learned in the course. Using
a volunteer dataset, you will do data analysis relevant for that data. In the presentation you will tell us what
analyses you did and why, describe operations you did in R, show us your results (including graphs), and speak
about their interpretation.
Download and Install Instructions
Windows
To install R:
Open an internet browser and go to www.r-project.org.
Click the “download R” link in the middle of the first paragraph under “Getting Started.”
Select a CRAN location (a mirror site) and click the corresponding link.
Click on the “Download R for Windows” link at the top of the page.
Click on the “install R for the first time” link at the top of the page.
Click “Download R for Windows” and save the executable file somewhere on your computer.
Run the .exe file and follow the installation instructions.
Now that R is installed, you need to download and install RStudio.
To install RStudio:
Go to www.rstudio.com and click on the “Download RStudio” button.
Click on “Download RStudio Desktop.”
Click on the version recommended for your system, or the latest Windows version, and save the executable file.
Run the .exe file and follow the installation instructions.
If a window pops up to ask if you want to install “command line developer tools”, there is no need.
Download and Install Instructions
Mac
To install R:
Open an internet browser and go to www.r-project.org.
Click the “download R” link in the middle of the first paragraph under “Getting Started.”
Select a CRAN location (a mirror site) and click the corresponding link.
Click on the “Download R for (Mac) OS X” link at the top of the page.
Click on the file containing the latest version of R under “Files.”
Save the .pkg file, double-click it to open, and follow the installation instructions.
Now that R is installed, you need to download and install RStudio.
To install RStudio:
Go to www.rstudio.com and click on the “Download RStudio” button.
Click on “Download RStudio Desktop.”
Click on the version recommended for your system, or the latest Mac version, and save the .dmg file on your
computer.
Double-click it to open, and then drag and drop it to your applications folder.
If a window pops up to ask if you want to install “command line developer tools”, there is no need.
Lesson 1: The Basics
Outline of Lesson 1
1.
2.
3.
4.
5.
A brief history of R.
What is R?
Why is it so popular?
Types of Data used in R.
Functions and Packages.
1.) A Brief History of R
• Two statistics professors, Robert Gentlemen and Ross Ihaka, from the University of Auckland
came up with the idea in 1991.
• They both wanted technology better suited for their statistics students, who needed to analyze
data and produce graphical models of the information. At that point, most comparable software
had been designed by computer scientists and proved hard to use.
• They officially launched the open source program in 1996.
• It quickly gained popularity because it is a free, could be customized to fit the needs of
individuals, and was – in fact – easier to use then many other statistical software programs.
2.) So what is R exactly?
• Simply R is a computer programming language used for data analysis.
So what is a computer programming language?
• A Computer programming language, is any of various languages for expressing a set of detailed instructions for
a digital computer.
So again what is R exactly?
• R is a computer programming language made for statistics, it gives a computer detailed sets of instructions on
how to perform statistical analyses and graphics.
• Here is a link to a New York Times article about R and why it is so popular :
https://www.nytimes.com/2009/01/07/technology/business-computing/07program.html?_r=1&pagewanted=all
3.) Why is R so popular with ecologists?
• Ecologists began customizing R to fit their specific data analysis needs almost as soon as R was
released.
• At present, there are many packages (toolboxes that contain functions for specific and complex
analyses) available specifically for ecology.
• Vegan, labdsv, are both used widely for community vegetation ecology analyses.
• Adehabitat has many functions useful in analysis of wildlife data.
4.) What kinds of data can R use?
• R can use a wide variety of data types including vectors, matrices, data frames, and lists. All of
these can be made of numbers, characters, or logical values.
Matrices and Vectors
Matrix
• Is a rectangular array of one type of data
(usually numbers and characters) arranged in
rows and columns.
Vectors
• A list of values of one type of data (either
numeric or character).
• A column for DBH measurements from an excel
datasheet is a number vector (i.e., a list of
numbers).
• A column of species names is a character vector
(i.e., a list of characters).
• A vector can either be vertical or horizontal.
Data Frames and Lists
Data Frame
• A data frame is more general than a
matrix, in that different columns can
have different modes (numeric,
character, factor, etc.).
Lists
• An ordered collection of objects
(components). A list allows you to gather
a variety of (possibly unrelated) objects
under one name.
5.) Functions and Packages
Functions
• Data manipulation or calculation shortcuts.
• Easiest to explain with an example.
Packages
• Can be thought of as a toolbox that contains functions in R do data analyses in R.
• Often they are created by individuals to perform specific types of data analyses.
• The package which we will use most in this class is the “vegan” package or the Vegetation
Analysis package. It contains functions for community ecology analyses like functions for
calculating rarefaction curves and species diversity indices.
Lab 1: The Basic Building Blocks of R
(Using the “swirl()” package)
Lesson 2: Importing and Manipulating Data
Outline of Lesson 2
1.
2.
3.
4.
5.
6.
7.
The data.
The working directory.
Importing data.
Subsetting data.
Using the “$” operator.
Using “[]” commands.
Using the “==“ operator.
The Data
2016
7
2010
8 9
1
2
3
5 6
4
2m
2m
Transect Design
60 m
2012
The Data
Understory Vegetation Data
Variables
• Planting Year
• Plot Number
• Subplot
• Species
• Life Form
• Count (number of individuals)
• Whether species was planted or
recruited naturally
1.) Setting The Working Directory
What is the “working directory”?
It is simply the folder on your computer where you have saved the data that you wish to bring into R.
What is “setting your working directory”?
It is telling R what folder to look in to find the data file that you want to bring into R.
So how do I set the working directory?
With this function: setwd(“Address of your desired working directory folder goes here").
2.) Importing Data Into R
•
•
•
•
•
•
R cannot read .xlsx file types.
Excel files must first be saved as .csv files before they can be read into R.
The function to import a .csv file into R is read.csv().
Often when we import data we want to assign the data to a variable so we can view and manipulate it.
Here is an example of what that command might look like:
data <- read.csv("David_Understory_Data.csv")
3.) Tidy Data
It is best to format a table to be used in R as
follows:
• Each column is a variable.
• Each row is an observation.
This is known as having tidy data and having
tidy data makes manipulating data much easier
in R.
4.) Subsetting Data
Usually we are only interested in a small portion of a dataset
for a particular analysis.
Ex.) If we want to know the number of shrubs in Plot 1 from
the data table on the right, we need three pieces of
information.
1. The observations (rows) that are shrubs.
2. The variable (column) identifying which plot the
observation belongs to.
3. The variable (column) that holds the abundance data, in
this case the column titled “Count”.
How do we extract only the data needed? Well, R has
several ways of doing this in a process called “subsetting.”
5.) The “$” Operator
The data on the right is a data.frame we have
pulled into R using the read.csv function and
assigned to a new variable (data object) called
“data”. And we want to subset it.
data <- read.csv(“David_Understory_Data.csv”)
The $ operator will give you a single column from
your dataset
The code: species <- data$Species
Will give us a new variable named “species” that
contains just a string of all the species names.
6.) Brackets “[]”
Brackets allow you to specify locations within data frames
where the data is that you want.
The format for using brackets is:
newobject <- data[row, column]
The variable (also known as a data object) that contains the
data frame you are subsetting always comes before the
brackets.
Ex.) We want the value of the cell in the data frame to the
right that gives the abundance for Scleria sp. In Plot 1,
Subplot Aa.
To get that single cell subset of our data we would use the
command:
X <- data[3, 6]
Column
#6
Brackets “[]”
What if we want more than just one
number?
Column # 4
To get a whole column of data leave the
term to specify row blank.
Ex.) column4 <- data[,4]
This will give you all a new data object
with only the data from column 4.
Note: this is the same as doing
column4 <- data$Species
Brackets “[]”
What if we want just the data from Subplot Aa?
We need a range of rows from Row 1 to Row 5.
Here is the code to get the data we want:
subAa <- data[1:5,]
The “:” lets R know we want everything between 1
and 5. Also, by leaving the term for specifying the
column blank we get all columns in the dataset
Brackets “[]”
Now what if we want just subplot Ab and just the
columns for the year planted and species name?
That code is:
Ab.year.species <- data[6:12, c(1, 4)]
We are telling R we want all the rows between 6 and
12 and that we want only the columns 1 and 4.
7.) The If and Only If Operator “==”
What if we want only the observations (rows) for
shrubs?
For this data table we could do:
shrub <- data[c(1,12,13),]
But imagine now that we have a dataset that has
hundreds or even thousands of observations! It
would be very tedious to go through every line of
the table to find every shrub observation.
Luckily there is another way to subset data. We
can use the “==“ operator which will help us
choose only the rows that are observations for
shrubs.
The If and Only If Operator “==”
What if we want only the observations (rows) for
shrubs?
The “==“ operator means “if and only if” .
The new subset method for just choosing shrubs
thus, would look like this:
shrub <- data[data[,5] == “Shrub”,]
With the first square bracket we are specifying
that we want whole rows because we leave the
column term blank. In the row term we are
saying we want the row data “if and only if” the
data in column 5 is a shrub.
Lesson 3: Descriptive Statistics
Data: Barro Colorado Island Tree Data
• Data we will use comes from the CTFS 50 hectare
plot on Barro Colorado Island in Panama.
• The variables are:
• Tree number, forest type, species, DBH, height of
measurement, basal area, and plot number.
• Each plot is 20 X 20 m and every tree was
identified and recorded within each plot that is
greater than 1 cm DBH.
• There are two separate datasets that we will use.
Each is a subset of the full 50 hectare plot. Each
dataset has 6 plots. One dataset is plots randomly
selected from secondary forest and one dataset is
plots randomly selected from primary forest within
the 50 hectare plot.
Descriptive Stats Functions
Measures of Center
• Average - mean(BCI.P$dbh.mm)
• Median - median(BCI.P$dbh.mm)
• Mode - mode(BCI.P$dbh.mm)
Measures of Spread
• Variance – var(BCI.P$dbh.mm)
• Standard Deviation - sd(BCI.P$dbh.mm)
• Range – range(BCI.P$dbh.mm)
Sum and Summary
• Sum – sum(BCI.P$dbh.mm)
• Summary Statistics – summary(BCI.P$dbh.mm)
• Output: Min. 1st Qu. Median Mean 3rd Qu. Max.
50.0 62.0 89.0 131.2 138.0 1150.0
More Functions For Data Manipulation
The aggregate() function
• Finds descriptive statistics by a factor or group within
a data set
• Ex.) finds average dbh and groups the results by plot.
rbind() and cbind() functions
• Row bind and column bind are functions that can bind
two two separate data objects together if they have
the same data structure
• To put one data set on top of another use rbind()
because you are binding rows together (the datasets
must have the same number of columns, with the
same type of data in each column).
Lesson 4: Statistical Tests and Graphs
New Dataset
DANIELA.xlsx contains data collected by
Daniela Cusack from three plantations. Each
plantation was divided into areas with
homogenous overstory tree species of six
types. There are factors which predicted the
number of individual saplings in each of three
height classes. The saplings were also
classified in terms of the dispersal mechanism
associated with that species. The three
dispersal mechanisms were birds, mammals,
or other (includes wind, water, bats, gravity).
The Stat Universe
From Dr. Jonathan Reuning-Scherer
Statistical Tests, and Functions
In these examples x is a vector, y is a vector of the same data
type, and A is a grouping factor.
Parametric tests for comparing two or more groups
• Paired t-test – t.test(x, y, paired = TRUE)
• Unpaired t-test – t.test(x, y paired = FALSE)
• Pearson correlation – cor(x, y)
• One-way analysis of variance – aov(y ~ A)
Non-parametric equivalents
• Wilcoxon rank sum test - wilcox.test(x, y, paired = TRUE)
• Mann-Whitney U test – wilcox.test(x, y, paired = FALSE)
• Spearman correlation range(x, y, type = spearman)
• Kruskal Wallis test - kruskal.test(y ~ A)
Meeting Test Assumptions
Assumptions for T-tests and ANOVA
1.) Independent observations.
2.) Normality: the dependent variable must
follow a normal distribution in the population.
This is only needed for samples smaller than
some 25 units.
3.) Homogeneity: Equal variance. We only
need this assumption if our sample sizes are
(sharply) unequal.
Functions Used:
hist() – Creates a histogram.
qqPlot() – Creates a normal quantile plot (this
function is from the “car” package).
Regression
Linear Regression where x is an
independent variable that predicts a
response in a dependent variable y.
R Function is lm()
Ex.) Reg <- lm(y~x)
This can be read as “y is predicted by x”
The summary() function can be used to look
at the results of the regression.
It gives:
• Coefficients
• R – squared
• P – value
Regression
Checking Model Assumptions
Assumptions for Linear Regression
1.) Linearity. A linear relationship
between the dependent and independent
variables.
2.) Normally distributed. Residuals are
normally distributed.
3.) No autocorrelation.
4.) Homoscedasticity.
Functions used:
plot(y~x) – Creates a scatterplot of y
predicted by x.
plot(Reg) – Creates Residuals vs Fits and
normal quantile plot, when “Reg” is a
linear model data object.
Adding Labels and Regression Lines to Graphs
Adding a regression line:
plot(x,y) or plot(y~x) where both x and y
are continuous variables makes a
scatterplot.
abline(“Reg”) where “Reg” is a linear
model data object can be used in the
code line following the plot() function to
add a regression line to the plot.
Adding labels:
Labels are added using extra arguments
within the plotting function.
xlab = “x” - will label the x-axis “x”
ylab = “y” - will label the y-axis “y”
Main = “y predicted by x” – will create a
title for the graph that says, “y predicted
by x”.
* All of the same arguments can used
within the hist() function.
People Whose Work Contributed to This
Presentation
• I would like to give a special thanks to both Dr. Jonathan ReuningScherer and Dr. Simon Queensborough from Yale University for
advising me and contributing to this presentation.
List of Online Resources For Learning R
http://www.dataanalytics.org.uk/Publications/S4E2e%20Support/exercises/Preparing%20and%20managing%20community%20data.htm#spsite
• Statistics For Ecologists Using R and Excel
http://r-statistics.co
• Introductions to many different statistical operations in R
http://www.simonqueenborough.info/R
• Lectures, labs, and information introducing many statistical operations in R
https://www.statmethods.net/stats
• Step-by-step guides to doing basic stats in R
http://rpubs.com/SusanEJohnston/7953
• How to plot graphs using basic graphics in R.
https://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html
• If you want to expand your graphing abilities here is a workshop for how to use ggplot2
Download