Make sure you have the tidyverse
packages installed. Load the readxl
package (library(readxl)
), and then use the read_excel()
function. (See below on finding the path to your data.)
my_data <- read_excel("~/Desktop/SRIP/My_Data.xlsx")
For an Excel file with multiple sheets, specify the sheet
parameter:
my_data2 <- read_excel("~/Desktop/SRIP/My_Data.xlsx", sheet = "Sheet2")
my_data <- read_csv("~/Desktop/SRIP/My_Data.csv")
The first argument in the read_excel()
or read_csv()
function should be the path to your file. In the example above, my data is stored in the SRIP folder on my desktop at the absolute path "~/Desktop/SRIP/My_Data.csv"
. An absolute path is the exact location of the data file on your computer.
my_data <- read_csv("~/Desktop/SRIP/My_Data.csv")
Another option is a relative path, where you find your data file relative to the folder where you’ve saved your R code. See this guide to navigating your file system. Also see this Mac-specific or Windows- and Linux-specific guide to file paths on your system.
Here are some examples of relative vs. absolute paths:
# data in the same folder, SRIP
my_data <- read_csv("./My_Data.csv") # relative
my_data <- read_csv("~/Desktop/SRIP/My_Data.csv") # absolute
# data in a folder, Data, within the SRIP folder
my_data <- read_csv("./Data/My_Data.csv") # relative
my_data <- read_csv("~/Desktop/SRIP/Data/My_Data.csv") # absolute
# data in SRIP's parent folder, Desktop
my_data <- read_csv("../My_Data.csv") # relative
my_data <- read_csv("~/Desktop/My_Data.csv") # absolute
# data in another folder, DesktopData within the parent folder Desktop
my_data <- read_csv("../DesktopData/My_Data.csv") # relative
my_data <- read_csv("~/Desktop/DesktopData/My_Data.csv") # absolute
In short, .
is the same folder, ..
is its parent folder, and you can string together as many ..
and folder names as you need to get from one folder to another.
Note: When you are collaborating on a project, it’s best practice to work within a project folder and navigate using relative paths. Absolute paths break on different users’ computers.
Let’s say I have measurements for each day in a month over several months. I want the average of those daily values for each month:
# see the first ten rows of this data
airquality %>%
head(10)
I would group_by()
my month column, and then use the summarize()
function to find the average daily temperature within each month:
# calculate average temperature for each month
avg_temps_by_month <- airquality %>%
group_by(Month) %>%
summarize(avg_daily_temp = mean(Temp, na.rm = TRUE))
See the Data Wrangling lesson (Day 3) for more!
NA
stands for Not Available. We use NA for truly missing data.
NaN
stands for Not a Number. It is usually the result of dividing by 0.
A blank cell, especially in column of strings, is usually an empty string (“”). R distinguishes between empty and missing, so an empty string isn’t missing data.
Try to use NA
, but R also treats NaN
as missing. Here is one way to replace empty strings with NA
.
na.rm = TRUE
in my calculation?When you specify na.rm = TRUE
in some calculation, such as mean()
or min()
, both NA
and NaN
values get ignored! Similarly, the is.na()
function returns TRUE
for both NA
and NaN
. (There’s another function, is.nan()
, that distinguishes between the two. It returns TRUE
for NaN
and FALSE
for NA
.)
Well, that depends on what hypothesis you’re testing. Before you get to p-values, there’s a whole process of forming hypotheses and identifying which type of statistical test to perform. p-values are widely misunderstood and misused, so I recommend learning the statistics relevant to your problem before you jump into calculating p-values. See this explanation of Hypothesis Testing or the Wikipedia page to start. For a more in-depth explanation, see Chapter 11 of Learning Statistics with R on hypothesis testing, or ask a friend who has taken AP Statistics.