B Appendix: Review of Vectors and Data Frames in R

In this appendix, we’ll cover the basics of vectors and data frames in R, and in particular what they are, how to access the data within them, and how to use them within other functions.

B.1 Vectors

Let’s suppose we wanted a vector named x which contains the squares of integers 1..10. We could create that object as:

x <- c(1, 4, 9, 16, 25, 36, 49, 64, 81, 100)

A vector is a basic data structure in R and it contains a number of element of the same type, e.g. all numbers or all characters. Mostly we’ll deal with vectors that are lists of numbers.

There are a number of ways to create a vector, the most common being using the c() or combine function, as shown above.

B.1.1 Accessing single elements from a Vector

We access the elements of the vector using square brackets, []. For example, if I wanted to access the third element of x, I would use x[3]. In this case, the number 3 is an index into the vector. If you type this in the console, it will return 9 as shown below. If you wanted the 8th element, you would use x[8].

x[3]
## [1] 9

B.1.2 Accessing multiple elements from a Vector

We’ve seen above how to access a single element of an array. Sometimes we want to return more than one value. There are (at least) two ways to do this:

  • by specifying a test condition or
  • by using a set of indexes.

For example, what if we wanted to return only those values greater than 50, we could use x[x>50]. Read this as x where x is greater than 50.

The red part of this is our test condition. In fact, what R does is test each value of x to see if it’s greater than 50 or not, and then return a list of TRUE or FALSE values depending on the test, as shown below. In our case, only the last three values are greater than 50 and so only the last three test results are TRUE.

x>50
##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE

We then use this vector of TRUE/FALSE values to select which element of x we want to return, and R returns only those values where the test condition is TRUE.

x[x>50]
## [1]  64  81 100

We’ll use test conditions more below.

As an alternative approach, we can also access a set of elements in x by giving it a set of indices. For example to get the first three elements we could use x[c(1,2,3)].

x[c(1,2,3)]
## [1] 1 4 9

To get just the odd elements we could use x[c(1,3,5,7,9)].

B.1.3 Using Vectors within other Functions

There are a lot of built in functions in R that can act on vectors, meaning we can pass the vector as a parameter to another function. Some examples include:

  • sum(x) - returns the total of the given vector
  • length(x) - returns the length of the given vector
  • max(x) - returns the maximum value in the given vector

B.1.4 Guided Practice

  1. create a vector that contains the first 15 prime numbers
  2. extract the 2nd element of vector x
  3. extract all elements of the vector x that are less than or equal to 25
  4. extract elements 2,4,6,8 and 10 of the vector x
  5. extract the last element of the vector x
  6. calculate the maximum value of the vector x

B.2 Data Frames

Above we saw that vectors are basically lists of the same type of element. They are one dimensional.

Another important data structure in R is the data frame. A data frame is a two dimensional object used to store data.

An example of this is the data from the 2019 Seattle Sounders regular season, shown below, stored in a data frame called sounders.

##      Date       Opponent Goals_For Goals_Against WDL Home_Away
## 1   2-Mar     Cincinnati         4             1   W         H
## 2   9-Mar       Colorado         2             0   W         H
## 3  16-Mar        Chicago         4             2   W         A
## 4  30-Mar      Vancouver         0             0   D         A
## 5   6-Apr      Salt Lake         1             0   W         H
## 6  13-Apr        Toronto         3             2   W         H
## 7  21-Apr        LA (FC)         1             4   L         A
## 8  24-Apr       San Jose         2             2   D         H
## 9  28-Apr        LA (FC)         1             1   D         H
## 10  4-May      Minnesota         1             1   D         A
## 11 11-May        Houston         1             0   W         H
## 12 15-May        Orlando         2             1   W         H
## 13 18-May   Philadelphia         0             0   D         A
## 14 26-May    Kansas City         2             3   L         A
## 15  1-Jun         Dallas         1             2   L         A
## 16  5-Jun       Montreal         1             2   L         A
## 17 29-Jun      Vancouver         1             0   W         H
## 18  3-Jul  New York City         0             3   L         A
## 19  6-Jul       Columbus         2             1   W         A
## 20 14-Jul        Atlanta         2             1   W         H
## 21 21-Jul       Portland         1             2   L         H
## 22 27-Jul        Houston         1             0   W         A
## 23  4-Aug    Kansas City         2             3   L         H
## 24 10-Aug    New England         3             3   D         H
## 25 14-Aug      Salt Lake         0             3   L         A
## 26 17-Aug    LA (Galaxy)         2             2   D         A
## 27 23-Aug       Portland         2             1   W         A
## 28  1-Sep    LA (Galaxy)         4             3   W         H
## 29  7-Sep       Colorado         0             2   L         A
## 30 15-Sep NY (Red Bulls)         4             2   W         H
## 31 18-Sep         Dallas         0             0   D         H
## 32 22-Sep           D.C.         0             2   L         A
## 33 29-Sep       San Jose         1             0   W         A
## 34  6-Oct      Minnesota         1             0   W         H

As shown, a data frame has both columns and rows. For the sounders data, each row contains the information about each game, and each column contains a single variable tracked across all of the games, in this case including the data, opponent, goals scored, result and location of the game.. This data frame has 34 rows and 6 columns. The dim() function tells us this:

dim(sounders)
## [1] 34  6

And we see that this data frame has 34 rows and 6 columns. Data frames will typically have named columns. To determine the names, type

names(sounders)
## [1] "Date"          "Opponent"      "Goals_For"     "Goals_Against"
## [5] "WDL"           "Home_Away"

and here we see there are 6 names, one for each column, in the order listed above.

B.2.1 Accessing single elements from a Data Frame

Similar to above, we use brackets to access single elements from a data frame, with one change. Since the data frame is two dimensional, we need to use two indices, the first one for the row and second one for the column as [row, column]. So, to access the “Goals_For” (the 3rd column) in the 5th game (the 5th row) we’d use:

sounders[5,3]
## [1] 1

Similarly, to access the “Opponent” in the 10th game we’d use

sounders[10,2]
## [1] "Minnesota"

B.2.2 Accessing entire columns from a Data Frame

The $ operator allows us to select an entire column. So sounders$Date is the list (vector) of all of the dates of the games and sounders$WDL is a vector which indicates whether each game was a Win, Loss, or Draw.

Of importance here is that the result of this, i.e. an extracted column, is a standard R vector. Hence everything that we said above about vectors also applies to data from columns. We can:

  • sum them: sum(sounders$Goals_For)
  • find the maximum: max(sounders$Goals_Against)
  • access individual elements using brackets: sounders$Goals_For[3]
  • access a set of elements using brackets and combine: sounders$Goals_For[c(1,2,3,4,5)]

B.2.3 Accessing rows from a Data Frame

The above [r,c] notation has an important variation that allows us to access the value of every column. If we wanted to get all of the data for a specific row (i..e for a specific game):

sounders[34,]
##     Date  Opponent Goals_For Goals_Against WDL Home_Away
## 34 6-Oct Minnesota         1             0   W         H

is the data for the 34th game. So, leaving the column value blank tells R to return all the columns in the specified row.

B.2.4 Guided Practice

  1. determine if the 7th game was home or away (6th column)
  2. determine the goals against (4th column) in the 23rd game
  3. create a vector of the goals against in each game
  4. create a vector of the dates of all the games
  5. first create a vector of all of the opponents and then use this to find the opponent in the 6th game
  6. access all of the data for the 32nd game
  7. is the result of a single row, e.g. sounders[14,] a vector?

B.2.5 Using data from one column to select data from another column

We often want to use data from one column to help choose or select data from another column. For example, what if we wanted to know how many wins the Sounders had that were away games? Or how many games they won when they scored at least 2 goals? In both of these examples, we need to use data from one column to select which data to retrieve from another. And to do this, we use the test condition idea introduced above.

To figure out “how many wins the Sounders had that were away games?”, we proceed by:

  • first extracting the $WDL vector, and
  • then selecting from that only the games that were away.

Our test condition here will be sounders$Home_Away == "A". Examining this in detail we see that

sounders$Home_Away == "A"
##  [1] FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [13]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE
## [25]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE

returns a vector of TRUE and FALSE indicate whether each of the games were away games or not. We can use this to select only the relevant games from the sounders$WDL vector as:

sounders$WDL[sounders$Home_Away == "A"]
##  [1] "W" "D" "L" "D" "D" "L" "L" "L" "L" "W" "W" "L" "D" "W" "L" "L" "W"

And we see there are only 17 games that were away games, with results as above.

Lastly, to figure out how many of these are wins, we can use the table() function, recognizing again that the previous results is simply a vector:

table(sounders$WDL[sounders$Home_Away == "A"])
## 
## D L W 
## 4 8 5

And hence, there were 5 away games that the Sounders won.

As a second example, let’s ask “How many games did the Sounders win when they scored at least 2 goals?” Here we’ll use the when they scored at least 2 goals as our test condition and select the win/draw/loss record from only those games. So our test condition will be:

sounders$Goals_For>=2
##  [1]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE  TRUE
## [13] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
## [25] FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE

which again returns a vector of TRUE and FALSE. To then select the win/draw/loss data, we’d call:

sounders$WDL[sounders$Goals_For>=2]
##  [1] "W" "W" "W" "W" "D" "W" "L" "W" "W" "L" "D" "D" "W" "W" "W"

but note this doesn’t exactly answer our question. We had asked how many games they won in this test condition. The table() function acts on a vector with different levels, such as win, loss and draw, and counts the results in each level or category.

table(sounders$WDL[sounders$Goals_For>=2])
## 
##  D  L  W 
##  3  2 10

Hence we see they won 10 games when they scored more than 2 goals.

Also note that the table() result is itself a vector, meaning we can access individual values using the [] notation. So, table(sounders$Home_Away)[1] returns 17 for the number of away games.

B.2.6 Guided Practice

  1. What is the test condition if we wanted only home games?
  2. What is the test condition if we wanted games where the sounders didn’t score any goals?
  3. What is the test condition if we wanted only those games against Portland?
  4. How would you select the dates of the games against Vancouver?
  5. How would you select the goals against in games that the sounders lost?
  6. How would you calculate the total goals they scored in home games?

B.2.7 How to create data frames

To create data frames in R, we typically either read data in from external files using the read.csv() function, or we create data frames directly using the data.frame() function. For the latter, suppose we had data on two different test scores for a set of students.

test1 <- c(92, 88, 100, 96, 80, 82, 85, 93, 81, 95, 84, 92, 93, 90, 92)
test2 <- c(92, 90, 99,  95, 78, 81, 83, 94, 75, 98, 94, 96, 94, 89, 72)

test_scores <- data.frame(test1=test1, test2=test2)
head(test_scores)
##   test1 test2
## 1    92    92
## 2    88    90
## 3   100    99
## 4    96    95
## 5    80    78
## 6    82    81

The names of the resulting columns are on the left side of the equality and the vector is on the right side.

B.2.8 Reading External Data

To load data from an eternal file, use the read.csv() command. The PATH of the file must match the file’s location.

sounders <- read.csv("sounders.csv")

A common error when loading data is:

Error in file(file, "rt") : cannot open the connection

This typically means the file is not located where R is looking. Check your working directory using getwd() and make sure the file is located there.

B.3 Building compound test conditions

Above we saw how to build simple test conditions, namely where we were selecting a single value as a test condition (including using \(<\) or \(>\)). Occasionally, we want to select a range of values or if we wanted to use multiple columns as part of our test condition?

To do this, we need to consider the ideas of OR and AND. If we have two test conditions, A and B, that can each be either TRUE or FALSE, then if we want either of them to be true, we can have A OR B as our combined or compound test condition. Alternatively, if we need both of them to be true, we use A AND B as our combined test condition.

For example, if we wanted to extract a list of goals scored in games that were either wins or draws, we could write

sounders$Goals_For[sounders$WDL=="W" | sounders$WDL=="D"]
##  [1] 4 2 4 0 1 3 2 1 1 1 2 0 1 2 2 1 3 2 2 4 4 0 1 1

and here the pipe | symbol indicates logical OR.

If we wanted to know the game results (WDL) and for games where the sounders scored more at least 3 goals and the opponent scored no more than 1 goal, we’d use:

table(sounders$WDL[sounders$Goals_For >=3 & sounders$Goals_Against <=1])
## 
## W 
## 1

where the & symbol indicates logical AND, and we see there was only 1 game in which that occurred and the Sounders won.

B.3.1 Guided Practice

  1. What is the compound test condition for games against either Vancouver or Portland?
  2. What is the compound test condition for home games that they lost?
  3. What is the compound test condition for games that they won and scored at least 2 goals?
  4. How would you extract results of the dates of home games that they won?
  5. How would you extract the total goals scored at home in games they won?
  6. How many away games did the sounders score zero goals and tie?

B.4 Summary

This chapter has described two ubiquitous data structures in R, namely vectors and data frames. It has information on their use, how to create them and how to access data within each. It has also discussed how we can create test conditions to extract specific data from both vectors and data frames.