Tristan Robinson

Tristan Robinson's Blog

A Crash Course in R Part 2

Following on from Part 1 which introduces the R basics (which can be found here), in Part 2 I’ll start to use lists, data frames, and more excitingly graphics.

 

Lists

Lists differ from vectors and matrices because they are able to store different data types together in the same data structure. A list can contain all kinds of R objects - vectors, matrices, data frames, factors and more. They can even store other lists. Unfortunately calculations are less straightforward because there's no predefined structure.

To store data within a list, we can use the list() function, trip <- c("London","Paris",220,3). As mentioned in my previous blog we can use str(list) to understand the structure, this will come in handy. Notice, the different data types.

Subsetting lists is not similar to subsetting vectors. If you try to subset a list using the same square brackets as when subsetting a vector, it will return a list element containing the first element, not just the first element "London" as a character vector. To extract just the vector element, we can use double square brackets [[ ]]. We can use similar syntax to before, to extract the first and third elements [c(1,3)] as a new list, but the same does not work with double brackets. The double brackets are reserved to select single elements from a list. If you have a list inside a list, then [[c(1,3)]] would work and would select the 3rd element of the 1st list! Subsetting by names, and logicals is exactly the same as vectors.

Another piece of syntax we can use is $ to select an element, but it only works on named lists. To select the destination of the trip, we can use trip$destination. We can also use this syntax to add elements to the list, trip$equipment <- c("bike","food","drink").

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Data Frames

While using R, you'll often be working with data sets. These are typically comprised of observations and each observation has a number of variables against it, similar to a customer/product table you'd usually find in a database. This is where data frames come in, as the other data structures are not practical to store this type of data. In a data frame, the rows correspond to the observations, while the columns the variables. Similar to a matrix but we can store different data types (like a list!). Under the hood, a data frame is actually a list but with some special requirements such as vector length, and char vectors as factors. Creating a data frame is usually achieved by importing data from source rather than manually created but you can do this via the data.frame function as shown below:

 

# Definition of vectors 
planets <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") 
type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet",  
          "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") 
diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) 
rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) 
rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) 

# Encoded type as a factor
type_factor <- factor(type) 

# Constructed planets_df
planets_df <- data.frame(planets, type_factor, diameter, rotation, rings, stringsAsFactors = FALSE) 

# Displays the structure of planets_df 
str(planets_df) 

 

Similar to SQL, sometimes you will want just a small portion of the dataset to inspect. This can be done by using the head() and tail() functions. We can also use dim() to show the dimensions which returns the number of rows and number of columns, but str() is preferable as you get a lot more detail.

Due to the fact a data frame is an intersection between a matrix and a list, the subsetting syntax can be used from both, so the familiar [ ], [[ ]], and $ will work to extract either elements, vectors, or data frames depending upon the type of subsetting performed. We can also use the subset() function, for instance subset(planets_df, subset = has_rings == TRUE). The subset argument should be a logical expression that indicates which rows to keep.

We can also extend the data frame in a similar format to lists/matrices. To add a new variable/column we can use either people$height <- new_vector or people[["height"]] <- new_vector. We can also use cbind() for example people <- cbind(people, “height” = new_vector), it works just the same. Similarly rbind() can add new rows, but you'll need to add a data frame made of new vectors to the existing data frame, the names between the data frames will also need to match.

Lastly, you may also want to sort your data frame - you can do this via the order() function. The function returns a vector with the rank position of each element. For example in a vector of 21,23,22,25,24 the order function will return 1,3,2,5,4 to correspond the position in the rank. We can then use ages[ranks, ] to re-arrange the order of the data frame. To sort in descending order, we can use the argument decreasing = TRUE.

 

# Created a desired ordering for planets_df
positions <- order(planets_df$diameter, decreasing = TRUE)

# Created a new, ordered data frame
largest_first_df <- planets_df[positions, ]

# Printed new data frame
largest_first_df

 

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Graphics

One of the main reasons to use R is its graphics capabilities. The difference between R and a program such as Excel is that you can create plots with lines of R code which you can replicate each time. The default graphics functionality of R can do many things, but packages have also been developed to extend this – popular packages include ggplot2, ggvis, and lattice.

The first function to look at is plot() which is a very generic function to plot. For instance, take a data frame of MPs (members of parliament), which contains the name of the MP, their constituency area, their party, their constituency county, number of years as MP, etc. We can plot the parties in a bar chart by using plot(parliamentMembers$party). R realises that their party is a factor and you want to do a count across it in a bar chart format (by default). If you pick a different variable such as the continuous variable number of years as MP, the figures are shown on an indexed plot. Pick 2 continuous variables, such as number of years as MP vs. yearly expenses - plot(parliamentMembers$numOfYearsMP, parliamentMembers$yearlyExpenses) and the result is a scatter plot (each variable holds an axis).

Are these variables related? To make relationships clearer, we can use the log() function- plot(log(parliamentMembers$numOfYearsMP), log(parliamentMembers$yearlyExpenses)). For 2 categorical variables, R handles it differently again, and creates a stacked bar chart giving the user an idea of the proportion of the 2 variables. Notice how the type of variable you use impacts the type of chart displayed.  You will find that the first element of plot forms the x axis and the second element the y axis. For 3 variables, it gets better still – here I used plot(parliamentMembers[c("numOfYearsMP","votesReceived","yearlyExpenses")]). As you can see below, R plots the variables against one another in 9 different plots!

 

clip_image001 

 

The second function to look at is hist(), which gives us a way to view the distribution of data. We can use this in a similar format to plot by specifying the syntax hist(parliamentMembers$numOfYearsMP). By default, R uses an algorithm to work out the number of bins to split the data into based on the data set. To create a more detailed representation, we can increase the bins by using the breaks argument - hist(parliamentMembers$numOfYearsMP, breaks = 10).

There are of course, many other functions we can use to create graphics, the most popular being barplot(), boxplot(), pie(), and pairs().

 

Customising the Layout

To make our plots more informative,we need to add a number of arguments to the plot function. For instance the following R script creates the plot below:

 

plot(
parliamentMembers$startingYearMP, 
parliamentMembers$votesReceived, 
xlab = "Number of votes", 
ylab = "Year", 
main = "Starting Year of MP vs Votes Received", 
col = "orange",
col.main = "black",
cex.axis = 0.8
pch = 19
)

# xlab = horizontal axis label
# ylab = vertical axis label
# main = plot title
# col = colour of line
# col.main = colour of the title
# cex.axis = ratio of font size on axis tick marks
# pch = plot symbol (35 different types!!)

clip_image001[7]

 

To inspect and control these graphical parameters, we can use the par() function. To get a complete list of options we can use ?par to bring up the R documentation. By default, the parameters are set per plot, but to specify session-wide parameters, just use par(col = “blue”).

 

Multiple Plots

The next natural step for plots is to incorporate multiple plots – either side by side or on top of one another.

To build a grid of plots and compare correlations we can use the mfrow parameter, like this par(mfrow = c(2,2)) by using par() and passing in a 2x2 vector which will build us 4 subplots on a 2 by 2 grid. Now, when you start creating plots, they are not replaced each time but are added to the grid one by one. Plots are added in a row-wise fashion - to use column-wise, we can use mfcol. To reset the graphical parameters, so that R plots once again to a single figure per layout, we pass in a 1x1 grid.

Apart from these 2 parameters, we can also use the layout() function that allows us to specify more complex arrangements. This takes in a matrix, where you specify the location of the figures. For instance:

 

grid <- matrix(c(1,1,2,3), nrow = 2, ncol = 2, byrow = TRUE)

grid
    [1][2]
[1]  1  1
[2]  2  3

layout(grid)
plot(dataFrame$country, dataFrame$sales)
plot(dataFrame$time, dataFrame$sales)
plot(dataFrame$businessFunction, dataFrame$sales)

# Plot 1 stretches across the entire width of the figure
# Plot 2 and 3 sit underneath this and are half the size

 

To reset the layout we can use layout(1) or use the mfcol / mfrow parameters again. One clever trick to save time is to store the default parameters in a variable such as old_par <- par() and then reset once done using par(old_par).

Apart from showing the graphics next to one another, we can also stack them on top of one another in layers. There are functions such as abline(), text(), lines(), and segments() to add depth to a graphic. Using lm() we can create an object which contains the coefficients of the line representing the linear fit using 2 variables, for instance movies_lm <- lm(movies$rating ~ movies$sales). We can then add it to an existing plot such as plot(movies$rating, movies$sales) by using the abline() (adds the line) and coef() (extracts the coefficients) functions, for instance abline(coef(movies_lm)). To add some informative text we can use xco <- 7e5 yco <- 7 and text(xco, yco, label = “More votes? Higher rating!”) to generate the following:

 

clip_image001[1]

 

Interested in testing your knowledge, check out the DataCamp exercises here, here and here.

 

Summary

This pretty much concludes the crash course in R. We’ve looked at the basics, vectors, matrices, factors, lists, data frames and graphics – and in each case looked at arithmetic and subsetting functions.

Next I’ll be looking at some programming techniques using R, reading/writing to data sources and some use cases – to demonstrate the power of R using some slightly more interesting data sets such as the crime statistics in San Francisco.

 

 

 

A Crash Course in R Part 1

What is R?  Also known as the language for statistical computing, it was developed in the 1990s, and provides the ability to use a wide variety of statistical techniques and visualization capabilities across a set of data. 

Pros for the language include the fact its open source, it has great graphical capabilities, runs via a CLI (provides the ability to script and automate), has a huge community behind it, and is gaining a wider adoption in business!

R can be used by a number of tools; the most common are R Tools for Visual Studio, RStudio standalone, or more recently R Scripts in Power BI.

 

Basics

Let’s start with the absolute basics. If you print 1+2 to the command line, the console will return 3. Print text to the command line, the console will return the same body of text.

Everyone loves variables. To be able to store values in variables, we can use the syntax apples <- 4 and pears <- 2 to the command line to store 4 in the apples, and 2 in the pears variable. There’s no print out here because its a variable. We can then do total_fruit <- apples + pears to create a new variable using existing variables.

As you create variables, you create a workspace, which you can reference using ls(). This details all the variables created within the R session. You can then use rm(variable) to clean up the workspace to maintain resource.

Now, not everyone loves commenting but you can comment your code via #. Here’s a simple example script to calculate the volume of a circle:

# Create the variables r and R 
r <- 2 
R <- 6 

# Calculate the volume of a circle
vol_circle <- 2*pi^2*r^2*R 

# Remove all intermediary variables that you've used with rm() 
rm(r,R) 

# List the elements in your workspace 
ls() 

[1] "vol_circle"

 

Data Types

As with any language, there are a number of data types supported:

  • TRUE/FALSE are "logical"
  • "This is text" is "character"
  • 2 or 2.5 are "numerics”
  • You can add L to numeric such as 2L to call this number an integer (the outputs the same, but the class is different). Here we have what's known as a type hierarchy.
  • Other types include double, complex, and raws.

We can use the function class() to determine the data type of variable. We can also use the dot function as to coerce or transform between the data types, such as as.character(4) = "4" and as.numeric("4.5") = 4.5.  To evaluate the type use is.numeric(2) = TRUE or is.numeric(“2”) = TRUE.  NA is returned when trying to convert “hello” to numeric.

Interested in testing your knowledge, check out the DataCamp exercises here and here.

 

Vectors

A vector is a sequence of data elements of the same data type which can called using the c() function. Vectors can be of all the types seen previously.  The three properties of a vector are type, length, and attributes.

For example, a vector to provide us UK Government Parties (who doesn’t love politics) and assigned to the variable parties can be built parties <- c("Labour","Conservative","Libdems","SNP"). A check can done to see if the variable is of a vector type similar to before using is.vector(parties).

But what if the vector contains data that has slightly more meaning behind it, for instance seat_count <- c(262,318,12,35). You can attach labels to the vector elements by using the names() function: names(seat_count) <- parties.  You could also do this using one line - c(Labour = 262, Conservative = 318, Libdem = 12, SNP = 35).

The variables created previously are actually stored in a vector of length 1. R does not provide a data structure to hold a single number/character string.  If you do build a vector of different data types, R performs coercion to make sure they are all of the dame data type. For instance c(1,5,"A","B","C","D") becomes "1","5","A","B","C","D".  If you need to differentiate data types, you can use a list instead!

 

Vector Arithmetic

Computations can be performed between vectors and are done so in an element-wise fashion. This means you can do earnings <- c(10,20,30) * 3 to generate [30] [60] [90].

You can also do vector minus vector so earnings – expenses (again done element wise). Multiplication and division using this method does not result in a matrix!

Other functions include sum(bank) to sum all elements of the vector, sum(bank > 0) to return a count of elements in the vector (given bank contains numerics) or sum(bank == x) to return the count of element x in a vector.

 

Vector Subsetting

As the name suggests, this is basically selecting parts of your vector to end up as a new vector (a subset of your original vector).

If you want to select the first element from our seat_count vector, we write seat_count[1] and this will return Labour 262. Both the value and name are kept from the original vector. We can also use the name, so seat_count[“Labour”] will return the same result.

If you want to select multiple elements, you can use the syntax seat_count[c(1,4)] by passing in a vector. To select elements in a sequence you can use 2:5 instead of 2,3,4,5. You can also subset via an inverse, by using the syntax seat_count[-1] which returns all the seats, apart from the element in [1].

One last method to create subsetting is by logical vectoring, so by specifying seat_count[c(TRUE,FALSE,FALSE,FALSE)] we can return the equivalent of [1]. R is also able to “recycle” this type of vectoring so if your logical vector is length 2, it will loop over itself to fit the vector you are subsetting.

Remember we can also use the arithmetic from the previous section, to select our vector contents for subsetting, examples include main_parties <- seat_count[seat_count > 50].

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.

 

Matrices

While a vector represents a sequence of data elements (1 dimensional), a matrix is a similar collection of data elements but arranged as rows/columns (2 dimensional) – a very natural extension to a vector.

To build a matrix you’ll need to use the following format; matrix(1:6, nrow = 2) which creates a 2-by-3 matrix for values 1-6. You can also specify columns rather than rows by using ncol = 3. R infers the other dimension by using the length of the input vector. To fill the vector by row instead of by columns, you can use the argument byrow = TRUE.

Another way to create a matrix is by using the functions rbind() and cbind(). These essentially take the 2 vectors you pass the function and stick them together. You can also use these functions to bind together a new vector with an existing matrix. For example my_matrix <- rbind(matrixA, vectorA)

To name the matrix, you can use rownames() and colnames(). For example rownames(my_matrix) <- c(“row1”,”row2”) and colnames(my_matrix) <- c(“col1”,”col2”).

You can also create a matrix using a one-liner, by using dimnames() and specifying a list of vector names.

my_matrix <- matrix(1:6, byrow = TRUE, nrow = 2,
dimnames = list(c("row1", "row2"), c("col1","col2","col3")))

Similar to vectors, matrices are also able to recycle themselves, only store a single atomic data type and perform coercion automatically.

Continuing on from vectors, matrices can also be subsetted. If you’re after a single element, you’ll need to specify both row and column elements of interest using the syntax m[1,3] for row 1 column 3. If you’re looking to subset an entire row or column, you can use the syntax m[3,] (notice the missing column value) to select the entirety of row 3. Columns can be selected using the inverse via m[,3]. You can also select multiple elements using a simple methodology to vectors. This can be achieved by using the syntax m[2, c(2,3)] to select the 2nd and 3rd column values of row 2. Subsetting the names works just the same as by index, you can even use a combination of both! The same is true of subsetting by a logical vector – just use c(FALSE,TRUE,TRUE) for the last 2 rows of a 3 row matrix. You can see some examples below.

# Create a matrix using 2 vectors
my_mega_matrix <- cbind(vectorA, vectorB)

# Subset the matrix to get columns 2 and 3
my_subsetted_mega_matrix <- my_mega_matrix[,c(FALSE,TRUE,TRUE,FALSE]

# Subset the matrix using names for columns 1 to 4
my_alt_subsetted_mega_matrix <- my_mega_matrix[,c("A","B","C","D")]

# Calculate totals for the columns 2 and 3
total_mega_matrix <- colSums(my_subsetted_mega_matrix)

As seen above there are another 2 functions we can use on matrices namely colSums() and rowSums() to do column and row arithmetic. This is addition to other standard arithmetic. All computations are performed element wise. So we can do my total_mega_matrix * 1.3 to convert the totals in GBP to USD (as an example). Performing calculations using 2 matrices is just the same (matrixA – MatrixB). Be careful here though, if they contain the same number of elements, everything will be done element wise, else recycling will occur.

Notice the similarity between vectors and matrices – they’re both data structures that store elements of the same type. Both perform coercion, and recycling. Arithmetic is also similar as everything is performed element wise.

Interested in testing your knowledge, check out the DataCamp exercises here, here, and here.

 

Factors

Unlike numeric variables, categorical variables can only take a limited number of different values. The specific data structure for this is what is known as a factor. A good example of this is blood, types can only be of type A, B, AB, or O – we then define a vector of peoples blood types blood <- c(“B”,”AB”,”O”,”O”,”A”,”B”,”B”). To convert this vector to a factor we can use factor(blood). R scans the vector to check for categories, and stores the distinct list as levels (sorted alphabetically). Values in the vector and then replaced with numeric values corresponding to the associated level. You can think of factors as integer vectors, where each integer refers to a category or level. To inspect the structure, we can use str(factor).

Similar to the names() function, you can also specify the levels() function and pass a vector to name the levels differently to those categories picked up in the scan, for instance levels(my_factor) <- c(“BT_A”,”BT_B”,”BT_O”,”BT_AB”). However its much safer to pass in both the levels and the labels because of the way in which it sets the levels alphabetically which means you have to be careful your names correspond correctly to the levels.

In statistics, there is also a difference between nominal categorical variables and ordinal categorical variables – nominal variables have no implied order, i.e. blood type A is not greater or less than B. There are examples however where ordering does exist, for example with t-shirt sizes, and you can use R to impose this on the factor. To do this, you can set the ordered function inside the vector to TRUE, and then specify the levels in the correct order. You can now evaluate the levels in the factor. An example can be seen below.

# Definition of temperature_vector
temperature_vector <- c("High", "Low", "High", "Low", "Medium")

# Encoded temperature_vector as a factor
temperature_factor <- factor(temperature_vector, 
                             ordered = TRUE,
                             levels = c("Low","Medium","High")
                             )

# Print out
temperature_factor

Interested in testing your knowledge, check out the DataCamp exercises here.

 

This is only just the start of understanding R – in the next blog I’ll look at lists, data frames and most importantly graphics! We can then start to looking at some more complicated examples and use cases.