Learning the Fundamentals of Data Analysis & Data Visualization in R.

Share on facebook
Share on google
Share on twitter
Share on linkedin
An Extended Tutorial For Beginners To Data Science And Data Visualization in R.

This article will take a closer look at the programming language R (Programming language R). We do not make it complicated, so that it is understandable for everyone. Programming in R is getting more popular every year. It is also nice to look at (simple) code examples of R. Grab a cup of coffee and read on!


What Is R?

In 1993 the programming language R was designed by Ross Ihaka and Robert Gentleman at the University of Auckland in New Zealand. The first names of both developers start with an R. That is where the programming language’s name comes from.

R is a newer and better version than the programming language S, which existed before R was launched. Statistical calculations are performed with S. The name “R” is also somewhat related to its predecessor “S.”

The “R core team” maintains the programming language. Naturally, this team includes founders Ross and Robert. Also on the team is John Chambers, the founder of the programming language S.


What Can We Do With R?

What can you actually do when you start programming in R? From the above text, you have already understood that R is mainly used for statistical calculations and data analysis. You can interpret and visualize data with it.

Let’s go deeper into the possibilities with R. Below are some possibilities that you have not yet seen.

  • Create Machine Learning Algorithms
  • Develop and host web apps
  • Data mining
  • Develop Deep Learning models

Getting Started.

First of all, you need an environment to get you started before you can program anything at all, so let’s download the R programming language and R studio (IDE).

https://rstudio.com/ — Download Link

When you boot up, you will see the following screen:

Image By Author: Bryan Dijkhuizen

Understanding Your IDE.

To those of you who are new to programming in an IDE, this might be quite overwhelming to explain everything in this program.

The Console

Image By Author: Bryan Dijkhuizen

In the top left corner of your screen, you will see a console in which you can type commands to execute R code, for example:

> plot(iris)

This will plot a graph of the famous dataset: ‘iris’:

Image By Author: Bryan Dijkhuizen

In the bottom right corner, a plot will appear. You can save this image individually as well as a PNG file.

The Files

As you switch from ‘Plots’ to ‘Files,’ you’ll see all of the files in your working directory, which is currently only the Tutorial.Rproj file, which holds the settings of your project.

Image By Author: Bryan Dijkhuizen

The Environment

One other important thing is the environment (in both ways).

Image By Author: Bryan Dijkhuizen

This environment tab holds all of your variables, so let’s create a simple variable in our console by typing: a <- 1 .

Image By Author: Bryan Dijkhuizen

That way, a variable gets added to the environment.


Learning About Variables.

Now that you know how your IDE is working, we can get started programming, and the first thing you’ll need to learn about are variables.

What is a variable?

In mathematics, a variable is a designation for any element of a collection. It is said that the variable traverses the set or that the variable takes values in that set. A variable is usually represented by a letter, but sometimes by more than one letter in the alphabet; letters from other alphabets are also used.

In programming, it’s said a storage room.

How do I create a variable in R?

To create a variable in R, we use the <- sign, don’t use the = sign, it’s allowed, but it’s bad practice, and you want to get this good from the beginning.

> a <- 1 # this stores the integer 1 as a double in a variable called a

To print the variable, call it:

> a 
> [1] 1

Data Types

There are many data types, the most important ones:

  • Integers.
  • Strings.
  • Characters.
  • Booleans.
  • ‘Collections’ (There are many different kinds of collections)

We’ll create one variable for each of them:

> newInteger <- 1 # Integer with value: 1
> newCharacter <- "a" # Character with value: "a"
> newBoolean <- TRUE # Boolean with value: TRUE

We can even do maths with these variables. For example, let’s add two variables:

> a <- 4
> b <- 6
> c < a + b

This will add a and b

> #result
> c
> [1] 10

Collections: Vectors, Matrixes, Lists and Arrays.

Now we arrive at the most important part of Data Science, the actual data, and how this data is stored through collections. We call them: Data Structures.

Vectors.

An example of a data structure (a way to store your data) is a vector. This is one of the most default ways in R.

A vector is a one-dimensional collection. Let’s create one:

> v1 <- c(1, 2, 3, 4, 5)

The c() concatenates the numbers into one object (and puts it into the vector).

Let’s see the result:

> v1
> [1] 1 2 3 4 5

And to be sure this is a vector, R has a built-in function which is called the is() which you can use to check any data type:

> is.vector(v1)
> [1] TRUE

As you can see, we are dealing with a real vector over here.

You can fill the vector with all sorts of data types and variables, as long as they are the same: Integers, Characters, Logical (Booleans).

Matrixes.

For higher-dimensional collections, we use Matrixes. In linear algebra, a subfield of mathematics, a matrix is a rectangular number scheme.

In R Data Science, it means no more than a multidimensional Vector.

Let’s create one with characters:

> m1 <- matrix(c("a", "b", "c", "d", "e", "f"), nrow = 2)

The nrow property tells the matrix it has to create two rows. When I call this out, you see it is structured in a 2D structure.

> m1
[,1] [,2] [,3]
[1,] "a" "c" "e"
[2,] "b" "d" "f"

Arrays.

When you want to structure data which has not the same data types, arrays are perfect.

I’m not going to spend that much attention on arrays since we will take a better look at data frames, but let’s create a simple array.

> x <- array(c(1:16, c(4,4,2))

When we take a look at this array, we see that there were two tables create, each having four rows and four columns:

> x
, , 1
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16
, , 2
[,1] [,2] [,3] [,4]
[1,] 1 5 9 13
[2,] 2 6 10 14
[3,] 3 7 11 15
[4,] 4 8 12 16

Data Frames.

A data frame is similar to a matrix. It is also a collection of data in tabular form. But unlike an array, the data can be of a different type. You can display numerical data together with texts (characters) in a data frame.

In that sense, a data frame most resembles a spreadsheet. That text can then be found, for example, in the first row, at the head of the columns, in the form of the names of the various variables whose numbers are in the columns.

Let’s start by showing you an example and create three individual vectors:

> vNums <- c(1, 2, 3)
> vChars <- c("a", "b", "c")
> vBools <- c(TRUE, FALSE, TRUE)

The next step is to combine these into a data frame:

> df1 <- cbind(vNums, vChars, vBools)

But this has one issue: it converts everything to characters:

> df1
vNums vChars vBools
[1,] "1" "a" "TRUE"
[2,] "2" "b" "FALSE"
[3,] "3" "c" "TRUE"

That’s something you’d rather want to avoid, and for that, we use the following line instead of cbind() :

> df2 <- as.data.frame(cbind(vNums, vChars, vBools))

When we call out our new data frame:

> df2
vNums vChars vBools
1 1 a TRUE
2 2 b FALSE
3 3 c TRUE

You will not often create a data frame within R in the above way to store your data in tabular form in practice.

You are much more likely to do that with a program. Then, e.g., the data editor can be from R itself, edit ().

Lists.

Now, let’s take a look at lists. As with a data frame, an array can contain row and column names. This is very useful if, for example, you want to look up data later from specific rows and columns.

To demonstrate this, we create a matrix in which we keep track of several properties for several people.

We put the names of the people on the rows and the properties on the column names. Lists are handy objects in R. Lists consist of vectors that have a name.

These names are called keys. The values behind these names are called values. This means you are talking about key-value pairs. Because of this structure, you can call up the value after a key with the $ operator, just like with a data frame.

Let’s create a list:

> vNums2 <- c(1, 2, 3)
> vChars2 <- c("a", "b", "c")
> vBools2 <- c(TRUE, FALSE, TRUE)

We take three new vectors and combine those into a list:

> list1 <- list(vNums2, vChars2, vBools2)

This will result in:

> list1
[[1]]
[1] 1 2 3
[[2]]
[1] "a" "b" "c"
[[3]]
[1] TRUE FALSE TRUE

We can even put a list in a list:

> list2 <- list(vNums2, vChars2, vBools2, list1)

Which gives the following result:

> list2
[[1]]
[1] 1 2 3
[[2]]
[1] "a" "b" "c"
[[3]]
[1] TRUE FALSE TRUE
[[4]]
[[4]][[1]]
[1] 1 2 3
[[4]][[2]]
[1] "a" "b" "c"
[[4]][[3]]
[1] TRUE FALSE TRUE

Data Type Conversion.

It is useful that R automatically defines the class. However, sometimes you are not satisfied with the data type that R automatically assigns. That is why you can easily change the data type in R.

We can change one by using. coerce.

> (coerce1 <- c(1, "a", TRUE))
[1] "1" "a" "TRUE"

This will automatically force our coerce into characters, simply because that is the least strict data type. But for example, when we save a number, an integer basically. It will be stored as double, let’s say we don’t want that. We really want to save it as an integer, and we can do that with the following lines of code:

> (coerce2 <- 5)
[1] 5
> typeof(coerce2)
[1] "double"
> (coerce3 <- as.integer(5))
[1] 5
> typeof(coerce3)
[1] "integer"

And as you can see, it is stored as an integer. You can also do this with characters you want to store as numerics:

> (coerce4 <- c("1", "2", "3"))
[1] "1" "2" "3"
> (coerce5 <- as.numeric(c("1", "2", "3")))
[1] 1 2 3

Now, they’re saved as numerics.

Convert Matrixes to Data Frames.

This is a more advanced thing but very handy when you’re analyzing or editing data, converting a Matrix into a Data Frame. Because a DataFrame has a lot more options than a matrix.

We use the matrix from before:

> m1 <- matrix(c("a", "b", "c", "d", "e", "f"), nrow = 2)

And we will turn it into a data frame:

> (coerce6 <- as.data.frame(m1))
V1 V2 V3
1 a c e
2 b d f

Let’s check:

> is.data.frame(coerce6)
[1] TRUE

Cleaning Your Environment.

After you have experimented with R and all of the variables and data types, you’ve got a lot going on in your console and your environment. Let’s clean all of that up with these simple command-line tools:

> rm(list = ls())

It clears the environment:

Image By Author: Bryan Dijkhuizen

We can do the same thing with our console:

> cat("\f")
Image By Author: Bryan Dijkhuizen

Data Visualization

Data visualization is a method for converting abstract data into concrete information and possibly knowledge.

The exponential increase in data ensures that we can measure more and more and provide insight. You do this with the help of data visualization. For many people, however, data is an abstract given. Creating data visualizations ensures that we can convert data into tangible and readable data. But why do we want this, what exactly does it bring us, and how can we implement this?

As mentioned above, there is an enormous increase in available internal and external data. With the help of data, we can convey a message, and we do that with the help of data visualization.


Why is data visualization important?

Effective communication is essential when conveying a story or advice. It can easily happen that when conveying a story, the recipient misunderstands your message. Consider, for example, transferring a story via WhatsApp. We write a message, but it can easily be misunderstood.

How is it then that a face-to-face message is less often misunderstood? This has to do with expression, body language, and sounds. Consider, for example, a situation where you convey a sarcastic message. This is a lot easier face-to-face than via Whatsapp.

In short: in mutual communication, expression and sounds support the message you want to convey. So what created this visual and auditory way of thinking? Our early ancestors (around prehistoric times) were already visual and auditory. This originated as a survival mechanism, a way of estimating dangerous situations.

So it still applies to us that we are visually and audibly oriented. That is why it is good to communicate with the help of expression and sounds. We can convey a theory or advice in both ways. It works best when you combine both. This can be done by visualizing the data and then giving a face-to-face explanation of the visualization: storytelling.


Getting Started.

Before getting into data visualization with R, you need to make sure you have a fair understanding of how this programming language works.

In this tutorial, I will be using R Studio. So let’s get started.


Colors in R.

The best way to be looking at data is probably by a graph; the graph’s most simple form is a barplot. We can very easily create one in R by first creating a tiny dataset:

> x = c(24, 13, 7, 5, 3, 2)

This vector has 6 values; let’s create the barplot:

> barplot(x)

This will result in the following image:

Image By Author: Bryan Dijkhuizen

But this isn’t very appealing. With R, you can implement all sorts and kinds of colors. To take a look at which colors you can use, enter the following line of code:

> colors()

This will spit out a bunch of color codes. We can use one of these to color our graph:

> barplot(x, col = "powderblue")
Image By Author: Bryan Dijkhuizen

You can also use any RGB or HEX color codes to color your graphs:

> barplot(x, col = "#4F0B96")
Image By Author: Bryan Dijkhuizen

Bar charts in R.

One of the simplest kinds of data visualization is a bar chart. Such a chart can help you with 80–90% of your data problems because it’s so clear. Let’s take a look at how we can create a bar chart.

A bar graph is a graphical representation of the frequency distribution of data from a discrete probability distribution. This diagram shows bars of small width with height equal to the frequencies erected above the possible values.

Loading Packages.

To load in the requisite packages, enter the following line to your script:

> pacman::p_load(pacman, tidyverse)

We will be using the ‘diamonds’ dataset, which acquired 50000 values from different diamonds. As you load the dataset, you’ll see all of the values and labels:

> diamonds
# A tibble: 53,940 x 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.290 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ... with 53,930 more rows

The most simple form of a bar chart is created as follows:

> plot(diamonds$cut)

We will be indexing our chart by using the ‘cut’ label, which stands for the quality of the diamond. If you want any extra info about a dataset, use the ?in front of the dataset to access the readme file of that dataset.

Image By Author: Bryan Dijkhuizen

Plot using Pipes

We can do the same thing using the tidyverse pipes:

diamonds %>%
select(cut) %>%
plot()

To use the barplot() we need to format our data using table() :

diamonds %>%
select(cut) %>%
table() %>%
barplot()

This way, you will be able to add filters such as sorting:

diamonds %>%
select(cut) %>%
table() %>%
sort(decreasing = T) %>%
barplot()
Image By Author: Bryan Dijkhuizen

As I said above, using colors can be useful to draw attention. Let’s make our graph purple using and add a title:

diamonds %>%
select(cut) %>%
table() %>%
sort(decreasing = T) %>% # Sort table
barplot(
main = "Cut of Diamonds",
col = "#4F0B96"
)
Image By Author: Bryan Dijkhuizen

Histograms in R.

A histogram or column diagram is the graphical representation of the frequency distribution of data grouped into classes, derived from a continuous probability distribution. This diagram shows columns with an area the size of the frequencies erected above the classes.

We’re using the Diamond Dataset again, but now we’re taking a look at the price:

> hist(diamonds$price)
Image By Author: Bryan Dijkhuizen

You can see many more ‘cheap’ diamonds than ‘expensive’ diamonds by looking at such a graph. Now we’ll add some options to the graph, just as with the bar charts:

hist(diamonds$price,
breaks = 7, # Suggest number of breaks
main = "Histogram of Price of Diamonds",
ylab = "Frequency",
xlab = "Price of Diamonds",
border = T,
col = "#4F0B96"
)
Image By Author: Bryan Dijkhuizen

That looks much better.


BoxPlots in R.

A graphical representation of the five-number summary summarizes the five-number summary in descriptive statistics, a box plot, mustache box, box chart, or box-with-bar chart.

This five-number summary consists of the minimum, the first quartile, the median, the third quartile, and the observed data’s maximum.

Let’s take a look at how to do a boxplot in R:

> boxplot(diamonds$price)
Image By Author: Bryan Dijkhuizen

This is a default chart vertically. Usually, a boxplot comes horizontally, so we’ll take a look at that in a bit. We’ll be using the pipes again:

diamonds %>%
select(price) %>%
boxplot(
horizontal = T,
notch = T,
main = "Boxplot of Price of Diamonds",
xlab = "Price of Diamonds",
col = "#4F0B96"
)
Image By Author: Bryan Dijkhuizen

We’re taking the diamonds dataset and then select the price and we send it to the boxplot method. We add in some arguments (horizontal = True, Noth = True, main = “title”, xlab = ‘label on the x-axis”, col = “color”).

To understand boxplots better, take a look at this article by Michael Galarnyk.


Line charts in R.

A line chart or line graph is a chart that usually shows the development of a variable over time. This diagram is often used to show how something develops over time.

To do this in R, we will use a different dataset, this time, we use uspop which represents the population of the United States:

> pacman::p_load(pacman, tidyverse)
> library(datasets)
> uspop
# result
Time Series:
Start = 1790
End = 1970
Frequency = 0.1
[1] 3.93 5.31 7.24 9.64 12.90 17.10 23.20 31.40 39.80 50.20 62.90 76.00 92.00 105.70 122.80 131.70
[17] 151.30 179.30 203.20

Let’s plot this dataset:

> plot(uspop)
Image By Author: Bryan Dijkhuizen

And you can see that the population has been increasing over the years. And it’s actually increasing very smoothly. As you look closely at the ‘bump’ in the graph, you might notice this is right around World War 2.

Let’s clear our graph up with titles and colors:

uspop %>% 
plot(
main = "US Population 1790–1970 ",
xlab = "Year",
ylab = "Population (in millions)",
)
Image By Author: Bryan Dijkhuizen

Now, I’m going to add a few new things. I will add clarification for the period between 1930–1940 when the recession happened.

uspop %>% 
plot(
main = "US Population 1790–1970 ",
sub = "(Source: datasets::uspop)",
xlab = "Year",
ylab = "Population (in millions)",
)
abline(v = 1930, col = "lightgray")
text(1930, 10, "1930", col = "red3")
abline(v = 1940, col = "lightgray")
text(1940, 2, "1940", col = "red3")
Image By Author: Bryan Dijkhuizen

So we now accentuated the period where the population didn’t grow as smooth as it used to do in the decennium before. There are a lot more datasets to experiment with, which is pretty cool to do.


Bottom Line.

I hope you had a nice experience of learning R from this article. The examples in this article are only a tiny part of all the possibilities in R. It is nice to experiment further with R by finding out a dataset yourself, for example, from the government’s data portal, and thus constantly checking yourself out. Days to extract valuable information from the dataset.

I can already tell you that this involves a lot of Google work and that you will often visit the StackOverflow website for answers to your questions.

This is the learning process; everything you do not know and lookup on is a part you have learned again. Due to knowledge of the structure in R, you will now also be able to ask the right questions. You now know better what you need and can therefore be specific with the searches in Google.

In this way, your knowledge will grow in R. Besides, in my next article, I will cover more challenging and suitable topics for people with sufficient knowledge of R, which you now also belong to.

bryan@dijkhuizenmedia.com

bryan@dijkhuizenmedia.com

Leave a Replay

Sign up for our Newsletter

Click edit button to change this text. Lorem ipsum dolor sit amet, consectetur adipiscing elit