library(readxl)
library(writexl)
library(tidyverse)
Presentation 3: ggplot2
Importing libraries and data
Load data
The iris dataset is a widely-used dataset in data science, containing 150 observations of iris flowers with features like sepal length, sepal width, petal length, and petal width. It includes three species: Setosa, Versicolor, and Virginica, making it ideal for classification tasks and data visualization.
data('iris')
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Data wrangling
Making the data more fun to plot.
# Define 5 colors and their respective ratios
<- c(rep("Red", 40),
colors rep("Blue", 20),
rep("Yellow",30),
rep("Green", 20),
rep("Purple",40))
set.seed(123) # For reproducibility
# Shuffle the colors to mix them randomly
<- sample(colors, replace = TRUE)
colors
# Add the 'Flower.Color' column to the iris dataset
$Flower.Color <- colors iris
Have a look at the dataset
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Flower.Color
1 5.1 3.5 1.4 0.2 setosa Red
2 4.9 3.0 1.4 0.2 setosa Blue
3 4.7 3.2 1.3 0.2 setosa Purple
4 4.6 3.1 1.5 0.2 setosa Blue
5 5.0 3.6 1.4 0.2 setosa Red
6 5.4 3.9 1.7 0.4 setosa Purple
colnames(iris)
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
[6] "Flower.Color"
ggplot2: The basic concepts
This is the starting point of a ggplot. The dataframe and the columns we wish to plot are defined. We have not specified what type of plot we want, hence an empty plot is produced.
ggplot(iris, # dataframe
aes(x = Sepal.Length, # x-value
y = Petal.Length)) # y-value
# missing type of plot
Scatter plot with geom_point
A scatter plot is made with the geom_point
function and is used to get an overview over the relationship between two numeric variables. Here we see the relationship between sepal length and width.
ggplot(iris, # dataframe
aes(x = Sepal.Length, # x-value
y = Petal.Length)) + # y-value
geom_point() # type of plot
Change color of entire plot by setting it outside aes()
.
ggplot(iris,
aes(x = Sepal.Length,
y = Petal.Length)) +
geom_point(color = 'hotpink')
Scatter plot with geom_point
with color stratification
To change colors based on a feature, you need to set it inside aes()
. Here we see the relationship between sepal length and width colored by species.
ggplot(iris,
aes(x = Sepal.Length,
y = Petal.Length,
color = Species)) +
geom_point()
Boxplot with geom_boxplot
Boxplots are great to get an overview of continues variables and spot outliers. Can be shown on either axis (x and y).
ggplot(iris,
aes(y = Sepal.Length)) +
geom_boxplot()
Split up by categorical variable like Species:
ggplot(iris,
aes(y = Sepal.Length,
x = Species)) +
geom_boxplot()
… or color.
ggplot(iris,
aes(y = Sepal.Length,
fill = Species)) +
geom_boxplot()
Violin plot with geom_violin
A violin plot shows the distribution of a continuous variable across different categories, combining the features of a box plot and a density plot. Also, the labels can be edited.
ggplot(iris,
aes(y = Sepal.Length,
x = Species)) +
geom_violin() +
labs(y = 'Sepal Length',
x = 'Flower Species',
title = 'Violin plot of sepal length stratisfied by flower species')
Histogram with geom_histogram
Histogram shows the distribution of a continuous variable.
ggplot(iris,
aes(x = Sepal.Length)) +
geom_histogram()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
You will sometimes get a message that suggests to select another binwidth
. Do what is says and you will often get nicer plot (something nothing changes).
ggplot(iris,
aes(x = Sepal.Length)) +
geom_histogram(binwidth = 0.5)
Bar chart with geom_bar
A bar chart is made with the geom_bar
function and is used to get an overview over the distribution of a single categorical variable, e.g. Flower.Color in this instance. The function treats the x-axis as categorical and calculates the bar heights based on the number of occurrences in each category. Here we see the number of flowers of each Flower.Color. Notice that the Flower.Colors are sorted alphabetically.
# Save plot in p
<- ggplot(iris,
p aes(x = Flower.Color)) +
geom_bar()
# Show p
p
# Show p with new labels
+ labs(x = 'Awesome Flower Colors', y = 'Awesome Count') p
# Show p again
p
# Save p with new lables in p (overwrite / reassign)
<- p + labs(x = 'Awesome Flower Colors', y = 'Awesome Count')
p2
# Show p2
p2
Color by species. The bars are stacked by default.
ggplot(iris,
aes(x = Flower.Color,
fill = Species)) +
geom_bar()
Add position = "dodge"
for bars to be placed next to each other.
ggplot(iris,
aes(x = Flower.Color,
fill = Species)) +
geom_bar(position = "dodge")
Add position = "fill"
for bars to be normalized such that heights of the bars to represent percentages rather than counts.
Additionally, themes can be added as a layer to any ggplot if you prefer a theme other than the default grey background.
ggplot(iris,
aes(x = Flower.Color,
fill = Species)) +
geom_bar(position = "fill") +
theme_bw()
# theme_classic()
# theme_minimal()
# theme_dark()
Using facet_wrap
if you want a plot to be split up according to a categorical variable.
ggplot(iris,
aes(x = Flower.Color)) +
geom_bar() +
facet_wrap(vars(Species))
Ordering columns
We can order the columns such that the count goes from lowest to highest. This is actually not that easy in R.
First, we see that the class of the Flower.Color is character. Characters are always sorted alphabetically like we saw above.
class(iris$Flower.Color)
[1] "character"
Extract the number of flowers for each Flower.Color.
<- iris %>%
dl_Flower.Color group_by(Flower.Color) %>%
summarise(n = n()) %>%
arrange(desc(n))
dl_Flower.Color
# A tibble: 5 × 2
Flower.Color n
<chr> <int>
1 Red 44
2 Yellow 33
3 Purple 30
4 Green 22
5 Blue 21
$Flower.Color dl_Flower.Color
[1] "Red" "Yellow" "Purple" "Green" "Blue"
Change the class of the Flower.Color feature to factor and add levels according to the number of flowers with each color.
$Flower.Color <- factor(iris$Flower.Color,
irislevels = dl_Flower.Color$Flower.Color)
Check class now
class(iris$Flower.Color)
[1] "factor"
Now we do the same plot as before and we see that the order has changed to range from largest to smallest Flower.Colors group. The plot is saved in the variable p.
<- ggplot(iris, # dataframe
p aes(x = Flower.Color)) + # x-value
geom_bar() # type of plot
p
We can also flip the chart. We update the plot, p, be reassignment.
<- p + coord_flip()
p
p
Since we are working with colors, we can change the colors of the bars to match the groups.
R color chart here
# Define color palette
<- c("Red" = "red3",
color_palette "Blue" = "cornflowerblue",
"Yellow" = "lightgoldenrod1",
"Green" = "darkolivegreen2",
"Purple" = "darkorchid3")
<- p +
p aes(fill = Flower.Color) + # add the fill ascetics
scale_fill_manual(values = color_palette) # set the fill color according to the color palette
print(p)
Bar chart with geom_col
Another way to make a bar chart is by using geom_col
. Unlike geom_bar
, which only requires an x-value and automatically counts occurrences, geom_col
requires both x- and y-values. This makes geom_col
ideal for cases where you already have pre-calculated values that you want to use as the bar heights.
The mean of the sepal length within each color is calcualted using the summarize
function.
<- iris %>%
mean_sepal_length_pr_color group_by(Flower.Color) %>%
summarize(mean_Sepal.Length = mean(Sepal.Length))
head(mean_sepal_length_pr_color)
# A tibble: 5 × 2
Flower.Color mean_Sepal.Length
<fct> <dbl>
1 Red 5.88
2 Yellow 5.89
3 Purple 5.91
4 Green 5.65
5 Blue 5.79
ggplot(mean_sepal_length_pr_color, # dataframe
aes(x = Flower.Color, # x-value
y = mean_Sepal.Length)) + # y-value
geom_col() # type of plot