Plotly¶
Python Open Source Graphing Library¶
Plotly's Python graphing library makes interactive, publication-quality graphs. Examples of how to make line plots, scatter plots, area charts, bar charts, error bars, box plots, histograms, heatmaps, subplots, multiple-axes, polar charts, and bubble charts.
Plotly has an easy-to-use interface to it called Plotly express. This library makes plotting with Plotly very easy. Plotly express works nicely with Pandas dataframes as input, we just need to specify which columns need to be plotted.
Import modules¶
import pandas as pd
import plotly.express as px
Introduction¶
Let's start exploring the Plotly Database. The regular syntax for any Plotly.Express chart is px.chart_type(data, parameters)
so let's try a simple line chart: px.line(data, parameters)
.
There're different ways to create the plot. We will check them all, but I think the third one makes the most sense.
- using lists of values
- using
pandas.Series
- using
pandas.DataFrame
and referencing the column names
1. Using lists of values.
We can create two lists of values for the x
and y
axis and use them as parameters for the line chart plot
year = list(range(1996,2020,4))
medals = [1,4,5,9,1,2]
print(year)
print(medals)
[1996, 2000, 2004, 2008, 2012, 2016] [1, 4, 5, 9, 1, 2]
px.line(x = year, y = medals)
2. Using pandas.Series
This is very much like using lists
year_series = pd.Series(year)
medals_series = pd.Series(medals)
print(year_series)
print(medals_series)
0 1996 1 2000 2 2004 3 2008 4 2012 5 2016 dtype: int64 0 1 1 4 2 5 3 9 4 1 5 2 dtype: int64
px.line(x = year_series, y = medals_series)
3. Using pandas.DataFrame
This is most of the time the best option. We can plot things directly from our DataFrame of interest. We need to give the px.chart_type()
function our dataframe using the argument data_frame
. Then we only need to specify as x
and y
axis the name of the columns we want to use!
# We create our dataframe
df = pd.DataFrame({"Year" : year, "Medals" : medals})
df.head()
Year | Medals | |
---|---|---|
0 | 1996 | 1 |
1 | 2000 | 4 |
2 | 2004 | 5 |
3 | 2008 | 9 |
4 | 2012 | 1 |
px.line(data_frame = df, x = "Year" , y = "Medals")
Note
: If our dataframe is in wide format, we may need to change the shape to long format. This means that we always need to have our variables of interest as columns! Have a look at the melt method in Pandas. For example, lets make a wide dataframe:
df = pd.DataFrame({'Year': {0: '2004', 1: '2008', 2: '2012', 3: '2016'},
'Canada': {0: 4, 1: 3, 2: 5, 3: 3},
'USA': {0: 5, 1: 9, 2: 1, 3: 2}})
df
Year | Canada | USA | |
---|---|---|---|
0 | 2004 | 4 | 5 |
1 | 2008 | 3 | 9 |
2 | 2012 | 5 | 1 |
3 | 2016 | 3 | 2 |
In this case, we would like to have a column named "Countries" that will encompass Canada and USA. We use the .melt()
method to do this.
long_df = pd.melt(df, id_vars=['Year'], value_vars=['Canada', 'USA'])
long_df
Year | variable | value | |
---|---|---|---|
0 | 2004 | Canada | 4 |
1 | 2008 | Canada | 3 |
2 | 2012 | Canada | 5 |
3 | 2016 | Canada | 3 |
4 | 2004 | USA | 5 |
5 | 2008 | USA | 9 |
6 | 2012 | USA | 1 |
7 | 2016 | USA | 2 |
We also may want update the column names of our long_df to something more meaningful. Do you remember how to do that from yesterday?
#update column names to 'country' and 'medals'
long_df.rename(columns={'variable': 'country', 'value': 'medals'}, inplace=True)
Now we can use the long format dataframe to plot
px.line(data_frame = long_df, x = "Year" , y = "medals", color = "country")
Save as variable and show¶
We can save our plots as variables. Then, if you would like to show your plot again, you can call it using the method .show()
fig = px.line(data_frame = long_df, x = "Year" , y = "medals", color = "country")
fig.show()
A look behind the scenes: Plotly object structure¶
On the background, each graph is a dictionary-like object. When you store the graph into a variable, commonly fig
, and display this dictionary using fig.to_dict()
or fig["data"]
or fig.data
to see the elements data or fig["layout"]
to review the design of the plot.
We can use .to_dict().keys()
to see all keys inside the fig object:
fig.to_dict().keys()
dict_keys(['data', 'layout'])
There are two items inside fig.data
because we have two lines, one for Canada and one for USA.
fig.data
(Scatter({ 'hovertemplate': 'country=Canada<br>Year=%{x}<br>medals=%{y}<extra></extra>', 'legendgroup': 'Canada', 'line': {'color': '#636efa', 'dash': 'solid'}, 'marker': {'symbol': 'circle'}, 'mode': 'lines', 'name': 'Canada', 'orientation': 'v', 'showlegend': True, 'x': array(['2004', '2008', '2012', '2016'], dtype=object), 'xaxis': 'x', 'y': array([4, 3, 5, 3]), 'yaxis': 'y' }), Scatter({ 'hovertemplate': 'country=USA<br>Year=%{x}<br>medals=%{y}<extra></extra>', 'legendgroup': 'USA', 'line': {'color': '#EF553B', 'dash': 'solid'}, 'marker': {'symbol': 'circle'}, 'mode': 'lines', 'name': 'USA', 'orientation': 'v', 'showlegend': True, 'x': array(['2004', '2008', '2012', '2016'], dtype=object), 'xaxis': 'x', 'y': array([5, 9, 1, 2]), 'yaxis': 'y' }))
fig.layout
Layout({ 'legend': {'title': {'text': 'country'}, 'tracegroupgap': 0}, 'margin': {'t': 60}, 'template': '...', 'xaxis': {'anchor': 'y', 'domain': [0.0, 1.0], 'title': {'text': 'Year'}}, 'yaxis': {'anchor': 'x', 'domain': [0.0, 1.0], 'title': {'text': 'medals'}} })
As you can see, there are many attributes inside this dictionary. This means that a plot can be modified even after it is created. For example, we can use a layout template to modify the design of a plot or change the plot and axis titles
fig.update_layout(template="plotly_dark", title = "Example", yaxis_title='Medals Earned')
This update is not only displayed, it has also changed the plot object. See how the layout
part is different now:
fig.layout
Layout({ 'legend': {'title': {'text': 'country'}, 'tracegroupgap': 0}, 'margin': {'t': 60}, 'template': '...', 'title': {'text': 'Example'}, 'xaxis': {'anchor': 'y', 'domain': [0.0, 1.0], 'title': {'text': 'Year'}}, 'yaxis': {'anchor': 'x', 'domain': [0.0, 1.0], 'title': {'text': 'Medals Earned'}} })
We can also modify the attributes of the data using the update_traces
method. For example, we change all lines to be dashed:
fig.update_traces(line={"dash":"dash"})
We will see more ways of modifying the plots as we go through the different types of plots we can make!
More fun with line graphs¶
There are even more things we can do with line graphs!
Let's use a dataframe with more rows and columns. We will make use of the gapminder dataset which is already integrated in plotly. We can load it by writing px.data.gapminder
. Lets see what kind of dataset this is:
gapminder_data = px.data.gapminder()
gapminder_data
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
0 | Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.445314 | AFG | 4 |
1 | Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.853030 | AFG | 4 |
2 | Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.100710 | AFG | 4 |
3 | Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.197138 | AFG | 4 |
4 | Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.981106 | AFG | 4 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
1699 | Zimbabwe | Africa | 1987 | 62.351 | 9216418 | 706.157306 | ZWE | 716 |
1700 | Zimbabwe | Africa | 1992 | 60.377 | 10704340 | 693.420786 | ZWE | 716 |
1701 | Zimbabwe | Africa | 1997 | 46.809 | 11404948 | 792.449960 | ZWE | 716 |
1702 | Zimbabwe | Africa | 2002 | 39.989 | 11926563 | 672.038623 | ZWE | 716 |
1703 | Zimbabwe | Africa | 2007 | 43.487 | 12311143 | 469.709298 | ZWE | 716 |
1704 rows × 8 columns
For now we would like to only use countries from Oceania. Can you help me to subset the dataframe?
#Let's subset the data
#@title Solution
df = gapminder_data.loc[gapminder_data['continent'] == 'Oceania']
df.sample(5)
Color argument¶
As shown above, we can change the color of the lines based on a dataframe colunm by using the argument color
. In this example, we plot the life expectancy column VS the year column and the line are colored by the content of the country column. This also gives us separate lines for the separate countries.
# We can separate the data from the different countries by color using the argument `color`
# Separating the px.line call into several lines like this is purely aesthetic. It does not influence the flow of the execution.
fig = px.line(df,
x="year",
y="lifeExp",
color='country')
fig.show()
You want, instead, to change the color of all the lines, we need to use the method update_traces()
fig.update_traces(line={"color":"red"})
fig.show()
Color_discrete_map argument¶
We can also decide the color palette to use with color_discrete_map
. In this case, we need to specify for each level of the variable to color by, here country
, what color should be used:
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"})
fig.show()
Title
argument¶
We have already seen how to change a plot's title with update_layout
, but we can also already pass a title when we make the plot with the title
argument.
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
title="Life expectancy in Oceania")
fig.show()
Text argument¶
We can further display the value of each 'dot' in the line (from the x and y values) by using the text
argument.
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
title="Life expectancy per year",
text="lifeExp") #The text argument allows us to plot the actual number on the datapoint
fig.show()
Notice how the text argument positioned the text right on top of the data points? We can modify this behaviour by updating our figures using the update_traces()
method, which will modify all data points inside fig.data
.
fig.update_traces(textposition="top center")
fig.show()
Line_dash argument¶
By using the line_dash
argument, we can change the dash pattern of the lines based on a variable.
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
text="lifeExp",
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
line_dash = "country",
title="Life expectancy per year")
fig.update_traces(textposition="top center")
fig.show()
If you want, instead to change all lines to be dashed, you need to use the update_traces()
method as shown above. You can choose one of dash
, dot
or the default solid
.
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
text="lifeExp",
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
title="Life expectancy per year")
fig.update_traces(textposition="top center", line = {"dash" : "dot"}) #now all lines are dotted, it does not depend on the country column anymore.
fig.show()
Line_dash_map argument¶
Similar to color_discrete_map
there is also line_dash_map
to specify the line type at creation.
Note that for this to work you need to specify the line_dash
argument (what column the dashing should depend on), otherwise a dash_map makes no sense.
fig = px.line(df,
x="year",
y="lifeExp",
color="country",
text="lifeExp",
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
line_dash = "country",
line_dash_map = {"Australia":"solid", "New Zealand": "dot"},
title="Life expectancy per year")
fig.show()
Exercise 1: Line graphs¶
Now you!
- Create a line graph of life expectancy per year for the continents 'Ocenia' and 'Africa'.
#create the dataframe and verify that it has the data you want
#now make the plot
- Color by the country and change the line type by the continent
- Change the template of the plot. Check out templates here
Quiz¶
What would you do if instead of a line chart you wanted to show the data in a scatter plot?
Scatter plots¶
Scatter plots are coordinate plots that use x and y coordinates to show the relationship between two variables. However, the values of the variables do not necessarily need to be linked or ordered like in a line plot.
Plotting a scatter plot is very much like plotting a line plot, but we use the px.scatter()
function. Many of the arguments shown previously for the line plots work here as well, for example, the color argument:
df = gapminder_data.loc[gapminder_data["continent"] == 'Europe']
fig = px.scatter(df,
x="lifeExp",
y="gdpPercap",
color="country")
fig.show()
Symbol argument¶
If you want to further differenciate the countries from each other, you can the symbol
argument to different types of symbols, not just dots/circles.
fig = px.scatter(df,
x="lifeExp",
y="gdpPercap",
color="country",
symbol='country')
fig.show()
Size argument¶
We can also play with the size
of the dots to create Bubble plots
fig = px.scatter(df,
x="lifeExp",
y="gdpPercap",
color="country",
size='pop') # Using population as the size for the plot
fig.show()
Trendline argument¶
We can easily add trendlines to our scatter plot using the argument trendline
. By default you will use the Ordinary Least Squares trendline (linear regression).
We quickly see the relationship between GDP and life expectancy is not linear for Europe in general.
fig = px.scatter(df,
x="lifeExp",
y="gdpPercap",
trendline = "ols") # fitting a trendline with ordinary least squares
fig.show()
If you have separated the countries using the color
argument, you will get a trendline per country.
This will look quite ugly since there are many countries. Some of them actually look like the relationship could be linear.
fig = px.scatter(df,
x="lifeExp",
y="gdpPercap",
color="country",
trendline = "ols")
fig.show()
If you want to color by a variable but still have a global trend, use the argument trendline_scope="overall"
. We will also change to a none-linear fitting called LOWESS (Locally Weighted Scatterplot Smoothing). This type of fit is also sometimes called LOESS if you are familiar with that term.
fig = px.scatter(df,
x="lifeExp",
y="gdpPercap",
color="country",
trendline = "lowess",
trendline_scope="overall")
fig.show()
Exercise 2: Scatter plots and trendlines¶
- Using the data from 'Africa', create a scatter plot using GDP and population. Try to make the countries as distinguishable as possible.
- Make two separate plots that model the correlation between GDP and population for each country, once using an OLS fit and once a LOWESS fit. Which fit do you think looks more convincing?
#ols
#lowess
Bar Charts¶
With px.bar()
, each row of the DataFrame is represented as a rectangular mark. Bar plots are very useful to show quantitative information across qualitative features such as years, countries or other categorical data.
As line and scatter plots, px.bar()
shares a lot of arguments with line and scatter plots.
df = gapminder_data.loc[gapminder_data["continent"] == 'Oceania']
fig = px.bar(df, x='year', y='pop', color='country')
fig.show()
Orientation argument¶
If we would rather see horizontal bars instead of vertical, we can set the argument orientation
to "h"
. Note that we need to change the order of the x
and y
arguments now!
fig = px.bar(df, x='pop', y='year', color='country', orientation="h")
fig.show()
Text on bar charts¶
You can add text to bars using the text_auto
or text
argument. text_auto=True
will automatically use the same variable as the y
argument, while you can use any variable with text
.
Let's try this with a different build-in dataset of plotly, Olympic medals:
df = px.data.medals_long()
df
nation | medal | count | |
---|---|---|---|
0 | South Korea | gold | 24 |
1 | China | gold | 10 |
2 | Canada | gold | 9 |
3 | South Korea | silver | 13 |
4 | China | silver | 15 |
5 | Canada | silver | 12 |
6 | South Korea | bronze | 11 |
7 | China | bronze | 8 |
8 | Canada | bronze | 12 |
We would like to see the different types of medals (gold, silver, bronze) per country.
Luckily for us the data is already aggregated so we can directly use the count
column for the height of the bar (the y-axis).
fig = px.bar(df, x="medal", y="count", color="nation", text="nation")
fig.show()
By default, Plotly will scale and rotate text labels to maximize the number of visible labels, which can result in a variety of text angles and sizes and positions in the same figure. The textfont
, textposition
and textangle
trace attributes can be used to control these.
In addition, you can use the text_auto
argument to format the text shown in the plot`
This is the default behaviour
We will plot populations of European countries and label the bars with auto text.
Let's use everything we have learned yesterday and make the mother of all selections: Rows for Europe, for the year 2007 and only for countries with a population of greater than 2 mio (nobody cares about Liechtenstein!).
df = gapminder_data.loc[(gapminder_data["continent"] == 'Europe') & (gapminder_data['year'] == 2007) & (gapminder_data['pop']>2.e6)]
df.head()
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
83 | Austria | Europe | 2007 | 79.829 | 8199783 | 36126.492700 | AUT | 40 |
119 | Belgium | Europe | 2007 | 79.441 | 10392226 | 33692.605080 | BEL | 56 |
155 | Bosnia and Herzegovina | Europe | 2007 | 74.852 | 4552198 | 7446.298803 | BIH | 70 |
191 | Bulgaria | Europe | 2007 | 73.005 | 7322858 | 10680.792820 | BGR | 100 |
fig = px.bar(df, y='pop', x='country', text_auto='.2s', #text_auto will show only two numbers
title="Default: various text sizes, positions and angles")
fig.show()
Again we can use update_traces()
to control the angle of the text (set to 0) and the position (outside the bar) and the font size.
fig = px.bar(df, y='pop', x='country', text_auto='.2s',
title="Controlled text sizes, positions and angles")
fig.update_traces(textfont_size=12, textangle=0, textposition="outside")
fig.update_layout(yaxis_range=[0,10**8]) # We increase the range of the plot so the text fits
fig.show()
Sorting bar charts¶
We can influence the order the bars are shown in by using a categoryorder
like so:
Total ascending
means to sort by the total y-value in ascending order.
fig.update_xaxes(categoryorder='total ascending')
fig.show()
We could also impose alphabetic sort (this is actually the default!)
fig.update_xaxes(categoryorder='category ascending')
fig.show()
Lastly, you could impose you own custom order by using a category array:
Can someone see what this is sorted by (roughly)?
fig.update_xaxes(categoryorder='array', categoryarray= ['Portugal','Spain','Ireland','France', 'United Kingdom', 'Belgium', 'Netherlands', 'Switzerland', 'Italy',
'Germany','Denmark','Norway','Austria','Sweden', 'Czech Republic', 'Slovenia', 'Croatia', 'Poland',
'Slovak Republic', 'Hungary','Bosnia and Herzegovina', 'Albania', 'Serbia','Greece', 'Romania',
'Bulgaria','Finland', 'Turkey'])
fig.show()
Stacked vs Grouped Bars¶
We will not always have data that is already aggregated.
For example, let's take the tips dataset which reports on tips given by customers and some information about the customer. It looks like this:
df = px.data.tips()
df.sample(10)
total_bill | tip | sex | smoker | day | time | size | |
---|---|---|---|---|---|---|---|
212 | 48.33 | 9.00 | Male | No | Sat | Dinner | 4 |
5 | 25.29 | 4.71 | Male | No | Sun | Dinner | 4 |
216 | 28.15 | 3.00 | Male | Yes | Sat | Dinner | 5 |
16 | 10.33 | 1.67 | Female | No | Sun | Dinner | 3 |
31 | 18.35 | 2.50 | Male | No | Sat | Dinner | 4 |
45 | 18.29 | 3.00 | Male | No | Sun | Dinner | 2 |
196 | 10.34 | 2.00 | Male | Yes | Thur | Lunch | 2 |
136 | 10.33 | 2.00 | Female | No | Thur | Lunch | 2 |
75 | 10.51 | 1.25 | Male | No | Sat | Dinner | 2 |
239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
We would like to make a bar plot that shows the total bill for men and for woman, so sex on the x-axis. We would also like to have the bars color by time (Lunch or Dinner).
We could do the following:
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()
You see that you get a lot of small bars stacked on top of each other, creating white lines in the plot. If we want to get rid of these, we'll need to make a dataframe that has the sum of the total bill split up by sex and time and then plot that.
Lucky we learned groupby yesterday! Can you help me to create the dataframe we need?
Now we can plot it easily:
fig2 = px.bar(bills, x="sex", y="total_bill", color='time')
fig2.show()
Now you see how pandas and plotly interact and complement each other.
What we we wanted the two chunks for lunch and dinner to be next to each other instead of on top?
We can set the barmode
argument to group
:
fig3 = px.bar(bills, x="sex", y="total_bill",
color='time', barmode='group')
fig3.show()
Histograms¶
In statistics, a histogram is representation of the distribution of numerical data, where the data are binned and the count for each bin is represented. More generally, in Plotly a histogram is an aggregated bar chart, with several possible aggregation functions (e.g. sum, average, count...) which can be used to visualize data on categorical and date axes as well as linear axes.
Compared to px.bar()
, px.histogram()
can work with only the x
argument, which can be a continuous or categorical variable
fig = px.histogram(df, x="total_bill", title = "Continuous variable")
fig.show()
fig = px.histogram(df, x="day", title="Categorical variable")
fig.show()
px.histogram()
also shares the color
, text_auto
and barmode
argument
fig = px.histogram(df,
x="total_bill",
color="sex",
text_auto=True)
fig.show()
Bins argument¶
By default, the number of bins is chosen so that this number is comparable to the typical number of samples in a bin. This number can be customized, as well as the range of values, with the nbins
argument:
fig = px.histogram(df, x="total_bill", nbins=20)
fig.show()
Histnorm argument¶
The default mode is to represent the count of samples in each bin. With the histnorm
argument, it is also possible to represent the percentage or fraction of samples in each bin (histnorm='percent'
or probability), or a density histogram
(the sum of all bar areas equals the total number of sample points, density), or a probability density histogram
(the sum of all bar areas equals 1, probability density).
fig = px.histogram(df, x="total_bill", histnorm='probability density')
fig.show()
The y-axis of histograms: Histfunc¶
Usually, we do not pass a y value when we plot a histogram because y should be the sum, i.e. how many total bills were between 0 and 10 dollar, how many between 10 and 20 dollar and so on.
This behavior can be changed by passing a histfunc
. This tells plotly to do something else than count the number of occurences.
In the below example we will use the average of the tip column as the y-axis instead:
fig = px.histogram(df, x="total_bill", y="tip", histfunc='avg')
fig.show()
Because the default histfunc
is sum
, we can actually use this to get around our earlier problem with the striped bar plots without calculating the values beforehand! How handy!
Switch px.bar
for px.histogram
and pass a y-value:
fig = px.histogram(df, x="sex", y="total_bill",
color='time')
fig.show()
Compare to what we got before:
fig = px.bar(df, x="sex", y="total_bill", color='time')
fig.show()
Exercise 3: Bar charts and histograms¶
- Use the gapminder data for Oceania and show the GDPR for each year in a bar plot.
- Now separate the bars into countries and put them next to each other instead of stacked on top of each other.
- Have a look at the dataframe created below. What does it contain?
df = gapminder_data.loc[(gapminder_data["continent"] == 'Europe') & (gapminder_data['year'].isin([1987,2007]))]
df.head()
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
19 | Albania | Europe | 1987 | 72.000 | 3075321 | 3738.932735 | ALB | 8 |
23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
79 | Austria | Europe | 1987 | 74.940 | 7578903 | 23687.826070 | AUT | 40 |
83 | Austria | Europe | 2007 | 79.829 | 8199783 | 36126.492700 | AUT | 40 |
115 | Belgium | Europe | 1987 | 75.350 | 9870200 | 22525.563080 | BEL | 56 |
Plot a histogram of this dataframe that shows the life expectancy for countries in Europe, colored by the year. Display the count inside the bar.
How many countries had a life expectancy of less than 75 in 1987? How many had a life expectancy of less than 75 in 2007?
Bonus question: Using the tips dataset, create a chart that displays the average total bill depending on the day of the week. You will need to use a histfunc
for this.
fig = px.violin(df, y="lifeExp", x="year")
fig.show()
Box plots and violin plots¶
Box plots and violin plots are another nice way of showing data distributions. px.box()
and px.violin()
share almost all their arguments and can be used interchangebly.
df = px.data.tips()
fig = px.box(df, y="tip", x="smoker", color="sex")
fig.show()
fig = px.violin(df, y="tip", x="smoker", color="sex")
fig.show()
Points argument¶
You can show the underlying data inside the plots by setting the argument points="all"
, to show only outliers points="outliers"
or not show any points with points=False
fig = px.violin(df, y="total_bill", x="smoker", color="sex", points = "all")
fig.show()
fig = px.box(df, y="total_bill", x="smoker", color="sex", points = False)
fig.show()
Boxplot inside violin¶
You can show a boxplot inside a violin plot using box=True
fig = px.violin(df, y="tip", x="smoker", color="sex", box=True)
fig.show()
Notched bloxplot¶
You can add notches to your boxplot using notched=True
fig = px.box(df, y="total_bill", x="smoker", color="sex", points="all", notched=True)
fig.show()
Show mean¶
We can show the mean in our boxplot using by updating our traces using boxmean=True
and in our violin plots using meanline_visible=True
fig = px.box(df, y="total_bill", x="smoker", color="sex", points="all", notched=True)
fig.update_traces(boxmean=True)
fig.show()
fig = px.violin(df, y="total_bill", x="smoker", color="sex", points="all", box=True)
fig.update_traces(meanline_visible=True)
fig.show()
Exercise 4: Boxplots and violin plots¶
Again, using the following dataframe:
df = gapminder_data.loc[(gapminder_data["continent"] == 'Europe') & (gapminder_data['year'].isin([1987,2007]))]
df.head()
country | continent | year | lifeExp | pop | gdpPercap | iso_alpha | iso_num | |
---|---|---|---|---|---|---|---|---|
19 | Albania | Europe | 1987 | 72.000 | 3075321 | 3738.932735 | ALB | 8 |
23 | Albania | Europe | 2007 | 76.423 | 3600523 | 5937.029526 | ALB | 8 |
79 | Austria | Europe | 1987 | 74.940 | 7578903 | 23687.826070 | AUT | 40 |
83 | Austria | Europe | 2007 | 79.829 | 8199783 | 36126.492700 | AUT | 40 |
115 | Belgium | Europe | 1987 | 75.350 | 9870200 | 22525.563080 | BEL | 56 |
- Make a boxplot of life expectancy versus the year.
- Now do the same as a violin plot.
Which one do you prefer as a visualization and why?
Heatmaps¶
The px.imshow()
function can be used to display heatmaps (as well as full-color images, as its name suggests). It accepts both array-like objects like lists of lists, as well as pandas.DataFrame objects. Heatmaps are particularly useful to display correlations between the variables of the data
We can use corr
to see how much the variables in the tip data set are correlated with each other. Correlation can only be calculated on numerical columns.
df = px.data.tips()
df.corr()
total_bill | tip | size | |
---|---|---|---|
total_bill | 1.000000 | 0.675734 | 0.598315 |
tip | 0.675734 | 1.000000 | 0.489299 |
size | 0.598315 | 0.489299 | 1.000000 |
px.imshow(df.corr(), text_auto=True)
We can modify the color scale using the argument color_continuous_scale
px.imshow(df.corr(), text_auto=True, color_continuous_scale='RdBu_r')
We can also explicitly map the color scale using the range_color
argument.
px.imshow(df.corr(), text_auto= '.2f',
color_continuous_scale='RdBu_r', range_color=[-1,1])
Exercise 5: Heatmaps¶
Extract info for the continent Europe from the gapminder dataset and calculate the correlation between columns. Plot the result in a heatmap. What do you observe? Are the correlations as you expected?
Change the color scheme to something you find pleasing and add the correlation values the the squares.
Now, do the same Africa. What do you observe? Are you surprised?
Advanced plotting¶
Facet_row and facet_col arguments¶
Another cool thing we can do in many types of plots is to split the chart into rows or columns depending on a variable. For example, we can divide the information of life expectancy into different plots using the variable "country"
df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
facet_col ="country",
text="lifeExp",
title="Life expectancy per year")
fig.update_traces(textposition="top center")
fig.show()
Plot marginals¶
In scatter and histogram plots, you can add extra plots on the margins (called Plot Marginals) of your scatter plot, for instance "histogram", "rug", "box", or "violin" plots. These plots can be easily added by just using the attributes: marginal_x
and marginal_y
.
df = px.data.iris()
df.head()
sepal_length | sepal_width | petal_length | petal_width | species | species_id | |
---|---|---|---|---|---|---|
0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa | 1 |
1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa | 1 |
2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa | 1 |
3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa | 1 |
4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa | 1 |
fig = px.scatter(df,
x="sepal_length",
y="sepal_width",
color="species",
marginal_x="box",
marginal_y="violin",
size='petal_width',
hover_name="species")
fig.show()
Exercise 6: Marginals and facets¶
- Can you get a scatter plot with a histogram instead of a rug distribution plot?
- Divide the previous plot using the species variable
Error argument¶
In scatter, line and bar plots we can show error bar information, such as confidence intervals or measurement errors, using the error
arguments. You can choose between displaying the error in the y or x axis (error_y
and error_x
, respectively).
Note: You will need another variable that contains such information! Below, we create an error variable for showcasing.
df = px.data.gapminder().query("continent=='Oceania'")
df['e'] = df["lifeExp"]/100 # We create an error variable just to show case
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
error_y='e',
title="Life expectancy per year")
fig.update_traces(textposition="top left", line = {"dash" : "dot"})
fig.show()
Modifying Tooltips¶
Tooltips are the square popups that appear when you hover the mouse over a data point in the plot. We can modify the behaviour of these:
hover_name
- highlights value of this column on the top of the tooltiphover_data
- let you add or remove tooltips by setting them True/Falselabels
- let you rename the column names inside the tooltip
df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df, x="year", y="lifeExp", color='country')
fig.show()
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
hover_name="country",
hover_data = {"country" : False}, # we remove country from the tooltip
labels={"year": "Year"}, # change year for Year
title="Life expectancy per year")
fig.show()
Range Slider and Selector in Python¶
You can use sliders to navigate the range of your axis. This can for instance be very useful when visualizing time-series data. (https://plotly.com/python/reference/layout/xaxis/#layout-xaxis-rangeslider)
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
facet_col ="country",
hover_name="country",
text="lifeExp",
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
line_dash = "country",
title="Life expectancy per year")
fig.update_traces(textposition="top center")
fig.update_xaxes(rangeslider_visible=True)
fig.show()
Exercise 7: Range sliders¶
- Using the Africa's gapminder dataset, create a scatter plot with a range selector.
- Modify the tool tip so that when you hover over it will provide information about life expectancy, population, GDP and country code.
Changing axis ticks¶
If we do not like the ticks on our axis, we can change them using the method update_xaxes()
or update_yaxes()
. We will tell what texts we would like to show (ticktext
) instead of the actual values (tickvals
)
df = px.data.gapminder().query("continent=='Oceania'")
fig = px.line(df,
x="year",
y="lifeExp",
color='country',
facet_col ="country",
hover_name="country",
text="lifeExp",
color_discrete_map = {"Australia":"Black", "New Zealand": "Red"},
line_dash = "country",
title="Life expectancy per year")
fig.update_xaxes(
ticktext=["50s", "60s", "70s", "80s", "90s", "00s"],
tickvals=["1950", "1960", "1970", "1980", "1990", "2000"],
)
fig.show()
Animating your plot¶
Several Plotly Express functions support the creation of animated figures through the animation_frame
and animation_group
arguments (https://plotly.com/python/animations/).
In order to make the animation look nicer, we will use the orientation
argument to make the plot horizontal. In addition, the variable gdoPercap
has too many decimals. We can change the look of the text value by using again the update_traces()
method, which will use text comprehension to only display 2 decimals.
df = px.data.gapminder().query("continent=='Oceania'")
fig = px.bar(df,
y="country",
x="gdpPercap",
color="country",
orientation="h",
animation_frame="year",
animation_group="country",
title="Evolution of GDP",
text="gdpPercap", range_x=[5000, 40000])
fig.update_traces(texttemplate='%{text:.2f}')
fig.show()
Exercise 8: Animations¶
Recreate the above animation for data from Africa, but show the development of life expectancy over time and the GDP as text inside the bars. Remember to separate the countries.