On top of Matplotlib, Seaborn offers an API that offers reasonable options for plot style and color defaults, specifies straightforward high-level methods for widely used statistical plot kinds, and integrates with the features offered by Pandas DataFrames.
Although Matplotlib has proven to be a highly effective and well-liked visualization tool, even ardent users will concede that it frequently falls short of expectations. There are some legitimate criticisms of Matplotlib that frequently surface.
Seaborn can be a solution to these issues.
The API for Matplotlib is not very high level. Although conceivable, complex statistical visualization frequently necessitates a lot of boilerplate code.
Matplotlib was created more than a decade before Pandas. Hence it cannot be used with Pandas DataFrames. As a result, you must extract each Series from a Pandas DataFrame and frequently concatenate them to create the proper format before displaying the data. A plotting library that can intelligently utilize DataFrame labels in a plot would be preferable.
To be fair, the Matplotlib team is working to fix this; most recently, the plt.style tools covered in Customizing Matplotlib: Configurations and Style Sheets were added, and Pandas data handling is beginning to function more naturally. A new default stylesheet will be included in the library’s 2.0 version, which will improve over the present one. But Seaborn still stands as a very helpful addon for all the reasons mentioned above.
Abstraction of the API across visualizations
There is no single, ideal method for visualizing data. Different plots work best for different questions. Seaborn allows switching between various visual representations simply by using a consistent dataset-oriented API.
Because it is intended to display numerous different statistical relationships, relplot() is called after them. Even though scatter plots are frequently helpful, lines are preferable to scatter plots in connections where one variable corresponds to a time unit. There is a handy kind argument in the relplot() function that makes switching to this alternative representation simple:
dots = sns.load_dataset("dots") sns.relplot( data=dots, kind="line", x="time", y="firing_rate", col="align", hue="choice", size="coherence", style="choice", facet_kws=dict(sharex=False), )
The size and style parameters are utilized in both the scatter and line plots, although they have different effects on the two visualizations. These comprise of symbols in the scatter plot and changing the marker area vs. dashing in the line plot and the line width. We didn’t have to think about those minutiae, which allowed us to concentrate on the plot’s broad structure and the information we wanted to convey.
Error bars and statistical estimation
We are frequently interested in the average value of one variable as a function of another. Many seaborn routines will execute the statistical estimation required to answer these questions automatically:
var_fmri = sns.load_dataset("fmri") sns.relplot( data=var_fmri, kind="line", x="timepoint", y="signal", col="region", hue="event", style="event", )
Seaborn engages bootstrapping in quest to compute confidence intervals and draw error bars to illustrate the estimate’s uncertainty when estimating statistical values. In seaborn, statistical estimating extends beyond descriptive statistics. For example, using lmplot(), you may improve a scatterplot by inserting a linear regression model and associated uncertainty:
sns.lmplot(data=tips, x="total_bill", y="tip", col="time", hue="smoker")
Complex graphics can be created using classes and functions
These tools combine axis-level plotting routines with objects that handle the figure’s layout, linking the datasets’ structure to a grid of axes. Both parts are part of the public API and can be used directly to generate complicated graphs with less code.
var_g = sns.PairGrid(penguins, hue="species", corner=True) var_g.map_lower(sns.kdeplot, hue=None, levels=5, color=".3") var_g.map_lower(sns.scatterplot, marker="+") var_g.map_diag(sns.histplot, element="step", linewidth=0, kde=True) var_g.add_legend(frameon=True) var_g.legend.set_bbox_to_anchor((.61, .6))
Views composited onto multivariate datasets
Some seaborn programs combine various types of graphs to provide concise summaries of a dataset. The jointplot() function focuses on a single relationship. It depicts the combined distribution of two variables as well as the marginal distribution of each variable:
pen_var = sns.load_dataset("penguins") sns.jointplot(data=pen_var, x="flipper_length_mm", y="bill_length_mm", hue="species")
Categorical data visualizations with specialized plots
In seaborn, several specific plot types are geared toward showing categorical data. They are accessible via catplot(). These graphs have varying degrees of granularity. At the most granular level, you might want to observe each observation by creating a “swarm” plot: a scatter plot that modifies the placements of the dots along the category axis such that they don’t overlap:
sns.catplot(data=tips, kind="swarm", x="day", y="total_bill", hue="smoker")
You may also use kernel density estimation to describe the underlying distribution from which the points are sampled:
sns.catplot(data=tips, kind="violin", x="day", y="total_bill", hue="smoker", split=True)
Alternatively, you could merely display the mean value and its confidence interval within each nested category:
sns.catplot(data=tips, kind="bar", x="day", y="total_bill", hue="smoker")
Discovery of Seaborn Plots
The basic concept behind Seaborn is that it gives users access to high-level commands for producing a wide range of plot kinds that are beneficial for statistical data analysis and even some model fitting.
Let’s examine a couple of the datasets and plot types that Seaborn offers. It should be noted that all of the following could be accomplished using specific Matplotlib commands (in fact, this is what Seaborn does), but the Seaborn API is much more user-friendly.
Densities, KDE, and histograms
In statistical data visualization, histograms and joint distributions of variables are frequently all you need to plot. We’ve seen that using Matplotlib makes this rather simple:
var_data = np.random.multivariate_normal([0, 0], [[5, 2], [2, 2]], size=2400) var_data = pd.DataFrame(var_data, columns=['x', 'y']) for i in 'xy': plt.hist(var_data[i], normed=True, alpha=0.7)
We can obtain a smooth approximation of the distribution instead of a histogram by using a kernel density estimation, as Seaborn does with sns.kdeplot:
for i in 'xy': sns.kdeplot(var_data[i], shade=True)
Distplot can be used to combine histograms and KDE:
sns.distplot(var_data['x']) sns.distplot(var_data['y']);
We will obtain a two-dimensional representation of the data if we feed the entire two-dimensional dataset to kdeplot:
sns.kdeplot(var_data);
Using sns.jointplot, we can view the joint distribution and marginal distributions side by side. We’ll choose a grey background as the style for this story:
with sns.axes_style('grey'): sns.jointplot("x", "y", var_data, kind='kde');
There are additional arguments that can be supplied to jointplot; for instance, we might use a histogram with a hexagonal base instead:
with sns.axes_style('grey'): sns.jointplot("x", "y", var_data, kind='hex')
Plots in Pairs
Pair plots result from generalizing joint plots to datasets with more dimensions. It is beneficial when examining correlations between multidimensional data because it allows you to plot all value pairs against one another. With the well-known Iris dataset, which contains measurements of the petals and sepals of three different iris species, we’ll demonstrate this:
var_iris = sns.load_dataset("iris") var_iris .head()
Visualizing the multidimensional relationships between the samples is simple by calling sns.pairplot.
sns.pairplot(iris, hue='species', size=2.5);
Histograms with facets
Histograms of subsets can sometimes be the most effective method to view data. It is very straightforward, thanks to Seaborn’s FacetGrid. We’ll look at some data that illustrates the number of gratuities that restaurant employees earn based on different indicator data:
var_tips = sns.load_dataset('tips') var_tips .head() var_tips ['tip_pct'] = 100 * var_tips ['tip'] / var_tips ['total_bill'] var_grid = sns.FacetGrid(tips, row="sex", col="time", margin_titles=True) var_grid.map(plt.hist, "tip_pct", bins=np.linspace(0, 40, 15));
Factor plots
For this type of depiction, factor graphs can be helpful as well. It enables you to see how a parameter distributes inside bins set by any other parameter:
with sns.axes_style(style='ticks'): var_g = sns.factorplot("day", "total_bill", "sex", data=tips, kind="box") var_g .set_axis_labels("Day", "Total Bill");
Joint distributions
We can use sns.jointplot to display the joint distribution between various datasets and the corresponding marginal distributions in a manner similar to the pairplot we saw earlier:
with sns.axes_style('grey'): sns.jointplot("total_bill", "tip", data=tips, kind='hex')
Even automatic kernel density estimation and regression can be performed using the joint plot:
sns.jointplot("total_bill", "tip", data=tips, kind='reg');
Bar graphs
sns.factorplot can be used to plot time series. The Planets data that we initially saw in Aggregation and Grouping will be used in the example that follows:
var_planets = sns.load_dataset('planets') var_planets.head() with sns.axes_style('grey'): var_g = sns.factorplot("year", data=planets, aspect=2, kind="count", color='steelblue') var_g.set_xticklabels(step=5)
Looking at how each of these planets was found to learn more:
with sns.axes_style('grey'): g = sns.factorplot("year", data=planets, aspect=4.0, kind='count', hue='method', order=range(2001, 2015)) g.set_ylabels('Number of Planets Discovered')
Opinionated defaults and customizable options
Seaborn generates comprehensive visuals with a single function call, adding proper axis labels and legends that explain the semantic mappings in the plot when possible.
In many circumstances, seaborn will also select default values for specific parameters based on data characteristics. For example, the color mappings we’ve seen thus far employed different hues (blue, orange, and occasionally green) to represent varying levels of the categorical variables assigned to the colors. Some functions will use a continuous gradient when mapping a numeric variable:
sns.relplot( data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="body_mass_g" )
Before deciding to publish or share your workings, it is important to make improvements to the default achievements.
Seaborn offers many levels of personalization. In fact, it provides several built-in themes that apply to all figures, and its functions have standardized parameters that can change the semantic mappings for each plot. Extra keyword arguments are given to the underlying matplotlib artsts, giving even more control. Once you’ve built a plot, you can modify its properties using both the seaborn API and by descending to the matplotlib layer for fine-grained control:
sns.set_theme(style="ticks", font_scale=1.25) var_g = sns.relplot( data=penguins, x="bill_length_mm", y="bill_depth_mm", hue="body_mass_g", palette="crest", marker="x", s=100, ) var_g.set_axis_labels("Bill length (mm)", "Bill depth (mm)", labelpad=10) var_g.legend.set_title("Body mass (g)") var_g.figure.set_size_inches(6.5, 4.5) var_g.ax.margins(.15) var_g.despine(trim=True)
Example: Dissecting the Finishing times of Marathon
In this part, we explore using the Seaborn to visualize then from the results, and we will understand finishing a Marathon.
The source of the data is: https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv
Lucky for us, the data has been aggregated and identity details removed. Our task is to download the data using the link above using curl.
# !curl -O https://raw.githubusercontent.com/jakevdp/marathon-data/master/marathon-data.csv
Subsequently, we will load the data into Pandas.
data = pd.read_csv('marathon-data.csv') data.head()
Pandas automatically loaded the time columns as Python strings(type object). We can examine this rapidly by looking at the DataFrame’s dtypes.
data.dtypes
Let’s fix this by offering a time converter:
def convert_time(s): h, m, s = map(int, s.split(':')) return pd.datetools.timedelta(hours=h, minutes=m, seconds=s) data = pd.read_csv('marathon-data.csv', converters={'split':convert_time, 'final':convert_time}) data.head()
data.dtypes
That appears a lot better. Next, let’s add columns that provide the times in seconds for our Seaborn plotting tools:
var_data['split_sec'] = var_data['split'].astype(int) / 1E9 var_data['final_sec'] = var_data['final'].astype(int) / 1E9 var_data.head()
We may plot a jointplot over the data to get a sense of how the data appears:
with sns.axes_style('white'): var_g = sns.jointplot("split_sec", "final_sec", data, kind='hex') var_g.ax_joint.plot(np.linspace(4000, 16000), np.linspace(8000, 32000), ':k')
The dotted line depicts where a person’s time would fall if they maintained a constant pace during the marathon. The fact that the distribution is higher than this suggests that most runners tend to slow down as the marathon progresses sluggishly. If you’ve ever participated in competitive running, you’ll be aware that runners who run faster in the latter stages of the race are considered to have “negative-split” the race.
Let’s add a new column to the data called the split fraction, which counts how much each runner positive- or negatively splits the race:
var_data['split_frac'] = 1 - 2 * var_data['split_sec'] / var_data['final_sec'] var_data.head()
The person divides the race by that fraction into cases where the split difference is less than zero. Let’s plot the distribution of this split portion as follows:
sns.distplot(var_data['split_frac'], kde=False); plt.axvline(0, color="k", linestyle="--"); sum(data.split_frac < 0)
Only 250 out of the almost 40,000 athletes opposing split their marathon. Let’s investigate the relationship between this split fraction and other elements. We’ll do this by plotting all of these associations using a pairgrid:
var_g = sns.PairGrid(var_data, vars=['age', 'split_sec', 'final_sec', 'split_frac'], hue='gender', palette='RdBu_r') var_g.map(plt.scatter, alpha=0.8) var_g.add_legend();
It appears that the split fraction does not significantly correspond with age but does with the overall time: quicker runners typically have splits that are more similar to even for their marathon time. This example demonstrates that Seaborn does not completely solve all of Matplotlib’s plot style issues, notably the overlapping x-axis labels.
However, as a result, it is a straightforward Matplotlib plot. The methods under Customizing Ticks can be utilized to modify such things if needed. It’s interesting to see how men and women differ in this situation. Let’s examine the split fractions histogram for these two groups:
sns.kdeplot(var_data.split_frac[var_data.gender=='M'], label='men', shade=True) sns.kdeplot(var_data.split_frac[var_data.gender=='W'], label='women', shade=True) plt.xlabel('split_frac');
What is notable here is that a large number of men than women are running closer to a split that is even. It is a semblance of bimodal distribution among women and men. Now, by examining the distribution as age’s function, we can elucidate what is happening. A violin plot is a beautiful approach to comparing distributions.
sns.violinplot("gender", "split_frac", data=var_data, palette=["lightblue", "lightpink"]);
It is one more method of contrasting the distributions of men and women. Let’s examine these violin graphs in more detail and contrast them according to age. We’ll begin by adding a new column to the array that indicates the decade of age in which each individual is currently:
var_data['age_dec'] = data_var.age.map(lambda age: 10 * (age // 10)) var _datavar.head() men = (var_data.gender == 'M') women = (var_data.gender == 'W') with sns.axes_style(style=None): sns.violinplot("age_dec", "split_frac", hue="gender", data=var_data, split=True, inner="quartile", palette=["lightblue", "lightpink"]);
Looking at this, we can observe where the distributions of men and women diverge: compared to women of the same age, the split distributions of males in their 20s to 50s exhibit a conspicuous over-density toward lower divides (or of any age, for that matter).
Unexpectedly, the 80-year-old woman appears to perform the best in split time. There aren’t many runners in that range. Therefore, this is presumably because we’re estimating the distribution from limited numbers:
(data.age > 80).sum()
Once again, who are these runners among the men with negative splits? Does this divided fraction indicate a speedy completion? We can easily plot this. To automatically fit a linear regression to the data, we’ll use regplot:
var_g = sns.lmplot('final_sec', 'split_frac', col='gender', data=var_data, markers=".", scatter_kws=dict(color='c')) var_g.map(plt.axhline, y=0.1, color="k", ls=":");
It appears that the elite runners who finish in less than 15,000 seconds, or around 4 hours, are the ones with quick splits. A quick-second split is far less likely to occur in people who are slower than that.
Example: Visualizing Panda’s Dataframe using seaborn
The first step involves the importation of the libraries that are necessary for working on the dataset.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import random
The second step is setting up the data. After creating an empty dataset, we generated a set of random data using the random function, which we placed in the variables X and Y. To print the dataset, we utilized the print function.
df = pd.DataFrame() df['x'] = random.sample(range(1, 50), 25) df['y'] = random.sample(range(1, 100), 25) print(); print(df.head()) print(); print(df.tail())
The third step is plotting different plots. We will use seaborn for plotting varied kinds of plots. Actually, by passing the parameters needed, we will be able to plot a scatterplot.
sns.lmplot('x', 'y', data=df, fit_reg=False)
Next, we will plot a regression line fitting the data as follows:
sns.lmplot('x', 'y', data=df, fit_reg=True)
In the following section, we will plot a density plot for the data as shown below:
sns.kdeplot(df.y); plt.show() sns.kdeplot(df.y, df.x); plt.show() sns.distplot(df.x); plt.show()
Below is the code snippet for plotting a histogram based on the data.
plt.hist(df.x, alpha=.3) sns.rugplot(df.x) plt.show()
Subsequently, we can use the data to plot a Boxplot as follows:
sns.boxplot([df.y, df.x]) plt.show()
As well, we can use the data to plot a Violin plot
sns.violinplot([df.y, df.x]) plt.show()
In this section, we are using the data to plot a heatmap.
sns.heatmap([df.y, df.x], annot=False, fmt="d") plt.show()
We will now plot a clustermap for the data
sns.clustermap(df) plt.show()
What is the connection to Matplotlib?
Due to Seaborn’s connection with matplotlib, you may use it in all of the settings this library offers, including exploratory analysis in notebooks, real-time interaction in GUI apps, and archive output in various ways raster and vector formats.
The ideas and API of matplotlib must be understood to fully customize your visualizations, even though you can still be productive with only seaborn functions.
Knowing when to descend to the matplotlib layer to accomplish a specific modification will be one of the learning curves for new Seaborn users. Users moving from matplotlib, however, will discover that much of their knowledge is transferable.
With the help of Matplotlib’s extensive and potent API, you can alter almost every attribute of the figure as you see fit. You can quickly explore your data and produce graphics that can be customized into a finished output of publishing quality by combining the high-level interface of seaborn and the extensive customizability of matplotlib.
Conclusion
Python’s Seaborn package allows you to create statistical visuals. It incorporates tightly with Pandas data structures and is built upon Matplotlib.
You may examine and comprehend your data better with Seaborn. Its charting functions work with dataframes and arrays that include entire datasets, and they internally carry out the semantic mapping and statistical aggregation required to make valuable graphs. Thanks to its dataset-oriented, declarative API, you can concentrate on what the various components of your plots represent rather than the specifics of how to draw them.