Pandas DateFrame Histogram

A histogram is a type of chart frequently used to show how numerical data are distributed. When investigating it, you’ll frequently wish to quickly comprehend how specific numerical variables are distributed throughout a dataset. A histogram is responsible for this.

A histogram handles counting observations that fall into each bin after dividing the values contained within a numerical variable into “bins.” We may quickly and easily understand the distribution of values inside a variable by visually representing these binned counts in a columnar way.

A histogram plot is created in pandas in several different methods. We’re going to explore the first option first before proceeding.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>pd.DataFrame.hist(column='your_data_column')
pd.DataFrame.plot(kind='hist')
pd.DataFrame.plot.hist()</pre>
</div>

When displaying a lot of data, this function is frequently utilized. Pandas will display one histogram for each column you feed it to .hist()

Pseudo code: Create a histogram depicting the distribution of data points for each column in the DataFrame.

Histogram of pandas

The built-in will meet most of your requirements.histogram() method. The true magic works once the parameters are customized—the bins parameter.

Bins are the categories into which your histogram will be divided. Pandas will organize your data into bins or buckets on the back end. The result is plotted once pandas count how many values fit into that bucket. The number of bars you want in your histogram graphic is another method to explain bins. Much or little?

Parameters for a histogram

Remember that Pandas charts inherit other parameters from the main Pandas Plot function before we discuss the histogram-specific parameters. These additional settings will contrast scatter-specific properties with a basic chart layout. To get the full flexibility of the chart, we advise watching these.

  • Column: The specific column or columns you want to call a histogram on are indicated here. Every series you have in your dataset will automatically result in a chart being created by pandas.
  • By: using this parameter, your data will be divided into various groups, and a chart is created for each group. See the example below where we divided on a different column.
  • Bins: The number of bars you want in your chart, expressed as either a scalar or a list, is called bins. Or the number of buckets you want to use to organize your data. Pandas will create bins with edges made up of the values in your list if you give a list rather than a scale.
  • Formatting options: You may further customize the appearance of your chart by using a number of extra formatting options. In our opinion, you should look them up on the official panda’s Hist page.

Here’s an example to cement these concepts.

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># start by importing the panda's library as pd
import pandas as pd
# Also, import the numpy library as np
import numpy as np
# Also, import matplotlib as plt
import matplotlib.pyplot as plt
 

np.random.seed(seed=62)
data_points = 1000

data_frame = pd.DataFrame(data=list(zip(np.random.choice(["DBM", "OS"], size=data_points),
                                np.random.beta(35, 16, size=data_points),
                                np.random.beta(50, 10, size=data_points))),
            columns=['Subjects', 'Rat 1', 'Rat 2'])

data_frame.head()
</pre>
</div>

In our second approach, this recipe will walk you through the steps of using Python to create a histogram. You will use the panda’s hist() method, which is a wrapper for the matplotlib.pyplot API.

You’ll see the session length distribution for a website in our example. The following parts comprise the steps in this recipe:

  • Data Wrangling
  • Exploration & Preparation of Data
  • Visualization of data

In this sample Mode report, you can see examples of all the stages listed below. Let’s get going.

Wrangling of Data

To get the data you’ll need for our investigation, use SQL. You’ll be using the sessions dataset from Mode’s Public Data Warehouse for this example. Run the following query to organize your data using the schema browser in the editor, making sure that your data source is set to the Mode Public Warehouse data source:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>select *from modeanalytics.sessions</pre>
</div>

Rename your SQL query to Sessions once it has finished executing so that you can recognize it in the Python notebook. You can accomplish this by clicking “Rename” next to the three dots next to “Query 1” in your editing toolbar.

Exploration & Preparation of Data

Once you’ve organized your data, head over to the Python notebook to start preparing it for visualization. Let’s begin by importing the Python modules into the Python notebook that you will be using for this recipe’s remainder:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># begin by importing the numpy library as np

import numpy as np
# subsequently import the panda's library as pd
import pandas as pd
# Also, import the matplotlib's pyplot module as plt 
import matplotlib.pyplot as plt
# Also, import StrMethodFormatter from matplotlib 
from matplotlib.ticker import StrMethodFormatter</pre>
</div>

The output of your SQL queries is piped automatically by Mode into a pandas dataframe associated with the variable datasets. To access the outcomes of your SQL query as a dataframe and assign them to a new variable, use the Python code shown below:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>data_frame = datasets['Sessions']</pre>
</div>

The dataframe shape attribute can be used to determine the form of your dataset:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>data_frame.shape</pre>
</div>

When a dataframe’s shape attribute is called, a tuple providing the dimensions (rows x columns) of the dataframe is returned. The sessions dataset we are dealing with in our example has 65,499 rows (sessions) and five columns. By calling the dtypes attribute, you can look at the data types of the variables in your dataset:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>data_frame.dtypes</pre>
</div>

A dataframe’s dtypes attribute can be called to get details on the data types of each of its constituent variables. As you can see in our case, pandas properly inferred the data types of some of the variables while leaving some as object data types. Manually casting these variables to more suitable data types is possible:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre># conversion of respective data types
data_frame['created_at'] = data_frame['created_at'].astype('datetime64[ns]')
data_frame['user_type'] = data_frame['user_type'].astype('category')

# Show new data types
data_frame.dtypes</pre>
</div>

You now have your dataset ready, and we can begin visualizing the data.

Visualization of data

Pandas hist()

We will produce a histogram using the panda’s hist() method. All non-nuisance series in a pandas dataframe will return their histograms when the hist() method is called on the dataframe.

column: the particular column or columns that you wish to make a histogram of

By: the variable used to divide your data into groups; this results in many histograms showing each group.

Bins: If not given, Pandas will automatically build bins based on the number of buckets your data will be divided into the respective count of bars in your histogram.

You will supply the column name to the hist() method’s column argument. The reason for the latter is your special interest in visualizing the distribution of the session_duration_seconds variable. It will restrict the output of the visualization to the variable of interest:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>data_frame.hist(column='session_duration_seconds')</pre>
</div>

By giving the hist() method other parameters and using the styling capabilities of matplotlib, you may further alter the appearance of your histogram:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>data_var = data_frame.hist(column='session_duration_seconds', bins=25, grid=False, figsize=(12,8), color='#86bf91', zorder=2, rwidth=0.9)

data_var = data_var[0]
for i in data_var:

    # Despine
    i.spines['right'].set_visible(False)
    i.spines['top'].set_visible(False)
    i.spines['left'].set_visible(False)

    # Switch off ticks
    i.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on")

    # drawing for horizontal axis lines
    vals_var = i.get_yticks()
    for v in vals_var:
        i.axhline(y=v, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)

    # Remove title
    i.set_title("")

    # label set on the x-axis
    i.set_xlabel("The duration session (Seconds)", labelpad=20, weight='bold', size=12)

    # label of the set y-axis
    i.set_ylabel("Sessions", labelpad=20, weight='bold', size=12)

    #label of the format y-axis
    i.yaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))</pre>
</div>

You can also make distinct subplots for various data sets by supplying a column to the argument of the panda’s hist() method. When using the hist() method, for instance, you can make different histograms for various user kinds by supplying the user_type column to the by parameter:

<div class="wp-block-codemirror-blocks-code-block code-block">
<pre>data_var = df.hist(column='session_duration_seconds', by='user_type', bins=25, grid=False, figsize=(8,10), layout=(3,1), sharex=True, color='#86bf91', zorder=2, rwidth=0.9)

for var_i,var_x in enumerate(data_var):

    # Despine
    var_x.spines['right'].set_visible(False)
    var_x.spines['top'].set_visible(False)
    var_x.spines['left'].set_visible(False)

    # Switch off ticks
    var_x.tick_params(axis="both", which="both", bottom="off", top="off", labelbottom="on", left="off", right="off", labelleft="on")

    # Draw horizontal axis lines
    vals = var_x.get_yticks()
    for v in vals:
        var_x.axhline(y=v, linestyle='dashed', alpha=0.4, color='#eeeeee', zorder=1)

    # Set x-axis label
    var_x.set_xlabel("Session Duration (Seconds)", labelpad=20, weight='bold', size=12)

    # Set y-axis label
    if var_i == 1:
        var_x.set_ylabel("Sessions", labelpad=50, weight='bold', size=12)

    # Format y-axis label
    var_x.yaxis.set_major_formatter(StrMethodFormatter('{x:,g}'))

    var_x.tick_params(axis='x', rotation=0)</pre>
</div>

Conclusion

Understanding distribution within a data series relies heavily on histograms. Pandas Histogram offers a simple method for creating a chart directly from your data. Traditionally, histogram charts just required one dimension of data. It was intended to display the number of values or groups of values in your series.

Using your DataFrame, Pandas DataFrame.hist() will produce a histogram graphic that displays the distribution of values within your series. You can get started with the default values, but many modification options are available.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *