Because these disciplines constitute the foundation of all machine learning algorithms, math and statistics are vital for data science. Moreover, every element of our lives is influenced by mathematics. From shapes, patterns, and colors to counting petals in flowers, everything around us is based on mathematics and statistics.
The ability to graphically describe, summarize, and portray data is essential in dealing with data. Python Statistics Libraries are comprehensive, accessible, and widely used tools for working with data.
Statistics Library in Python
Data collection, analysis, interpretation, and presentation are all part of statistics, which is a mathematical science. Thus, data scientists and analysts can look for relevant data trends and changes using statistics, which are used to solve challenging real-world problems. To put it another way, statistics may be utilized to gain essential insights from data by doing mathematical computations.
The statistics module in Python includes utilities for calculating numeric mathematical statistics.
The essential concepts of statistics are mean, median, and mode. They’re simple to calculate in Python, both with and without the use of additional libraries.
These are the three most used central tendency measures. The central tendency tells us what a dataset’s “normal” or “average” values are. It is the appropriate tutorial for you if you’re just getting started with data science.
By the end of this tutorial, you will be able to:
But, first, understand the meaning of the terms mean, median, and mode.
This section will create our mean, median, and mode functions in Python and quickly use the statistics module to get started with these metrics.
A Dataset’s Mean
A dataset’s mean, or average, is computed by summing all the values and dividing by the number of items in the collection.
For example, the mean of the dataset [4,5,6] is:
(4+5+6) / 3 = 5
One of the most frequent approaches to describe statistical findings is to determine the average of a dataset. Thus, people frequently use terms like, typically, or often to indicate the data’s center.
In this session, you will learn how to calculate the mean of a dataset, which is a specific measure of a dataset’s average. We’ll use the medium to assist us in answering the question.
Calculating the Average
The mean, sometimes known as the average, is a method of determining the average of a dataset.
A two-step process is used to calculate the average of a set:
Step 1: In your dataset, add all of the observations.
Step 2: Divide the total sum by the number of points in your dataset.
xˉ=(x1 + x2…+xn)/n
The mean is calculated using the equation above. Observations x1, x2, … xn come from a dataset of n observations.
Example
Consider the following scenario: we want to determine the average of a dataset with four observations:
data = [4, 6, 2, 8]
Step One: Calculation of the total
4 + 6 + 2 + 8 = 20
Step Two: Division of the total by the given count of observations
The resultant total is equivalent to 20, while the count of the observation is 4.
20/4 = 5
The resulting average value of this division is 5.
Average NumPy
While you’ve demonstrated that you can calculate the average on your own, it becomes time-consuming as your dataset grows in size – imagine adding all of the numbers in a dataset with 10,000 observations.
The NumPy.average() or.mean() functions can help you with addition and division.np.average() is essential in calculating the average of a dataset with tens of values in the example below.
import numpy as np array_list = np.array([27, 19, 33, 13, 15, 31, 41, 5, 7, 39]) average_vals = np.average(array_list) print(average_vals)
The code above computes the average of the example array and saves the result in the variable example average. The array’s average, as a result, is 23.
Using Python to Calculate the Mean
One of the widely known central tendency measures includes the mean, sometimes known as the arithmetic average.
Remember that the central tendency of a set of data is a typical value.
Because a dataset is a collection of data, a dataset in Python can be any of the built-in data structures listed below:
- Objects are organized into lists, tuples, and sets.
- Strings are a group of characters.
- A dictionary is a set of key-value pairs.
Although Python has various data structures such as queues and stacks, we’ll only use the built-in ones.
We can find the mean by summing all the values in a dataset and dividing the result by the number of values. In case of the following examples:
[4, 5, 6, 7, 8, 9]
Because the list’s total length is 6 and its sum is 39, the mean or average would be 6.5. The result of dividing thirty-nine by six is 6.5. This computation can be done using the formula below:
(4 + 5 + 6 + 7 + 8 + 9) / 6 = 6.5
User-Defined Mean Function
Let’s start by estimating the average (mean) age of the drivers in a racing team. The team will be known as “Pythonic Racing.”
pythonic_racing_ages = [22, 25, 37, 29, 35, 33, 27, 27] def custom_mean(data_vals): return sum(data_vals) / len(data_vals) print(custom_mean(pythonic_racing_ages))
The following are the steps to deciphering this code:
The “pythonic racing ages” is a list of drivers’ ages.
A custom_mean() function is defined, which returns the total of the specified dataset divided by its length.
The sum() function, paradoxically, returns the whole sum of the values of an iterable, in this case, a list. So if you try to pass the dataset as an argument, you’ll get a 211 response.
If you send the dataset to the len() function, it will return 8 as the length of an iterable. Then, we use the custom_mean() function to calculate the ages of the racing team and then report the result.
It is the output of the average age of the drivers. It’s worth noting that the number doesn’t appear in the dataset but accurately describes the age of the majority of players.
Using the Python Statistic Module’s mean() function
Most developers are familiar with calculating measures of central tendency. Because Python’s statistics module includes various functions for calculating them and other essential statistics topics, this is the case.
PIP does not require installing any external packages because it is part of the Python standard library.
It is how you should use this module.
from statistics import mean racing_ages = [22, 25, 37, 29, 35, 33, 27, 27] print(mean(racing_ages))
Import the mean() function from the statistics module and pass the dataset as an argument in the above code. The result will be the same as the custom function we defined in the previous section:
29.375
Example 1: using statistics module
import statistics as stat list_vals= ([2, 5, 6, 9]) mean_val=stat.mean(list_vals) print("The resultant mean of the list is: ", mean_val)
Let’s move on to the median measurement now that you’ve grasped the concept of the mean.
A dataset’s median
The median of a dataset is the value that falls in the middle, provided the dataset is sorted from smallest to greatest. The median is the middle two values in a dataset with an even number of values.
Assume we have the following ten numbers in our dataset:
34, 26, 40, 20, 22, 38, 48, 12, 14, 46
The result of ordering this dataset from the smallest to the largest is as follows.
12, 14, 20, 22, 26, 34, 38, 40, 46, 48
This dataset’s medians are 26 and 34, respectively, because they are the fifth and sixth observations in the dataset. Alternatively, there are four observations to the left of 26 and four to the right of 34.
If we added a new value (say, 28) near the middle of the dataset, it would look like this:
12, 14, 20, 22, 26, 28, 34, 38, 40, 46, 48
Because there are 5 values smaller than it and 5 values greater than it, the new median equals 28.
In Python, how do you find the median?
The median of a sorted dataset is the value in the middle. It’s used to provide a “typical” value for a given population once more.
The median is the value that divides a sequence into two parts — the lower half and the upper half — in programming.
We must first sort the dataset to calculate the median. We could use sorting algorithms or the built-in function sorted() to accomplish this. The next step is to figure out whether the dataset is odd or even in length. Some of the following processes may be affected by this:
- Odd: The median is the dataset’s middle value.
- Even: The median is a result of dividing the sum of the two middle numbers by two.
Let’s continue with our racing team dataset and calculate the median height in cm of the drivers:
[181, 187, 196, 196, 198, 203, 207, 211, 215]
Since the dataset is odd; we select the middle value
median = 198
As you can see, we may use the middle value as the median because the dataset length is odd. What if, on the other hand, one player retired?
We’d have to determine the median using the dataset’s two middle values.
[181, 187, 196, 198, 203, 207, 211, 215]
We select the two middle values and divide them by 2
median = (198 + 203) / 2 median = 200.5
User-Defined Median Function
Let’s put the above idea into practice with a Python function. While doing so, remember the three stages we must do to find the dataset’s median:
Sort the data in the following order: This can be accomplished using the sorted() function.
Subsequently, determine whether the set is odd or even: We can do this by calculating the dataset’s length and applying the modulo operator ( percent )
Calculate the median for each case:
- Odd: Return the value in the middle.
- Even: In this case, the average of the two middle values should be returned.
As a result, the following function would be created:
racing_heights = [181, 187, 196, 196, 198, 203, 207, 211, 215] racing_heights_after_retirement = [181, 187, 196, 198, 203, 207, 211, 215] def median(dataset): data = sorted(dataset) index = len(data) // 2 # If the dataset is odd if len(dataset) % 2 != 0: return data[index] # If the dataset is even return (data[index - 1] + data[index]) / 2
Next is printing the desired outcome of the dataset.
print(median(racing_heights)) print(median(racing_heights_after_retirement))
At the start of the function, we create a data variable that points to the sorted dataset. Even though the lists above are sorted, we want to create a reusable function, so we’ll have to sort the dataset each time the function is called.
Using the integer division operator, the index stores the dataset’s middle value — or upper-middle value. For example, if we passed the “racing heights” list, the value would be 4.
Remember that sequence indexes in Python start at zero since we may return the middle index of a list using integer division.
Then we compare the result of the modulo operation with any integer that isn’t zero to see if the dataset’s length is odd. In the case of the “racing heights” list, if the condition is true, we return the middle element:
racing_heights[4]
If the dataset is evenly distributed, we return the total of the middle values split by two. It’s worth noting that data[index -1] offers us the dataset’s lower midpoint, whereas data[index] gives us the upper midway.
Using the Python Statistic Module’s median() function
Because we’re using an already-existing function from the statistics module, this method is significantly easier.
We would utilize something that has already been defined for us because of the DRY principle (don’t repeat yourself) (in this case, don’t duplicate other people’s code).
The following code will determine the median of the previous datasets:
from statistics import median as md racing_heights = [211, 217, 226, 226, 228, 233, 237, 241, 245] racing_height_after_retirement = [211, 217, 226, 228, 233, 237, 241, 245] print(md (racing_heights)) print(md (racing_height_after_retirement))
Example 1: using median statistics module
list_vals= ([5, 8, 9, 12]) import statistics as st median_val=st.median(list_vals) print("The resultant Median of the list is: ", median_val)
Median_low
When considering an odd number of data points, the middle value is returned. When the two middle values are equal, the smaller of the two is returned.
list_vals = [ 4, 6, 8, 10 ] import statistics as stat low_median = stat.median_low(list_vals) print("The Median Low of the list is: ",low_median)
Median_high
In case there is an odd number of data points, the middle value is returned. When the two central values are equal, the greater of the two is returned.
list_vals = [ 4, 6, 8, 10 ] import statistics as stat high_median = stat.median_high(list_vals) print("The resultant Median High of the list is: ", high_median)
Median_grouped
Return the median of grouped continuous data via interpolation, calculated as the 50th percentile. Statistics Error is triggered if the data is empty.
The data are rounded in the following example. Each number reflects the midpoint of data classes, e.g., 1 represents the midpoint of class 0.5-1.5, 2 represents the midpoint of 1.5-2.5, 3 represents the midpoint of 2.5-3.5, and so on.
import statistics as stat list_vals = [3, 5, 5, 6, 7, 7, 7, 7, 7, 8] grouped_median = stat.median_grouped(list_vals) print("The Median Group of the list is: ",grouped_median)
Example 1.6: grouped median
The grouped median with numeric data interval is returned by the median_ grouped () method. The class interval is represented by the optional argument interval, which defaults to 1. The interpolation will alter if the class interval is changed:
import statistics as st var_list = [4, 6, 6, 8, 10] median_group_one = st.median_grouped(var_list, interval=1) median_group_two= st.median_grouped(var_list, interval=2) print("The Median Group One of the list is: ", median_group_one) print("The Median Grouped Two of the list is: ", median_group_two)
A Dataset’s Mode
The mode can be expressed as the value that is very common in a given dataset. We can conceive of it as a school’s “popular” group, which may set a benchmark for all pupils.
A tech store’s daily sales could be an illustration of mode. The most popular product on a given day would be the mode of that dataset.
['laptop', 'desktop', 'smartphone', featurephone, featurephone, 'headphones']
The mode of the above dataset is “featurephone,” as you can see because it was the most common value in the list.
The significant part about the mode is that it doesn’t require a numeric dataset. So we can, for example, work with strings.
Let’s have a look at the sales from another day:
['mouse', 'camera', 'headphones', 'usb', 'headphones', 'mouse']
Because both have a frequency of two, the dataset above contains two modes: “mouse” and “headphones.” Therefore, it indicates that the dataset is multimodal.
What if, in a dataset like the one below, we can’t find the mode?
['usb, camera, smartphone, laptop, television']
It is known as a uniform distribution, and it simply means that the dataset has no mode.
Let’s compute mode in Python now that you’ve grasped the concept of mode.
Creating a User-Defined Mode
The frequency of a value can be thought of as a key-value pair or a Python dictionary.
Using the basketball analogy again, we can work with two datasets: scores per game and sneaker sponsorship of some players.
To find the mode, we must first create a frequency dictionary with all of the values in the dataset, then calculate the maximum frequency and return all elements with that frequency.
Let’s examine what this looks like in code:
game_points = [6, 18, 26, 45, 33, 13, 13, 15] name_of_sponsor = ['nike', 'deloitte', 'nike', 'adidas', 'adidas', 'rebook', 'puma', 'deloitte'] def mode(dataset): freq = {} for val in dataset: freq[val] = freq.get(val, 0) + 1 frequent_val = max(freq.values()) mode_vals = [key for key, value in freq.items() if value == frequent_val] return mode_vals
Using the two lists as arguments, we can check the result:
print(mode(game_points)) print(mode(name_of_sponsor))
As observed, the first print statement only returned one mode, whereas the second print statement returned numerous modes.
To further explain the code above, consider the following:
A freq dictionary is declared.
We generate a histogram by iterating over the dataset. A histogram is a statistical term for a set of counters (or frequencies).
If the key is found in the dictionary, the value is increased by one. On the other hand, if it isn’t found, a key-value pair with a value of one is created.
The most frequent variable, paradoxically, stores the frequency dictionary’s largest value (not key).
The modes variable contains all of the keys in the frequency dictionary with the highest frequency.
Example 1: user-defined mode
def mode_one(first): y ={} for a in first: if not a in y: y[a] = 1 else: y[a] +=1 return [g for g, i in y.items() if i==max(y.values())] print(mode_one([43,45,67,54,56,76,89,89,12,13,14,15,16,17,24,45,45,6,7]))
Using the Python Statistic Module’s mode() and multimode() functions
The statistics module offers us a simple way to do basic statistics operations once again.
There are two functions that we can use:
- mode() and
- multimode()
from statistics import mode, multimode game_points = [6, 18, 26, 45, 33, 13, 13, 15] name_of_sponsor = ['nike', 'adidas', 'nike', 'jordan','jordan', 'rebook', 'under-armour', 'adidas']
The code above defines the datasets we’ve been working with and imports both functions.
Here’s where the minor difference comes in: Multimode() produces a list of the most frequent values in the dataset, whereas mode() returns the first mode it sees.
As a result, we can call the custom function we created a multimode() function.
print(mode(game_points)) print(mode(name_of_sponsor))
Note that the mode() method in Python 3.8 and higher returns the first mode is encountered. You’ll get a StatisticsError if you’re using an older version.
The multimode() function can be used in the following ways:
print(multimode(game_points)) print(multimode(name_of_sponsor))
Example 1: Using statistics mode
list_vals=[5,8,6,5,11,6,12,7,5,8,9] import statistics as stat mode_val= stat.mode(list_vals) print("The resultant mode of the list is: ", mode_val)
Example 2: Using statistics mode
list_vals = ["green", "orange", "orange", "green", "red", "green", "green"] import statistics as stat mode_val= stat.mode(list_vals) print("The resultant mode of the list is: ", mode_val)
Summary
Congratulations! If you’ve read this far, you’ve learned how to compute the mean, median, and mode, which are the three primary measures of central tendency.
Although you can create custom functions to find mean, median, and mode, the statistics module is suggested because it is part of the standard library and requires no installation.
When we understand the central tendency of a sample of data, we usually look at the mean (or average), the median, and the mode first.
Overall, we learned Python to determine or compute the mean, median, and mode. In addition, we went over how to develop our functions to compute them step by step and then leverage Python’s statistics module to obtain these metrics quickly.