Handling missing data using SimpleImputer of Scikit-learn

Data preparation is one of the tasks you must complete before training your machine learning model. At the core of the data preprocessing activity is data cleansing, which usually entails eliminating rows with empty values or replacing them with imputed values.

The term “impute” refers to a value derived from the value of the items or processes it contributes. Imputation refers to the entire process of trying or replacing missing data with alternative values in statistics.

This article explores the utilization of the SimpleImputer class in sklearn to replace missing values in Pandas dataframes quickly and easily.

SimpleImputer is a scikit-learn class that can aid with missing data in predictive model datasets. It substitutes a placeholder for the NaN values.

The SimpleImputer() method is used to implement it, and it takes the following arguments:

  • missing_values: It is the placeholder for missing values it must impute. The default values are NaN.
  • strategy: the data that will replace the NaN values in the dataset. In fact, the values for the strategy argument are mean (default),’median’,’most_frequent’, and ‘constant’.
  • fill_value: It is the constant value that will be applied to the NaN data when the constant technique is used.

Replacing Values that are missing

NaN is used to represent all missing values in the dataframe. You can usually either remove them or replace them with inferred values. To fill the NaN in the A column with the mean, for example, you could do something like this:

df_values['A'] = df_values['A'].fillna(df_values['A'].mean())
df_values

The mean of column A is now used to fill the empty values in column B.

It is straightforward; however, your fill approach may vary from time to time. Instead of using the column’s mean to fill in missing values, you might wish to use the value that occurs the most frequently. Column E, for example, has the most frequently occurring value of “Good.”

You can use the following line to replace the missing value in column E with the most often occurring value:

df_values['E'] = df_values['E'].fillna(df_values['E'].value_counts().index[0])
df_values

Using the SimpleImputer Class from sklearn

The SimpleImputer class from sklearn is used instead of the fillna() function. The SimpleImputer class is found in the sklearn.impute package. The best approach to learning how to utilize it is to look at an example.

SimpleImputer is a Python class that demonstrates how to use it.

import numpy as np

# first, import the SimpleImputer class
from sklearn.impute import SimpleImputer

# using the mean strategy on the Imputer object and
# missing_values type for imputation
imputer = SimpleImputer(missing_values = np.nan,
						strategy ='mean')

initial_data = [[22, np.nan, 44], [20, 42, np.nan],
		[np.nan, 21, 30]]

print("The initial data : \n", initial_data )
# data fitting to the imputer object
imputer = imputer.fit(initial_data )

# Imputition of the given data	
data = imputer.transform(initial_data )

print("Resultant imputed Data : \n", initial_data )

Remember that the mean or median is calculated along the matrix’s column. It would help if you first created an instance of the SimpleImputer class by defining the strategy (mean) and the missing values you wish to find (np.nan):

imputer = SimpleImputer(strategy='mean', missing_values=np.nan)

The fit() function is very vital in fitting the imputer on the column(s) you wish to work on once you’ve generated the instance:

imputer = imputer.fit(df_values[['A']])

Now you can use the transform() function to fill in the missing values using the approach you provided in the SimpleImputer class’s initializer. Keep in mind that both the fit() and transform() functions require a 2D array, so be sure to use one. On the flip side, you’ll get an error if you don’t. For instance, passing in a 1D array or a Pandas Series. The transform() function returns a 2D array as a result. In this case, we reassign the value to column A:

df_values['A'] = imputer.transform(df_values[['A']])

Replacement in Multiple Columns

You only need to pass in a dataframe containing the required columns to replace the missing values for several columns in your dataframe:

df_values = pd.read_csv('NaNDataset.csv')
imputer = SimpleImputer(strategy='mean', missing_values=np.nan)
imputer = imputer.fit(df_values[['A','B']])
df_values[['A','B']] = imputer.transform(df_values[['A','B']])
df_values

By using the “mean” method, the following example fills the missing values in columns A and B:

Using the median as a replacement

Instead of updating the missing values using the mean of each column, you can use the median:

imputer = SimpleImputer(strategy='median', missing_values=np.nan)
imputer = imputer.fit(df_values[['A','B']])
df_values[['A','B']] = imputer.transform(df_values[['A','B']])
df_values

Substituting the most common value

Use the “most_frequent” technique to replace missing values with the most commonly occurring value:

imputer = SimpleImputer(strategy='most_frequent',
missing_values=np.nan)
imputer = imputer.fit(df_values[['C']])
df_values[['C']] = imputer.transform(df_values[['C']])
df_values

This method works well for category columns (although it also works for numerical columns).

Using a fixed value as a replacement

Another option is to use a fixed (constant) value to replace missing values. To do so, use the strategy argument to choose “constant” and the fill value option to define the fill_value:

imputer = SimpleImputer(strategy='constant',
missing_values=np.nan, fill_value=0)
imputer = imputer.fit(df_values[['A','B']])
df_values[['A','B']] = imputer.transform(df_values[['A','B']])
df_values

All missing values in columns A and B are replaced with 0’s in the code excerpt above.

The SimpleImputer is applied to the entire dataframe

You can use the dataframe to run the fit() and transform() functions to apply the same technique to the entire dataframe. When the result is returned, you can update the dataframe with the iloc[] indexer method:

df_values = pd.read_csv('NaNDataset.csv')
imputer = SimpleImputer(strategy='most_frequent',
missing_values=np.nan)
imputer = imputer.fit(df_values)
df_values.iloc[:,:] = imputer.transform(df_values)
df_values

Another option is to use the result of the transform() function to construct a new dataframe:

df_values = pd.DataFrame(imputer.transform(df_values.loc[:,:]),
columns = df_values.columns)
df_values

The preceding example applies the “most_frequent” technique to the entire dataframe. In addition, if you use the median or mean strategies, you’ll get an error because column C isn’t numerical.

Conclusion

Many real-world datasets have missing values, commonly encoded as blanks, NaNs, or other placeholders. On the other hand, these datasets are incompatible with scikit-learn estimators, which presume that all values in an array are numerical and have significance.

The discarding of entire rows and columns having missing values is the primary method for using incomplete datasets. However, attaining the latter is at the cost of potentially valuable data being lost (even though incomplete). Instead, imputing the missing values, i.e., inferring them from the known part of the data, is preferable.

In this article, we’ve shown you how to use sklearn’s SimpleImputer class to replace missing values in your dataframe. While the fillna() function may replace missing values manually, the SimpleImputer class makes handling missing values very simple. If you’re working with sklearn, it’s easy to combine SimpleImputer and Pipeline objects.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *