Dropping duplicate rows pandas

Do you ever accidentally have repeat rows in your data? Duplicates will be eliminated for you by Pandas Drop. Any duplicate rows or a subset of duplicate rows will be eliminated from your DataFrame by using Pandas DataFrame.drop duplicates().

It is quite helpful when you want to ensure your data has a unique key or unique rows. Duplicate rows in a DataFrame can be removed using the pandas.DataFrame.drop_duplicates() method. You can drop duplicate rows on specific multiple columns or all columns using this technique as well. Using functions like DataFrame.drop_duplicates(), DataFrame.apply(), and lambda function with examples, we’ll show various techniques to remove duplicate rows from a Pandas DataFrame in this post.

The syntax is as follows:

UserDataFrame.drop_duplicates()

It is a representation of combining DataFrame.drop() and DataFrame.duplicated() into one function. When we have a column that reflects an entity’s unique id, we utilize this method the most. Additionally, we’ll execute.drop_duplicates() with the subset argument set to the unique column.

In the pseudo-code, check whether any rows (or a subset of columns containing your rows) have duplicate data. If so, you can drop them as you so wish.

Dropping duplicate rows in Pandas Python

The two choices you’ll need to make when using.drop_duplicates() are: 1) Which subset of your data do you want Pandas to check for duplicates? and 2) Which of your duplicates do you wish to keep—the first, the last, or none?

Simple Cases of Drop Duplicate Rows

Here are a few simple examples of removing duplicate rows from a pandas DataFrame if you’re in a rush.

# Here is a brief illustration: # keep the first duplicate row
second_df = df.drop_duplicates()

# To keep the first duplicate row, use DataFrame.drop_duplicates().
second_df = df.drop_duplicates(keep='first')

# maintain the final duplicate row
second_df = df.drop_duplicates( keep='last')

# Eliminate all redundant rows.
second_df = df.drop_duplicates(keep=False)

# Remove redundant rows depending on particular columns.
second_df = df.drop_duplicates(subset=["Courses", "Fee"], keep=False)

# Duplicate rows should be removed.
df.drop_duplicates(inplace=True)

# Using the lambda function and DataFrame.apply()
second_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['Courses', 'Fee'], keep='first')

Examples & Syntax of drop_duplicates()

The DataFrame’s syntax is shown below. Duplicate rows are eliminated from the pandas DataFrame using the drop_duplicates() method.

# Syntax of drop_duplicates
DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)
  • subset- refers to a column label or list of labels in the subgroup. None is the default value for it. Consider finding duplicate rows after passing columns.
  • keep – First, last, and False are the only valid values; first is the default.
  • “first” – All rows with duplicate data except the first one are dropped.
  • “last” – All duplicate rows other than the final one are removed.
  • False – All redundant rows are removed.
  • inplace – a Boolean value for inplace. When True, it removes duplicate rows from the existing DataFrame. The latter is False by default.
  • ignore index: is a Boolean value that is typically False.

Let’s now establish a DataFrame with a few rows and columns with duplicate data. The columns in our DataFrame have names like language, cost, timespan, and field_work.

import pandas as pd
import numpy as np

assignment = {
    'language':["Python","JavaScript","PHP","Java","MySQL","Kotlin","Go"],
    'cost' :[40000,45000,42000,50000,42000,40000,70000],
    'time_span':['180days','310days','200days','180days','425days','500days','650days'],
    'field_work':[1000,2300,1200,2000,1200,1000,2000]
              }
df = pd.DataFrame(assignment)
print(df)

Pandas Remove Repeated Rows

Here, we will utilize a DataFrame. First, drop rows with identical values across all columns by using drop_duplicates() without any arguments. It accepts the default parameters keep=’first’ and subset=None. After deleting duplicate entries from our DataFrame, the example below returns four rows.

# retain the first row of duplicates
second_df = df.drop_duplicates()
print(second_df)

# Using DataFrame.drop_duplicates() to keep first duplicate row
second_df = df.drop_duplicates(keep='first')
print(second_df)

Remove redundant rows while keeping the final row

You must give a retain parameter with the value “last” if you wish to select all duplicate rows along with their most recent occurrence. df.drop_duplicates(keep=’last’) is one example.

# maintain the last duplicate row
second_df = df.drop_duplicates(keep='last')
print(second_df)

From the Pandas DataFrame, eliminate all duplicate rows

To eliminate every duplicate row, you can use ‘keep=False’ in the drop_duplicates() function. df.drop_duplicates(keep=False) for Example:

# Eliminate all redundant rows.
second_df = df.drop_duplicates(keep=False)
print(second_df )

Remove Duplicate Rows Based on Particular Columns

If you want to remove duplicate rows based on multiple columns, list all of the column names. To eliminate every duplicate row, you can use ‘keep=False’ in the drop_duplicates() function.

# Remove redundant rows depending on particular columns.
second_df = df.drop_duplicates(subset=["assignment", "cost"], keep=False)
print(second_df )

Remove any duplicate rows In place

# here, duplicate rows are removed.
df.drop_duplicates(inplace=True)
print(df)

Eliminate Duplicate Rows Using the Lambda Function and DataFrame.apply()

Using DataFrame, duplicate rows can be removed. Lowercase the DataFrame using apply() and a lambda function, then apply the lower string.

# Using DataFrame.apply() and lambda function
second_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['assignment', 'cost'], keep='first')
print(second_df)

Detailed Illustration of Dropping Duplicate Rows in DataFrame

import pandas as pd
import numpy as np
assignment = {
    'language':["Python","JavaScript","PHP","Java","MySQL","Kotlin","Go"],
    'cost' :[40000,45000,42000,50000,42000,40000,70000],
    'time_span':['180days','310days','200days','180days','425days','500days','650days'],
    'field_work':[1000,2300,1200,2000,1200,1000,2000]
              }
df = pd.DataFrame(assignment)
print(df)

# keep the first duplicate row
second_df = df.drop_duplicates()
print(second_df)

# Using DataFrame.drop_duplicates() to keep first duplicate row
second_df = df.drop_duplicates(keep='first')
print(second_df)

# keep the last duplicate row
second_df = df.drop_duplicates( keep='last')
print(second_df)

# Remove all duplicate rows
second_df = df.drop_duplicates(keep=False)
print(second_df)

#Remove redundant rows depending on particular columns.
second_df = df.drop_duplicates(subset=["assignment", "cost"], keep=False)
print(second_df)

# dropping of the duplicate rows in place
df.drop_duplicates(inplace=True)
print(df)

# employing the DataFrame.apply() and lambda function
second_df = df.apply(lambda x: x.astype(str).str.lower()).drop_duplicates(subset=['assignment', 'cost'], keep='first')
print(second_df)

Example: deleting all duplicate values from rows

In this illustration, all-valued rows will be eliminated. Because there is no such row in the CSV file, a random row is replicated and added to the data frame first.

# current length before adding a row
length_one = len(data)

# inserting duplicate of a row of row 337  manually
data.loc[1001] = [data["First Name"][337],
				data["Gender"][337],
				data["SDate"][337],
				data["Last Login Time"][337],
				data["NetPay"][337],
				data["Bonus %"][337],
				data["Project Management"][337],
				data["Team"][337]]

# resultant length proceeding adding row
length_two = len(data)

# first name sorting
data.sort_values("F. Name", inplace=True)

# dropping duplicate values
data.drop_duplicates(keep=False, inplace=True)

# length post duplicate removal
length_three = len(data)

# printing all data frame lengths
print(length_one, length_two, length_three)

Example: Remove rows with the same First Name

The following example returns a new data frame after removing rows with the same First Name.

#begin by importing the panda's package
import pandas as pd

# converting a CSV file to a data frame
data = pd.read_csv("employees.csv")

# sorting by first name
data.sort_values("F. Name", inplace=True)

# removing ALL double-valued values
data.drop_duplicates(subset="F. Name",keep=False, inplace=True)

# show the data finally
print(data)

As you can see, one of ClubA and the team has been removed because the value was duplicated.

import pandas as pd

vals = {
	"A": ["ClubA", "ClubB", "ClubB", "ClubC", "ClubA"],
	"B": [63, 53, 53, 43, 63],
	"C": [True, False, False, False, True]
}

df = pd.DataFrame(vals)

display(df.drop_duplicates())

Example: Dropping of Rows from redundant rows

When we use pandas’ default drop duplicates function, we instruct it to identify all duplicate rows and keep only the first set. As presented in the example below, row 2 (index=1) is discarded because it is the second instance of a duplicate row.

electronic_df = pd.DataFrame({
    'brand': ['Lenovo', 'Lenovo', 'Toshiba', 'Toshiba', 'Toshiba'],
    'device': ['Laptop', 'Laptop', 'Backpack', 'Earphones', 'Backpack'],
    'rating': [4.1, 4.1, 6.6, 9.7, 8.1]
})
print(electronic_df)
electronic_df.drop_duplicates()

Example: removing rows from a subset of duplicate columns

Drop duplicates will only check a column (or several columns) to see if they are duplicates with any other subset of columns from other rows when we specify a subset. Those duplicates will be removed if this is the case.

Here, we define a subset only to consider the “brand” column. The brand column will eliminate all duplicates except for the first ones (since keep defaults to “first”).

electronic_df = pd.DataFrame({
    'brand': ['Lenovo', 'Lenovo', 'Toshiba', 'Toshiba', 'Toshiba'],
    'device': ['Laptop', 'Laptop', 'Backpack', 'Earphones', 'Backpack'],
    'rating': [4.1, 4.1, 6.6, 9.7, 8.1]
})

electronic_df.drop_duplicates(subset='brand')

By passing a list, you can perform several columns as a subset.

electronic_df.drop_duplicates(subset=['brand', 'device'])

Example: Keeping the most recent duplicate rather than the first by default

The first duplicate that the .drop_duplicates() discovers is always retained. However, you may specify keep=’last’ to change things up and keep the last one. The command is being executed in this instance except that keep=’last’. Take note of how row 1 (index=0) disappears. We keep the most recent duplicate.

electronic_df = pd.DataFrame({
    'brand': ['Lenovo', 'Lenovo', 'Toshiba', 'Toshiba', 'Toshiba'],
    'device': ['Laptop', 'Laptop', 'Backpack', 'Earphones', 'Backpack'],
    'rating': [4.1, 4.1, 6.6, 9.7, 8.1]
})

electronic_df.drop_duplicates(keep='last')

Example: From the DataFrame, eliminate any duplicate rows

Duplicate rows are eliminated with the drop_duplicates() function. Use the subset argument if only a subset of the supplied columns should be considered when looking for duplicates.

import pandas as pd

vals = {
  "name": ["Green", "Bright", "Blue", "James"],
  "score": [80, 70, 60, 30],
  "qualified": [True, False, False, False]
}

rs_df = pd.DataFrame(vals)

final_df = rs_df.drop_duplicates()

Conclusion

Finding duplicate data and handling them appropriately are frequent tasks associated with data preprocessing and analysis.

Duplicates can be eliminated from the Pandas Dataframe in Python using the drop_duplicates() method. This post has taught you how to use pandas to delete duplicate rows. With examples, we discussed DataFrame.drop_duplicates(), DataFrame.apply, and lambda functions.

The two methods for locating and eliminating duplicate rows, duplicated() and drop_duplicates(), as well as how to alter their behavior to meet your unique requirements, are covered in this article. Knowing them is crucial since we frequently utilize them while preparing and analyzing data.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *