Pandas is an essential data analysis library in Python. One of the most common data manipulation tasks is efficiently filtering DataFrame rows based on conditions.
This comprehensive 3500+ word guide will impart expert knowledge on the full range of options for removing rows from Pandas DataFrames programmatically.
You‘ll gain practical skills to wrangle and refine messy real-world datasets with ease. Let‘s dive in!
Photo by Solen Feyissa on Unsplash
Overview of Key Methods
Pandas provides a flexible API for slicing, dicing, transforming dataset rows with surgical precision. Here‘s a quick overview of the main methods before we explore them in-depth:
.drop()
– Removes rows by passing index labels, slices, boolean arrays or conditions
.query()
– Filters DataFrames using intuitive SQL-like queries
.loc
/ .iloc
– Positional & labelled indexing and slicing of data
.dropna()
– Removes missing/null values
.duplicated()
– Logical checking for duplicate rows
.unique()
– Extracts unique rows
Now let‘s dive into specifics on how and when to apply each technique.
Setting up a Sample Dataset
We‘ll use a real-world Kaggle dataset on popular movies through the ages for demonstration. Our DataFrame df
has details on ±5000
films.
import pandas as pd
df = pd.read_csv(‘movies.csv‘)
print(df.head())
print(f"Shape: {df.shape}")
| | title | ... | budget_adj | worlwide_gross_adj |
|---:|:--------------------------------||:-----:|:----------:|:---------------------|
| 0 | Avatar 3D | ... | 3.787e+08 | 3.119e+09 |
| 1 | Pirate of the...| ... | 4.39e+07 | 1.065e+09 |
| 2 | Star Wars Ep. VII: The Force A...| ... | 3.576e+08 | 2.068e+09 |
| 3 | Titanic 3D | ... | 3.545e+08 | 2.187e+09 |
| 4 | Frozen | ... | 1.5e+08 | 1.276e+09 |
Shape: (4993, 28)
That‘s a decently large dataset covering various metrics on popular movies. Now let‘s discuss how to filter it!
Removing Rows Using .drop()
The most common method for removing rows is the .drop()
method. It provides a few different interfaces to match your use case:
.drop(labels, axis=0): Drop rows by index label or slice
.drop(index_array, axis=0): Drop rows by boolean index array
.drop(condition, axis=0): Drop rows matching boolean condition
Let‘s look at examples of each:
Drop by index label
sample = df.head(10)
smaller = sample.drop(labels=10, axis=0) # drops 1 row
print(smaller.shape)
This drops the row with label 10, keeping rows 0-9.
Drop by index slice
small_range = df[1000:2000]
without_1500s = small_range.drop(labels=slice(1500, 1600)) # drop a range
Here we first slice the dataset to rows 1000-2000, then drop the 1500-1599 slice.
Drop by boolean index
high_budget = df[df[‘budget_adj‘] > 1e8]
no_us = high_budget.drop(index_array=high_budget[‘country‘]==‘USA‘)
First filters high budget rows then remove USA movies by passing boolean index array.
These examples demonstrate the flexibility of .drop()
for removing sets of rows.
Now let‘s try some more filters combining .drop
with boolean conditions:
recent_blockbusters = df.drop(df[‘worlwide_gross_adj‘]<5e8) # keep high grossing
non_sequels = recent_blockbusters.drop(recent_blockbusters[‘sequel‘]==1) # drop sequels
family_friendly = non_sequels.drop(non_sequels[‘content_rating‘]>‘PG‘) # stricter ratings
print(family_friendly.shape)
We pipeline multiple filters focusing down to high grossing, non-sequel family movies. Much cleaner than nested logic!
Key Idea: combine .drop()
filters to drill-down interesting subsets.
Performance Warning 🚨
One catch when filtering by row subsets like this – it gets slower on larger data!
Each intermediate filter does reallocation to make subset copies. Better to compose logic into a single filter where possible:
start = time()
family_friendly = df.drop(
(df[‘worlwide_gross_adj‘]<5e8) |
(df[‘sequel‘]==1) |
(df[‘content_rating‘]>‘PG‘)
)
end = time()
print(end - start)
Here a single conditional drop is much quicker than doing it in steps!
When setting up pipelines:
- Profile to catch any unwanted copies
- Combine filters where possible
Now let‘s look at another handy filtering method – .query()
Filtering Rows Using .query()
Pandas .query()
allows filtering DataFrames using intuitive SQL syntax:
best_drama = df.query("genre == ‘Drama‘ & avg_vote > 9")
We filter for highly rated dramas in one shot!
Here are more examples:
recent = df.query("year > 2010") # simple numerical filter
blockbusters = df.query("(budget_adj > 2e8) & (worlwide_gross_adj > 1e9)") # AND condition
new_scifi = df.query("year >= 2015 and genre==‘Science Fiction‘") # OR + AND
The syntax is super readable for ad hoc interactive filtering.
We can also use comparison operators like ==
, >
, >=
etc. And all columns values are accessible.
Pro Tip – You can define variables inside .query()
for reusable logic:
high_rating = 8.5
popular_filter = f"ratings > {high_rating} & votes > 10000"
critically_acclaimed = df.query(popular_filter)
Here we create the filter logic string with a variable, then reuse it. Clean abstraction!
One downside of .query()
is speed – it can be slower than raw numpy indexers like .loc
. But usage is so pleasant that I think the tradeoff is worth it!
Positional and Labelled Indexing & Slicing Using .loc and .iloc
For more advanced programmatic manipulation, Pandas provides .loc
and .iloc
indexers.
.loc
– filters by row & column labels.iloc
– filters by row & column positions *
These standard Python slice syntax for selecting sets of rows:
df.loc[start:stop:step, cols]
df.iloc[start:stop:step. cols]
Let‘s demo usage:
first_100 = df.loc[:99] # first 100 rows by label
2014_2016 = df.loc[2013:2016] # slice years by label
every_10th = df.iloc[::10] # select every 10th row by position
columns = [‘title‘, ‘director‘, ‘avg_vote‘]
top_250 = df.loc[0:250, columns] # slice + column subset
We can slice, dice and subset the dataset in flexible ways.
Slicing by labels vs positions enables different modes of programmatic access.
Here‘s an example building a subset in a loop:
yeartags = [‘2010s‘, ‘2000s‘, ‘1990s‘] # dataset has precomputed decades
decade_dfs = []
for tag in yeartags:
decade_df = df.loc[df[‘decade‘] == tag]
decade_dfs.append(decade_df)
print(len(decade_dfs)) # 3 subsets
We iteratively build decadal subsets using the precomputed decade flag.
The benefits of .loc
, .iloc
are:
- Speed and versatility for data access
- Enable iterative, programmatic manipulation
- Avoid expensive copies compared to
.drop()
So utilize them where possible, while .query()
and .drop()
are simpler for ad hoc use.
Now let‘s discuss dealing with duplicate and missing data.
Removing Duplicate Rows
It‘s common to have duplicate rows in raw datasets. To remove them:
Identify duplicate rows
The .duplicated()
method flags duplicate rows:
duplicates = df[df.duplicated()]
print(f"Total Duplicates: {duplicates.shape[0]} ")
This subsets all duplicate rows.
Remove duplicates
Calling .drop_duplicates()
removes duplicates:
df = df.drop_duplicates()
print(f"Duplicates Removed, new shape: {df.shape}")
By default keeps first encountered row.
To specify columns to identify uniqueness:
df = df.drop_duplicates(subset=[‘movie_title‘, ‘year‘])
print(f"Unique Titles per Year: {df.shape}")
Now you know how to eliminate messy duplicate entries from datasets using Pandas!
Removing Rows With Missing Values
Another data cleaning task – dealing with missing values encoded as NaN
or None
in Pandas.
Identify missing values
Before removing, identify occurrences using .isna()
/.notna()
:
missing = df[df.budget_adj.isna()]
print(f"% With Missing Budget: {len(missing) / len(df) * 100:.2f}")
This prints the % of movies missing budget data.
Remove missing values
We can drop rows with .dropna()
:
df = df.dropna()
print(f"Shape after dropping na: {df.shape}")
By default drops rows with any na
values.
To specify a threshold:
df = df.dropna(thresh=10) # drops rows with > 10 non-na
Use this to filter out rows missing important data.
There are also options to fill the missing values instead of dropping – a topic for another guide!
The key is .isna()
-> .dropna()
workflow to remove unwanted na
rows from DataFrames.
Interactive Filtering Options
For interactive analysis, Pandas integrates with a few GUI libraries like pandas-profiling to enable rich visual data exploration.
Here‘s a preview:
The GUI auto-generates an interactive report profiling the dataset, with widgets to dynamically filter, sort and visualize data subsets.
Definitely check pandas-profiling out for experimenting with DataFrame filters during analysis!
For production pipelines, scripted filters we discussed are preferred. But GUIs enable deeper insights and cleaner code when iterating.
Wrap Up: Best Practices for Row Filtering
We‘ve covered a wide gamut of methods to filter DataFrame rows based on any conditions required:
.drop()
great for simple, ad hoc row removal.query()
enables intuitive SQL filtering syntax.loc
/.iloc
blend performance and programmability.dropna()
handles missing values.duplicated()
removes duplicates- GUIs like pandas-profiling useful for interactive analysis
The key is understanding the strengths of each approach and where it fits into your pipeline.
Here are my recommended best practices:
- Simplify pipelines by combining filters using De Morgan‘s laws
- Profile and optimize – catch unwanted copies, exploit vectorization
- Prefer programmatic methods like
.loc
and.query()
for production - Explore interactively during analysis with pandas GUI frontend
I hope you enjoyed this comprehensive guide to slicing, dicing and wrangling Pandas DataFrames effectively.
Pandas row manipulation powers lies in blending its API for surgical precision without compromising performance.
With this article‘s tools in your toolkit, you can tap into that power for any data task!