pandas is a powerful data analysis library with a rich API that offers multiple ways to perform any given data manipulation task. Some of these approaches are better than others, and pandas users often learn suboptimal coding practices that become their default workflows. This post highlights four common pandas anti-patterns and outlines a complementary set of techniques that you should use instead.*
For illustrative examples of good and bad pandas patterns, I'm using this Netflix dataset from Kaggle, which characterises almost 6,000 Netflix shows and movies with respect to 15 features spanning various data types.
import pandas as pd # https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies df = pd.read_csv("titles.csv") print(df.sample(5, random_state=0)) print(df.shape) # returns: # id title type description release_year age_certification runtime genres production_countries seasons imdb_id imdb_score imdb_votes tmdb_popularity tmdb_score # 1519 ts38761 Another Miss Oh SHOW Two women working in the same industry with th... 2016 TV-MA 69 ['drama', 'fantasy', 'romance', 'comedy'] ['KR'] 1.0 tt5679572 7.9 1769.0 22.672 8.2 # 4942 ts225657 Halston SHOW American fashion designer Halston skyrockets t... 2021 TV-MA 47 ['drama'] ['US'] 1.0 tt10920514 7.5 14040.0 21.349 7.3 # 895 tm34646 Sisterakas MOVIE A man takes revenge on his sister by hiring he... 2012 NaN 110 ['drama', 'comedy'] ['PH'] NaN tt2590214 5.2 286.0 2.552 4.9 # 5426 ts301609 Love Is Blind: Brazil SHOW The dating experiment comes to Brazil as local... 2021 TV-MA 56 ['romance', 'reality'] ['BR'] 1.0 tt15018224 6.1 425.0 5.109 6.4 # 2033 ts56038 Dave Chappelle SHOW Comedy icon Dave Chappelle makes his triumphan... 2017 NaN 60 ['comedy', 'documentation'] ['US'] 1.0 tt6963504 8.7 2753.0 2.962 7.6 # (5806, 15)
The four pandas anti-patterns we'll cover are:
- Mutating instead of chaining
forloops with pandas DataFrames
.applyin place of
- Using incorrect data types
Anti-Pattern #1: Mutating instead of chaining
Most pandas practitioners first learn data processing with pandas by sequentially mutating DataFrames as a series of distinct, line-by-line operations. There are a few reasons why excessive mutation of pandas DataFrames can cause problems:
- It wastes memory by creating global variables (this is especially true if you create a differently named DataFrame at each step)
- The code is cumbersome and difficult to read
- It's liable to bugs - particularly in notebooks where the ordering of data manipulation steps may not be strongly enforced
- It often produces the irritating and notoriously confusing
And no, passing the
inplace=True parameter to the operation does not help.
The example below contrasts a workflow based on mutation with an equivalent implementation that leverages chaining. Here we're performing various data transformation and column creation operations to explore the hypothesis that TV shows with more seasons have higher audience ratings than those with fewer seasons.
df = pd.read_csv("titles.csv") # Mutation - DON'T DO THIS df_bad = df.query("runtime > 30 & type == 'SHOW'") df_bad["score"] = df_bad[["imdb_score", "tmdb_score"]].sum(axis=1) df_bad = df_bad[["seasons", "score"]] df_bad = df_bad.groupby("seasons").agg(["count", "mean"]) df_bad = df_bad.droplevel(axis=1, level=0) df_bad = df_bad.query("count > 10") # Chaining - DO THIS df_good = (df .query("runtime > 30 & type == 'SHOW'") .assign(score=lambda df_: df_[["imdb_score", "tmdb_score"]].sum(axis=1)) [["seasons", "score"]] .groupby("seasons") .agg(["count", "mean"]) .droplevel(axis=1, level=0) .query("count > 10") ) # returns: # count mean # seasons # 1.0 835 13.064671 # 2.0 189 14.109524 # 3.0 83 14.618072 # 4.0 41 14.887805 # 5.0 38 15.242105 # 6.0 16 15.962500
Chaining transforms a DataFrame according to a multi-step procedure all at once. This guarantees the full and proper application of each pandas method, thus mitigating the risk of bugs. The code is more readable with each line cleanly representing a distinct operation (note: many Python code formatters will destroy this structure - wrap your pandas code blocks with '
#fmt: off' and '
#fmt: on' to prevent this). Chaining will feel natural for R users familiar with the magrittr
Occasionally, you'll need to perform complex data manipulation processes that can't be cleanly implemented using off-the-shelf chaining methods. This is where pandas'
.pipe can be used to abstract away complex DataFrame transformations into separately defined functions.
In the example below, we convert a string column containing lists of country codes into three separate columns for the first three individual country codes.
df = pd.read_csv("titles.csv") def split_prod_countries(df_): # split `production_countries` column (containing lists of country # strings) into three individual columns of single country strings dfc = pd.DataFrame(df_["production_countries"].apply(eval).to_list()) dfc = dfc.iloc[:, :3] dfc.columns = ["prod_country1", "prod_country2", "prod_country3"] return df_.drop("production_countries", axis=1).join(dfc) df_pipe = df.pipe(split_prod_countries) print(df["production_countries"].sample(5, random_state=14)) # returns: # 3052 ['CA', 'JP', 'US'] # 1962 ['US'] # 2229 ['GB'] # 2151 ['KH', 'US'] # 3623 ['ES'] print(df_pipe.sample(5, random_state=14).iloc[:, -3:]) # returns: # prod_country1 prod_country2 prod_country3 # 3052 CA JP US # 1962 US None None # 2229 GB None None # 2151 KH US None # 3623 ES None None
.pipe method allows us to incorporate this complex conversion into our chaining workflow without complicating the pipeline code.
FAQ for method chaining and using
How can I debug chained pandas methods?
As a starting point, try simply commenting out selected lines (i.e. methods) from the chained code. For more sophisticated debugging techniques, Matt Harrison has written about this comprehensively in Chapter 35 of Effective Pandas.
What if I want to view a DataFrame midway through the chain?
You can insert this piped function that uses
IPython.display.display to show the DataFrame at an intermediary stage without interrupting the rest of the chain:
.pipe(lambda df_: display(df_) or df_)
What if I want to store a DataFrame from midway through the chain as its own variable?
You can insert this piped function to store an intermediate DataFrame without interrupting the rest of the chain.
Anti-Pattern #2: Using
for loops with pandas DataFrames
The use of
for loops in pandas is a code smell that should always be eliminated. This includes pandas' built-in generator methods
DataFrame.itertuples(). There are two reasons to avoid looping in pandas:
- It's overly verbose, cumbersome, and incompatible with chaining. Alternative approaches can achieve the same result more succinctly, whilst being chainable.
- Looping through rows individually is slow. Vectorised operations are more performant - particularly when operating on numeric type columns.
Suppose we wanted to use our newly created
prod_country1 column to create another column corresponding whether
prod_country1 is in the top3/10/20 countries by count of occurrences. The code sample below shows a suboptimal
for loop approach, along with a cleaner
df = pd.read_csv("titles.csv").pipe(split_production_countries) # obtain country ranks vcs = df["prod_country1"].value_counts() top3 = vcs.index[:3] top10 = vcs.index[:10] top20 = vcs.index[:20] # Looping - DON'T DO THIS vals =  for ind, row in df.iterrows(): country = row["prod_country1"] if country in top3: vals.append("top3") elif country in top10: vals.append("top10") elif country in top20: vals.append("top20") else: vals.append("other") df["prod_country_rank"] = vals # df[col].apply() - DO THIS def get_prod_country_rank(country): if country in top3: return "top3" elif country in top10: return "top10" elif country in top20: return "top20" else: return "other" df["prod_country_rank"] = df["prod_country1"].apply(get_prod_country_rank) print(df.sample(5, random_state=14).iloc[:, -4:]) # returns: # prod_country1 prod_country2 prod_country3 prod_country_rank # 3052 CA JP US top10 # 1962 US None None top3 # 2229 GB None None top3 # 2151 KH US None other # 3623 ES None None top10
.apply pattern is a flexible approach that produces code that's easy to understand, and has the benefit of being chainable.
.applycan achieve the same result more elegantly.
.apply is not a vectorised operation - under the hood, it still loops through the rows of the DataFrame. As such, unless you're manipulating
object type (typically strings) columns, there are more performant approaches that should be used instead. This leads us to anti-pattern #3...
Anti-Pattern #3: Overusing
.apply in place of
Once pandas practitioners learn about
.apply, they often end up applying it everywhere. This isn't always a problem, as the
.apply approach produces coherent code and performs adequately with modestly-sized datasets.
With large datasets and compute intensive calculations, however,
.apply can be orders of magnitude slower than more efficient, vectorised workflows. This article nicely demonstrates the superiority of optimised approaches over
The np.select + .isin combination
df = pd.read_csv("titles.csv").pipe(split_production_countries) def get_prod_country_rank(df_): vcs = df_["prod_country1"].value_counts() return np.select( condlist=( df_["prod_country1"].isin(vcs.index[:3]), df_["prod_country1"].isin(vcs.index[:10]), df_["prod_country1"].isin(vcs.index[:20]), ), choicelist=("top3", "top10", "top20"), default="other" ) df = df.assign(prod_country_rank=lambda df_: get_prod_country_rank(df_))
The performance improvement isn't as dramatic as that achieved in the article above, as we're operating on
object type columns, and much of the computation time is associated with the value counts calculation common to both approaches. Nonetheless, this is still ~2x faster than the
.apply implementation. If the value counts calculation is placed inside the function used with
.apply, it becomes 500x slower than piping
In addition to the performance advantage, including the value counts calculation inside the piped
.isin function enables this transformation to be performed as a chained operation with no global dependencies. Note that the
np.select conditions list behaves equivalently to an
if elif else block in that it selects based on the first statement encountered that returns
An alternative option for cases where there are only two choices is
np.where. For instance, suppose we knew there was a bug with IMDB's scores, and we needed to subtract one from all reported scores for shows/movies released after 2016. We could perform that as follows.
df = df.assign( adjusted_score=lambda df_: np.where( df_["release_year"] > 2016, df_["imdb_score"] - 1, df_["imdb_score"] ) )
.applywill be less performant than other approaches. Instead, consider using
.isinto perform efficient, vectorised transformations.
pd.where behaves slightly differently to
np.where, but can also be useful if you want to retain the values in a column subject to a certain condition but replace everything else (e.g. keep values if they're within the top5 value counts and set everything else to 'other').
Anti-Pattern #4: Using incorrect data types
Optimising the data types for each column in a pandas DataFrame will improve performance and memory usage. When working with large datasets, significant gains can be made by shrinking the default
int64 data types to smaller equivalents, such as
int8, for columns where this doesn't result in data loss.
However, the most egregious data type mismatch worth eliminating from your pandas code is using strings instead of categoricals. Converting a low cardinality column of categorical data from its default
object type to a
category type often achieves memory usage improvements of 100x and computation speed ups of 10x. The code sample below demonstrates how this conversion can be performed within a chained workflow.
df = pd.read_csv("titles.csv") df = df.assign( age_certification=lambda df_: df_["age_certification"].astype("category") )
category, where appropriate.
This presentation exemplifies some of the benefits of using optimised data types.
In this article, I've shown four pandas anti-patterns, and alternative approaches you should adopt instead. The code sample below illustrates how these best practices can be combined into a coherent workflow. This particular example shows how we can calculate the mean adjusted score of the shows, depending on the prevalence rank of the first production country.
import pandas as pd import numpy as np df = pd.read_csv("titles.csv") def split_prod_countries(df_): dfc = pd.DataFrame(df_["production_countries"].apply(eval).to_list()) dfc = dfc.iloc[:, :3] dfc.columns = ["prod_country1", "prod_country2", "prod_country3"] return df_.drop("production_countries", axis=1).join(dfc) def get_prod_country_rank(df_): vcs = df_["prod_country1"].value_counts() return np.select( condlist=( df_["prod_country1"].isin(vcs.index[:3]), df_["prod_country1"].isin(vcs.index[:10]), df_["prod_country1"].isin(vcs.index[:20]), ), choicelist=("top3", "top10", "top20"), default="other", ) def get_adjusted_score(df_): return np.where( df_["release_year"] > 2016, df_["imdb_score"] - 1, df_["imdb_score"] ) (df .query("runtime > 30 & type == 'SHOW'") .pipe(split_prod_countries) .assign( imdb_score=lambda df_: get_adjusted_score(df_), score=lambda df_: df_[["imdb_score", "tmdb_score"]].sum(axis=1), prod_country_rank=lambda df_: get_prod_country_rank(df_), rank=lambda df_: df_["prod_country_rank"].astype("category") ) [["rank", "score"]] .groupby("rank") .agg(["count", "mean"]) .droplevel(axis=1, level=0) .sort_values("mean", ascending=False) ) #returns: # count mean # rank # top10 37 15.232432 # other 1104 12.824819 # top3 78 12.624359 # top20 20 10.775000
Adopting these practices allows for the complex data transformations and processing to all be conducted in a single chained statement. The code is performant, readable, and simple to maintain and extend. If you're not already coding pandas in this way, I recommend giving it a try!
* This post draws inspiration from various sources - particularly Matt Harrison's excellent Effective Pandas book. Other references are linked throughout the article.