Jul 11, 2022 9 min read Archive

4 Pandas Anti-Patterns to Avoid and How to Fix Them

pandas is a powerful data analysis library with a rich API that offers multiple ways to perform any given data manipulation task. Some of these approaches are better than others, and pandas users often learn suboptimal coding practices that become their default workflows. This post highlights four common pandas anti-patterns and outlines a complementary set of techniques that you should use instead.^*

Pandas Anti-Patterns

For illustrative examples of good and bad pandas patterns, I'm using this Netflix dataset from Kaggle, which characterises almost 6,000 Netflix shows and movies with respect to 15 features spanning various data types.

import pandas as pd

# https://www.kaggle.com/datasets/victorsoeiro/netflix-tv-shows-and-movies
df = pd.read_csv("titles.csv")
print(df.sample(5, random_state=0))
print(df.shape)

# returns:
#             id                  title   type                                        description  release_year age_certification  runtime                                             genres production_countries  seasons     imdb_id  imdb_score  imdb_votes  tmdb_popularity  tmdb_score
# 1519   ts38761        Another Miss Oh   SHOW  Two women working in the same industry with th...          2016             TV-MA       69          ['drama', 'fantasy', 'romance', 'comedy']               ['KR']      1.0   tt5679572         7.9      1769.0           22.672         8.2
# 4942  ts225657                Halston   SHOW  American fashion designer Halston skyrockets t...          2021             TV-MA       47                                          ['drama']               ['US']      1.0  tt10920514         7.5     14040.0           21.349         7.3
# 895    tm34646             Sisterakas  MOVIE  A man takes revenge on his sister by hiring he...          2012               NaN      110                                ['drama', 'comedy']               ['PH']      NaN   tt2590214         5.2       286.0            2.552         4.9
# 5426  ts301609  Love Is Blind: Brazil   SHOW  The dating experiment comes to Brazil as local...          2021             TV-MA       56                             ['romance', 'reality']               ['BR']      1.0  tt15018224         6.1       425.0            5.109         6.4
# 2033   ts56038         Dave Chappelle   SHOW  Comedy icon Dave Chappelle makes his triumphan...          2017               NaN       60                        ['comedy', 'documentation']               ['US']      1.0   tt6963504         8.7      2753.0            2.962         7.6
# (5806, 15)

The four pandas anti-patterns we'll cover are:

Mutating instead of chaining
Using for loops with pandas DataFrames
Overusing .apply in place of np.select, np.where, and .isin
Using incorrect data types

Anti-Pattern #1: Mutating instead of chaining

Most pandas practitioners first learn data processing with pandas by sequentially mutating DataFrames as a series of distinct, line-by-line operations. There are a few reasons why excessive mutation of pandas DataFrames can cause problems:

It wastes memory by creating global variables (this is especially true if you create a differently named DataFrame at each step)
The code is cumbersome and difficult to read
It's liable to bugs - particularly in notebooks where the ordering of data manipulation steps may not be strongly enforced
It often produces the irritating and notoriously confusing SettingWithCopyWarning

And no, passing the inplace=True parameter to the operation does not help.

💡

Instead of sequentially mutating a pandas DataFrame, a better approach is to transform it by chaining pandas methods.

The example below contrasts a workflow based on mutation with an equivalent implementation that leverages chaining. Here we're performing various data transformation and column creation operations to explore the hypothesis that TV shows with more seasons have higher audience ratings than those with fewer seasons.

df = pd.read_csv("titles.csv")

# Mutation - DON'T DO THIS
df_bad = df.query("runtime > 30 & type == 'SHOW'")
df_bad["score"] = df_bad[["imdb_score", "tmdb_score"]].sum(axis=1)
df_bad = df_bad[["seasons", "score"]]
df_bad = df_bad.groupby("seasons").agg(["count", "mean"])
df_bad = df_bad.droplevel(axis=1, level=0)
df_bad = df_bad.query("count > 10")

# Chaining - DO THIS
df_good = (df
    .query("runtime > 30 & type == 'SHOW'")
    .assign(score=lambda df_: df_[["imdb_score", "tmdb_score"]].sum(axis=1))
    [["seasons", "score"]]
    .groupby("seasons")
    .agg(["count", "mean"])
    .droplevel(axis=1, level=0)
    .query("count > 10")
)

# returns:
#          count       mean
# seasons
# 1.0        835  13.064671
# 2.0        189  14.109524
# 3.0         83  14.618072
# 4.0         41  14.887805
# 5.0         38  15.242105
# 6.0         16  15.962500

Chaining transforms a DataFrame according to a multi-step procedure all at once. This guarantees the full and proper application of each pandas method, thus mitigating the risk of bugs. The code is more readable with each line cleanly representing a distinct operation (note: many Python code formatters will destroy this structure - wrap your pandas code blocks with '#fmt: off' and '#fmt: on' to prevent this). Chaining will feel natural for R users familiar with the magrittr %>% operator.

The pandas `.pipe` method

Occasionally, you'll need to perform complex data manipulation processes that can't be cleanly implemented using off-the-shelf chaining methods. This is where pandas' .pipe can be used to abstract away complex DataFrame transformations into separately defined functions.

In the example below, we convert a string column containing lists of country codes into three separate columns for the first three individual country codes.

df = pd.read_csv("titles.csv")

def split_prod_countries(df_):
    # split `production_countries` column (containing lists of country
    # strings) into three individual columns of single country strings
    dfc = pd.DataFrame(df_["production_countries"].apply(eval).to_list())
    dfc = dfc.iloc[:, :3]
    dfc.columns = ["prod_country1", "prod_country2", "prod_country3"]
    return df_.drop("production_countries", axis=1).join(dfc)

df_pipe = df.pipe(split_prod_countries)

print(df["production_countries"].sample(5, random_state=14))
# returns:
# 3052    ['CA', 'JP', 'US']
# 1962                ['US']
# 2229                ['GB']
# 2151          ['KH', 'US']
# 3623                ['ES']

print(df_pipe.sample(5, random_state=14).iloc[:, -3:])
# returns:
#      prod_country1 prod_country2 prod_country3
# 3052            CA            JP            US
# 1962            US          None          None
# 2229            GB          None          None
# 2151            KH            US          None
# 3623            ES          None          None

The .pipe method allows us to incorporate this complex conversion into our chaining workflow without complicating the pipeline code.

FAQ for method chaining and using `.pipe`

How can I debug chained pandas methods?

As a starting point, try simply commenting out selected lines (i.e. methods) from the chained code. For more sophisticated debugging techniques, Matt Harrison has written about this comprehensively in Chapter 35 of Effective Pandas.

What if I want to view a DataFrame midway through the chain?

You can insert this piped function that uses IPython.display.display to show the DataFrame at an intermediary stage without interrupting the rest of the chain: .pipe(lambda df_: display(df_) or df_)

What if I want to store a DataFrame from midway through the chain as its own variable?

You can insert this piped function to store an intermediate DataFrame without interrupting the rest of the chain.

Anti-Pattern #2: Using `for` loops with pandas DataFrames

The use of for loops in pandas is a code smell that should always be eliminated. This includes pandas' built-in generator methods DataFrame.iterrows() and DataFrame.itertuples(). There are two reasons to avoid looping in pandas:

It's overly verbose, cumbersome, and incompatible with chaining. Alternative approaches can achieve the same result more succinctly, whilst being chainable.
Looping through rows individually is slow. Vectorised operations are more performant - particularly when operating on numeric type columns.

Suppose we wanted to use our newly created prod_country1 column to create another column corresponding whether prod_country1 is in the top3/10/20 countries by count of occurrences. The code sample below shows a suboptimal for loop approach, along with a cleaner .apply implementation.

df = pd.read_csv("titles.csv").pipe(split_production_countries)

# obtain country ranks
vcs = df["prod_country1"].value_counts()
top3 = vcs.index[:3]
top10 = vcs.index[:10]
top20 = vcs.index[:20]

# Looping - DON'T DO THIS
vals = []
for ind, row in df.iterrows():
    country = row["prod_country1"]
    if country in top3:
        vals.append("top3")
    elif country in top10:
        vals.append("top10")
    elif country in top20:
        vals.append("top20")
    else:
        vals.append("other")
df["prod_country_rank"] = vals

# df[col].apply() - DO THIS
def get_prod_country_rank(country):
    if country in top3:
        return "top3"
    elif country in top10:
        return "top10"
    elif country in top20:
        return "top20"
    else:
        return "other"

df["prod_country_rank"] = df["prod_country1"].apply(get_prod_country_rank)
print(df.sample(5, random_state=14).iloc[:, -4:])
# returns:
#      prod_country1 prod_country2 prod_country3 prod_country_rank
# 3052            CA            JP            US             top10
# 1962            US          None          None              top3
# 2229            GB          None          None              top3
# 2151            KH            US          None             other
# 3623            ES          None          None             top10

This .apply pattern is a flexible approach that produces code that's easy to understand, and has the benefit of being chainable.

💡

Wherever you might manipulate a pandas DataFrame with a for loop, .apply can achieve the same result more elegantly.

However, .apply is not a vectorised operation - under the hood, it still loops through the rows of the DataFrame. As such, unless you're manipulating object type (typically strings) columns, there are more performant approaches that should be used instead. This leads us to anti-pattern #3...

Anti-Pattern #3: Overusing `.apply` in place of `np.select`, `np.where`, and `.isin`

Once pandas practitioners learn about .apply, they often end up applying it everywhere. This isn't always a problem, as the .apply approach produces coherent code and performs adequately with modestly-sized datasets.

With large datasets and compute intensive calculations, however, .apply can be orders of magnitude slower than more efficient, vectorised workflows. This article nicely demonstrates the superiority of optimised approaches over .apply.

The np.select + .isin combination

One useful pattern to use instead is the combination of np.select and .isin. Here's how it looks for our country rank example.

df = pd.read_csv("titles.csv").pipe(split_production_countries)

def get_prod_country_rank(df_):
    vcs = df_["prod_country1"].value_counts()
    return np.select(
        condlist=(
            df_["prod_country1"].isin(vcs.index[:3]),
            df_["prod_country1"].isin(vcs.index[:10]),
            df_["prod_country1"].isin(vcs.index[:20]),
        ),
        choicelist=("top3", "top10", "top20"),
        default="other"
    )

df = df.assign(prod_country_rank=lambda df_: get_prod_country_rank(df_))

The performance improvement isn't as dramatic as that achieved in the article above, as we're operating on object type columns, and much of the computation time is associated with the value counts calculation common to both approaches. Nonetheless, this is still ~2x faster than the .apply implementation. If the value counts calculation is placed inside the function used with .apply, it becomes 500x slower than piping np.select + .isin instead!

In addition to the performance advantage, including the value counts calculation inside the piped np.select + .isin function enables this transformation to be performed as a chained operation with no global dependencies. Note that the np.select conditions list behaves equivalently to an if elif else block in that it selects based on the first statement encountered that returns True.

The `np.where` option

An alternative option for cases where there are only two choices is np.where. For instance, suppose we knew there was a bug with IMDB's scores, and we needed to subtract one from all reported scores for shows/movies released after 2016. We could perform that as follows.

df = df.assign(
    adjusted_score=lambda df_: np.where(
        df_["release_year"] > 2016, df_["imdb_score"] - 1, df_["imdb_score"]
    )
)

💡

Unless operating on object type columns, .apply will be less performant than other approaches. Instead, consider using np.select, np.where, and .isin to perform efficient, vectorised transformations.

Confusingly, pd.where behaves slightly differently to np.where, but can also be useful if you want to retain the values in a column subject to a certain condition but replace everything else (e.g. keep values if they're within the top5 value counts and set everything else to 'other').

Anti-Pattern #4: Using incorrect data types

Optimising the data types for each column in a pandas DataFrame will improve performance and memory usage. When working with large datasets, significant gains can be made by shrinking the default float64 and int64 data types to smaller equivalents, such as float16 and int8, for columns where this doesn't result in data loss.

However, the most egregious data type mismatch worth eliminating from your pandas code is using strings instead of categoricals. Converting a low cardinality column of categorical data from its default object type to a category type often achieves memory usage improvements of 100x and computation speed ups of 10x. The code sample below demonstrates how this conversion can be performed within a chained workflow.

df = pd.read_csv("titles.csv")

df = df.assign(
    age_certification=lambda df_: df_["age_certification"].astype("category")
)

💡

If performance and/or memory constraints are causing problems when working with large datasets, consider optimising the data types in your DataFrame - particularly converting object types to category, where appropriate.

This presentation exemplifies some of the benefits of using optimised data types.

Conclusion

In this article, I've shown four pandas anti-patterns, and alternative approaches you should adopt instead. The code sample below illustrates how these best practices can be combined into a coherent workflow. This particular example shows how we can calculate the mean adjusted score of the shows, depending on the prevalence rank of the first production country.

import pandas as pd
import numpy as np

df = pd.read_csv("titles.csv")

def split_prod_countries(df_):
    dfc = pd.DataFrame(df_["production_countries"].apply(eval).to_list())
    dfc = dfc.iloc[:, :3]
    dfc.columns = ["prod_country1", "prod_country2", "prod_country3"]
    return df_.drop("production_countries", axis=1).join(dfc)

def get_prod_country_rank(df_):
    vcs = df_["prod_country1"].value_counts()
    return np.select(
        condlist=(
            df_["prod_country1"].isin(vcs.index[:3]),
            df_["prod_country1"].isin(vcs.index[:10]),
            df_["prod_country1"].isin(vcs.index[:20]),
        ),
        choicelist=("top3", "top10", "top20"),
        default="other",
    )

def get_adjusted_score(df_):
    return np.where(
        df_["release_year"] > 2016, df_["imdb_score"] - 1, df_["imdb_score"]
    )

(df
    .query("runtime > 30 & type == 'SHOW'")
    .pipe(split_prod_countries)
    .assign(
        imdb_score=lambda df_: get_adjusted_score(df_),
        score=lambda df_: df_[["imdb_score", "tmdb_score"]].sum(axis=1),
        prod_country_rank=lambda df_: get_prod_country_rank(df_),
        rank=lambda df_: df_["prod_country_rank"].astype("category")
    )
    [["rank", "score"]]
    .groupby("rank")
    .agg(["count", "mean"])
    .droplevel(axis=1, level=0)
    .sort_values("mean", ascending=False)
)

#returns:
#        count       mean
# rank                   
# top10     37  15.232432
# other   1104  12.824819
# top3      78  12.624359
# top20     20  10.775000

Adopting these practices allows for the complex data transformations and processing to all be conducted in a single chained statement. The code is performant, readable, and simple to maintain and extend. If you're not already coding pandas in this way, I recommend giving it a try!

Code samples for this article can be found on GitHub

* This post draws inspiration from various sources - particularly Matt Harrison's excellent Effective Pandas book. Other references are linked throughout the article.