23  Transforming Data with Polars

23.1 Learning Objectives

By the end of this chapter, students will be able to:

  • Filter rows and select columns in a Polars DataFrame using filter and select.
  • Rename columns and cast data types to match a target database schema.
  • Handle missing values by dropping null rows or filling them with a replacement value.
  • Join two DataFrames on a common key using Polars join operations.
  • Compute derived columns using with_columns and Polars expressions.
  • Produce a clean DataFrame that is ready to load into a relational database table.

Raw data is rarely ready to load. Column names may be inconsistent, values may be missing or malformed, types may need adjusting, and the shape of the data may not match the schema you are loading into. The transformation step handles all of this: filtering, reshaping, computing derived values, and joining sources together into a clean, coherent dataset.

This chapter walks through the core Polars operations you will use most often in a data pipeline. By the end, you will have two clean DataFrames ready to load into PostgreSQL in the next chapter.

23.2 Loading the Data

We will work with the same two files from the previous chapter:

import polars as pl

indicators = pl.read_parquet("data/indicators.parquet")
countries = pl.read_csv("data/countries.csv")

print(f"indicators: {indicators.shape}")
print(f"countries:  {countries.shape}")
indicators: (4991, 10)
countries:  (218, 9)

23.3 Exploring Your Data

Before transforming anything, spend a moment understanding what you have. Polars provides two useful starting points: .schema for types and .describe() for numeric summary statistics.

print(indicators.schema)
Schema({'country_code': String, 'short_name': String, 'region': String, 'income_group': String, 'year': Int64, 'population': Int64, 'gdp_usd': Float64, 'gdp_per_capita_usd': Float64, 'pct_forest_area': Float64, 'land_area_km2': Float64})
print(indicators.describe())
shape: (9, 11)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬───────────┬───────────┬──────────┐
│ statistic ┆ country_c ┆ short_nam ┆ region    ┆ … ┆ gdp_usd   ┆ gdp_per_c ┆ pct_fores ┆ land_are │
│ ---       ┆ ode       ┆ e         ┆ ---       ┆   ┆ ---       ┆ apita_usd ┆ t_area    ┆ a_km2    │
│ str       ┆ ---       ┆ ---       ┆ str       ┆   ┆ f64       ┆ ---       ┆ ---       ┆ ---      │
│           ┆ str       ┆ str       ┆           ┆   ┆           ┆ f64       ┆ f64       ┆ f64      │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪═══════════╪═══════════╪══════════╡
│ count     ┆ 4991      ┆ 4991      ┆ 4991      ┆ … ┆ 4828.0    ┆ 4832.0    ┆ 4849.0    ┆ 4893.0   │
│ null_coun ┆ 0         ┆ 0         ┆ 0         ┆ … ┆ 163.0     ┆ 159.0     ┆ 142.0     ┆ 98.0     │
│ t         ┆           ┆           ┆           ┆   ┆           ┆           ┆           ┆          │
│ mean      ┆ null      ┆ null      ┆ null      ┆ … ┆ 3.1543e11 ┆ 15743.476 ┆ 32.378893 ┆ 604223.5 │
│           ┆           ┆           ┆           ┆   ┆           ┆ 786       ┆           ┆ 76129    │
│ std       ┆ null      ┆ null      ┆ null      ┆ … ┆ 1.4392e12 ┆ 24432.495 ┆ 24.549867 ┆ 1.7579e6 │
│           ┆           ┆           ┆           ┆   ┆           ┆ 97        ┆           ┆          │
│ min       ┆ ABW       ┆ Afghanist ┆ East Asia ┆ … ┆ 1.3965e7  ┆ 109.59    ┆ 0.0       ┆ 2.0      │
│           ┆           ┆ an        ┆ & Pacific ┆   ┆           ┆           ┆           ┆          │
│ 25%       ┆ null      ┆ null      ┆ null      ┆ … ┆ 4.2047e9  ┆ 1508.67   ┆ 10.9386   ┆ 10120.0  │
│ 50%       ┆ null      ┆ null      ┆ null      ┆ … ┆ 1.7638e10 ┆ 5270.43   ┆ 30.8998   ┆ 94280.0  │
│ 75%       ┆ null      ┆ null      ┆ null      ┆ … ┆ 1.1531e11 ┆ 20571.48  ┆ 51.2193   ┆ 446300.0 │
│ max       ┆ ZWE       ┆ Zimbabwe  ┆ Sub-Sahar ┆ … ┆ 2.6007e13 ┆ 226052.0  ┆ 95.5772   ┆ 1.638139 │
│           ┆           ┆           ┆ an Africa ┆   ┆           ┆           ┆           ┆ e7       │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴───────────┴───────────┴──────────┘

The describe() output reports count, mean, standard deviation, min, and max for each numeric column, plus null counts. A few things stand out: gdp_usd and gdp_per_capita_usd each have roughly 160 nulls, pct_forest_area has around 140, and land_area_km2 has about 100. These are data quality issues we will need to handle.

For a targeted view of null counts, compute them directly:

null_counts = indicators.select([
    pl.col(name).is_null().sum().alias(name)
    for name in indicators.columns
])
print(null_counts)
shape: (1, 10)
┌────────────┬────────────┬────────┬────────────┬───┬─────────┬────────────┬───────────┬───────────┐
│ country_co ┆ short_name ┆ region ┆ income_gro ┆ … ┆ gdp_usd ┆ gdp_per_ca ┆ pct_fores ┆ land_area │
│ de         ┆ ---        ┆ ---    ┆ up         ┆   ┆ ---     ┆ pita_usd   ┆ t_area    ┆ _km2      │
│ ---        ┆ u32        ┆ u32    ┆ ---        ┆   ┆ u32     ┆ ---        ┆ ---       ┆ ---       │
│ u32        ┆            ┆        ┆ u32        ┆   ┆         ┆ u32        ┆ u32       ┆ u32       │
╞════════════╪════════════╪════════╪════════════╪═══╪═════════╪════════════╪═══════════╪═══════════╡
│ 0          ┆ 0          ┆ 0      ┆ 0          ┆ … ┆ 163     ┆ 159        ┆ 142       ┆ 98        │
└────────────┴────────────┴────────┴────────────┴───┴─────────┴────────────┴───────────┴───────────┘

23.4 Filtering Rows

The filter() method accepts a Polars expression and keeps only the rows where that expression evaluates to True. A single condition is straightforward:

indicators_2022 = indicators.filter(pl.col("year") == 2022)
print(indicators_2022.shape)
(217, 10)

For compound conditions, combine expressions with & (and) and | (or). Each sub-expression must be wrapped in parentheses:

valid_2022 = indicators.filter(
    (pl.col("year") == 2022) &
    (pl.col("gdp_per_capita_usd").is_not_null()) &
    (pl.col("region") != "Aggregates")
)
print(valid_2022.shape)
print(valid_2022.select(["short_name", "region", "gdp_per_capita_usd"]).head(8))
(207, 10)
shape: (8, 3)
┌──────────────────────┬────────────────────────────┬────────────────────┐
│ short_name           ┆ region                     ┆ gdp_per_capita_usd │
│ ---                  ┆ ---                        ┆ ---                │
│ str                  ┆ str                        ┆ f64                │
╞══════════════════════╪════════════════════════════╪════════════════════╡
│ Aruba                ┆ Latin America & Caribbean  ┆ 30559.53           │
│ Afghanistan          ┆ South Asia                 ┆ 357.26             │
│ Angola               ┆ Sub-Saharan Africa         ┆ 2929.69            │
│ Albania              ┆ Europe & Central Asia      ┆ 6846.43            │
│ Andorra              ┆ Europe & Central Asia      ┆ 42414.06           │
│ United Arab Emirates ┆ Middle East & North Africa ┆ 49899.07           │
│ Argentina            ┆ Latin America & Caribbean  ┆ 13935.68           │
│ Armenia              ┆ Europe & Central Asia      ┆ 6571.97            │
└──────────────────────┴────────────────────────────┴────────────────────┘

The "Aggregates" region contains World Bank summary rows (totals for regions, income groups, and the world) rather than individual countries. Filtering them out isolates the 217 actual countries.

23.5 Selecting and Renaming Columns

Use select() to keep only the columns you need, and rename() to standardize column names. Both return a new DataFrame; they do not modify the original:

countries_slim = (
    countries
    .select(["country_code", "short_name", "region", "capital",
             "longitude", "latitude", "income_group", "lending_type"])
    .rename({"country_code": "code", "short_name": "name"})
)
print(countries_slim.head(5))
shape: (5, 8)
┌──────┬─────────────┬─────────────┬─────────────┬───────────┬──────────┬─────────────┬────────────┐
│ code ┆ name        ┆ region      ┆ capital     ┆ longitude ┆ latitude ┆ income_grou ┆ lending_ty │
│ ---  ┆ ---         ┆ ---         ┆ ---         ┆ ---       ┆ ---      ┆ p           ┆ pe         │
│ str  ┆ str         ┆ str         ┆ str         ┆ f64       ┆ f64      ┆ ---         ┆ ---        │
│      ┆             ┆             ┆             ┆           ┆          ┆ str         ┆ str        │
╞══════╪═════════════╪═════════════╪═════════════╪═══════════╪══════════╪═════════════╪════════════╡
│ AFG  ┆ Afghanistan ┆ South Asia  ┆ Kabul       ┆ 69.1761   ┆ 34.5228  ┆ Low income  ┆ IDA        │
│ ALB  ┆ Albania     ┆ Europe &    ┆ Tirane      ┆ 19.8172   ┆ 41.3317  ┆ Upper       ┆ IBRD       │
│      ┆             ┆ Central     ┆             ┆           ┆          ┆ middle      ┆            │
│      ┆             ┆ Asia        ┆             ┆           ┆          ┆ income      ┆            │
│ DZA  ┆ Algeria     ┆ Middle East ┆ Algiers     ┆ 3.05097   ┆ 36.7397  ┆ Lower       ┆ IBRD       │
│      ┆             ┆ & North     ┆             ┆           ┆          ┆ middle      ┆            │
│      ┆             ┆ Africa      ┆             ┆           ┆          ┆ income      ┆            │
│ ASM  ┆ American    ┆ East Asia & ┆ Pago Pago   ┆ -170.691  ┆ -14.2846 ┆ Upper       ┆ Not        │
│      ┆ Samoa       ┆ Pacific     ┆             ┆           ┆          ┆ middle      ┆ classified │
│      ┆             ┆             ┆             ┆           ┆          ┆ income      ┆            │
│ AND  ┆ Andorra     ┆ Europe &    ┆ Andorra la  ┆ 1.5218    ┆ 42.5075  ┆ High income ┆ Not        │
│      ┆             ┆ Central     ┆ Vella       ┆           ┆          ┆             ┆ classified │
│      ┆             ┆ Asia        ┆             ┆           ┆          ┆             ┆            │
└──────┴─────────────┴─────────────┴─────────────┴───────────┴──────────┴─────────────┴────────────┘

You can also rename inside an expression using .alias(). This is common when computing a new column and giving it a meaningful name in the same step.

23.6 Computing New Columns

with_columns() adds or replaces columns without dropping any existing ones. Pass it a list of expressions, each ending with .alias("new_name"):

indicators_enhanced = indicators.with_columns([
    (pl.col("gdp_usd") / 1e9).alias("gdp_billions"),
    (pl.col("population") / pl.col("land_area_km2")).alias("pop_density_per_km2")
])
print(
    indicators_enhanced
    .filter((pl.col("year") == 2022) & (pl.col("region") != "Aggregates"))
    .select(["short_name", "gdp_billions", "pop_density_per_km2"])
    .sort("pop_density_per_km2", descending=True, nulls_last=True)
    .head(8)
)
shape: (8, 3)
┌───────────────────────────┬──────────────┬─────────────────────┐
│ short_name                ┆ gdp_billions ┆ pop_density_per_km2 │
│ ---                       ┆ ---          ┆ ---                 │
│ str                       ┆ f64          ┆ f64                 │
╞═══════════════════════════╪══════════════╪═════════════════════╡
│ Singapore                 ┆ 498.474541   ┆ 7851.005571         │
│ Bahrain                   ┆ 46.680399    ┆ 1929.991139         │
│ Maldives                  ┆ 6.177118     ┆ 1747.02             │
│ Malta                     ┆ 19.20998     ┆ 1660.971875         │
│ Bangladesh                ┆ 460.131689   ┆ 1301.2591           │
│ Sint Maarten (Dutch part) ┆ 1.537089     ┆ 1239.382353         │
│ Bermuda                   ┆ 7.82798      ┆ 1199.055556         │
│ Barbados                  ┆ 6.257304     ┆ 656.553488          │
└───────────────────────────┴──────────────┴─────────────────────┘

Expressions compose naturally: you can divide one column by another, apply math functions from pl.Expr, compare against literals, and chain multiple operations. Polars evaluates all expressions in a with_columns() call in a single pass over the data.

23.7 Handling Missing Values

Nulls in your data require an explicit decision: drop the rows, fill with a constant, or impute from surrounding context. There is no single right answer; it depends on why the values are missing and how you intend to use the data.

Drop rows with nulls in specific columns:

indicators_complete_gdp = indicators.drop_nulls(subset=["gdp_per_capita_usd", "gdp_usd"])
print(f"Before: {indicators.shape[0]} rows")
print(f"After:  {indicators_complete_gdp.shape[0]} rows")
Before: 4991 rows
After:  4828 rows

Fill nulls with a constant or group statistic:

indicators_filled = indicators.with_columns(
    pl.col("pct_forest_area")
    .fill_null(
        pl.col("pct_forest_area").mean().over("region")
    )
    .alias("pct_forest_area_filled")
)

null_before = indicators["pct_forest_area"].is_null().sum()
null_after = indicators_filled["pct_forest_area_filled"].is_null().sum()
print(f"Nulls before fill: {null_before}")
print(f"Nulls after fill:  {null_after}")
Nulls before fill: 142
Nulls after fill:  0

The .over("region") call makes mean() a window function: it computes the mean within each region group and fills each null with the corresponding regional mean. Rows whose entire region has no forest area data remain null, which is why the count does not reach zero.

For this dataset, we will keep nulls as-is and let PostgreSQL store them as NULL. The fill example above is a technique to know, not necessarily one to apply blindly.

23.8 Grouping and Aggregating

group_by() followed by .agg() computes summary statistics for each unique combination of grouping columns. You can compute multiple aggregations in a single call:

gdp_by_group = (
    indicators
    .filter(
        (pl.col("year") == 2022) &
        (pl.col("region") != "Aggregates") &
        (pl.col("gdp_per_capita_usd").is_not_null())
    )
    .group_by("income_group")
    .agg([
        pl.col("gdp_per_capita_usd").mean().alias("mean_gdp_per_capita"),
        pl.col("gdp_per_capita_usd").median().alias("median_gdp_per_capita"),
        pl.col("short_name").count().alias("country_count")
    ])
    .sort("mean_gdp_per_capita", descending=True, nulls_last=True)
)
print(gdp_by_group)
shape: (4, 4)
┌─────────────────────┬─────────────────────┬───────────────────────┬───────────────┐
│ income_group        ┆ mean_gdp_per_capita ┆ median_gdp_per_capita ┆ country_count │
│ ---                 ┆ ---                 ┆ ---                   ┆ ---           │
│ str                 ┆ f64                 ┆ f64                   ┆ u32           │
╞═════════════════════╪═════════════════════╪═══════════════════════╪═══════════════╡
│ High income         ┆ 48055.756133        ┆ 36142.21              ┆ 75            │
│ Upper middle income ┆ 8897.194717         ┆ 7770.59               ┆ 53            │
│ Lower middle income ┆ 2821.378519         ┆ 2599.755              ┆ 54            │
│ Low income          ┆ 784.8912            ┆ 813.97                ┆ 25            │
└─────────────────────┴─────────────────────┴───────────────────────┴───────────────┘

The mean and median diverge sharply in the high-income group because a handful of very high GDP-per-capita countries (Luxembourg, Singapore, Switzerland) pull the mean well above the median.

You can also group by multiple columns:

regional_summary = (
    indicators
    .filter(
        (pl.col("year") == 2022) &
        (pl.col("region") != "Aggregates") &
        (pl.col("population").is_not_null())
    )
    .group_by(["region", "income_group"])
    .agg([
        pl.col("population").sum().alias("total_population"),
        pl.col("short_name").count().alias("countries")
    ])
    .sort(["region", "income_group"])
)
print(regional_summary)
shape: (23, 4)
┌───────────────────────┬─────────────────────┬──────────────────┬───────────┐
│ region                ┆ income_group        ┆ total_population ┆ countries │
│ ---                   ┆ ---                 ┆ ---              ┆ ---       │
│ str                   ┆ str                 ┆ i64              ┆ u32       │
╞═══════════════════════╪═════════════════════╪══════════════════╪═══════════╡
│ East Asia & Pacific   ┆ High income         ┆ 222835509        ┆ 13        │
│ East Asia & Pacific   ┆ Low income          ┆ 26328845         ┆ 1         │
│ East Asia & Pacific   ┆ Lower middle income ┆ 587551208        ┆ 14        │
│ East Asia & Pacific   ┆ Upper middle income ┆ 1519746456       ┆ 9         │
│ Europe & Central Asia ┆ High income         ┆ 523632130        ┆ 38        │
│ …                     ┆ …                   ┆ …                ┆ …         │
│ South Asia            ┆ Upper middle income ┆ 524106           ┆ 1         │
│ Sub-Saharan Africa    ┆ High income         ┆ 119878           ┆ 1         │
│ Sub-Saharan Africa    ┆ Low income          ┆ 623481122        ┆ 24        │
│ Sub-Saharan Africa    ┆ Lower middle income ┆ 532402794        ┆ 17        │
│ Sub-Saharan Africa    ┆ Upper middle income ┆ 73204779         ┆ 6         │
└───────────────────────┴─────────────────────┴──────────────────┴───────────┘

23.9 Joining DataFrames

The join() method combines two DataFrames on a shared key column. The how parameter controls the join type, mirroring the SQL join types you already know:

Polars how SQL equivalent
"inner" INNER JOIN
"left" LEFT JOIN
"right" RIGHT JOIN
"full" FULL OUTER JOIN
"cross" CROSS JOIN
"anti" WHERE NOT IN (subquery)

The countries.csv file has a lending_type column that does not exist in indicators.parquet. A left join adds it:

lending = countries.select(["country_code", "lending_type"])

indicators_with_lending = indicators.join(
    lending,
    on="country_code",
    how="left"
)

print(indicators_with_lending.select(["short_name", "year", "lending_type"]).head(5))
print(f"\nNull lending_type rows: {indicators_with_lending['lending_type'].is_null().sum()}")
shape: (5, 3)
┌────────────┬──────┬────────────────┐
│ short_name ┆ year ┆ lending_type   │
│ ---        ┆ ---  ┆ ---            │
│ str        ┆ i64  ┆ str            │
╞════════════╪══════╪════════════════╡
│ Aruba      ┆ 2000 ┆ Not classified │
│ Aruba      ┆ 2001 ┆ Not classified │
│ Aruba      ┆ 2002 ┆ Not classified │
│ Aruba      ┆ 2003 ┆ Not classified │
│ Aruba      ┆ 2004 ┆ Not classified │
└────────────┴──────┴────────────────┘

Null lending_type rows: 0

The null count tells you how many indicator rows could not be matched to a country in the countries table. For this dataset, it should be low: the indicators file was derived from the same underlying database.

Anti-joins are useful for auditing mismatches between two sources. This finds countries in the indicators file that have no matching entry in countries.csv:

unmatched = indicators.join(
    countries.select(["country_code"]),
    on="country_code",
    how="anti"
)
print(unmatched.select("country_code").unique())
shape: (0, 1)
┌──────────────┐
│ country_code │
│ ---          │
│ str          │
╞══════════════╡
└──────────────┘

23.10 Sorting

sort() accepts one or more column names and an optional descending flag. For nullable numeric columns, pass nulls_last=True to push null rows to the bottom:

top10_gdp = (
    indicators
    .filter(
        (pl.col("year") == 2022) &
        (pl.col("region") != "Aggregates")
    )
    .select(["short_name", "region", "income_group", "gdp_per_capita_usd"])
    .sort("gdp_per_capita_usd", descending=True, nulls_last=True)
    .head(10)
)
print(top10_gdp)
shape: (10, 4)
┌────────────────┬────────────────────────────┬──────────────┬────────────────────┐
│ short_name     ┆ region                     ┆ income_group ┆ gdp_per_capita_usd │
│ ---            ┆ ---                        ┆ ---          ┆ ---                │
│ str            ┆ str                        ┆ str          ┆ f64                │
╞════════════════╪════════════════════════════╪══════════════╪════════════════════╡
│ Monaco         ┆ Europe & Central Asia      ┆ High income  ┆ 226052.0           │
│ Liechtenstein  ┆ Europe & Central Asia      ┆ High income  ┆ 186400.23          │
│ Luxembourg     ┆ Europe & Central Asia      ┆ High income  ┆ 125006.02          │
│ Bermuda        ┆ North America              ┆ High income  ┆ 120897.31          │
│ Norway         ┆ Europe & Central Asia      ┆ High income  ┆ 108798.45          │
│ Ireland        ┆ Europe & Central Asia      ┆ High income  ┆ 106194.76          │
│ Switzerland    ┆ Europe & Central Asia      ┆ High income  ┆ 93245.8            │
│ Cayman Islands ┆ Latin America & Caribbean  ┆ High income  ┆ 92202.15           │
│ Qatar          ┆ Middle East & North Africa ┆ High income  ┆ 88701.46           │
│ Singapore      ┆ East Asia & Pacific        ┆ High income  ┆ 88428.7            │
└────────────────┴────────────────────────────┴──────────────┴────────────────────┘

Multi-column sort uses a list:

print(
    indicators
    .filter(pl.col("year") == 2022)
    .sort(["region", "gdp_per_capita_usd"], descending=[False, True], nulls_last=True)
    .select(["short_name", "region", "gdp_per_capita_usd"])
    .head(8)
)
shape: (8, 3)
┌──────────────────────┬─────────────────────┬────────────────────┐
│ short_name           ┆ region              ┆ gdp_per_capita_usd │
│ ---                  ┆ ---                 ┆ ---                │
│ str                  ┆ str                 ┆ f64                │
╞══════════════════════╪═════════════════════╪════════════════════╡
│ Singapore            ┆ East Asia & Pacific ┆ 88428.7            │
│ Australia            ┆ East Asia & Pacific ┆ 64997.01           │
│ Hong Kong SAR, China ┆ East Asia & Pacific ┆ 48826.05           │
│ New Zealand          ┆ East Asia & Pacific ┆ 48216.51           │
│ Guam                 ┆ East Asia & Pacific ┆ 41833.15           │
│ Macao SAR, China     ┆ East Asia & Pacific ┆ 36909.96           │
│ Brunei Darussalam    ┆ East Asia & Pacific ┆ 36632.93           │
│ Japan                ┆ East Asia & Pacific ┆ 34017.27           │
└──────────────────────┴─────────────────────┴────────────────────┘

23.11 Reshaping: Long to Wide

Analytical data is often stored in long format (one row per country-year combination), but reporting sometimes requires wide format (one row per country, years as columns). Polars pivot() handles this:

gdp_wide = (
    indicators
    .filter(
        pl.col("year").is_in([2000, 2010, 2020, 2022]) &
        (pl.col("region") != "Aggregates") &
        pl.col("gdp_per_capita_usd").is_not_null()
    )
    .pivot(
        values="gdp_per_capita_usd",
        index=["country_code", "short_name"],
        on="year"
    )
    .rename({"2000": "gdp_2000", "2010": "gdp_2010",
             "2020": "gdp_2020", "2022": "gdp_2022"})
    .sort("gdp_2022", descending=True, nulls_last=True)
    .head(10)
)
print(gdp_wide)
shape: (10, 6)
┌──────────────┬────────────────┬──────────┬───────────┬───────────┬───────────┐
│ country_code ┆ short_name     ┆ gdp_2000 ┆ gdp_2010  ┆ gdp_2020  ┆ gdp_2022  │
│ ---          ┆ ---            ┆ ---      ┆ ---       ┆ ---       ┆ ---       │
│ str          ┆ str            ┆ f64      ┆ f64       ┆ f64       ┆ f64       │
╞══════════════╪════════════════╪══════════╪═══════════╪═══════════╪═══════════╡
│ MCO          ┆ Monaco         ┆ 81789.02 ┆ 161853.92 ┆ 176891.89 ┆ 226052.0  │
│ LIE          ┆ Liechtenstein  ┆ 76087.91 ┆ 141089.81 ┆ 164671.09 ┆ 186400.23 │
│ LUX          ┆ Luxembourg     ┆ 48659.6  ┆ 110885.99 ┆ 116905.37 ┆ 125006.02 │
│ BMU          ┆ Bermuda        ┆ 56542.96 ┆ 104111.82 ┆ 106973.18 ┆ 120897.31 │
│ NOR          ┆ Norway         ┆ 38178.24 ┆ 88163.21  ┆ 68340.02  ┆ 108798.45 │
│ IRL          ┆ Ireland        ┆ 26334.57 ┆ 48679.4   ┆ 87567.11  ┆ 106194.76 │
│ CHE          ┆ Switzerland    ┆ 38865.02 ┆ 76531.37  ┆ 85897.78  ┆ 93245.8   │
│ CYM          ┆ Cayman Islands ┆ null     ┆ 76837.67  ┆ 82338.8   ┆ 92202.15  │
│ QAT          ┆ Qatar          ┆ 27535.18 ┆ 77387.33  ┆ 51683.51  ┆ 88701.46  │
│ SGP          ┆ Singapore      ┆ 23852.84 ┆ 47236.68  ┆ 61466.8   ┆ 88428.7   │
└──────────────┴────────────────┴──────────┴───────────┴───────────┴───────────┘

The reverse operation, wide to long, uses unpivot() (called melt() in some other libraries). We will not need it here, but it takes the same arguments in reverse: on lists the columns to collapse, index lists the columns to keep fixed, and variable_name/value_name name the resulting key and value columns.

23.12 Preparing the Clean DataFrames

Now we apply everything above to produce two clean DataFrames suitable for loading into PostgreSQL.

Countries: standard column selection, null rows retained for database NULL storage:

countries_clean = (
    countries
    .select([
        "country_code", "iso2_code", "short_name", "region",
        "capital", "longitude", "latitude", "income_group", "lending_type"
    ])
    .sort("country_code")
)
print(f"countries_clean: {countries_clean.shape}")
print(countries_clean.head(5))
countries_clean: (218, 9)
shape: (5, 9)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬──────────┬───────────┬───────────┐
│ country_c ┆ iso2_code ┆ short_nam ┆ region    ┆ … ┆ longitude ┆ latitude ┆ income_gr ┆ lending_t │
│ ode       ┆ ---       ┆ e         ┆ ---       ┆   ┆ ---       ┆ ---      ┆ oup       ┆ ype       │
│ ---       ┆ str       ┆ ---       ┆ str       ┆   ┆ f64       ┆ f64      ┆ ---       ┆ ---       │
│ str       ┆           ┆ str       ┆           ┆   ┆           ┆          ┆ str       ┆ str       │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪══════════╪═══════════╪═══════════╡
│ ABW       ┆ AW        ┆ Aruba     ┆ Latin     ┆ … ┆ -70.0167  ┆ 12.5167  ┆ High      ┆ Not class │
│           ┆           ┆           ┆ America & ┆   ┆           ┆          ┆ income    ┆ ified     │
│           ┆           ┆           ┆ Caribbean ┆   ┆           ┆          ┆           ┆           │
│ AFG       ┆ AF        ┆ Afghanist ┆ South     ┆ … ┆ 69.1761   ┆ 34.5228  ┆ Low       ┆ IDA       │
│           ┆           ┆ an        ┆ Asia      ┆   ┆           ┆          ┆ income    ┆           │
│ AGO       ┆ AO        ┆ Angola    ┆ Sub-Sahar ┆ … ┆ 13.242    ┆ -8.81155 ┆ Lower     ┆ IBRD      │
│           ┆           ┆           ┆ an Africa ┆   ┆           ┆          ┆ middle    ┆           │
│           ┆           ┆           ┆           ┆   ┆           ┆          ┆ income    ┆           │
│ ALB       ┆ AL        ┆ Albania   ┆ Europe &  ┆ … ┆ 19.8172   ┆ 41.3317  ┆ Upper     ┆ IBRD      │
│           ┆           ┆           ┆ Central   ┆   ┆           ┆          ┆ middle    ┆           │
│           ┆           ┆           ┆ Asia      ┆   ┆           ┆          ┆ income    ┆           │
│ AND       ┆ AD        ┆ Andorra   ┆ Europe &  ┆ … ┆ 1.5218    ┆ 42.5075  ┆ High      ┆ Not class │
│           ┆           ┆           ┆ Central   ┆   ┆           ┆          ┆ income    ┆ ified     │
│           ┆           ┆           ┆ Asia      ┆   ┆           ┆          ┆           ┆           │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴──────────┴───────────┴───────────┘

Indicators: filter out the Aggregate summary rows, keep everything else including rows with null economic values:

indicators_clean = (
    indicators
    .filter(pl.col("region") != "Aggregates")
    .select([
        "country_code", "year", "population", "gdp_usd",
        "gdp_per_capita_usd", "pct_forest_area", "land_area_km2"
    ])
    .sort(["country_code", "year"])
)
print(f"indicators_clean: {indicators_clean.shape}")
print(indicators_clean.head(5))
indicators_clean: (4991, 7)
shape: (5, 7)
┌──────────────┬──────┬────────────┬──────────┬──────────────────┬─────────────────┬───────────────┐
│ country_code ┆ year ┆ population ┆ gdp_usd  ┆ gdp_per_capita_u ┆ pct_forest_area ┆ land_area_km2 │
│ ---          ┆ ---  ┆ ---        ┆ ---      ┆ sd               ┆ ---             ┆ ---           │
│ str          ┆ i64  ┆ i64        ┆ f64      ┆ ---              ┆ f64             ┆ f64           │
│              ┆      ┆            ┆          ┆ f64              ┆                 ┆               │
╞══════════════╪══════╪════════════╪══════════╪══════════════════╪═════════════════╪═══════════════╡
│ ABW          ┆ 2000 ┆ 90588      ┆ 1.8735e9 ┆ 20681.02         ┆ 2.3333          ┆ 180.0         │
│ ABW          ┆ 2001 ┆ 91439      ┆ 1.8965e9 ┆ 20740.13         ┆ 2.3333          ┆ 180.0         │
│ ABW          ┆ 2002 ┆ 92074      ┆ 1.9618e9 ┆ 21307.25         ┆ 2.3333          ┆ 180.0         │
│ ABW          ┆ 2003 ┆ 93128      ┆ 2.0441e9 ┆ 21949.49         ┆ 2.3333          ┆ 180.0         │
│ ABW          ┆ 2004 ┆ 95138      ┆ 2.2548e9 ┆ 23700.63         ┆ 2.3333          ┆ 180.0         │
└──────────────┴──────┴────────────┴──────────┴──────────────────┴─────────────────┴───────────────┘

We drop short_name, region, and income_group from the indicators DataFrame because those fields belong to the countries table. Storing them in both tables would violate the normalization principles from the database design section of this book. When we need those fields for a query, we will join the two tables.

Save both to Parquet so Chapter 33 can load them without re-running the transformation:

countries_clean.write_parquet("data/countries_clean.parquet")
indicators_clean.write_parquet("data/indicators_clean.parquet")
print("Saved.")
Saved.

23.13 Summary

Operation Method Notes
Filter rows filter(expr) Compound: (cond1) & (cond2)
Select columns select([...]) Also used for column reordering
Rename columns rename({"old": "new"}) Or use .alias() inside expressions
Add/replace columns with_columns([...]) Does not drop existing columns
Handle nulls drop_nulls(), fill_null(), .is_null() Choose based on why data is missing
Aggregate group_by([...]).agg([...]) Multiple aggregations per call
Join join(other, on=..., how=...) how mirrors SQL join types
Sort sort([...], descending=[...], nulls_last=True) Multi-column sort uses lists
Reshape long to wide pivot(values=..., index=..., on=...) Reverse: unpivot()

In the next chapter, you will load countries_clean.parquet and indicators_clean.parquet into PostgreSQL using SQLAlchemy, including proper upsert logic so the pipeline can be run repeatedly without creating duplicate rows.

23.14 Exercises

  1. Filter indicators to the year 2000. How many countries have a non-null pct_forest_area value? How does that number compare to 2022?

  2. Compute a new column called forest_area_km2 by multiplying pct_forest_area by land_area_km2 and dividing by 100. Which 10 countries had the largest total forest area in 2022?

  3. Using group_by, compute the total world population for each year in the dataset (excluding Aggregate rows). Which year had the highest total?

  4. Join indicators with countries to add the lending_type column. Then group by lending_type and compute the mean gdp_per_capita_usd for 2022. Exclude nulls from the mean calculation.

  5. Create a wide-format pivot table showing pct_forest_area in 2000 and 2022 for countries in Sub-Saharan Africa. Which countries show the largest decline in forest area percentage over that period?