22  Extracting Data from Files and APIs

22.1 Learning Objectives

By the end of this chapter, students will be able to:

  • Explain the role of the extraction stage in an ETL pipeline and describe common data sources.
  • Read CSV, JSON, and Parquet files into Polars DataFrames using the appropriate read_* function.
  • Fetch data from a REST API using the requests library and parse the JSON response.
  • Inspect a freshly loaded DataFrame to verify its shape, schema, and the presence of missing values.
  • Identify practical differences between CSV, JSON, and Parquet as storage formats for tabular data.

The first stage of any ETL pipeline is extraction: pulling raw data from wherever it lives into your working environment. In the real world, data arrives in many forms. A colleague sends a CSV. A vendor exposes a REST API. A data warehouse exports Parquet files to cloud storage. Each format has its own structure, its own quirks, and its own best practices.

In this chapter, you will learn how to read the three most common formats using Polars, a fast DataFrame library for Python. You will also learn how to fetch data from a REST API using the requests library. By the end, you will have the extraction toolkit needed to ingest data from nearly any source you encounter as a data professional.

The example files used throughout this chapter live in the data/ directory of the book’s repository:

  • data/countries.csv: country metadata for 218 countries
  • data/indicators_meta.json: indicator definitions wrapped in a paginated API response envelope
  • data/indicators.parquet: economic and demographic indicators for 217 countries from 2000 to 2022

22.2 Prerequisites

This chapter requires Polars and the requests library. Install them if you have not already:

pip install polars requests

22.3 Reading CSV Files

Comma-separated values (CSV) is the most common data exchange format. Despite its simplicity, CSV files can hide surprises: missing values represented as blank strings or sentinel values like "N/A", columns whose types are ambiguous, and inconsistent quoting. Polars handles most of this gracefully.

The pl.read_csv() function reads a CSV file into a Polars DataFrame with automatic type inference:

import polars as pl

countries = pl.read_csv("data/countries.csv")
print(countries.head(5))
shape: (5, 9)
┌───────────┬───────────┬───────────┬───────────┬───┬───────────┬──────────┬───────────┬───────────┐
│ country_c ┆ iso2_code ┆ short_nam ┆ region    ┆ … ┆ longitude ┆ latitude ┆ income_gr ┆ lending_t │
│ ode       ┆ ---       ┆ e         ┆ ---       ┆   ┆ ---       ┆ ---      ┆ oup       ┆ ype       │
│ ---       ┆ str       ┆ ---       ┆ str       ┆   ┆ f64       ┆ f64      ┆ ---       ┆ ---       │
│ str       ┆           ┆ str       ┆           ┆   ┆           ┆          ┆ str       ┆ str       │
╞═══════════╪═══════════╪═══════════╪═══════════╪═══╪═══════════╪══════════╪═══════════╪═══════════╡
│ AFG       ┆ AF        ┆ Afghanist ┆ South     ┆ … ┆ 69.1761   ┆ 34.5228  ┆ Low       ┆ IDA       │
│           ┆           ┆ an        ┆ Asia      ┆   ┆           ┆          ┆ income    ┆           │
│ ALB       ┆ AL        ┆ Albania   ┆ Europe &  ┆ … ┆ 19.8172   ┆ 41.3317  ┆ Upper     ┆ IBRD      │
│           ┆           ┆           ┆ Central   ┆   ┆           ┆          ┆ middle    ┆           │
│           ┆           ┆           ┆ Asia      ┆   ┆           ┆          ┆ income    ┆           │
│ DZA       ┆ DZ        ┆ Algeria   ┆ Middle    ┆ … ┆ 3.05097   ┆ 36.7397  ┆ Lower     ┆ IBRD      │
│           ┆           ┆           ┆ East &    ┆   ┆           ┆          ┆ middle    ┆           │
│           ┆           ┆           ┆ North     ┆   ┆           ┆          ┆ income    ┆           │
│           ┆           ┆           ┆ Africa    ┆   ┆           ┆          ┆           ┆           │
│ ASM       ┆ AS        ┆ American  ┆ East Asia ┆ … ┆ -170.691  ┆ -14.2846 ┆ Upper     ┆ Not class │
│           ┆           ┆ Samoa     ┆ & Pacific ┆   ┆           ┆          ┆ middle    ┆ ified     │
│           ┆           ┆           ┆           ┆   ┆           ┆          ┆ income    ┆           │
│ AND       ┆ AD        ┆ Andorra   ┆ Europe &  ┆ … ┆ 1.5218    ┆ 42.5075  ┆ High      ┆ Not class │
│           ┆           ┆           ┆ Central   ┆   ┆           ┆          ┆ income    ┆ ified     │
│           ┆           ┆           ┆ Asia      ┆   ┆           ┆          ┆           ┆           │
└───────────┴───────────┴───────────┴───────────┴───┴───────────┴──────────┴───────────┴───────────┘

Polars inferred the column types by sampling the file. Let’s inspect the schema it chose:

print(countries.schema)
Schema({'country_code': String, 'iso2_code': String, 'short_name': String, 'region': String, 'capital': String, 'longitude': Float64, 'latitude': Float64, 'income_group': String, 'lending_type': String})

String columns were correctly identified as String. The longitude and latitude columns were inferred as Float64 because they contain decimal values.

22.3.1 Handling Null Values

Seven countries in this dataset have no recorded longitude or latitude: territories and disputed regions such as Gibraltar, Taiwan, and the Channel Islands. Polars represents these missing values as null in the DataFrame:

missing_coords = countries.filter(pl.col("longitude").is_null())
print(missing_coords.select(["short_name", "region", "capital"]))
shape: (7, 3)
┌───────────────────────────┬────────────────────────────┬─────────────┐
│ short_name                ┆ region                     ┆ capital     │
│ ---                       ┆ ---                        ┆ ---         │
│ str                       ┆ str                        ┆ str         │
╞═══════════════════════════╪════════════════════════════╪═════════════╡
│ Channel Islands           ┆ Europe & Central Asia      ┆             │
│ Curacao                   ┆ Latin America & Caribbean  ┆ Willemstad  │
│ Gibraltar                 ┆ Europe & Central Asia      ┆             │
│ Sint Maarten (Dutch part) ┆ Latin America & Caribbean  ┆ Philipsburg │
│ St. Martin (French part)  ┆ Latin America & Caribbean  ┆ Marigot     │
│ Taiwan, China             ┆ East Asia & Pacific        ┆             │
│ West Bank and Gaza        ┆ Middle East & North Africa ┆             │
└───────────────────────────┴────────────────────────────┴─────────────┘

By default, pl.read_csv() treats blank fields as null. If your CSV uses a different sentinel to indicate missing data, pass it via the null_values parameter:

pl.read_csv("data/countries.csv", null_values=["N/A", "-", ""])

You can also pass a dictionary to specify different null sentinels per column, which is useful when different columns use different conventions in the same file.

22.3.2 Selecting Columns at Read Time

You rarely need every column in a large file. Polars lets you select specific columns during the read, which avoids loading unnecessary data into memory:

countries_slim = pl.read_csv(
    "data/countries.csv",
    columns=["country_code", "short_name", "region", "income_group"]
)
print(countries_slim.head(8))
shape: (8, 4)
┌──────────────┬─────────────────────┬────────────────────────────┬─────────────────────┐
│ country_code ┆ short_name          ┆ region                     ┆ income_group        │
│ ---          ┆ ---                 ┆ ---                        ┆ ---                 │
│ str          ┆ str                 ┆ str                        ┆ str                 │
╞══════════════╪═════════════════════╪════════════════════════════╪═════════════════════╡
│ AFG          ┆ Afghanistan         ┆ South Asia                 ┆ Low income          │
│ ALB          ┆ Albania             ┆ Europe & Central Asia      ┆ Upper middle income │
│ DZA          ┆ Algeria             ┆ Middle East & North Africa ┆ Lower middle income │
│ ASM          ┆ American Samoa      ┆ East Asia & Pacific        ┆ Upper middle income │
│ AND          ┆ Andorra             ┆ Europe & Central Asia      ┆ High income         │
│ AGO          ┆ Angola              ┆ Sub-Saharan Africa         ┆ Lower middle income │
│ ATG          ┆ Antigua and Barbuda ┆ Latin America & Caribbean  ┆ High income         │
│ ARG          ┆ Argentina           ┆ Latin America & Caribbean  ┆ Upper middle income │
└──────────────┴─────────────────────┴────────────────────────────┴─────────────────────┘

For quick inspection, .head(n) returns the first n rows and .tail(n) returns the last n.

22.3.3 Overriding Inferred Types

Automatic type inference is convenient but not infallible. Suppose you want to treat region and income_group as categorical variables rather than plain strings. Categoricals store repeated string values more efficiently, which matters for columns with low cardinality (few unique values relative to total rows). Use schema_overrides to specify types explicitly:

countries_typed = pl.read_csv(
    "data/countries.csv",
    schema_overrides={
        "region": pl.Categorical,
        "income_group": pl.Categorical,
        "lending_type": pl.Categorical
    }
)
print(countries_typed.schema)
Schema({'country_code': String, 'iso2_code': String, 'short_name': String, 'region': Categorical, 'capital': String, 'longitude': Float64, 'latitude': Float64, 'income_group': Categorical, 'lending_type': Categorical})

Notice that region, income_group, and lending_type now show as Categorical while all other columns remain as inferred.

22.4 Reading JSON Files

JSON (JavaScript Object Notation) is the standard format for web API responses. It is flexible enough to represent nested structures that CSV cannot: an array of objects, where each object may itself contain nested objects or arrays.

Polars can read a flat JSON array directly with pl.read_json(). But most real-world API responses wrap the data you actually care about inside a response envelope that includes pagination metadata, status codes, and other fields alongside the records. You cannot read those files with pl.read_json() alone.

The data/indicators_meta.json file mimics this pattern. Let’s look at its top-level structure first:

import json

with open("data/indicators_meta.json", encoding="utf-8") as f:
    response = json.load(f)

print(list(response.keys()))
print(f"Page {response['page']} of {response['pages']}")
print(f"Showing {len(response['indicators'])} of {response['total']} indicators")
['page', 'pages', 'per_page', 'total', 'source', 'indicators']
Page 1 of 1
Showing 9 of 9 indicators

The actual records live under the "indicators" key. Extract that list, then construct a Polars DataFrame from it:

indicators_meta = pl.DataFrame(response["indicators"])
print(indicators_meta.select(["id", "name"]))
shape: (9, 2)
┌───────────────────┬─────────────────────────────────┐
│ id                ┆ name                            │
│ ---               ┆ ---                             │
│ str               ┆ str                             │
╞═══════════════════╪═════════════════════════════════╡
│ AG.LND.AGRI.ZS    ┆ Agricultural land (% of land a… │
│ AG.LND.ARBL.ZS    ┆ Arable land (% of land area)    │
│ AG.LND.FRST.ZS    ┆ Forest area (% of land area)    │
│ AG.LND.TOTL.K2    ┆ Land area (sq. km)              │
│ AG.LND.TOTL.RU.K2 ┆ Rural land area (sq. km)        │
│ AG.LND.TOTL.UR.K2 ┆ Urban land area (sq. km)        │
│ NY.GDP.MKTP.CD    ┆ GDP (current US$)               │
│ NY.GDP.PCAP.CD    ┆ GDP per capita (current US$)    │
│ SP.POP.TOTL       ┆ Population, total               │
└───────────────────┴─────────────────────────────────┘

22.4.1 Handling Nested Fields

Notice the source column. Rather than a plain string, each row’s source value is itself an object with id and value fields. Polars stores these as the Struct type. Use .struct.field() to extract the nested values into their own columns:

indicators_meta = (
    pl.DataFrame(response["indicators"])
    .with_columns([
        pl.col("source").struct.field("id").alias("source_id"),
        pl.col("source").struct.field("value").alias("source_name")
    ])
    .drop("source")
)
print(indicators_meta.select(["id", "name", "source_id", "source_name"]))
shape: (9, 4)
┌───────────────────┬─────────────────────────────────┬───────────┬──────────────────────────────┐
│ id                ┆ name                            ┆ source_id ┆ source_name                  │
│ ---               ┆ ---                             ┆ ---       ┆ ---                          │
│ str               ┆ str                             ┆ str       ┆ str                          │
╞═══════════════════╪═════════════════════════════════╪═══════════╪══════════════════════════════╡
│ AG.LND.AGRI.ZS    ┆ Agricultural land (% of land a… ┆ 2         ┆ World Development Indicators │
│ AG.LND.ARBL.ZS    ┆ Arable land (% of land area)    ┆ 2         ┆ World Development Indicators │
│ AG.LND.FRST.ZS    ┆ Forest area (% of land area)    ┆ 2         ┆ World Development Indicators │
│ AG.LND.TOTL.K2    ┆ Land area (sq. km)              ┆ 2         ┆ World Development Indicators │
│ AG.LND.TOTL.RU.K2 ┆ Rural land area (sq. km)        ┆ 2         ┆ World Development Indicators │
│ AG.LND.TOTL.UR.K2 ┆ Urban land area (sq. km)        ┆ 2         ┆ World Development Indicators │
│ NY.GDP.MKTP.CD    ┆ GDP (current US$)               ┆ 2         ┆ World Development Indicators │
│ NY.GDP.PCAP.CD    ┆ GDP per capita (current US$)    ┆ 2         ┆ World Development Indicators │
│ SP.POP.TOTL       ┆ Population, total               ┆ 2         ┆ World Development Indicators │
└───────────────────┴─────────────────────────────────┴───────────┴──────────────────────────────┘

This pattern applies broadly: read the raw response with json.load(), extract the list of records by key, build a DataFrame, then use .struct.field() to flatten any nested fields you need.

22.5 Reading Parquet Files

Parquet is a columnar storage format designed for analytical workloads. Unlike CSV, which stores data row by row as plain text, Parquet stores each column independently in a binary, compressed format. This makes it far faster to read when you only need a subset of columns, and far smaller on disk.

Reading a Parquet file with Polars is simple:

indicators = pl.read_parquet("data/indicators.parquet")
print(indicators.shape)
print(indicators.schema)
(4991, 10)
Schema({'country_code': String, 'short_name': String, 'region': String, 'income_group': String, 'year': Int64, 'population': Int64, 'gdp_usd': Float64, 'gdp_per_capita_usd': Float64, 'pct_forest_area': Float64, 'land_area_km2': Float64})

Two things stand out compared to CSV:

  1. The schema is embedded in the file itself. There is no inference step, and there is no risk of type ambiguity. Whatever types were written by the creator are read back exactly.
  2. Null counts are visible but do not require any extra work. Columns like gdp_usd and pct_forest_area have missing values for some country-year combinations, and Polars handles them as null automatically.

22.5.1 Size Comparison

To appreciate the compression benefit, compare the Parquet file’s size to an equivalent CSV export:

import os

parquet_size = os.path.getsize("data/indicators.parquet")

csv_path = "data/indicators_temp.csv"
indicators.write_csv(csv_path)
csv_size = os.path.getsize(csv_path)
os.remove(csv_path)

print(f"Parquet: {parquet_size / 1024:>8.1f} KB")
print(f"CSV:     {csv_size / 1024:>8.1f} KB")
print(f"Ratio:   {csv_size / parquet_size:.1f}× smaller as Parquet")
Parquet:    128.1 KB
CSV:        501.9 KB
Ratio:   3.9× smaller as Parquet

The Parquet file is roughly 3 to 4 times smaller. At the scale of millions of rows, this difference translates into meaningfully lower storage costs and faster transfer times.

22.5.2 Reading a Subset of Columns

Because Parquet stores columns independently, you can read only the columns you need without touching the rest of the file. This is called column pruning, and it is one of Parquet’s most important performance characteristics:

gdp_subset = pl.read_parquet(
    "data/indicators.parquet",
    columns=["country_code", "short_name", "year", "gdp_per_capita_usd"]
)
print(gdp_subset.head(5))
shape: (5, 4)
┌──────────────┬────────────┬──────┬────────────────────┐
│ country_code ┆ short_name ┆ year ┆ gdp_per_capita_usd │
│ ---          ┆ ---        ┆ ---  ┆ ---                │
│ str          ┆ str        ┆ i64  ┆ f64                │
╞══════════════╪════════════╪══════╪════════════════════╡
│ ABW          ┆ Aruba      ┆ 2000 ┆ 20681.02           │
│ ABW          ┆ Aruba      ┆ 2001 ┆ 20740.13           │
│ ABW          ┆ Aruba      ┆ 2002 ┆ 21307.25           │
│ ABW          ┆ Aruba      ┆ 2003 ┆ 21949.49           │
│ ABW          ┆ Aruba      ┆ 2004 ┆ 23700.63           │
└──────────────┴────────────┴──────┴────────────────────┘

With a CSV file, you must read the entire file before you can select columns. With Parquet, the columns you omit are never read from disk at all.

22.6 Fetching Data from a REST API

Many data sources are not files at all: they are web APIs that respond to HTTP requests with JSON. The Python requests library makes it straightforward to query these endpoints.

The World Bank publishes a free, open API for the development indicators we have been working with throughout this book. The following example fetches the most recent five years of population data for four countries:

import requests

url = "https://api.worldbank.org/v2/country/US;GB;DE;JP/indicator/SP.POP.TOTL"
params = {
    "format": "json",
    "per_page": 20,
    "mrv": 5
}

resp = requests.get(url, params=params, timeout=10)
resp.raise_for_status()

envelope, records = resp.json()
print(f"Total records: {envelope['total']}")
Total records: 20

The World Bank API returns a two-element list: the envelope (containing pagination metadata) followed by the array of records. We unpack them in a single assignment.

Two details worth noting:

  • resp.raise_for_status() raises an exception immediately if the server returned an HTTP error code (4xx or 5xx). Call it every time so you catch problems before silently processing an error response.
  • The API returns "date" as a string and "value" as either a number or None. We handle both below.

Now build a DataFrame from the records:

pop_df = pl.DataFrame([
    {
        "country": r["country"]["value"],
        "year": int(r["date"]),
        "population": r["value"]
    }
    for r in records
    if r["value"] is not None
])

print(pop_df.sort(["country", "year"]))
shape: (20, 3)
┌───────────────┬──────┬────────────┐
│ country       ┆ year ┆ population │
│ ---           ┆ ---  ┆ ---        │
│ str           ┆ i64  ┆ i64        │
╞═══════════════╪══════╪════════════╡
│ Germany       ┆ 2020 ┆ 83160871   │
│ Germany       ┆ 2021 ┆ 83196078   │
│ Germany       ┆ 2022 ┆ 83177813   │
│ Germany       ┆ 2023 ┆ 83287273   │
│ Germany       ┆ 2024 ┆ 83516593   │
│ …             ┆ …    ┆ …          │
│ United States ┆ 2020 ┆ 331577720  │
│ United States ┆ 2021 ┆ 332099760  │
│ United States ┆ 2022 ┆ 334017321  │
│ United States ┆ 2023 ┆ 336806231  │
│ United States ┆ 2024 ┆ 340110988  │
└───────────────┴──────┴────────────┘

22.6.1 Pagination

Most APIs cap the number of records returned per request and include a page count in the envelope. When you need all pages, check the envelope and loop:

all_records = []
page = 1

while True:
    resp = requests.get(
        "https://api.worldbank.org/v2/country/all/indicator/SP.POP.TOTL",
        params={"format": "json", "per_page": 1000, "page": page, "mrv": 1},
        timeout=10
    )
    resp.raise_for_status()
    envelope, records = resp.json()
    all_records.extend(records)
    if page >= envelope["pages"]:
        break
    page += 1

pop_all = pl.DataFrame([
    {"country_code": r["countryiso3code"], "population": r["value"]}
    for r in all_records
    if r["value"] is not None
])

print(f"Fetched {len(all_records)} records across {envelope['pages']} page(s)")
print(f"Non-null populations: {len(pop_all)}")
Fetched 266 records across 1 page(s)
Non-null populations: 265

This pattern works for any API that reports a page count. Adapt the stopping condition to match whatever field your API uses: pages, next_page, has_more, a cursor token, and so on.

22.7 Summary

The table below summarizes the extraction approach for each source type:

Source Tool Key consideration
CSV file pl.read_csv() Check null sentinels; override types with schema_overrides
Flat JSON array pl.read_json() Works directly on arrays of objects
JSON envelope json.load() + pl.DataFrame() Extract the records key before building the DataFrame
Parquet file pl.read_parquet() Embedded schema; use columns= for column pruning
REST API requests.get() + pl.DataFrame() Call .raise_for_status(); handle pagination

In the next chapter, you will take the extracted data and apply transformations with Polars: filtering, reshaping, computing new columns, and summarizing.

22.8 Exercises

  1. Read data/countries.csv and display the count of countries in each region. Sort the result from the region with the most countries to the fewest.

  2. The longitude and latitude columns have null values for 7 countries. Filter the DataFrame to show only those rows, displaying short_name, region, and capital.

  3. Read data/indicators.parquet using only the country_code, short_name, year, and pct_forest_area columns. How many row-year combinations have a non-null pct_forest_area?

  4. Load data/indicators_meta.json and flatten the nested source field. Print the id, name, and source_name for all 9 indicators.

  5. Modify the pagination loop to fetch all countries’ forest area data (indicator code AG.LND.FRST.ZS) for the most recent year available. Build a Polars DataFrame from the results and report how many countries have a non-null forest area value.