library(DBI)
library(RPostgres)
con <- dbConnect(
Postgres(),
host = "localhost",
port = 5433,
dbname = "nycflights",
user = "postgres",
password = "postgres"
)11 Connecting R to a Database
11.1 Learning Objectives
By the end of this chapter, students will be able to:
- Explain the two-layer DBI design: a standard interface package and database-specific driver packages.
- Open and close a PostgreSQL connection using
dbConnectanddbDisconnect, including withon.exitfor safe cleanup. - Retrieve query results as R data frames using
dbGetQueryanddbListTables/dbListFieldsfor schema exploration. - Visualize database query results using
ggplot2within an R script or notebook. - Write parameterized queries using the
paramsargument to prevent SQL injection.
Throughout this book, SQL code blocks have been executing against a live PostgreSQL database without much explanation of how that connection actually works. The connection is established by R using the DBI package (R Special Interest Group on Databases (R-SIG-DB), Wickham, and Müller 2026), and every query result comes back to R as a data frame. This chapter makes that mechanism explicit, showing you how to connect to a database from R, retrieve query results, visualize them with ggplot2 (Wickham 2016), and write queries safely using parameterization.
These skills matter outside of Quarto documents. Most real data science work happens in R scripts and notebooks where you need to pull data from a database, analyze it in R, and produce outputs. Understanding how to drive that connection directly gives you full control over the process.
11.2 The DBI Ecosystem
R’s database connectivity is organized around a two-layer design. The DBI package defines a standard interface: a set of functions with consistent names and behavior regardless of which database you are connecting to. Underneath DBI, driver packages implement that interface for specific databases.
| Driver package | Database |
|---|---|
RPostgres |
PostgreSQL |
RSQLite |
SQLite |
RMariaDB |
MySQL / MariaDB |
bigrquery |
Google BigQuery |
odbc |
Any ODBC-compatible database |
ODBC (Open Database Connectivity) is a decades-old standard API for database access supported by most enterprise databases: SQL Server, Oracle, Teradata, Snowflake, Databricks, and many others. The odbc package lets DBI connect to any database that has an ODBC driver installed on the system, making it the practical fallback when a native DBI driver does not exist for your database.
Because DBI provides a uniform interface, the code for connecting to and querying a database looks almost identical regardless of which database you use. Swapping from SQLite to PostgreSQL means changing one line (the driver), not rewriting your analysis.
11.3 Establishing a Connection
dbConnect() creates a connection to the database. It takes a driver object as its first argument, followed by connection parameters specific to that driver.
Each argument tells DBI something specific about where and how to connect:
Postgres()— the driver object; tells DBI which database engine to use. Swapping this forSQLite()orbigrquery::BigQuery()would connect to a different database entirely, with the rest of the call staying mostly the same.host— the network address of the server."localhost"means the database is running on the same machine as R. On a remote server or cloud database, this would be a hostname or IP address.port— the TCP port the server listens on. PostgreSQL’s default is 5432; this book’s Docker container is mapped to 5433. If you connect to a remote database and are unsure of the port, your database administrator can tell you.dbname— the name of a specific database on that server. A single PostgreSQL server can host many independent databases."nycflights"selects the one containing the flight data used throughout this book.user/password— the credentials for the database account. In a production environment, these should come from environment variables or a secrets manager, not be written as plain text in a script.
The object con represents an open connection to the database. It is passed to every subsequent DBI function that needs to communicate with the database.
Before writing any queries, it is useful to explore what the database contains:
dbListTables(con)[1] "airlines" "airports" "flights" "planes" "weather"
dbListFields(con, "flights") [1] "year" "month" "day" "dep_time"
[5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
[9] "arr_delay" "carrier" "flight" "tailnum"
[13] "origin" "dest" "air_time" "distance"
[17] "hour" "minute" "time_hour"
dbListTables() returns the names of all tables in the connected database. dbListFields() returns the column names for a specific table.
11.4 Querying Data
dbGetQuery() sends a SQL query to the database and returns the result as a data frame. It is the most common function for analytical work: you write the SQL, and R gives you back a data frame you can work with immediately.
delays_by_carrier <- dbGetQuery(con, "
SELECT
f.carrier,
a.name,
COUNT(*) AS flights,
ROUND(AVG(f.dep_delay)::numeric, 1) AS mean_delay,
ROUND(STDDEV(f.dep_delay)::numeric, 1) AS stddev_delay
FROM flights AS f
JOIN airlines AS a ON f.carrier = a.carrier
WHERE f.dep_delay IS NOT NULL
GROUP BY f.carrier, a.name
ORDER BY mean_delay DESC
")
knitr::kable(delays_by_carrier)| carrier | name | flights | mean_delay | stddev_delay |
|---|---|---|---|---|
| F9 | Frontier Airlines Inc. | 682 | 20.2 | 58.4 |
| EV | ExpressJet Airlines Inc. | 8255 | 19.1 | 48.8 |
| YV | Mesa Airlines Inc. | 545 | 19.0 | 49.2 |
| FL | AirTran Airways Corporation | 3187 | 18.7 | 52.7 |
| WN | Southwest Airlines Co. | 6000 | 17.6 | 43.9 |
| B6 | JetBlue Airways | 5925 | 14.8 | 46.7 |
| UA | United Air Lines Inc. | 7837 | 12.1 | 42.9 |
| OO | SkyWest Airlines Inc. | 23 | 10.4 | 40.7 |
| DL | Delta Air Lines Inc. | 22857 | 9.6 | 41.4 |
| 9E | Endeavor Air Inc. | 2372 | 8.9 | 34.8 |
| MQ | Envoy Air | 16189 | 8.5 | 34.1 |
| AA | American Airlines Inc. | 15063 | 6.7 | 34.8 |
| US | US Airways Inc. | 12574 | 3.3 | 27.9 |
The result is a standard R data frame. Every tool you use for data manipulation in R, including dplyr, data.table, and base R, works on the result immediately.
class(delays_by_carrier)[1] "data.frame"
11.5 Visualizing Query Results
Because dbGetQuery() returns a data frame, you can pipe the result directly into ggplot2. The following sections build three plots using the data retrieved in the previous section and a few additional queries.
11.5.1 Distribution of Departure Delays
A histogram reveals the shape of the delay distribution discussed in the previous chapter:
library(ggplot2)
delays_raw <- dbGetQuery(con, "
SELECT dep_delay
FROM flights
WHERE dep_delay IS NOT NULL
AND dep_delay BETWEEN -30 AND 180
")
ggplot(delays_raw, aes(x = dep_delay)) +
geom_histogram(binwidth = 5, fill = "#0165fc", color = "white") +
labs(
title = "Distribution of Departure Delays",
subtitle = "LGA flights, 2013 — trimmed to −30 to +180 minutes",
x = "Departure Delay (minutes)",
y = "Number of Flights"
) +
theme_minimal()The right skew is unmistakable. The bulk of flights cluster just below zero, and a long tail of severe delays extends to the right — the same pattern the mean-vs-median comparison in the previous chapter quantified numerically.
11.5.2 Departure Delay vs. Arrival Delay
The strong correlation between departure and arrival delay is easier to see as a scatterplot than as a single number:
delay_pairs <- dbGetQuery(con, "
SELECT dep_delay, arr_delay
FROM flights
WHERE dep_delay IS NOT NULL
AND arr_delay IS NOT NULL
AND dep_delay BETWEEN -30 AND 120
AND arr_delay BETWEEN -60 AND 120
")
ggplot(delay_pairs, aes(x = dep_delay, y = arr_delay)) +
geom_point(alpha = 0.05, color = "#0165fc") +
geom_smooth(method = "lm", color = "firebrick", se = FALSE) +
labs(
title = "Departure Delay vs. Arrival Delay",
subtitle = "Each point is one flight; regression line overlaid",
x = "Departure Delay (minutes)",
y = "Arrival Delay (minutes)"
) +
theme_minimal()The regression line runs nearly through the origin with a slope close to 1: a one-minute departure delay corresponds to roughly one minute of arrival delay. The transparency (alpha = 0.05) reveals where the mass of flights lies despite over 100,000 overlapping points.
11.5.3 Mean Delay by Carrier
A horizontal bar chart makes carrier comparisons easy to read:
ggplot(delays_by_carrier, aes(x = mean_delay, y = reorder(name, mean_delay))) +
geom_col(fill = "#0165fc") +
labs(
title = "Mean Departure Delay by Carrier",
subtitle = "LGA flights, 2013",
x = "Mean Departure Delay (minutes)",
y = NULL
) +
theme_minimal()reorder(name, mean_delay) sorts the bars by delay value rather than alphabetically, making the ranking immediately visible.
11.6 Parameterized Queries
When a query needs to incorporate a value from R, the instinct is to paste it directly into the query string:
# Do not do this
carrier_code <- "AA"
dbGetQuery(con, paste0("SELECT * FROM flights WHERE carrier = '", carrier_code, "'"))This is unsafe. If carrier_code came from user input and contained something like '; DROP TABLE flights; --, the database would execute that injected SQL. The correct approach is a parameterized query, which passes values separately from the query text so the database treats them as data, never as SQL.
DBI supports parameterized queries through the params argument. Placeholders in the query string are written as $1, $2, and so on (for PostgreSQL):
carrier_code <- "AA"
aa_flights <- dbGetQuery(
con,
"SELECT flight, dep_delay, arr_delay, dest
FROM flights
WHERE carrier = $1
AND dep_delay IS NOT NULL
ORDER BY dep_delay DESC",
params = list(carrier_code)
)
head(aa_flights) flight dep_delay arr_delay dest
1 2019 803 802 STL
2 2437 660 648 MIA
3 731 613 598 DFW
4 753 533 499 DFW
5 731 504 493 DFW
6 745 502 495 DFW
The value of carrier_code is sent to the database separately. No matter what string is in carrier_code, it is treated as a literal value to compare against the carrier column, not as SQL to execute. Multiple parameters follow the same pattern: add more $n placeholders and extend the params list.
Always use parameterized queries whenever a query incorporates a value that originates from outside your own code, including user input, configuration files, or other data sources. The params argument is the right tool in DBI.
11.7 Closing the Connection
A database connection holds resources on both the R side and the database server side. Close it explicitly when you are done:
dbDisconnect(con)In a script that opens a connection at the top and closes it at the bottom, it is good practice to use on.exit() to ensure the connection closes even if an error occurs partway through:
con <- dbConnect(Postgres(), host = "localhost", port = 5433,
dbname = "nycflights", user = "postgres", password = "postgres")
on.exit(dbDisconnect(con), add = TRUE)
# ... rest of your analysis ...on.exit(..., add = TRUE) schedules dbDisconnect(con) to run when the surrounding function or script exits, regardless of whether it exits normally or with an error.
11.8 Exercises
11.8.1 Reflection
What is the relationship between the
DBIpackage and driver packages likeRPostgres? Why is this two-layer design useful?Explain what is wrong with building a query string using
paste0()to insert a user-supplied value. What does a parameterized query do differently, and why does it prevent SQL injection?
11.8.2 Coding
Connect to the
nycflightsdatabase and usedbListFields()to inspect the columns of theweathertable. Then write adbGetQuery()call that retrieves the average temperature (temp) and average wind speed (wind_speed) for each month. Store the result in a data frame and print it.Using
dbGetQuery(), retrieve the number of flights per destination (dest) from theflightstable. Then useggplot2to create a bar chart of the top 15 destinations by flight count.Write a parameterized query that accepts a minimum number of Oscar nominations as a parameter and returns matching actors from the
actorsdatabase. (You will need to open a second connection to theactorsdatabase.) Test it with at least two different threshold values.Retrieve daily average departure delay from
flights(aggregate bymonthandday). Plot the result as a line chart withggplot2, usingmonthanddayto construct a date on the x-axis. Describe the seasonal pattern you observe.


