Wim Onderbeke

Notes on columnar databases

Notes on analytical columnar databases

  ·   9 min read

Throughout my career, I’ve primarily worked with traditional relational databases like PostgreSQL and MySQL. In my current role, we use a hybrid approach: PostgreSQL serves as our transactional database, while BigQuery acts as our analytical data warehouse. We stream changes from Postgres using change data capture (CDC) on the write-ahead log (WAL). Recently, I’ve been writing increasingly complex queries to detect patterns and data inconsistencies across our systems. What amazes me is the sheer difference in performance: queries that would bring our Postgres replica to its knees complete in seconds on BigQuery, even when scanning terabytes of data. This got me curious: what makes BigQuery so fast? The answer lies in its columnar storage architecture, a fundamentally different approach to organizing data compared to the row-based storage I’m familiar with.

These notes are about how that layout actually produces the difference. It comes down to a handful of ideas that all follow from how the data is arranged on disk.

What columnar storage is

Row stores keep each record contiguous. SELECT name FROM users still reads whole rows off disk even though you wanted one field. That’s the right layout for point lookups, like fetching everything about user 42. It’s a poor fit for analytics, where you touch a couple of fields across millions of rows.

Columnar stores keep each column contiguous instead:

Row-based: each record stored together

  [1, Alice, 25]  [2, Bob, 30]  [3, Carol, 25]

Columnar: each column stored together

  ID    [1, 2, 3]
  Name  [Alice, Bob, Carol]
  Age   [25, 30, 25]

Column pruning, compression, predicate pushdown, and vectorized execution all come out of this one change, once a column’s values sit next to each other.

Reading only the columns you need

A query touches a few columns out of many. A columnar engine reads those and ignores the rest. A row store can’t: the unwanted columns are physically interleaved with the ones you want, so you pay to read them regardless.

On a wide table the effect is large: reading two columns out of a hundred touches roughly 2% of the data.

BigQuery makes the cost of this visible. On-demand queries are billed by the bytes read from the columns you reference, so SELECT * on a wide table is far more expensive than naming the three columns you actually need. Adding a column to a query adds its bytes and nothing else.

Compression

Contiguous values are similar values: same type, often few distinct values, often sorted. Compressors do well on that. A row store can’t match the ratios because each value sits among unrelated bytes from other columns. Columnar formats typically compress 5–10×, more on low-cardinality columns.

A few encodings recur:

  • Dictionary: store distinct values once, replace the column with integer indices. A million-row country column with 200 distinct values becomes 200 strings plus a million small integers. Good for low-cardinality strings.
  • Run-length (RLE): store a value and its repeat count. [1,1,1,2,2,3,3,3,3] becomes [(1,3),(2,2),(3,4)]. Good for sorted or repetitive columns.
  • Delta: store the first value, then differences. [1000,1001,1003,1006] becomes [1000,1,2,3]. Good for timestamps and rising IDs.
  • Bit packing: use the minimum bits the range needs. Values 0–15 take 4 bits, not 32.

BigQuery’s storage format, Capacitor, does exactly this. It chooses an encoding per column from the data itself, and it will even reorder rows within a block to give run-length encoding longer runs, since row order usually carries no meaning. The large ratios come from stacking encodings: dictionary-encode, bit-pack the indices, then compress the result.

Skipping data you never read

Because a column sits together, it’s cheap to keep a little metadata beside it, at minimum a min and a max. That’s enough to answer a question before reading any values: can this block contain a matching row? If you filter WHERE price > 100 and a block’s max is 80, you skip it without reading a single value.

This is predicate pushdown. Pruning and compression reduce the cost of the data you read. Statistics go further and let you avoid reading some of it at all. BigQuery does this at two levels. Partitioning splits a table by a date or integer-range column, so a filter on that column drops whole partitions before any block is read. Clustering goes finer: within a partition it stores rows in blocks sorted by the clustering columns and keeps the min and max of each column per block, so a filter on a clustering column reads only the blocks whose ranges overlap it. Either way you are billed only for the bytes actually scanned.

This only helps if the data is laid out so the ranges are tight. On randomly ordered data every block’s range covers everything, and you skip nothing. That’s why clustering a table on the columns you filter by matters so much: computing the statistics is cheap, but the ranges only rule blocks out when similar values are stored together.

Staying columnar

Row-based code reconstructs rows early, then filters and aggregates over objects. A columnar engine does the opposite: stay in columns as long as possible and materialize rows last.

For a filter, you evaluate the predicate against the one filter column, collect the positions that match, then gather just those positions from the other columns. This is late materialization. You avoid building rows you are about to throw away, and the data stays in columnar form for whatever runs next.

That form is what enables vectorized execution: process a column a block at a time instead of a row at a time. A row-at-a-time engine pays interpreter and function-call overhead on every value and jumps around memory reconstructing rows. Looping over a block of one column amortizes that overhead, keeps the data in cache, and lets the CPU apply one instruction to many values at once with SIMD, where a single instruction can add 8 or 16 numbers in one step. It works only because the column is already a contiguous array of one type.

Working on encoded data

A naive engine decodes each column before touching it. Better engines work on the encoded form directly: comparing dictionary indices instead of strings, summing RLE runs as value × count without expanding them, combining compressed bitmaps without decompressing. This was an explicit goal for Capacitor. Unlike the format it replaced, it lets BigQuery operate on compressed data instead of decompressing it first. Pushing work down into the storage format like this is much of what separates a simple implementation from a production engine.

Joins

Everything so far has been about reading one table: touch fewer columns, compress them, skip blocks, stay in arrays. Joins are where that runs out. A join brings together rows from two tables that share a key, and the columnar layout does nothing to put those rows near each other. The matching rows can sit anywhere.

The mechanism is a hash join. Build a hash table on the smaller side, keyed by the join key, then stream the larger side through it and probe for matches. On a single machine that is the whole story. In a distributed engine like BigQuery the hard part is getting the rows that share a key onto the same worker, and there are two ways to do it.

A broadcast join is used when one side is small. BigQuery copies that side into the memory of every worker, and since each worker already holds a slice of the large table, it joins its slice locally. The only data crossing the network is the small table.

A hash join, also called a shuffle join, is used when both sides are large. BigQuery repartitions both tables by a hash of the join key so that every row with a given key lands on the same worker, which then joins its partition locally. This shuffle moves both tables across the network, and it is usually the most expensive step in the query. It is also where the columnar layout helps least: it can shrink what gets moved, but it cannot remove the need to move data between workers.

BigQuery chooses between the two based on the sizes it estimates for each side, and the query execution details report which strategy ran, as either BROADCAST or HASH.

The earlier ideas still pull their weight. Only the join key and the columns the query uses get shuffled, not the whole row, and filters pushed down before the join shrink both sides first. But the shuffle is why a join across two large tables can cost far more than a scan over either one, and why denormalizing into wide tables, or into nested and repeated fields that fold the related rows into one, is such a common pattern in BigQuery.

The other half: parallelism

The columnar ideas so far make a single worker efficient: fewer columns read, more blocks skipped, tight vectorized loops. But that only explains one machine, not why a scan over terabytes finishes in seconds. The rest is parallelism. BigQuery spreads a query across thousands of workers at once, each reading its own shard of the data, so the per-worker efficiency is multiplied across the whole fleet.

That fan-out is possible because BigQuery keeps storage and compute separate. The data lives in a shared filesystem that any worker can read, so the engine can throw a thousand of them at one query and release them when it finishes. A single-server database like Postgres couples the two: a query runs on the cores of the one machine that holds the data, however many that happens to be. Postgres can parallelize within that machine, but it cannot hand the work to a fleet that does not have the data.

Where it falls down

The same layout that makes analytics fast makes single-row work slow. To fetch one row by its ID you read from every column’s storage and reassemble it, where a row store reads one contiguous record. Writes are worse. A row store appends a record in one place, while a columnar store has to touch every column, and the encodings that buy the compression assume a column is written in bulk and then left alone, not mutated one value at a time. You cannot cheaply append a single row to a Capacitor or Parquet file.

That is why this is an analytical store and not a transactional one, and why the setup at the start of this post keeps both. Postgres absorbs the single-row inserts and updates from the application; those changes stream into BigQuery, which answers the questions that scan everything. Each database does the job its layout is good at.

Takeaway

The speed comes from a stack of effects, all of them downstream of storing each column together:

  • read only the columns a query touches
  • compress better, because similar values sit next to each other
  • skip whole blocks using cheap per-block statistics
  • stay in arrays so the CPU can vectorize

Run that stack across a fleet of workers reading shared storage in parallel, and a row store on a single machine has no way to keep up on analytics.