ClickHouse® AggregatingMergeTree: Rollups That Scale

This is the fifth post in the ClickHouse Deep Dive series, where I am explaining ClickHouse basics from scratch. If you are not familiar with how it works, you can check previous articles. The data modelling post placed AggregatingMergeTree in the engine selection guide, and the materialized views post introduced it with the -State/-Merge pattern. Both posts touched the surface. This one goes all the way down.

ClickHouse can scan billions of rows in seconds. But when you're running the same GROUP BY hundreds of times a minute, even ClickHouse needs help. AggregatingMergeTree is how you trade disk space for query speed: pre-compute the aggregations at write time, store the intermediate states, and read back the final result in milliseconds instead of seconds.

The documentation is scattered across a dozen sources, and none of them provides a comprehensive overview. This post connects the pieces: how the engine works, the three combinator patterns as a mental model, when to use SimpleAggregateFunction vs AggregateFunction, cascading multi-resolution rollups, and the production gotchas we learned the hard way.

How AggregatingMergeTree works

Same storage layer as every MergeTree variant. Same parts, same background merges, same sparse index. The data modelling post covered this: every MergeTree variant is the same engine with a different merge strategy. For AggregatingMergeTree, the merge strategy is: rows with the same ORDER BY key get their AggregateFunction columns combined.

That last sentence is the entire engine. Everything else follows from it.

1. BEFORE MERGE

PART (UNMERGED)

A

endpoint=/api

sum=5

B

endpoint=/web

sum=3

A

endpoint=/api

sum=8

A

endpoint=/api

sum=2

3 rows share ORDER BY key A

MERGE

2. MERGE RUNS

COMBINING STATES

A

sum(5, 8, 2)

B

sum(3)

Same ORDER BY key = combine

RESULT

3. AFTER MERGE

MERGED PART

A

endpoint=/api

sum=15

B

endpoint=/web

sum=3

4 rows collapsed to 2

Rows with matching ORDER BY keys get their aggregate columns combined during background merges

The best way to understand the engine is to see an example, and one especially useful is API requests analytics. In this case, the raw table stores individual events:

SQL

CREATE TABLE access_log (    endpoint String,    timestamp DateTime,    latency_ms Float64,    user_id UInt64) ENGINE = MergeTree()ORDER BY (endpoint, timestamp);

The AggregatingMergeTree destination stores hourly rollups. Every non-key column is an AggregateFunction, storing binary intermediate states instead of final values. The ORDER BY defines which rows get merged together, so it matches the GROUP BY in the materialized view:

SQL

CREATE TABLE endpoint_stats_1h (    endpoint LowCardinality(String),    hour DateTime,    request_count AggregateFunction(sum, UInt64),    unique_users AggregateFunction(uniq, UInt64)) ENGINE = AggregatingMergeTree()PARTITION BY toYYYYMM(hour)ORDER BY (endpoint, hour);

The materialized view reads raw events, truncates timestamps, and produces aggregate states with the -State combinator:

SQL

CREATE MATERIALIZED VIEW endpoint_stats_1h_mv TO endpoint_stats_1h ASSELECT    endpoint,    toStartOfHour(timestamp) AS hour,    sumState(toUInt64(1)) AS request_count,    uniqState(user_id) AS unique_usersFROM access_logGROUP BY endpoint, hour;

The MV's GROUP BY produces (endpoint, hour), matching the destination's ORDER BY. That alignment is what makes the whole thing work: during background merges, every row that shares the same endpoint and hour gets its aggregate columns combined.

On the query side, the -Merge combinator turns stored states back into final values. The GROUP BY is required because background merges are asynchronous: unmerged partial rows may still exist at query time.

SQL

SELECT    endpoint,    hour,    sumMerge(request_count) AS request_count,    uniqMerge(unique_users) AS unique_usersFROM endpoint_stats_1hWHERE endpoint = '/api/v1/data'GROUP BY endpoint, hourORDER BY hour DESC;

-State on INSERT, -Merge on SELECT. There's a third combinator, -MergeState, that produces another intermediate state instead of a final value. That's what makes cascading rollups possible, and I'll cover it in that section.

-State

123

Raw valuesNumbers, strings, timestamps

sumState()

0x

Aggregate stateBinary blob (opaque)

INSERT SIDE

-Merge

0x

Aggregate statesFrom AggregatingMergeTree

sumMerge()

42

Final valueRegular number you can read

SELECT SIDE

-MergeState

0x

Aggregate statesFrom level N table

sumMergeState()

0x

Aggregate stateInto level N+1 table

CASCADE SIDE

The three combinators as a progression: raw to aggregate, aggregate to final, aggregate to aggregate

ORDER BY: merge grouping and query performance

ORDER BY in an AggregatingMergeTree does double duty. It controls which rows get merged together, and it's the sparse index that ClickHouse uses to skip granules during queries. Two jobs, one key. That means two failure modes when you get it wrong.

Wrong granularity: Two versions of the same problem.

Too granular: If the ORDER BY key produces mostly unique combinations, the engine has nothing to merge. I've seen tables where this happened: the table kept growing linearly with inserts, query times climbed with it, and the whole point of pre-aggregation was lost. Your ORDER BY key should produce enough duplicate combinations that the engine actually has rows to combine.
Too coarse: The opposite problem. If you aggregate at too high a level, say just endpoint with no time dimension, you get one merged row per endpoint. Fast queries, tiny table, but you can never break down the results by hour or day. That granularity is gone. You aggregated it away, and there's no getting it back without reprocessing from raw data. You can only query at the resolution you stored, which is fine if you only need full-history totals.

Doesn't match your access pattern. The ORDER BY key is also the primary index. If your queries filter by hour first but your ORDER BY starts with endpoint, ClickHouse can't use the sparse index efficiently and scans more granules than it needs to. Same data, same merges, slower queries. You have three options depending on the failure mode:

Reorder the ORDER BY columns to match your most common filter if that still works for your merge granularity.
Add a skipping index for access patterns that don't align with your ORDER BY. These are lightweight filters (bloom filters, min-max) that let ClickHouse skip granules without changing the sort order.
Use a separate PRIMARY KEY if your ORDER BY has more columns than your typical query filter uses, which would reduce sparse index effectiveness.

Most people don't know that PRIMARY KEY and ORDER BY can be different in ClickHouse. PRIMARY KEY can be a prefix of ORDER BY. Data gets sorted on disk by the full ORDER BY (which controls merge granularity), but only the PRIMARY KEY columns go into the sparse index (which controls query performance and RAM usage). The data modelling post covers this in detail. For AggregatingMergeTree, this means you can have a granular ORDER BY for fine-grained merging while keeping a shorter PRIMARY KEY that matches how you actually query the table.

The tradeoff between granularity, access pattern and use case is the real design decision. A coarser key merges more aggressively (smaller table, faster scans), but you lose the ability to query at finer granularity. A granular aggregation means you can serve every request, but if the volume of data requested increases, for example, a user requesting full history on a time-series table, the query gets expensive fast. There's no universal answer. Start with the granularity your queries actually need, check system.parts to verify that merges are reducing row counts, and monitor query performance as usage grows.

Non-key columns and the any() trap

An AggregatingMergeTree table has three kinds of columns: ORDER BY key columns (like endpoint and hour), aggregate columns (SimpleAggregateFunction or AggregateFunction), and everything else. The engine knows what to do with the first two. Key columns identify which rows belong together. Aggregate columns are combined with their function (sum, max, uniq, etc.).

The nuance here is that if you add a plain column, for example, name: String, that is not in the GROUP BY, the engine has no rule for how to combine it. When two rows merge, there are two different name values, and ClickHouse has to pick one. It picks arbitrarily using any() semantics.

This means the value you see depends on whether a merge has happened yet. Before the merge, you query the table and see both rows with their correct names. After the merge, one name is kept, and the other one is gone, you don't control which one. Be aware that this doesn't produce an error. The query returns a result, the numbers look fine, and the name column is arbitrarily updated.

The fix is simple: every non-key column in an AggregatingMergeTree should be an aggregate type. If you need to keep a string value, wrap it in an argMax function to make the behaviour explicit.

SimpleAggregateFunction vs AggregateFunction

This is the decision framework the docs don't give you.

AggregateFunction stores a serialised binary state. The sum state is a running total. The uniq state is a HyperLogLog sketch. The quantile state is a t-digest. These blobs are opaque, you can't read them with a normal SELECT, and they need the -Merge combinator to produce a final value.

SimpleAggregateFunction stores the partial result directly as a regular value. No binary blob. No special combinator needed on SELECT. You just read the column normally.

The catch: SimpleAggregateFunction only works for functions where the partial result IS the final result. For example, max, min, sum and any qualify. These are the ones available with SummingMergeTree, and you can find the full list in the ClickHouse official docs.

uniq does not qualify, because you can't compute unique counts by taking the max of two unique counts. Same for quantile, which needs a t-digest to merge partial results. And argMin/argMax, which need to track both the value and its associated argument. You can't reduce that pair to a single partial result.

Endpoint metrics made this distinction concrete for me. Say you're rolling up an access log into per-minute stats:

request_count is SimpleAggregateFunction(sum, UInt64). The sum of sums is the sum. Simple.
max_latency is SimpleAggregateFunction(max, Float64). The max of maxes is the max.
unique_users is AggregateFunction(uniq, UInt64). You can't merge unique counts by summing or maxing them. The HyperLogLog sketch needs to combine with other sketches to produce a correct approximate count.
p99_latency is AggregateFunction(quantile(0.99), Float64). The p99 of two p99s is not the p99. The t-digest needs to merge with other digests.

SQL

CREATE TABLE endpoint_stats_1m (    endpoint LowCardinality(String),    minute DateTime,    request_count SimpleAggregateFunction(sum, UInt64),    max_latency SimpleAggregateFunction(max, Float64),    min_latency SimpleAggregateFunction(min, Float64),    unique_users AggregateFunction(uniq, UInt64),    p99_latency AggregateFunction(quantile(0.99), Float64)) ENGINE = AggregatingMergeTree()PARTITION BY toYYYYMM(minute)ORDER BY (endpoint, minute);

Querying mixes both patterns. SimpleAggregateFunction columns use their plain aggregate function (sum, max, min). AggregateFunction columns need the -Merge combinator (uniqMerge, quantileMerge):

SQL

SELECT    endpoint,    minute,    sum(request_count) AS request_count,    max(max_latency) AS max_latency,    min(min_latency) AS min_latency,    uniqMerge(unique_users) AS unique_users,    quantileMerge(0.99)(p99_latency) AS p99_latencyFROM endpoint_stats_1mWHERE endpoint = '/api/v1/data'GROUP BY endpoint, minuteORDER BY minute DESC;

Use SimpleAggregateFunction when you can. It's smaller on disk, faster to query, and doesn't need the -Merge combinator on SELECT. Fall back to AggregateFunction when the function requires tracking state across partial results.

Cascading rollups and multi-source aggregation

This is where -MergeState earns its place. If you need the same metrics at multiple time resolutions, each level should read from the previous one, not from raw data. The 1-hour MV reads from the 1-minute table. The 1-day MV reads from the 1-hour table.

SOURCE

MergeTree

access_log

48M rows/day

endpointString

latency_msFloat64

user_idUInt64

timestampDateTime

Raw events

-State

MV

1 MIN

AggregatingMergeTree

endpoint_stats_1m

144K rows/day

request_countSimple(sum)

max_latencySimple(max)

unique_usersAgg(uniq)

p99_latencyAgg(quantile)

333x reduction

-MergeState

MV

1 HOUR

AggregatingMergeTree

endpoint_stats_1h

2,400 rows/day

request_countSimple(sum)

max_latencySimple(max)

unique_usersAgg(uniq)

p99_latencyAgg(quantile)

20,000x total

Cascading rollup architecture: raw data flows into 1m, then 1h, then 1d, each level reading from the previous

The key is in the MV's SELECT. For the hourly table reading from the 1-minute table above:

SQL

CREATE MATERIALIZED VIEW endpoint_stats_1h_mv TO endpoint_stats_1h ASSELECT    endpoint,    toStartOfHour(minute) AS hour,    sum(request_count) AS request_count,       -- Simple: sum of sums    max(max_latency) AS max_latency,           -- Simple: max of maxes    min(min_latency) AS min_latency,           -- Simple: min of mins    uniqMergeState(unique_users) AS unique_users,       -- Full: state to state    quantileMergeState(0.99)(p99_latency) AS p99_latency -- Full: state to stateFROM endpoint_stats_1mGROUP BY endpoint, hour;

uniqMergeState takes the HyperLogLog sketch from the 1-minute table and produces another sketch for the 1-hour table. Not a final value. An intermediate state that can be merged again at the daily level. SimpleAggregateFunction columns use plain sum(), max(), min() because their partial results are already plain numbers.

You can't mix function types across levels. If the 1-minute table stores sumState, the hourly MV must use sumMergeState, not avgState. The binary states are function-specific, and the error message when you get it wrong is not helpful.

The other pattern worth knowing: multiple materialized views writing to the same target. One MV processes web access logs, another processes mobile events, both keyed by (endpoint, minute) and writing to the same endpoint_stats_1m table. During background merges, the engine combines all partial states for the same ORDER BY key regardless of which MV produced them. Barely documented, but it works. The engine doesn't care where the rows come from.

Production gotchas

These are based on GitHub issues, community posts, and things we hit in production.

You can't ALTER AggregateFunction columns. Adding, dropping, or modifying an AggregateFunction column on a live table fails or hangs. Schema changes to these columns mean drop-recreate-backfill, same as MV schema changes. Plan your schema carefully upfront. I said this in the materialized views post, and I'll say it again: write and test the backfill query the same day you create the table.

Queries without GROUP BY return unmerged partial rows. AggregatingMergeTree merges are eventual, the same as every other MergeTree variant. Between merges, multiple partial rows exist for the same ORDER BY key. Without GROUP BY in your SELECT, you get all of them. This is the same principle as FINAL in ReplacingMergeTree, except here you solve it with GROUP BY and -Merge functions instead of FINAL.

Forgetting -Merge on SELECT returns binary garbage. A SELECT * on an AggregatingMergeTree table shows binary blobs in the AggregateFunction columns. No error, no warning. Just unreadable output. You need -Merge functions for every AggregateFunction column. After a few months working with ClickHouse, it becomes obvious, but if you are new, keep this in mind. SimpleAggregateFunction columns don't have this problem, which is another reason to prefer them when you can.

AggregateFunction state serialisation can change between ClickHouse versions. The binary format that stores intermediate states (the HyperLogLog sketch for uniq, the t-digest for quantile) is not guaranteed to be stable across major versions. If you upgrade ClickHouse and the serialisation format changes, existing states may not deserialise correctly. In practice, test your AggregatingMergeTree tables after upgrades, and keep your backfill queries ready.

Backfilling with raw values instead of -State values corrupts silently. If you INSERT plain numbers into an AggregateFunction column (skipping the -State combinator), ClickHouse won't always reject it at insert time. The data looks fine until the next background merge combines it with a real aggregate state and produces garbage. The materialized views post covers this in detail. Always backfill through the MV or use the -State functions explicitly.

When not to use AggregatingMergeTree

If your aggregation only needs sum and count, SummingMergeTree is simpler. No AggregateFunction columns, no -State/-Merge syntax, no binary blobs. Just numeric columns that get added together during merges. I said in the materialized views post to prefer -State/-Merge over SummingMergeTree, and I stand by that for anything beyond trivial use cases. But if you genuinely only need sums and counts, and you're confident you won't need complex aggregations that require intermediate states later, SummingMergeTree saves you real complexity.

If your aggregation is a simple re-sort or single-level GROUP BY without uniq/quantile, consider a projection instead. Projections store an alternative sort order (and optionally a GROUP BY aggregation) inside the same table. The query optimiser picks them automatically. No separate target table, no MV, no -State/-Merge syntax. Schema changes are ALTER TABLE ADD PROJECTION / DROP PROJECTION instead of drop-recreate-backfill. We use projections on raw tables and MVs with AggregatingMergeTree for anything that needs uniq, quantile, or cascading. That split has worked well. Projections hit their limits faster than you'd expect, which the materialized views post goes into.

Scenario	Engine	Why
Sum and count only	SummingMergeTree	Simpler syntax, no binary states
Uniq, quantile, argMin/argMax	AggregatingMergeTree	These functions require aggregate state tracking
Multi-resolution rollups	AggregatingMergeTree	-MergeState enables cascading
Simple re-sort or count	Projection on MergeTree	Automatic routing, no separate table
Latest row per key	ReplacingMergeTree	Different problem entirely

Closing

One storage layer, seven merge strategies. The mental model from the data modelling post still holds: pick the merge strategy that matches your data shape, and everything else, compression, sparse indexes, and partitioning, works the same.

AggregatingMergeTree is the one that turns "this query takes 3 seconds" into "this query takes 50 milliseconds." It demands more upfront design than the other variants, but once the pipeline is set up, it runs itself.

Next in the series: a real OHLC candle pipeline built on these patterns, from raw trades to multi-resolution candles served to TradingView.

How AggregatingMergeTree works

ORDER BY: merge grouping and query performance

Non-key columns and the any() trap

SimpleAggregateFunction vs AggregateFunction

Cascading rollups and multi-source aggregation

Production gotchas

When not to use AggregatingMergeTree

Closing

Related posts

Current State with ClickHouse ReplacingMergeTree

ClickHouse ReplacingMergeTree: Deduplication Done Right

Data Modelling in ClickHouse: Engines, Tables, and Materialisations