ClickHouse® ReplacingMergeTree: Deduplication Done Right
This post is the 4th of the ClickHouse Deep Dive series. If you haven't read the previous ones, I recommend starting from the beginning. The data modelling post introduced ReplacingMergeTree as the engine for rows that get updated, CDC (Change Data Capture) streams, dimension tables and anything where you're doing upserts. One paragraph in a broader guide. This is the full deep dive.
You need to unlearn something if you come from Postgres or MySQL. INSERT ON CONFLICT in Postgres deduplicates immediately. You query, you always get one row. ReplacingMergeTree doesn't do that, it deduplicates whenever the background merger decides to run based on the availability of resources in the server. It could be seconds. It could be hours. You don't control it, and between inserts, the old and new rows both exist in storage, visible to every query.
That mismatch is responsible for almost every ReplacingMergeTree bug I've seen in production. This post covers the mechanics, configuration, query patterns, and migration traps. After reading it, you shouldn't need to read anything else about this engine.
How deduplication actually works
You insert a row. Then you insert a newer version of that same row. In Postgres, the old one is gone. In ClickHouse, both sit in storage. They'll both be returned in query results.
ReplacingMergeTree dedup lifecycle, both rows visible until the background merge resolves them
ClickHouse removes the older version during background merges. Merges are asynchronous and heuristic-based, the system decides when to run them based on part sizes, load, and internal scheduling. I've seen a table where merges took place within two seconds of an insert, and the same table on a busy day where duplicates stuck around for over an hour. There's no SLA on merge timing. You design around this, or you get burned by it.
SELECT without FINAL returns the duplicates. COUNT(*) is inflated. Aggregations double-count. The scary part is that your numbers are plausible, just wrong. Nobody gets paged for a 1.8x row count. You find out three weeks later when a customer asks why their dashboard numbers don't match their source system.
Dedup scope makes things trickier. It only happens within a single partition. Same logical row lands in two different partitions? Those copies will never merge. OPTIMIZE TABLE FINAL won't fix it. Nothing will. The same story in sharded setups: deduplication is shard-local only.
Now, the part that trips up almost everyone coming from relational databases.
ORDER BY defines which rows are considered duplicates. Not PRIMARY KEY. In Postgres, PRIMARY KEY enforces uniqueness. In ClickHouse, PRIMARY KEY controls the sparse index, which is a query performance mechanism. ORDER BY controls deduplication, which is a data correctness mechanism. They are related: PRIMARY KEY has to be a prefix of ORDER BY, but they don't have to be the same.
-- Dedup uses all four columns
-- Sparse index uses only the first two
ORDER BY (event_type, toDate(created_at), created_at, id)
PRIMARY KEY (event_type, toDate(created_at))
The column ordering here is intentional. Low-cardinality columns go first because ClickHouse uses ORDER BY for physical sort order on disk, and sorting by low-cardinality values first means the sparse index can skip large chunks of data during scans. The unique identifier (id in this example) appears last because it's only there to guarantee deduplication correctness, not to help query performance.
Every column in ORDER BY must be immutable for a given logical row. If event_type changes for a row with id=123, the old version has ORDER BY key (A, 2026-01-15, 123), and the new version has (B, 2026-01-15, 123). Those are different keys. The engine sees them as different rows, and both copies will live forever, no merge will resolve them. I'll expand this in the migration traps section.
One more thing that belongs here because it catches people mid-design: materialized views fire on INSERT, before any merge runs. If you have an MV doing sumState() on a ReplacingMergeTree source, it will count duplicates. The source table eventually looks fine after the merge. The MV target is permanently wrong, it already aggregated both versions before dedup happened. You can work around this with ReplacingMergeTree on the target, or argMaxState, or an idempotent MV design. None of these workarounds is particularly clean. Think it through before writing the CREATE statement. The materialized views post goes deeper on this.
Configuring the table: ver and is_deleted
Two parameters. Both are optional in the syntax. ver is mandatory in practice if you're running anything real. is_deleted depends on whether your use case needs to handle deletes from the source.
ver is the version column. During a merge, the row with the highest version wins. Without it, the last physically inserted row wins, and that depends on which thread or connector wrote last. On a single-threaded import, you can probably skip this if order is guaranteed. The moment you have two CDC connectors, or parallel batch inserts, or anything where insertion order isn't deterministic, you need an explicit version column.
What works for version values: an incrementing integer from the source system, a DateTime64 with microsecond precision, or, if you're doing CDC, the source database's LSN (log sequence number). I've used all three. The LSN approach is the most reliable for CDC because it tracks the source's actual write order, but it requires your connector to expose it.
is_deleted landed in ClickHouse 23.2 and solved a problem that used to require CollapsingMergeTree. To delete a row: insert a new version with the same ORDER BY key, a higher version, and is_deleted = 1. During merges, ClickHouse treats it as a deletion marker. Queries with FINAL automatically filter these out.
CREATE TABLE events (
_version UInt64,
_is_deleted UInt8,
id UInt64,
event_type LowCardinality(String),
created_at DateTime64(3),
payload String
) ENGINE = ReplacingMergeTree(_version, _is_deleted)
PARTITION BY toYYYYMM(created_at)
ORDER BY (event_type, toDate(created_at), id)
SETTINGS min_age_to_force_merge_seconds = 86400;
Deleted rows don't disappear from disk on their own. The delete markers stay there, and so does the data. min_age_to_force_merge_seconds handles this: parts older than the threshold get force-merged, which cleans up delete markers as a side effect. A 24-hour window works well in practice, but can be adapted for your pipeline needs. Historical partitions stay clean without manual intervention, recent partitions keep their markers until they age out, and you handle correctness at query time in the meantime. The time buffer also absorbs out-of-order arrivals. If a late row shows up before the merge window, the delete marker is still there to handle it.
There used to be a clean_deleted_rows = 'Always' setting for more aggressive cleanup, but it was deprecated and removed.
If your table has a high delete rate (above ~10% of rows), this cleanup overhead starts to add up. CollapsingMergeTree handles high-volume deletes more naturally, but we will go deeper into it in a future post.
Querying: FINAL and the alternatives
Every query against a ReplacingMergeTree table without FINAL can return duplicates. FINAL forces dedup at read time.
SELECT * FROM events FINAL WHERE event_type = 'purchase';
For years, the standard advice was to avoid FINAL. And the advice was right, it used to be terrible. Single-threaded, read primary key columns in addition to requested columns, 12x slower in benchmarks.
That's not true anymore. Before 20.5, FINAL was single-threaded. Performance improved incrementally across versions, and in 22.6 it became fully multi-threaded. But the biggest single improvement was a setting:
SET do_not_merge_across_partitions_select_final = 1;
Processes partitions independently during FINAL instead of merging across all of them. Altinity benchmarked the difference: 7x faster. FINAL within ~30% of non-FINAL queries.
We run FINAL on every ReplacingMergeTree query now. 100B+ rows, every query, all the time. The overhead is measurable. Wrong results are not acceptable. That's the tradeoff, and I'm comfortable with it.
SELECT with and without FINAL, same data same table, FINAL forces deduplication at read time
There's a caveat: do_not_merge_across_partitions_select_final only works correctly if the same ORDER BY key never appears in multiple partitions. It should be enforced by a proper design, but it's worth calling out.
One more thing to know: FINAL disables PREWHERE optimization by default. PREWHERE is how ClickHouse filters rows before reading all columns, and it's normally applied automatically. With FINAL active, filtering on non-ORDER BY columns gets slower because ClickHouse reads more data than it otherwise would. You can't just add an explicit PREWHERE to work around this, PREWHERE filters rows before FINAL applies deduplication, which means it can filter out the replacement row and return a stale version instead (GitHub #23702). If this is a bottleneck, restructure the query with a subquery: deduplicate in the inner query with FINAL, filter in the outer query.
People still ask about alternatives, so here they are. The argMax pattern selects the column value from the row with the highest version:
SELECT
id,
argMax(event_type, _version) AS event_type,
argMax(payload, _version) AS payload
FROM events
GROUP BY id;
LIMIT 1 BY sorts by version descending and takes the first per key:
SELECT * FROM events
ORDER BY id, _version DESC
LIMIT 1 BY id;
Both work. But looking again at Altinity's benchmarks across 40M rows, 10M unique keys:
| Method | Memory | Time |
|---|---|---|
| argMax (GROUP BY) | 866 MiB | 9.7s |
| ORDER BY + LIMIT 1 BY | 8.39 GiB | 14.5s |
| Subquery (IN) | 924 MiB | 4.1s |
| FINAL | 834 MiB | 2.3s |
FINAL is fastest and uses the least memory on full-table scans. Point queries are a wash across all methods (~0.006s). Range queries (10K keys from 40M), FINAL still wins at 0.357s.
So why does argMax still come up? Because FINAL and argMax solve different shapes of the same problem.
FINAL deduplicates the rows and then hands you the result set. If all you need is "give me the latest version of these rows," FINAL is faster and simpler. But FINAL only deduplicates. It doesn't aggregate. If you want "the latest name for each user AND the count of their events grouped by month," you need a GROUP BY on top of the FINAL, which means ClickHouse deduplicates first and then aggregates, two passes over the data.
argMax does both in one pass. It deduplicates and aggregates in the same GROUP BY. That's its real advantage: when the query already needs a GROUP BY for business logic, argMax lets you fold dedup into it instead of layering it on top. The tradeoff is more verbose SQL, you write argMax(col, _version) for every column you want to deduplicate, and on pure dedup queries without aggregation, it's slower than FINAL.
I used to recommend argMax as the default. The FINAL optimizations in recent ClickHouse versions changed that. Today: FINAL for straightforward dedup queries, argMax when you're already doing a GROUP BY and want to combine dedup with aggregation in one shot.
OPTIMIZE TABLE FINAL vs SELECT FINAL
One word in common. Completely different operations. I've seen people run OPTIMIZE TABLE FINAL on a cron job, thinking it's the same as using FINAL in queries. It's not.
| OPTIMIZE TABLE FINAL | SELECT ... FINAL | |
|---|---|---|
| What it is | DDL. Physically merges all parts on disk. | Query modifier. Deduplicates at read time. |
| Writes data? | Yes. Decompresses, merges, recompresses. | No. Read-only. |
| Cost | Massive. Blocks other merges. | Per-query overhead. |
| When to use | Maintenance windows. Bulk load cleanup. | Every query. |
OPTIMIZE TABLE FINAL is for maintenance. Run it after a big backfill or during off-peak hours when you want to compact parts. Do not run it on a schedule against production tables that are actively being written to. The merge is single-threaded, the I/O is enormous, and it blocks other background merges while it runs.
Production patterns
Four patterns that cover most of what we've needed. The code is copy-pasteable, and the settings are the ones we actually run.
CDC from Postgres
The bread and butter. Source database changes stream in through CDC (Debezium, PeerDB, a custom connector, whatever), and ClickHouse stores the latest state.
CREATE TABLE cdc_events (
_version UInt64,
_is_deleted UInt8,
id UInt64,
event_type LowCardinality(String),
created_at DateTime64(3),
payload String
) ENGINE = ReplacingMergeTree(_version, _is_deleted)
PARTITION BY toYYYYMM(created_at)
ORDER BY (event_type, toDate(created_at), id)
SETTINGS min_age_to_force_merge_seconds = 86400;
id at the end of ORDER BY for dedup. Low-cardinality columns first for the sparse index. Partition by month so the same id stays in the same partition, assuming created_at doesn't change for a given row. If it does, you need a different partition strategy. Remember to set do_not_merge_across_partitions_select_final = 1 at the session or query level for FINAL performance.
Dimension tables
Small lookup tables. The latest version of each entity. No need to partition.
CREATE TABLE user_profiles (
version UInt64,
id UInt64,
name String,
email String,
plan LowCardinality(String),
updated_at DateTime64(3)
) ENGINE = ReplacingMergeTree(version)
ORDER BY id;
id is the dedup key, version picks the survivor. That's it.
Row policy for delete filtering
If you're using is_deleted, two lines of SQL save you from adding WHERE clauses everywhere:
CREATE ROW POLICY hide_deleted ON cdc_events
FOR SELECT USING _is_deleted = 0 TO ALL;
Every query automatically filters deleted rows. Applied early in the query pipeline, before your WHERE clause even runs. We have this on almost every table that uses is_deleted.
Force-merging old partitions
Recent data stays unmerged for fast ingestion (FINAL handles correctness at read time). Historical data benefits from being fully merged, with fewer parts for FINAL to process.
ALTER TABLE cdc_events MODIFY SETTING
min_age_to_force_merge_seconds = 86400;
Parts older than 24 hours get force-merged. Simple, effective. Historical queries get faster, and recent data stays ingestion-friendly. You can adapt the timeline depending on your needs and the ingestion pipeline.
When to use something else
ReplacingMergeTree is the default for upsert and CDC. Two engines cover the gaps.
CollapsingMergeTree takes a different approach entirely. Instead of "insert new version, old one gets cleaned up eventually," it uses double-entry bookkeeping. You insert the original row with sign=1. To update: insert a cancellation row with sign=-1, then the new row with sign=1. Queries use sum(amount * sign) and get correct results without FINAL, because the +1 and -1 cancel out mathematically.
| ReplacingMergeTree | CollapsingMergeTree | |
|---|---|---|
| Insert pattern | Insert the new version | Cancel old + insert new |
| Previous state needed? | No | Yes, must know old row to cancel it |
| Deletes | is_deleted flag + cleanup | Native, cancellation row |
| MVs work correctly? | Only uniq()-like | Counts and sums work |
| Application complexity | Low | High |
The tradeoff is clear. CollapsingMergeTree gives you correct aggregations without FINAL and proper MV support, but your application has to track previous row state to emit cancellations. If you can do that and your delete volume is high, it's the better engine.
AggregatingMergeTree solves a different problem. It stores pre-computed aggregate function states for a fundamentally different purpose: "give me pre-aggregated metrics" vs "give me the latest version of this row." Not interchangeable with ReplacingMergeTree. Use it for repeated GROUP BY aggregations where you want sub-millisecond reads.
The Postgres migration traps
This is the reference table I wish existed when I first moved a Postgres CDC pipeline to ClickHouse:
| Postgres/MySQL | ClickHouse ReplacingMergeTree |
|---|---|
| UPDATE modifies the row in place | INSERT a new row with a higher version |
| DELETE removes the row | INSERT a row with is_deleted=1, higher version |
| PRIMARY KEY enforces uniqueness | No uniqueness enforcement. Eventual dedup. |
| INSERT ON CONFLICT = immediate | INSERT = eventual (duplicates visible until merge) |
| Single source of truth | Multiple versions coexist until merge |
| Any column is mutable | ORDER BY columns are immutable |
Four traps specifically.
The first one catches everyone. You query without FINAL. COUNT() returns 2x what you expect. You don't notice because the data looks plausible, it's the right shape, right columns, right date ranges. Just inflated. I've seen this go undetected for weeks until someone cross-referenced with the raw data, and the numbers didn't match.
Mutable ORDER BY columns are the second. In Postgres, update any column you want. In ClickHouse, if an ORDER BY column value changes for a logical row, the old row and the new row now have different dedup keys. They're different rows as far as the engine is concerned. Two permanent copies, forever, no merge will resolve them. The fix is to only put immutable columns in ORDER BY, include the source PK at the end, and if you genuinely need to change a key column, prepare to suffer.
Third: cross-partition duplicates. If created_at is part of your PARTITION BY expression and it can change (maybe the source system corrects timestamps retroactively), the old and new versions end up in different partitions. Dedup is partition-scoped. Those copies are permanent.
The last one is subtle, but painful. ClickHouse columns aren't nullable by default. A column declared as String without Nullable() will silently convert NULL to an empty string. UInt64 converts NULL to 0. Date converts NULL to 1970-01-01. Your stream sends a NULL because the Postgres column is null, ClickHouse stores a zero, and you find out when a customer asks why their report shows transactions dated January 1970.
Conclusions
ReplacingMergeTree handles row-level dedup with almost no application complexity. Insert the new version, and the engine sorts it out. But it does punish you if you bring Postgres assumptions to the table. Eventual dedup, immutable ORDER BY columns, partition-scoped merges. These are the costs of columnar performance at scale, not bugs.
Use FINAL on every query. Design immutable ORDER BY keys. Think through your MV interactions. That's the short version.
Related posts
How We Built a Stateless Distributed Cache for ClickHouse
In a stateless distributed database, the network is the bottleneck — not CPU or memory. We built a distributed cache mesh where every ClickHouse node contributes its local NVMe to a shared cache layer.
Data Modelling in ClickHouse: Engines, Tables, and Materialisations
How to choose between MergeTree engines, design ORDER BY and PARTITION BY for performance, and use materialised views for pre-aggregation. A practical guide from running ClickHouse at petabyte scale.