Skip to content
AstroPaper
Go back

Offloading reporting from Aurora PostgreSQL to Redshift with Zero-ETL: what you actually need to know

Edit page

If you’re running reporting queries against your Aurora PostgreSQL read replicas and feeling the pinch — whether that’s from contention with OLTP traffic, the cost of scaling replicas just to feed dashboards, or the fact that your “analytical” SQL was never really meant for a row store — Aurora PostgreSQL Zero-ETL integration with Redshift is probably on your shortlist.

The pitch is great: near real-time replication into a columnar warehouse with no pipelines to write, no Glue jobs to babysit, no DMS task tuning. But once you start planning a real migration, the questions get interesting fast. Two of the ones I keep getting asked:

  1. What happens to my NOT NULL columns? PostgreSQL enforces them strictly — does Redshift?
  2. What about indexes and performance? Half my Aurora schema is held together by indexes and foreign keys.

Let’s go through both.

Table of contents

Open Table of contents

1. NOT NULL: the question hiding three other questions

When people ask “how does Zero-ETL handle NOT NULL columns,” they’re usually conflating three different things. Splitting them apart is the key to a useful answer.

The Redshift target is read-only

Before anything else: the database that Zero-ETL creates in Redshift is read-only. You can’t INSERT, UPDATE, ALTER TABLE, or change constraints on the replicated tables. The integration owns them entirely. That fact reframes the whole conversation about constraints.

Question A: will NULLs ever arrive in a NOT NULL column?

No, unconditionally. Aurora enforces NOT NULL at write time, before the row hits the logical replication stream that Zero-ETL consumes. Aurora is the gate, and Redshift just receives whatever Aurora already validated.

You do not need to worry about data integrity here. If a column is NOT NULL on Aurora, no NULL value will ever land in the Redshift copy.

Question B: is the NOT NULL constraint declared on the Redshift table?

In practice, yes — the integration propagates NOT NULL onto replicated columns, and primary key columns are NOT NULL by definition (Zero-ETL requires a primary key to replicate at all).

But here’s the catch: AWS’s data type mapping table documents type translations meticulously and is silent on nullability preservation. So while you’ll observe NOT NULL on your replicated tables, AWS doesn’t formally guarantee this. Treat it as best-effort behavior, not a contract. Don’t build logic that relies on the constraint declaration being there.

Question C: can you change NOT NULL on the replicated table?

No. Read-only target. ALTER TABLE ... ALTER COLUMN ... SET/DROP NOT NULL is not available to you on Zero-ETL tables. If the constraint as replicated is wrong, your only lever is the curated layer downstream.

Where NOT NULL actually becomes your problem

Since Aurora enforces the constraint and you can’t modify the replicated table anyway, NOT NULL becomes an engineering concern in the curated reporting layer you build on top — the CTAS tables, materialized views, or transformed marts that your dashboards actually query. A few sharp edges:

Aurora PostgreSQL gotchas worth knowing


2. Indexes and performance: rewire your mental model

This is the part where a lot of people get burned. Redshift is not Aurora-with-more-storage. It’s a columnar MPP store with a fundamentally different execution model. The instincts that serve you well in PostgreSQL — “this query is slow, let me add an index” — will lead you in completely the wrong direction.

What gets created during Zero-ETL?

No indexes from your PostgreSQL schema. Redshift doesn’t have B-tree, hash, GIN, or GiST indexes. Your PostgreSQL indexes are simply ignored.

Tables land with DISTSTYLE AUTO and SORTKEY AUTO. Redshift’s autonomic optimizer eventually chooses distribution and sort keys based on observed query patterns — but it’s reactive and conservative. For your hot reporting tables, you’ll usually do better setting these explicitly. Foreign keys do replicate, but as informational constraints only — Redshift doesn’t enforce them, though the optimizer uses them for things like join elimination.

The Aurora → Redshift translation table

Aurora PostgreSQLRedshift equivalent
B-tree / hash / GIN indexesSORTKEY (controls physical sort order) + zone maps (auto min/max per 1MB block)
(no concept)DISTKEY — controls how rows shard across compute nodes; critical for join performance
Foreign keys (enforced)Foreign keys (informational only — optimizer hint, not enforced)
Primary keys (enforced unique)Primary keys (informational only)

The two concepts that actually drive Redshift performance are DISTKEY and SORTKEY:

Tables with no indexes in PostgreSQL

These replicate fine. Performance in Redshift has nothing to do with whether the source had indexes. It depends entirely on DISTKEY/SORTKEY choice and how columnar scans plus zone maps prune blocks for your queries.

What’s actually faster?

Rough mental model:

The catch with Zero-ETL tables

You can’t ALTER a Zero-ETL replicated table to set DIST/SORT keys. The target is read-only. So how do you get the performance benefits?

The pattern is to treat replicated tables as a landing zone, not as your reporting surface. Build a curated layer on top:

  1. Land Zero-ETL replicated tables in a dedicated schema (e.g., aurora_replica).
  2. For high-volume reporting tables, build a curated layer (regular Redshift tables, or materialized views) with explicit DISTKEY on the join column and SORTKEY on the most common filter column — usually a timestamp.
  3. Point dashboards at the curated layer, not the raw replicated tables.

This gives you all the things Zero-ETL doesn’t: explicit DIST/SORT keys, NOT NULL enforcement on the columns you care about, the ability to denormalize and pre-aggregate, and a stable interface for your BI tools that’s decoupled from the source schema.


3. Retries, failures, and duplicates: what actually happens when things go wrong?

This is a great question to ask, especially since you’ve already noticed that Redshift treats PRIMARY KEY and UNIQUE as informational only. If Redshift won’t reject a duplicate insert, what stops a retried replication event from creating one?

The short answer: the integration is upsert-based, keyed on the primary key, not append-based — and that’s why Zero-ETL requires a primary key in the first place. Let’s unpack that.

Why the primary key requirement isn’t just bureaucracy

You may have noticed Zero-ETL refuses to replicate any table without a primary key. That’s not a stylistic choice — it’s the mechanism that makes the replication idempotent. From the AWS docs on handling tables without primary keys: “Primary keys are required for zero-ETL integrations because they tie each change log event to the specific row being modified.”

Each row in the Aurora WAL is identified by its primary key. When the integration applies a change event to Redshift, it’s effectively doing a keyed merge — INSERT ... ON CONFLICT-style logic, conceptually — not a blind INSERT. So if a network blip causes the same change event to be delivered twice, the second application is a no-op (or an idempotent overwrite to the same value), not a duplicate row.

This is a subtle but important architectural detail: Redshift doesn’t enforce uniqueness, but the replication apply layer does, by virtue of always merging on the primary key.

What “retry” actually looks like in Zero-ETL

There are three different failure granularities, and they each behave differently:

1. Transient stream-level retries (the common case) The replication stream is checkpointed against Aurora’s WAL position (LSN). If the integration fails to apply a batch, it retries from the last acknowledged LSN. Because each event is keyed on the PK, replaying events that already landed is safe. You won’t see duplicates; at worst you’ll see momentary lag while the retry catches up.

2. Table-level resync (ResyncRequired state) Certain operations on the source — adding a column in a specific position, adding a column with DEFAULT CURRENT_TIMESTAMP, multi-operation ALTER TABLE — put a single table into ResyncRequired. The integration doesn’t append; it fully reloads that table from source. While the resync is happening, the table is unavailable for queries (unless you set QUERY_ALL_STATES=TRUE). When it finishes, you have a fresh, consistent copy. No duplicates, by construction — the old data was dropped first.

3. Integration-level failure (Failed state / Needs attention) This happens when the integration itself can’t proceed: tracked changes between source and target diverged, authorization broke, or the source was deleted. AWS’s prescribed fix for several of these is “delete the integration and create a new one” — which fully re-initializes the pipeline. Again, full reload, not append. No duplicates.

So where could duplicates actually sneak in?

In the steady-state Zero-ETL flow, they shouldn’t — assuming a real primary key. The places to actually watch out:

One mental model that helps

Think of Zero-ETL as a state-replicating system, not a log-replaying system. It’s continuously trying to make Redshift’s copy of each row equal Aurora’s copy of that row, identified by primary key. Retries don’t accumulate — they re-converge. That’s a fundamentally different shape from a “stream of events appended to a sink” system, which is what would expose you to duplicates if uniqueness weren’t enforced.

That state-replicating shape is exactly why the absence of constraint enforcement in Redshift doesn’t bite you for the replicated tables. It only bites you in the layers you build on top — and those, you control.


Putting it together

A practical setup looks something like this:

Aurora PostgreSQL  ──Zero-ETL──▶  Redshift `aurora_replica` schema  (read-only landing zone)

                                          │  CTAS / materialized views with explicit
                                          │  DISTKEY, SORTKEY, NOT NULL constraints

                                  Redshift `reporting` schema     (curated layer)


                                       Dashboards

A few rules to internalize:

Zero-ETL is a great primitive. But it’s a replication mechanism, not a reporting platform. The win comes from pairing it with a deliberately designed curated layer — which is the part you’d be writing anyway, just without the burden of also building and maintaining the pipeline that feeds it.


Edit page
Share this post on:

Next Post
Hello, I'm Venki — and welcome to my notes