PostgreSQL in the Cloud through the Eyes of an Oracle DBA: Why is your Aurora cluster lagging, and why can't you turn off RDS backups?

The Real Differences between Self-hosted, RDS, and Aurora. An honest breakdown of Managed Postgres architecture.

Feb 13, 2026

Introduction: A Paradigm Shift

The database market has changed irreversibly. If 10 years ago Oracle was the "Gold Standard" for the Enterprise, today PostgreSQL is the "Linux of Databases." But for an experienced Oracle DBA, the transition to Postgres (especially in the AWS cloud) is often a culture shock.

You are no longer the "Guardian of the Temple," tuning init.ora and managing physical data files. In the cloud, your role transforms into a DBRE (Database Reliability Engineer). The focus shifts from "how do I add a data file" to "how does architecture impact latency, cost, and RTO."

This article is a bridge between the Oracle world and the three faces of PostgreSQL:

Self-hosted: (EC2/K8s) — Full control, full responsibility.
Amazon RDS: Managed service, classic architecture.
Amazon Aurora: Cloud-native database with distributed storage.

We will break down why Aurora is sometimes slower than RDS, why you can't simply "turn off backups," and how a single misunderstanding of MVCC can kill production.

1. The 2-Minute Mental Model

Self-hosted PostgreSQL

Analogy: Classic Oracle on Bare Metal or VM.
Ownership: You own the entire stack: OS, FS (XFS/Ext4), kernel settings (hugepages), storage (LVM/RAID), and networking.
The Pain: Backup, HA, Failover, Patroni, PgBouncer — you assemble this Lego set yourself. You are responsible for ensuring wal_keep_size doesn't fill up the disk.

Amazon RDS for PostgreSQL

Analogy: "Oracle DBAaaS."
Essence: This is "Vanilla" PostgreSQL. The engine behavior is 99% identical to the upstream version.
Limitations: No OS access (SSH). No real superuser (you get rds_superuser). Storage is a classic EBS volume attached to the instance.
The Plus: AWS handles OS patching, minor version upgrades, and backups.

Amazon Aurora PostgreSQL-Compatible

Analogy: Oracle RAC (but shared-storage, not shared-cache) + Exadata Storage Server (conceptually).
Architecture: "Compute is separate from Storage."
Storage: Data does not reside on the instance disk. It lives in a Distributed Cluster Volume.
- Data is written 6 ways across 3 Availability Zones (AZs).
- Storage grows automatically (in 10 GiB steps) up to 128 TiB.
- Important: In modern versions (Aurora I/O Optimized), storage supports dynamic resizing — the bill goes down if you delete data.
Failover: When an instance crashes, data doesn't need to be "restored" or "remounted." The new instance simply connects to the same shared storage.

Quick Comparison Table (Cheat Sheet #1)

Area	Self‑hosted	RDS for PostgreSQL	Aurora PostgreSQL
OS access	Full	No	No
PostgreSQL superuser	Full	No (`rds_superuser` only)	No (managed, roles similar to RDS)
Extensions	Any (incl. custom C)	Only supported	Only supported (delegated extension management)
Storage	You choose (NVMe/EBS/RAID/…)	EBS	Distributed cluster volume (auto‑grow)
HA in region	You build	Multi‑AZ DB instance (1 standby) or Multi‑AZ DB cluster (2 readable standbys)	Writer + replicas + storage‑level durability
Read scaling	Replicas/sharding — you build	Read replicas (in/cross‑region), Multi‑AZ DB cluster reads from standbys	Up to 15 replicas + reader endpoint
Endpoints	Your VIP/LB/Proxy	DB endpoint, replica endpoints	Writer endpoint + reader endpoint + instance endpoints
Backups / PITR	You build and test	Automated backups 0‑35 days + manual snapshots	Automated backups 1‑35 days, cannot disable
Multi‑region DR	You build	Cross‑region read replicas	Aurora Global Database (async replication)
Upgrade/patching	You	AWS‑managed window	AWS‑managed window
DBA focus	“DBA + sysadmin + HA engineer”	“DB reliability engineer”	“DB reliability engineer + endpoint/failover routing discipline”

2. Deep Dive: Internals That Kill an Oracle DBA's Career

Here we dive into things that work "automatically" or "differently" in Oracle.

2.1 MVCC and Bloat: Why `DELETE` is a Lie

In Oracle, you are used to Undo Segments. The old version of a row sits in Undo, the new one in the data block. COMMIT confirms the new version, and Undo is overwritten over time. Clean and tidy.

In PostgreSQL (The Postgres Reality):

MVCC: Old and new row versions (tuples) live together in the same table heap.
DELETE: This is not a deletion. It is an UPDATE that marks the row as "dead" (xmax set).
UPDATE: This is an INSERT of a new version + marking the old one as "dead."

The Main Trap (Bloat): Oracle DBA thinks: "I deleted 10 million rows and Committed. Space is freed." Postgres DBA knows: "I deleted 10 million rows. Now my table is 50% garbage (dead tuples), the physical file size hasn't decreased, and table scans are twice as slow."

The Role of VACUUM (Housekeeping): This is the garbage collector process. It scans the table, finds dead rows, and marks the space as "free for reuse" (FSM - Free Space Map).

Gotcha: Standard VACUUM (and autovacuum) does not return space to the OS. It just makes "holes" in the file available for new inserts.
To actually shrink the file, you need VACUUM FULL (exclusive table lock, downtime) or pg_repack (works by creating a new table and swapping, requires 2x space).

2.2 Processes vs. Threads: Why PgBouncer is Mandatory

Oracle Way: Thread architecture or Dedicated Servers. Oracle has a powerful Shared Pool and easily handles 5,000 connections, even if they are idle.

Postgres Reality:

Process-per-connection architecture.
Every connection is an OS process fork().
Every connection consumes RAM + kernel structures.
Context Switching: If you have 5,000 active processes, the Linux scheduler goes crazy. CPU time goes to sys, not user.

The Solution: In the Postgres world (and RDS/Aurora), using PgBouncer (or RDS Proxy) is mandatory for high-load systems.

Analogy: PgBouncer in transaction pooling mode does what Oracle Shared Server (MTS) does — multiplexes thousands of client connections onto a few hundred real DB connections.
Mistake: Connecting an app (especially Serverless/Lambda) directly to Postgres without a pooler is a guaranteed way to kill the database during a traffic spike.

2.3 TOAST: The Hidden Enemy of `SELECT *`

Oracle has LOBs. Postgres has TOAST (The Oversized-Attribute Storage Technique).

If a row is wider than a page (usually 8KB), Postgres automatically slices long fields (TEXT, JSONB) and moves them to a hidden TOAST table.
The Problem: This is transparent. You run SELECT * FROM users. If the table has "fat" JSON fields, Postgres is forced to go into the TOAST index and read a bunch of extra blocks for every row.
Advice: Avoid SELECT * like the plague. Select only the columns you need. In Oracle, this is "good manners"; in Postgres, it's a matter of I/O survival.

3. Caching Layer: Redis/Memcached (The New Reality)

In the classic Oracle Enterprise world, we often relied on the Buffer Cache (SGA). It is huge, smart, and fast. In the world of Cloud Native apps (Mobile/Web), there is almost always a caching layer in front of the database.

Why isn't Postgres enough?

Connections: Redis holds 100k+ connections easily. Postgres does not.
Latency: Reading from Redis memory is <1ms. Reading from Postgres (even from Shared Buffers) involves overhead for SQL parsing, planning, and MVCC visibility checks.
Cost: Scaling Redis is cheaper than scaling an Aurora Writer.

Architectural Patterns (What a DBRE Must Know):

Read-Through / Look-Aside: App checks Redis first. If miss -> goes to DB and populates Redis.
- DBA Risk: Thundering Herd. If Redis crashes or you flush it (FlushAll), all traffic instantly hits Postgres. The DB dies in seconds.
- Verdict: Redis is critical infrastructure. You can't "just restart it."
Write-Behind: App writes to a queue/cache, workers write to the DB.
- DBA Risk: Data loss on cache failure. Eventual consistency.

Tools in AWS:

Amazon ElastiCache (Redis/Memcached) or Amazon MemoryDB (Redis with durability).
For a DBRE, this is a "black box" that offloads SELECT load from the DB but adds complexity to data invalidation.

4. Backup & Replication: Answering the Skeptics (Deep Dive)

Cloud critics often say: "In RDS, you manage a black box, not a database. You can't even configure a proper backup; it's just a VM snapshot."

This is a fundamental misconception born from a misunderstanding of Managed Services mechanics. Let's break down how backups and replication work "under the hood," using terms like RMAN, Data Guard, and ASM.

4.1 Backups: The Myth of the "VM Snapshot"

Self-Hosted (pgBackRest / WAL-G)

Gold Standard: You use pgBackRest. It performs a physical backup (pg_basebackup style) + archives WAL files as they are generated (archive_command).
Plus: Full control. Differential and incremental backups possible.
Minus: You are responsible for storage (S3/NFS), rotation, integrity checks, and most importantly, ensuring archive_command doesn't fail and fill up the local pg_wal.

RDS for PostgreSQL (Volume Snapshot + WAL)

Reality: This is not just a "VM snapshot" (which freezes I/O). It is an orchestration of Storage Volume Snapshot (EBS) + WAL Archiving.
1. Once a day (backup window), AWS takes an EBS Snapshot of the data volume. This is a block-level copy.
2. Every 5 minutes (or frequently), AWS copies Transaction Logs (WAL) to S3.
3. PITR (Point-in-Time Recovery): When you ask to restore the DB to 14:32:00, AWS takes the last night snapshot, hydrates it to a new volume, and "replays" WAL files from S3 over it up to the exact second.
Verdict: This is a full-fledged physical backup with recovery to any second (RPO ~5 min, RTO depends on WAL volume). It is a managed implementation of pg_basebackup + continuous archiving.

Aurora (Continuous Backup)

The Magic: There are no "data files" in the traditional sense. Aurora storage is a log-structured system.
Mechanism: Aurora storage continuously accepts records (Redo Log Records). A backup is simply a "pointer" to the history of changes in the distributed storage.
Restore: Restoration is almost instant (no data copying) or via cloning (Copy-on-Write). You don't "restore files"; you create a new volume pointing to the data state in the past.

4.2 Why Can't I Turn Off RDS Backups?

In Oracle RMAN, you decide: backup or not. In RDS/Aurora, Automated Backups (usually 1-35 days) often cannot be fully disabled (especially in Aurora), or disabling them breaks other features.

Reason: WAL Chain ( The Chain of Life) RDS/Aurora rely on the WAL Chain (transaction log chain) for Point-in-Time Recovery (PITR) and for Read Replicas to function.

The service constantly streams WAL files to S3.
If you turn off backups (retention = 0), AWS deletes the accumulated WALs.
Consequence: You lose the ability to do PITR. In Aurora, this can also break the ability to create new replicas or clone the cluster.
In Aurora, storage architecture is inextricably linked to the concept of "continuous backup." The data in the storage volume is essentially the redo log stream. You can't say "don't backup" because that is part of the write protocol.

4.3 Replication & HA: The Technology Matrix (Where is Physics, Where is Logic?)

It's important not to get confused: in RDS, the terms "Multi-AZ" and "Read Replica" are two different technologies living under one roof.

Self-Hosted: Physical Streaming Replication (The Classic)
- Doubt: "Is Self-Hosted replication logical?"
- Answer: No! The industry standard for HA and Read Replicas is Physical Replication.
- Mechanism: Master sends a stream of WAL records (bytes, block changes) to the Replica. The Replica applies them (Redo). This gives an exact binary copy of the cluster.
- Logical Replication: (pg_logical) exists but is used for migrations (CDC), ETL, or partial table replication. Building HA on logical replication is "bad practice" (slow, no DDL replication).
- HA: Postgres itself cannot handle Failover. You install Patroni + etcd. Patroni manages physical replication and promotes the replica if the master dies.
RDS: Two Different Technologies in One Box
- A) RDS Multi-AZ (Instance) = High Availability (HA Only)
  - Goal: Disaster Recovery.
  - Technology: Synchronous Block Replication (EBS level, analogous to DRBD / ASM Mirroring).
  - How it works: You COMMIT. Data is written to the Primary disk and synchronously to the Standby disk in another AZ.
  - Standby: The Postgres process on the Standby is not running (or not open). It does not accept SQL. You cannot read from a Multi-AZ Standby.
  - Plus: RPO=0 (guaranteed data safety).
  - Minus: You pay for a second instance that just "warms the air" waiting for a crash.
- B) RDS Read Replica = Read Scaling
  - Goal: Offload SELECT traffic from the master.
  - Technology: PostgreSQL Native Streaming Replication (same as Self-hosted).
  - How it works: Primary sends WAL files to the Replica (Async).
  - Standby: Here Postgres is open in Hot Standby mode. You can connect and read.
  - Important: RDS Read Replica does not have automatic Failover (by default). If the Master dies, the Read Replica remains an "orphan" until you manually Promote it.
Aurora: Shared Storage Replication
- Technology: Quorum Writes at the storage level.
- Essence: There is no "Block Replication" between instances and no "Streaming Replication" of data (only metadata).
- Writer writes to shared distributed storage.
- Readers (up to 15) are connected to the same storage.
- Uniqueness: In Aurora, any Reader is simultaneously a Target for HA Failover and a point for Read Scaling. It merges the two worlds.

Verdict for the Skeptic:

RDS uses Volume Backup + WAL (like RMAN).
RDS Multi-AZ is Synchronous Disk Mirroring (like ASM Mirroring).
RDS Read Replica is Standard Data Guard (Apply lag, open read-only).

Quick Comparison Table (Cheat Sheet #2)

Feature	Self-hosted (Patroni)	RDS Multi-AZ (Standby)	RDS Read Replica	Aurora Replica
Technology	Physical WAL Streaming	Block Level Mirroring	Physical WAL Streaming	Shared Storage Quorum
Readable?	Yes (Hot Standby)	No (process closed)	Yes (Hot Standby)	Yes (Hot Standby)
Replication Type	Async (usually) or Sync	Synchronous	Asynchronous	Async (lag < 20ms)
Failover	Auto (Patroni)	Auto (AWS DNS)	Manual (usually)	Auto (AWS DNS)
Lag	Depends on network/load	0 (Zero)	Depends on network/load	Minimal (ms)
Backups	Your scripts (pgBackRest)	Automated Snapshots + WAL (PITR)	N/A	Automated (Continuous Redo Stream)

5. Why is Aurora Slow? (Honest Breakdown)

You migrated from RDS to Aurora expecting the "x5 performance" from the ads, but got degradation. Why?

I/O Latency (Network Hop): In RDS (EBS), the disk is connected "close by." In Aurora, every COMMIT requires writing a quorum (4 out of 6 copies) over the network to 3 different Availability Zones.
- Reality Check: For simple queries (single row insert), latency can be higher than local NVMe. Aurora wins on concurrency (thousands of parallel transactions), not on single-thread speed.
I/O Cost (Bill Shock): In classic Aurora Standard, you pay for every million I/O operations.
- Scenario: You have a "bad" query plan doing a Sequential Scan on a large table, or you forgot an index.
- Result: The query works, but the I/O counter spins wildly. At the end of the month, you get a $10,000 bill just for I/O.
- Solution: Switch to Aurora I/O Optimized (higher compute/storage cost, but $0 for I/O requests) — this saves the budget for write-heavy systems.
DNS Caching during Failover: Aurora promises failover in 30 seconds. But your Java app "hangs" for 15 minutes.
- Reason: JVM (Java) caches DNS forever (TTL=infinite) by default. The driver keeps hammering the old IP of the dead writer.
- Cure: Configure JVM TTL (networkaddress.cache.ttl) or use the AWS JDBC Wrapper, which understands cluster topology.

6. Survival Guide: How Not to Kill Your Career

Scenarios that lead to termination or gray hair (Resume Generating Events).

Scenario 1: Transaction ID Wraparound (The Apocalypse)

Essence: XID in Postgres is a 32-bit number (4 billion). It is cyclical. If old transactions are not "frozen" by vacuum, the database hits the wrap limit.
What happens: To prevent data loss (where old becomes new), Postgres stops. It refuses to accept any write commands. At all. Even from Superuser.
Cure: Only VACUUM FREEZE in single-user mode (essentially production downtime for hours/days).
Prevention: Monitor age(datfrozenxid). Never disable autovacuum. Do not allow long-running transactions (they block freezing).

Scenario 2: Replication Slot Doom

Essence: You set up logical replication (CDC) for analytics/DMS, created a replication slot, and then the consumer crashed/turned off.
What happens: The slot tells Postgres: "Don't delete WAL files, I need them." Postgres obediently keeps them. Forever.
Result: The disk (EBS or Aurora Volume) fills up with WAL files. The DB stops due to "Disk Full."
Advice: Monitoring the size and age of replication slots is mandatory. Delete unused slots.

Scenario 3: "But we have Auto Scaling storage!"

Essence: RDS has a "Storage Autoscaling" checkbox. You think you are safe.
Reality: Autoscaling has a cooldown (time between increases — 6 hours or more) and a step limit. If you load data faster (bulk load) than autoscaling triggers, you will get "Disk Full" and a stopped DB before AWS manages to add a disk.

7. Practical Core Skills Checklist

If you want to be a DBRE, not just an "AWS Console Operator":

Top SQL: Ability to find "who is loading the DB" via pg_stat_statements or Performance Insights.
Locks: Build the blocked -> blocker chain and find the root cause (often idle in transaction).
Autovacuum health: Understand thresholds, see bloat, prevent wraparound.
Planner literacy: Read EXPLAIN (ANALYZE, BUFFERS) and understand why an Index Scan turned into a Seq Scan.
Capacity trends: Data growth, WAL growth, temp spills (disk writes during sorts), connection growth.
Drills: Planned failover + restore test. If you haven't done a restore, you don't have a backup.

Conclusion

Moving from Oracle to PostgreSQL in the AWS cloud is not a downgrade. It is a paradigm shift. You gain incredible flexibility and ecosystem integration, but you lose the "foolproofing" that Oracle provided.

Choose RDS if you need predictability and a "Vanilla" experience.
Choose Aurora if you need extreme availability, fast failover, and read scaling, and you are ready to understand the nuances of I/O billing.
Don't forget Cache (Redis) and Pooler (PgBouncer) — modern Postgres in the web world doesn't live without them.

And remember: Autovacuum is your best friend, not an enemy to be disabled.

PostgreSQL in the Cloud through the Eyes of an Oracle DBA: Why is your Aurora cluster lagging, and why can't you turn off RDS backups?

PostgreSQL in the Cloud through the Eyes of an Oracle DBA: Why is your Aurora cluster lagging, and why can't you turn off RDS backups?

The Real Differences between Self-hosted, RDS, and Aurora. An honest breakdown of Managed Postgres architecture.

Introduction: A Paradigm Shift

1. The 2-Minute Mental Model

Self-hosted PostgreSQL

Amazon RDS for PostgreSQL

Amazon Aurora PostgreSQL-Compatible

2. Deep Dive: Internals That Kill an Oracle DBA's Career

2.1 MVCC and Bloat: Why DELETE is a Lie

2.2 Processes vs. Threads: Why PgBouncer is Mandatory

2.3 TOAST: The Hidden Enemy of SELECT *

3. Caching Layer: Redis/Memcached (The New Reality)

4. Backup & Replication: Answering the Skeptics (Deep Dive)

4.1 Backups: The Myth of the "VM Snapshot"

4.2 Why Can't I Turn Off RDS Backups?

4.3 Replication & HA: The Technology Matrix (Where is Physics, Where is Logic?)

5. Why is Aurora Slow? (Honest Breakdown)

6. Survival Guide: How Not to Kill Your Career

Scenario 1: Transaction ID Wraparound (The Apocalypse)

Scenario 2: Replication Slot Doom

Scenario 3: "But we have Auto Scaling storage!"

7. Practical Core Skills Checklist

Conclusion

Share

Subscribe

2.1 MVCC and Bloat: Why `DELETE` is a Lie

2.3 TOAST: The Hidden Enemy of `SELECT *`