PostgreSQL in the Cloud through the Eyes of an Oracle DBA: Why is your Aurora cluster lagging, and why can't you turn off RDS backups?
The Real Differences between Self-hosted, RDS, and Aurora. An honest breakdown of Managed Postgres architecture.
Feb 13, 2026
Introduction: A Paradigm Shift
The database market has changed irreversibly. If 10 years ago Oracle was the "Gold Standard" for the Enterprise, today PostgreSQL is the "Linux of Databases." But for an experienced Oracle DBA, the transition to Postgres (especially in the AWS cloud) is often a culture shock.
You are no longer the "Guardian of the Temple," tuning init.ora and managing physical data files. In the cloud, your role transforms into a DBRE (Database Reliability Engineer). The focus shifts from "how do I add a data file" to "how does architecture impact latency, cost, and RTO."
This article is a bridge between the Oracle world and the three faces of PostgreSQL:
- Self-hosted: (EC2/K8s) — Full control, full responsibility.
- Amazon RDS: Managed service, classic architecture.
- Amazon Aurora: Cloud-native database with distributed storage.
We will break down why Aurora is sometimes slower than RDS, why you can't simply "turn off backups," and how a single misunderstanding of MVCC can kill production.
1. The 2-Minute Mental Model
Self-hosted PostgreSQL
- Analogy: Classic Oracle on Bare Metal or VM.
- Ownership: You own the entire stack: OS, FS (XFS/Ext4), kernel settings (hugepages), storage (LVM/RAID), and networking.
- The Pain: Backup, HA, Failover, Patroni, PgBouncer — you assemble this Lego set yourself. You are responsible for ensuring
wal_keep_sizedoesn't fill up the disk.
Amazon RDS for PostgreSQL
- Analogy: "Oracle DBAaaS."
- Essence: This is "Vanilla" PostgreSQL. The engine behavior is 99% identical to the upstream version.
- Limitations: No OS access (SSH). No real
superuser(you getrds_superuser). Storage is a classic EBS volume attached to the instance. - The Plus: AWS handles OS patching, minor version upgrades, and backups.
Amazon Aurora PostgreSQL-Compatible
- Analogy: Oracle RAC (but shared-storage, not shared-cache) + Exadata Storage Server (conceptually).
- Architecture: "Compute is separate from Storage."
- Storage: Data does not reside on the instance disk. It lives in a Distributed Cluster Volume.
- Data is written 6 ways across 3 Availability Zones (AZs).
- Storage grows automatically (in 10 GiB steps) up to 128 TiB.
- Important: In modern versions (Aurora I/O Optimized), storage supports dynamic resizing — the bill goes down if you delete data.
- Failover: When an instance crashes, data doesn't need to be "restored" or "remounted." The new instance simply connects to the same shared storage.
Quick Comparison Table (Cheat Sheet #1)
| Area | Self‑hosted | RDS for PostgreSQL | Aurora PostgreSQL |
|---|---|---|---|
| OS access | Full | No | No |
| PostgreSQL superuser | Full | No (rds_superuser only) | No (managed, roles similar to RDS) |
| Extensions | Any (incl. custom C) | Only supported | Only supported (delegated extension management) |
| Storage | You choose (NVMe/EBS/RAID/…) | EBS | Distributed cluster volume (auto‑grow) |
| HA in region | You build | Multi‑AZ DB instance (1 standby) or Multi‑AZ DB cluster (2 readable standbys) | Writer + replicas + storage‑level durability |
| Read scaling | Replicas/sharding — you build | Read replicas (in/cross‑region), Multi‑AZ DB cluster reads from standbys | Up to 15 replicas + reader endpoint |
| Endpoints | Your VIP/LB/Proxy | DB endpoint, replica endpoints | Writer endpoint + reader endpoint + instance endpoints |
| Backups / PITR | You build and test | Automated backups 0‑35 days + manual snapshots | Automated backups 1‑35 days, cannot disable |
| Multi‑region DR | You build | Cross‑region read replicas | Aurora Global Database (async replication) |
| Upgrade/patching | You | AWS‑managed window | AWS‑managed window |
| DBA focus | “DBA + sysadmin + HA engineer” | “DB reliability engineer” | “DB reliability engineer + endpoint/failover routing discipline” |
2. Deep Dive: Internals That Kill an Oracle DBA's Career
Here we dive into things that work "automatically" or "differently" in Oracle.
2.1 MVCC and Bloat: Why DELETE is a Lie
In Oracle, you are used to Undo Segments. The old version of a row sits in Undo, the new one in the data block. COMMIT confirms the new version, and Undo is overwritten over time. Clean and tidy.
In PostgreSQL (The Postgres Reality):
- MVCC: Old and new row versions (tuples) live together in the same table heap.
- DELETE: This is not a deletion. It is an
UPDATEthat marks the row as "dead" (xmaxset). - UPDATE: This is an
INSERTof a new version + marking the old one as "dead."
The Main Trap (Bloat): Oracle DBA thinks: "I deleted 10 million rows and Committed. Space is freed." Postgres DBA knows: "I deleted 10 million rows. Now my table is 50% garbage (dead tuples), the physical file size hasn't decreased, and table scans are twice as slow."
The Role of VACUUM (Housekeeping): This is the garbage collector process. It scans the table, finds dead rows, and marks the space as "free for reuse" (FSM - Free Space Map).
- Gotcha: Standard
VACUUM(and autovacuum) does not return space to the OS. It just makes "holes" in the file available for new inserts. - To actually shrink the file, you need
VACUUM FULL(exclusive table lock, downtime) orpg_repack(works by creating a new table and swapping, requires 2x space).
2.2 Processes vs. Threads: Why PgBouncer is Mandatory
Oracle Way: Thread architecture or Dedicated Servers. Oracle has a powerful Shared Pool and easily handles 5,000 connections, even if they are idle.
Postgres Reality:
- Process-per-connection architecture.
- Every connection is an OS process
fork(). - Every connection consumes RAM + kernel structures.
- Context Switching: If you have 5,000 active processes, the Linux scheduler goes crazy. CPU time goes to
sys, notuser.
The Solution: In the Postgres world (and RDS/Aurora), using PgBouncer (or RDS Proxy) is mandatory for high-load systems.
- Analogy: PgBouncer in
transaction poolingmode does what Oracle Shared Server (MTS) does — multiplexes thousands of client connections onto a few hundred real DB connections. - Mistake: Connecting an app (especially Serverless/Lambda) directly to Postgres without a pooler is a guaranteed way to kill the database during a traffic spike.
2.3 TOAST: The Hidden Enemy of SELECT *
Oracle has LOBs. Postgres has TOAST (The Oversized-Attribute Storage Technique).
- If a row is wider than a page (usually 8KB), Postgres automatically slices long fields (TEXT, JSONB) and moves them to a hidden TOAST table.
- The Problem: This is transparent. You run
SELECT * FROM users. If the table has "fat" JSON fields, Postgres is forced to go into the TOAST index and read a bunch of extra blocks for every row. - Advice: Avoid
SELECT *like the plague. Select only the columns you need. In Oracle, this is "good manners"; in Postgres, it's a matter of I/O survival.
3. Caching Layer: Redis/Memcached (The New Reality)
In the classic Oracle Enterprise world, we often relied on the Buffer Cache (SGA). It is huge, smart, and fast. In the world of Cloud Native apps (Mobile/Web), there is almost always a caching layer in front of the database.
Why isn't Postgres enough?
- Connections: Redis holds 100k+ connections easily. Postgres does not.
- Latency: Reading from Redis memory is <1ms. Reading from Postgres (even from Shared Buffers) involves overhead for SQL parsing, planning, and MVCC visibility checks.
- Cost: Scaling Redis is cheaper than scaling an Aurora Writer.
Architectural Patterns (What a DBRE Must Know):
- Read-Through / Look-Aside: App checks Redis first. If miss -> goes to DB and populates Redis.
- DBA Risk: Thundering Herd. If Redis crashes or you flush it (FlushAll), all traffic instantly hits Postgres. The DB dies in seconds.
- Verdict: Redis is critical infrastructure. You can't "just restart it."
- Write-Behind: App writes to a queue/cache, workers write to the DB.
- DBA Risk: Data loss on cache failure. Eventual consistency.
Tools in AWS:
- Amazon ElastiCache (Redis/Memcached) or Amazon MemoryDB (Redis with durability).
- For a DBRE, this is a "black box" that offloads
SELECTload from the DB but adds complexity to data invalidation.
4. Backup & Replication: Answering the Skeptics (Deep Dive)
Cloud critics often say: "In RDS, you manage a black box, not a database. You can't even configure a proper backup; it's just a VM snapshot."
This is a fundamental misconception born from a misunderstanding of Managed Services mechanics. Let's break down how backups and replication work "under the hood," using terms like RMAN, Data Guard, and ASM.
4.1 Backups: The Myth of the "VM Snapshot"
Self-Hosted (pgBackRest / WAL-G)
- Gold Standard: You use
pgBackRest. It performs a physical backup (pg_basebackupstyle) + archives WAL files as they are generated (archive_command). - Plus: Full control. Differential and incremental backups possible.
- Minus: You are responsible for storage (S3/NFS), rotation, integrity checks, and most importantly, ensuring
archive_commanddoesn't fail and fill up the localpg_wal.
RDS for PostgreSQL (Volume Snapshot + WAL)
- Reality: This is not just a "VM snapshot" (which freezes I/O). It is an orchestration of Storage Volume Snapshot (EBS) + WAL Archiving.
- Once a day (backup window), AWS takes an EBS Snapshot of the data volume. This is a block-level copy.
- Every 5 minutes (or frequently), AWS copies Transaction Logs (WAL) to S3.
- PITR (Point-in-Time Recovery): When you ask to restore the DB to
14:32:00, AWS takes the last night snapshot, hydrates it to a new volume, and "replays" WAL files from S3 over it up to the exact second.
- Verdict: This is a full-fledged physical backup with recovery to any second (RPO ~5 min, RTO depends on WAL volume). It is a managed implementation of
pg_basebackup+ continuous archiving.
Aurora (Continuous Backup)
- The Magic: There are no "data files" in the traditional sense. Aurora storage is a log-structured system.
- Mechanism: Aurora storage continuously accepts records (Redo Log Records). A backup is simply a "pointer" to the history of changes in the distributed storage.
- Restore: Restoration is almost instant (no data copying) or via cloning (Copy-on-Write). You don't "restore files"; you create a new volume pointing to the data state in the past.
4.2 Why Can't I Turn Off RDS Backups?
In Oracle RMAN, you decide: backup or not. In RDS/Aurora, Automated Backups (usually 1-35 days) often cannot be fully disabled (especially in Aurora), or disabling them breaks other features.
Reason: WAL Chain ( The Chain of Life) RDS/Aurora rely on the WAL Chain (transaction log chain) for Point-in-Time Recovery (PITR) and for Read Replicas to function.
- The service constantly streams WAL files to S3.
- If you turn off backups (retention = 0), AWS deletes the accumulated WALs.
- Consequence: You lose the ability to do PITR. In Aurora, this can also break the ability to create new replicas or clone the cluster.
- In Aurora, storage architecture is inextricably linked to the concept of "continuous backup." The data in the storage volume is essentially the redo log stream. You can't say "don't backup" because that is part of the write protocol.
4.3 Replication & HA: The Technology Matrix (Where is Physics, Where is Logic?)
It's important not to get confused: in RDS, the terms "Multi-AZ" and "Read Replica" are two different technologies living under one roof.
- Self-Hosted: Physical Streaming Replication (The Classic)
- Doubt: "Is Self-Hosted replication logical?"
- Answer: No! The industry standard for HA and Read Replicas is Physical Replication.
- Mechanism: Master sends a stream of WAL records (bytes, block changes) to the Replica. The Replica applies them (
Redo). This gives an exact binary copy of the cluster. - Logical Replication: (
pg_logical) exists but is used for migrations (CDC), ETL, or partial table replication. Building HA on logical replication is "bad practice" (slow, no DDL replication). - HA: Postgres itself cannot handle Failover. You install Patroni + etcd. Patroni manages physical replication and promotes the replica if the master dies.
- RDS: Two Different Technologies in One Box
- A) RDS Multi-AZ (Instance) = High Availability (HA Only)
- Goal: Disaster Recovery.
- Technology: Synchronous Block Replication (EBS level, analogous to DRBD / ASM Mirroring).
- How it works: You
COMMIT. Data is written to the Primary disk and synchronously to the Standby disk in another AZ. - Standby: The Postgres process on the Standby is not running (or not open). It does not accept SQL. You cannot read from a Multi-AZ Standby.
- Plus: RPO=0 (guaranteed data safety).
- Minus: You pay for a second instance that just "warms the air" waiting for a crash.
- B) RDS Read Replica = Read Scaling
- Goal: Offload
SELECTtraffic from the master. - Technology: PostgreSQL Native Streaming Replication (same as Self-hosted).
- How it works: Primary sends WAL files to the Replica (Async).
- Standby: Here Postgres is open in
Hot Standbymode. You can connect and read. - Important: RDS Read Replica does not have automatic Failover (by default). If the Master dies, the Read Replica remains an "orphan" until you manually Promote it.
- Goal: Offload
- A) RDS Multi-AZ (Instance) = High Availability (HA Only)
- Aurora: Shared Storage Replication
- Technology: Quorum Writes at the storage level.
- Essence: There is no "Block Replication" between instances and no "Streaming Replication" of data (only metadata).
- Writer writes to shared distributed storage.
- Readers (up to 15) are connected to the same storage.
- Uniqueness: In Aurora, any Reader is simultaneously a Target for HA Failover and a point for Read Scaling. It merges the two worlds.
Verdict for the Skeptic:
- RDS uses Volume Backup + WAL (like RMAN).
- RDS Multi-AZ is Synchronous Disk Mirroring (like ASM Mirroring).
- RDS Read Replica is Standard Data Guard (Apply lag, open read-only).
Quick Comparison Table (Cheat Sheet #2)
| Feature | Self-hosted (Patroni) | RDS Multi-AZ (Standby) | RDS Read Replica | Aurora Replica |
|---|---|---|---|---|
| Technology | Physical WAL Streaming | Block Level Mirroring | Physical WAL Streaming | Shared Storage Quorum |
| Readable? | Yes (Hot Standby) | No (process closed) | Yes (Hot Standby) | Yes (Hot Standby) |
| Replication Type | Async (usually) or Sync | Synchronous | Asynchronous | Async (lag < 20ms) |
| Failover | Auto (Patroni) | Auto (AWS DNS) | Manual (usually) | Auto (AWS DNS) |
| Lag | Depends on network/load | 0 (Zero) | Depends on network/load | Minimal (ms) |
| Backups | Your scripts (pgBackRest) | Automated Snapshots + WAL (PITR) | N/A | Automated (Continuous Redo Stream) |
5. Why is Aurora Slow? (Honest Breakdown)
You migrated from RDS to Aurora expecting the "x5 performance" from the ads, but got degradation. Why?
- I/O Latency (Network Hop): In RDS (EBS), the disk is connected "close by." In Aurora, every
COMMITrequires writing a quorum (4 out of 6 copies) over the network to 3 different Availability Zones.- Reality Check: For simple queries (single row insert), latency can be higher than local NVMe. Aurora wins on concurrency (thousands of parallel transactions), not on single-thread speed.
- I/O Cost (Bill Shock): In classic Aurora Standard, you pay for every million I/O operations.
- Scenario: You have a "bad" query plan doing a Sequential Scan on a large table, or you forgot an index.
- Result: The query works, but the I/O counter spins wildly. At the end of the month, you get a $10,000 bill just for I/O.
- Solution: Switch to Aurora I/O Optimized (higher compute/storage cost, but $0 for I/O requests) — this saves the budget for write-heavy systems.
- DNS Caching during Failover: Aurora promises failover in 30 seconds. But your Java app "hangs" for 15 minutes.
- Reason: JVM (Java) caches DNS forever (TTL=infinite) by default. The driver keeps hammering the old IP of the dead writer.
- Cure: Configure JVM TTL (
networkaddress.cache.ttl) or use the AWS JDBC Wrapper, which understands cluster topology.
6. Survival Guide: How Not to Kill Your Career
Scenarios that lead to termination or gray hair (Resume Generating Events).
Scenario 1: Transaction ID Wraparound (The Apocalypse)
- Essence: XID in Postgres is a 32-bit number (4 billion). It is cyclical. If old transactions are not "frozen" by vacuum, the database hits the wrap limit.
- What happens: To prevent data loss (where old becomes new), Postgres stops. It refuses to accept any write commands. At all. Even from Superuser.
- Cure: Only
VACUUM FREEZEin single-user mode (essentially production downtime for hours/days). - Prevention: Monitor
age(datfrozenxid). Never disable autovacuum. Do not allow long-running transactions (they block freezing).
Scenario 2: Replication Slot Doom
- Essence: You set up logical replication (CDC) for analytics/DMS, created a replication slot, and then the consumer crashed/turned off.
- What happens: The slot tells Postgres: "Don't delete WAL files, I need them." Postgres obediently keeps them. Forever.
- Result: The disk (EBS or Aurora Volume) fills up with WAL files. The DB stops due to "Disk Full."
- Advice: Monitoring the size and age of replication slots is mandatory. Delete unused slots.
Scenario 3: "But we have Auto Scaling storage!"
- Essence: RDS has a "Storage Autoscaling" checkbox. You think you are safe.
- Reality: Autoscaling has a cooldown (time between increases — 6 hours or more) and a step limit. If you load data faster (bulk load) than autoscaling triggers, you will get "Disk Full" and a stopped DB before AWS manages to add a disk.
7. Practical Core Skills Checklist
If you want to be a DBRE, not just an "AWS Console Operator":
- Top SQL: Ability to find "who is loading the DB" via
pg_stat_statementsor Performance Insights. - Locks: Build the blocked -> blocker chain and find the root cause (often
idle in transaction). - Autovacuum health: Understand thresholds, see bloat, prevent wraparound.
- Planner literacy: Read
EXPLAIN (ANALYZE, BUFFERS)and understand why an Index Scan turned into a Seq Scan. - Capacity trends: Data growth, WAL growth, temp spills (disk writes during sorts), connection growth.
- Drills: Planned failover + restore test. If you haven't done a restore, you don't have a backup.
Conclusion
Moving from Oracle to PostgreSQL in the AWS cloud is not a downgrade. It is a paradigm shift. You gain incredible flexibility and ecosystem integration, but you lose the "foolproofing" that Oracle provided.
- Choose RDS if you need predictability and a "Vanilla" experience.
- Choose Aurora if you need extreme availability, fast failover, and read scaling, and you are ready to understand the nuances of I/O billing.
- Don't forget Cache (Redis) and Pooler (PgBouncer) — modern Postgres in the web world doesn't live without them.
And remember: Autovacuum is your best friend, not an enemy to be disabled.