Name Data ingestion issues due to migration issues
Date May 09, 2025 - May 09, 2025

Summary


Our ingestion pipeline experienced complete unavailability for 23 minutes on 9th May 2025 (14:07 UTC – 14:30 UTC) while we were performing a scheduled database schema migration. During this window, 100 % of ingest operations failed with a 503 status code, and reads depending on fresh data were delayed. No data was lost: most customer pipelines buffered events until service recovered.

The outage was triggered by a race condition that surfaced only when traffic was simultaneously double‑written to two RDS clusters during the migration. An open transaction on the new cluster escalated to a global lock on the primary ingestion database, blocking all subsequent writes. The service recovered immediately once the blocking session was terminated and the migration was rolled back.

We’re sorry for the disruption to your data flows. Below we explain the impact, what happened and when, and the steps we’re taking to keep it from happening again.

Timeline

Time Event
May 09 14:00 UTC Started performing migration actions.
May 09 14:07 UTC — IMPACT BEGINS —

We observe ingest error rate start to increase. | | May 09 14:12 UTC | Stopped migration actions and reverted performed changes. | | May 09 14:14 UTC | Ingestion briefly starts to recover. | | May 09 14:16 UTC | After a brief recovery, we see further degradation and increase in ingestion error rate. The fallback actions are not having the desired impact. | | May 09 14:24 UTC | We identify a stale lock in our database, holding back all ingestion despite the reversal actions. We proceed to terminate this lock and soon after we start to see ingestion starts to recover. | | May 09 14:30 UTC | — IMPACT ENDS —

Error rate % decreased and all ingestion functionality is restored. |

Analysis

For two weeks prior to the migration, ingestion workers double‑wrote metadata to two RDS clusters:

  1. Cluster A (current production)
  2. Cluster B (using new schema)

During the scheduled switch we flipped the primary write path to Cluster B. In the first minutes of the switchover, a rare interleaving of the two concurrent writes produced this sequence:

  1. Write to Cluster A failed to commit.
  2. Corresponding write to Cluster B left an open transaction holding row locks.
  3. Ingestion workers retried, steadily accumulating blocked sessions behind the open lock.

After the rollback at 14:12 UTC, the backlog of pending writes made the race dramatically more probable; each surge of retries quickly recreated the blocking lock until the session was killed at 14:24 UTC.

Why we didn’t catch it sooner