On August 10, 2017, between 2:10am PST and 7:14am PST, Shippo services had intermittent outages due to our database becoming unresponsive.
Shippo’s primary database server became unresponsive due to the standby database being inaccessible. The standby database was inaccessible due to a co-tenant performing very IO expensive operations that resulted in volume failure. The primary database’s performance was impacted by this inaccessibility of the standby database since disk writes were delayed waiting for the replication to the standby database to complete.
We disabled our standby database to prevent disk writes from being delayed and restored overall system performance.
We’ve added instrumentation and monitoring to temporarily disable data center failover in case of a secondary database failure.