Degraded System Performance and Errors

Incident Report for Shippo

Postmortem

Database Co-Tenancy Resource Contention

Summary

On August 10, 2017, between 2:10am PST and 7:14am PST, Shippo services had intermittent outages due to our database becoming unresponsive.

Root Cause

Shippo’s primary database server became unresponsive due to the standby database being inaccessible. The standby database was inaccessible due to a co-tenant performing very IO expensive operations that resulted in volume failure. The primary database’s performance was impacted by this inaccessibility of the standby database since disk writes were delayed waiting for the replication to the standby database to complete.

Action Taken

We disabled our standby database to prevent disk writes from being delayed and restored overall system performance.

Remediation

We’ve added instrumentation and monitoring to temporarily disable data center failover in case of a secondary database failure.

Posted Aug 16, 2017 - 23:58 UTC

Resolved

We identified an issue that resulted in degraded performance and elevated response times between 2:10am- 2:39AM and 3:30AM- 4:13am Pacific Time. The issue has been addressed and all systems are performing normally again.

Posted Aug 10, 2017 - 11:49 UTC