Cloud Mail Incident - 25/01/2022
Incident Report for SYNAQ
Postmortem

Summary and Impact to Customers

On Tuesday 25th January 2022 from 11:30 to 16:38, SYNAQ Cloud Mail experienced an authentication incident affecting a subset of users.

The resultant impact of the event was that certain users were unable to authenticate and access their mailboxes.

Root Cause and Solution

The root cause of this event was due to Virtual Machine (VM) backup snapshots on two Cloud Mail mailbox stores not being removed as part of a maintenance task. As part of our routine nightly backup processes, or when server maintenance is being performed, backups snapshots are created for each VM to ensure quick roll back and minimal impact to clients should there be an issue during maintenance or otherwise. Once a maintenance task has been completed the snapshots are then removed manually. If snapshots are not removed the performance of the VM slowly degrades over time.

Human error resulted in snapshots not being removed from two of the stores after maintenance was performed last week Thursday. Whilst the removal of snapshots is not normally a problem in isolation, a further error occurred with our monitoring scripts that look for this issue – they did not detect that these VM’s had old out-of-date snapshots and therefore no alerts were raised. As such, these snapshots started to impact the performance of the stores on Monday. To resolve this degradation, the snapshots were removed on Tuesday morning. Five of the seven snapshots removed successfully, however, two of the snapshots did not complete their consolidation step and this degraded the VM’s performance even more forcing them to enter an unresponsive state.

To resolve this issue, a manual consolidation of both VM’s needed to be run. Due to the business day load and performance of these VM’s this took a couple of hours to complete.

Remediation Actions

• Process improvements will be made to the existing maintenance process to ensure that the snapshot removal task is triple checked for completion.

• Snapshot monitoring checks to be repaired and updated to alert for old snapshots.

• Snapshot removal to only take place outside of business hours despite possible degradation of VM performance.

Posted Jan 26, 2022 - 14:49 CAT

Resolved
Dear Clients,

The SYNAQ Cloud Mail incident has been resolved and all services are running optimally.
Posted Jan 25, 2022 - 16:38 CAT
Update
Dear Clients,

Our engineers are still working on the resolution of the SYNAQ Cloud Mail Incident. This is being treated as a matter of urgency. Half of affected users have already recovered.

We will send our next update in 60 minutes
Posted Jan 25, 2022 - 14:43 CAT
Update
Dear Clients,

Our engineers are still working on the resolution of the SYNAQ Cloud Mail Incident. This is being treated as a matter of urgency. Half of affected users have already recovered.

We will send our next update in 60 minutes
Posted Jan 25, 2022 - 13:42 CAT
Identified
Dear Clients,

Our engineers have identified the SYNAQ Cloud Mail incident and are working on a resolution.
We will send our next update in 60 minutes
Posted Jan 25, 2022 - 12:46 CAT
Investigating
Dear Clients,

SYNAQ Cloud Mail is currently experiencing an incident where a subset of users are unable to log into their mailboxes. Engineers are investigating this as a matter of urgency.

We will send our next update in 60 minutes.
Posted Jan 25, 2022 - 11:59 CAT
This incident affected: SYNAQ Cloud Mail.