Summary and Impact to Customers
On Tuesday 25th January 2022 from 11:30 to 16:38, SYNAQ Cloud Mail experienced an authentication incident affecting a subset of users.
The resultant impact of the event was that certain users were unable to authenticate and access their mailboxes.
Root Cause and Solution
The root cause of this event was due to Virtual Machine (VM) backup snapshots on two Cloud Mail mailbox stores not being removed as part of a maintenance task. As part of our routine nightly backup processes, or when server maintenance is being performed, backups snapshots are created for each VM to ensure quick roll back and minimal impact to clients should there be an issue during maintenance or otherwise. Once a maintenance task has been completed the snapshots are then removed manually. If snapshots are not removed the performance of the VM slowly degrades over time.
Human error resulted in snapshots not being removed from two of the stores after maintenance was performed last week Thursday. Whilst the removal of snapshots is not normally a problem in isolation, a further error occurred with our monitoring scripts that look for this issue – they did not detect that these VM’s had old out-of-date snapshots and therefore no alerts were raised. As such, these snapshots started to impact the performance of the stores on Monday. To resolve this degradation, the snapshots were removed on Tuesday morning. Five of the seven snapshots removed successfully, however, two of the snapshots did not complete their consolidation step and this degraded the VM’s performance even more forcing them to enter an unresponsive state.
To resolve this issue, a manual consolidation of both VM’s needed to be run. Due to the business day load and performance of these VM’s this took a couple of hours to complete.
• Process improvements will be made to the existing maintenance process to ensure that the snapshot removal task is triple checked for completion.
• Snapshot monitoring checks to be repaired and updated to alert for old snapshots.
• Snapshot removal to only take place outside of business hours despite possible degradation of VM performance.