Summary and Impact to Customers
On Wednesday 8th May 2021 from 06:00 to 12:28, SYNAQ Cloud Mail experienced a mail availability incident.
The resultant impact of the event was that users were unable to authenticate and could not access the platform.
Root Cause and Solution
The root cause of this event was due to a scheduled change on the morning of the 8th of May from 00:00 to 06:00. The change was to improve the overall switch redundancy in the storage network. The change was unsuccessful and there was a requirement to roll back. Once the change was rolled back not all the hosts could see all the relevant storage paths. This prevented many of the VM’s from being able to start up correctly so the environment could not come up completely.
To resolve this issue, the port channels between the two core switches had to be brought back up. Once this was done all communication was restored from all servers to all the relevant storage. Once this occurred all VM’s were successfully restored, and users were able to authenticate and access their mail once again.
Upon further investigation, it was determined that the change to stack the Core switches to increase switch redundancy took down the existing port channels between the Core switches. Once these were brought back up the switch configuration was returned to its original state and all Cloud Mail functionality was restored.
• A full audit of change has been conducted and all the reasons for the change being a failure have been identified.
• Changes have been made to improve our rollback plan process to ensure crucial steps are not missed going forward.