On Monday the 5th of October 2015, SYNAQ Securemail experienced a service performance degradation event from 08h00 until 14h00, followed by a full outage event, which started at 14h00 and ended at 17h00 on the same day.
The full impact of the event resulted in clients of our Securemail email filtering service experiencing delays in sending (via the SMTP service) and receiving mail, as well a full loss of service where no email could be sent or received through our platform.
The service was fully stable and considered restored as of 04h30 on the morning of Tuesday the 6th of October, when all email queues were cleared.
The root cause of these events was a networking configuration change resulting in a connectivity fault in the data centers of SYNAQ’s upstream hosting provider.
The critical path of the outage began on Sunday the 4th of October 2015 at 15h10 when the upstream provider initiated a change for their inbound Border Gateway Protocol (BGP) prefix list filters on their Provider Edge Routers, as part of a new customer install.
This lead to a route leak into the network from a neighboring BGP peer, which resulted in overall network instability for a short time. This in and of itself did not have an immediate impact on SYNAQ Securemail because network traffic load was very low due to it being a Sunday afternoon. According to the upstream provider this issue was resolved on Sunday the 4th of October.
The next Monday morning (October 5th) at approximately 08h30 SYNAQ engineers started receiving alerts that a large backlog of email was beginning to queue. Following this alert engineers also started receiving alerts that SMTP connections were being refused from upstream servers attempting to relay email.
These connection refused errors occurred because Securemail’s concurrent connection threshold protection kicked in to protect against a Distributed Denial of Service attack (DDOS).
At 08h48 SYNAQ announced that they were experiencing a DDOS attack and began work to mitigate this perceived attack.
At 09h00 SYNAQ engineers determined that network connectivity to and from the Securemail environment was also degraded and experiencing between 65% and 80% packet loss.
SYNAQ then notified our upstream network provider of this connectivity degradation at 09h30.
Shortly thereafter SYNAQ engineers realised that the perceived DDOS protection was triggering due to the very high TCP/IP retry rates from all the upstream hosts failing to get connections to our cluster as a result of the high packet loss.
During this time Securemail service was degraded, and the impact to customers was slow mail inbound and outbound delivery.
At 11h28 our upstream provider escalated the Securemail network packet loss issue to their senior networking engineers who confirmed that the data center core-switching infrastructure was indeed experiencing high rates of packet loss.
By 14h00 the upstream vendors had not been able to stabilize the core switching environment which by now had become very unstable due to unnoticed effects from the previous day’s BGP filter list leak event, and had decided to move SYNAQ’s Securemail environment to new switches rather than affect all the other hosting clients in the data center.
This was done without SYNAQ’s consent, and lasted until 17h00. This resulted in a complete outage on the Securemail service with no receipt and sending of email for our clients, and no access to the environment for SYNAQ engineers.
After the switch cutover was complete and network access was restored, the SYNAQ Securemail cluster then had to process, scan and deliver an approximate backlog over 3.5 million emails from upstream senders, and successfully completed this task by 04h30 AM on the morning of the 6th of October.
In addition to meeting with our upstream provider to mitigate similar changes from affecting the SYNAQ Securemail network and hosting environment, SYNAQ is also engaging in a review of our service level and operational level agreements with our upstream network provider to ensure the following:
SYNAQ wishes to unreservedly apologise for this outage and the inconvenience it may have caused to our customers and partners.
We wish to assure you that we are doing everything that we can to work with our providers to improve the resilience and redundancy of our networking environment in the weeks and months ahead, so that events of this nature can never happen again.