SYNAQ Securemail - Upstream Network Outage (Previously DDOS Attack)

Incident Report for SYNAQ

Postmortem

Summary and Impact to Customers

On Monday the 5th of October 2015, SYNAQ Securemail experienced a service performance degradation event from 08h00 until 14h00, followed by a full outage event, which started at 14h00 and ended at 17h00 on the same day.

The full impact of the event resulted in clients of our Securemail email filtering service experiencing delays in sending (via the SMTP service) and receiving mail, as well a full loss of service where no email could be sent or received through our platform.

The service was fully stable and considered restored as of 04h30 on the morning of Tuesday the 6th of October, when all email queues were cleared.

Root cause and Solution

The root cause of these events was a networking configuration change resulting in a connectivity fault in the data centers of SYNAQ’s upstream hosting provider.

The critical path of the outage began on Sunday the 4th of October 2015 at 15h10 when the upstream provider initiated a change for their inbound Border Gateway Protocol (BGP) prefix list filters on their Provider Edge Routers, as part of a new customer install.

This lead to a route leak into the network from a neighboring BGP peer, which resulted in overall network instability for a short time. This in and of itself did not have an immediate impact on SYNAQ Securemail because network traffic load was very low due to it being a Sunday afternoon. According to the upstream provider this issue was resolved on Sunday the 4th of October.

The next Monday morning (October 5th) at approximately 08h30 SYNAQ engineers started receiving alerts that a large backlog of email was beginning to queue. Following this alert engineers also started receiving alerts that SMTP connections were being refused from upstream servers attempting to relay email.

These connection refused errors occurred because Securemail’s concurrent connection threshold protection kicked in to protect against a Distributed Denial of Service attack (DDOS).

At 08h48 SYNAQ announced that they were experiencing a DDOS attack and began work to mitigate this perceived attack.

At 09h00 SYNAQ engineers determined that network connectivity to and from the Securemail environment was also degraded and experiencing between 65% and 80% packet loss.

SYNAQ then notified our upstream network provider of this connectivity degradation at 09h30.

Shortly thereafter SYNAQ engineers realised that the perceived DDOS protection was triggering due to the very high TCP/IP retry rates from all the upstream hosts failing to get connections to our cluster as a result of the high packet loss.

During this time Securemail service was degraded, and the impact to customers was slow mail inbound and outbound delivery.

At 11h28 our upstream provider escalated the Securemail network packet loss issue to their senior networking engineers who confirmed that the data center core-switching infrastructure was indeed experiencing high rates of packet loss.

By 14h00 the upstream vendors had not been able to stabilize the core switching environment which by now had become very unstable due to unnoticed effects from the previous day’s BGP filter list leak event, and had decided to move SYNAQ’s Securemail environment to new switches rather than affect all the other hosting clients in the data center.

This was done without SYNAQ’s consent, and lasted until 17h00. This resulted in a complete outage on the Securemail service with no receipt and sending of email for our clients, and no access to the environment for SYNAQ engineers.

After the switch cutover was complete and network access was restored, the SYNAQ Securemail cluster then had to process, scan and deliver an approximate backlog over 3.5 million emails from upstream senders, and successfully completed this task by 04h30 AM on the morning of the 6th of October.

Future Risk Mitigation Actions

In addition to meeting with our upstream provider to mitigate similar changes from affecting the SYNAQ Securemail network and hosting environment, SYNAQ is also engaging in a review of our service level and operational level agreements with our upstream network provider to ensure the following:

Escalations of issues on our hosting environment bypass standard channels and reach the senior network engineering level immediately and are responded to in a more timely manner.
No changes on the provider’s network that could affect SYNAQ’s environment can be applied without notifying SYNAQ and gaining their approval.

Conclusion

SYNAQ wishes to unreservedly apologise for this outage and the inconvenience it may have caused to our customers and partners.

We wish to assure you that we are doing everything that we can to work with our providers to improve the resilience and redundancy of our networking environment in the weeks and months ahead, so that events of this nature can never happen again.

Posted Oct 13, 2015 - 15:15 CAT

Resolved

All mail backlogs are caught up. A full report will be made available as soon as SYNAQ receives a Root Cause report from our upstream network provider.

Posted Oct 06, 2015 - 06:07 CAT

Update

Mail queue backlogs continue to be processed and are clearing steadily.

Posted Oct 05, 2015 - 19:18 CAT

Update

Mail backlog is processing steadily. Processing rate is approximately eleven thousand emails per minute and climbing.

Posted Oct 05, 2015 - 17:41 CAT

Update

Upstream Vendors have restored connectivity. Mail backlog has started processing.

Posted Oct 05, 2015 - 16:56 CAT

Update

Unfortunately, upstream network is still not restored after being re-cabled. Engineers are checking configurations of new switching infrastructure to identify cause.

Posted Oct 05, 2015 - 16:27 CAT

Update

Upstream vendor update: ETA for Securemail network restoration is now set for 16:05.

Posted Oct 05, 2015 - 15:55 CAT

Update

Upstream vendors network engineers are cutting over Securemail to new switching infrastructure to resolve. ETA to this cutover is 90 minutes.

Posted Oct 05, 2015 - 14:15 CAT

Update

Our Upstream Network Provider has still not resolved the issue affecting network connectivity to the Securemail cluster. We have further escalated this issue to attempt to ensure as speedy a resolution as possible.

Posted Oct 05, 2015 - 14:05 CAT

Update

Upstream Vendor Network Engineers are onsite attending to core networking infrastructure affecting Securemail. ETA to full resolution unknown at this stage.

Posted Oct 05, 2015 - 12:33 CAT

Update

Upstream vendor is still working on Network stability resolution and SYNAQ Engineers are starting to see improvements to mail delivery flows.

Posted Oct 05, 2015 - 11:57 CAT

Update

Update: DDOS likely caused by degraded Network Connectivity and high email retry rate, triggering Securemail DDOS protection.
We have our upstream vendor investigating.

Posted Oct 05, 2015 - 10:59 CAT

Monitoring

SYNAQ Engineers have begun restoring normal functioning after the DDOS attack and systems are starting to stablise.
We are monitoring and will update clients within the next 30 minutes.

Posted Oct 05, 2015 - 09:34 CAT

Identified

SYNAQ Engineers have made progress in mitigating the DDOS attack, and are making the necessary system changes to bring services back to normal operation. More information will be forthcoming in the next 30 minutes.

Posted Oct 05, 2015 - 09:19 CAT

Investigating

SYNAQ Securemail is currently experiencing a DDOS attack.
This means that clients may experience errors in sending email, and delays in receiving email while our engineers work to combat the attack.

We have no resolution ETA at this point, but will provide updates regularly.
Thank you.

SYNAQ Securemail Team.

Posted Oct 05, 2015 - 08:48 CAT

This incident affected: SYNAQ Cloud Mail and SYNAQ Securemail.