Cloud Mail Incident - 22/06/2021

Incident Report for SYNAQ

Postmortem

Summary and Impact to Customers

On Tuesday 22nd June 2021 from 14:48 to 28th June 2021 14:55, SYNAQ Cloud Mail experienced an intermittent, degraded mail authentication incident.

The resultant impact of the event was that certain users would receive authentication pop-up messages when trying to login via HTTPS, POP3/S, IMAP/S, SMTP/S, as well as slow access to webmail.

Root Cause and Solution

On the 22nd of June 2021 at 14:48 SYNAQ Cloud Mail began to experience incoming mail delays. This delay occurred at the Zimbra MTA (Mail Transfer Agent) layer. Once identified, we attempted a series of fixes to resolve the incident.

At 15:30 a change was made to increase the processing threads of the MTA servers from 100 to 150 to increase the amount of mail dealt with at any one time by an MTA when trying to deliver mail. The change was done in an attempt to increase the processing of the mail building up in the queue. However, this did not have the desired effect.

At 16:30 a new Exim server was added to the MTA cluster (a project that was going to take place within the next couple of weeks – as detailed below in the root cause section), and this server was able to process mails with no delays. We ceased new mail delivery to the MTA’s, one at a time, until they cleared their queues. Thereafter, normal mail flow was restored by 16:51.

On the 23rd of June 2021 at 09:09, mail delays re-occurred, coupled with a select group of users receiving authentication failure “pop-up” messages when trying to login to their mailboxes.

At 09:52 debugging was performed on the Anti-Virus functions on the MTA servers, as this appeared to be where the delay was taking place. Configuration changes then were made to timeout settings and processing times to try and resolve this issue. Unfortunately, this did not have the desired effect.

At 10:15 the replacement of the rest of the current Cloud Mail MTAs with new Exim servers commenced. This replacement was decided upon because the one server currently in production was processing mail without delay. This was fully completed by 20:30.

At 17:14 mail delays and authentication had recovered.

On the 24th of June 2021 at 09:41 a select group of users were receiving authentication pop -up messages when trying to login and experienced slow access to mail. As a result, at 10:17 we moved our focus from the MTAs to the LDAP servers as mail flow was no longer affected since the change to the new Exim servers.

A data dump of the master LDAP database was performed and reloaded on all the replicas to rule out any memory page fragmentation (performance inhibiting side-affect) across these LDAP servers.

At 13:00 file system errors were discovered on the master LDAP server. All Cloud Mail servers were adjusted to point to the secondary master. A file system check and repair was run on the primary LDAP master. While the sever was down memory and CPU resources were increased.

At 13:10 mail authentication and slow access had recovered.

On the 25th of June 2021 at 09:22, a select group of users were receiving authentication pop-up messages when trying to login and experiencing slow access to mail.

At 10:35, TCP and connection and timeout settings adjusted on all the LDAP servers.

At 12:35, connection tracking was disabled on the load balancer. This was done to ensure that if there was a particular problem with an individual LDAP replica, then the connections move seamlessly to another replica. Mail authentication and slow access then recovered.

On the 26th and 27th of June we experienced no further recurrences of the issues.

On the 28th of June 09:30, a select group of users were receiving authentication pop-up messages when trying to login and experienced slow access to mail.

At 10:00, two new LDAP replicas were built to be added to the cluster.

At 11:03, the global address list feature was turned off (for classes of service with large domains that did not need this feature) to try and reduce the traffic to the LDAP servers.

At 13:05, we deleted 30 data sources (external account configurations) that were stored in LDAP but were showing errors during LDAP replication.

At 14:30, the two new LDAP servers and all unique components of Cloud Mail, stores, MTAs, and proxies were pointed to their own unique set of LDAP replicas.

At 14:55, mail authentication and slow access had recovered.

The root cause of this event was due to a project that we initiated last year to replace standard Zimbra MTAs with custom built Exim MTAs. The purpose of this project was to vastly increase the security and delivery of clients’ mail. The initial project phase (last year) was to replace the outbound servers and then to do the inbound servers in July. A test inbound server was added, and this resulted in the start of the experienced issues. In addition, the replacement of all of the remaining MTAs with the new inbound servers in an attempt to resolve the issue, only exacerbated the problem.

The problem that was introduced was that all native servers to Zimbra establish a persistent connection through to the LDAP servers. These new MTAs, introduced to reduce load and traffic to the LDAP servers, established short-term connections. The load balancer tried to deal with the different ways to establish connections in the same way and would overload a single LDAP server and would then proceed to affect the rest in cascading manner as the load was redistributed.

To resolve this issue, two different load balancer IP addresses were configured with their own separate LDAP servers behind them. One, to manage persistent connections and the other, to manage short-term connections. Thereafter, the relevant servers were pointed to the load balancer IP that suits how they communicate and connect to LDAP.

Remediation Actions

• Two additional LDAP replicas have been built and added to the LDAP cluster.

• Two different load balancer IP addresses have been configured with their own separate LDAP servers behind them. One to manage persistent connections and one to manage short-term connections. Thereafter, the relevant servers were pointed to the load balancer IP that suits how they communicate through to LDAP.

• A third load balancer IP will be added to improve LDAP redundancy. This will allow store servers to attempt a new connection rather than remaining connected to an LDAP server that is no longer responding.

Posted Jul 21, 2021 - 14:14 CAT

Resolved

Dear Clients,

The SYNAQ Cloud Mail incident has been resolved and the service has returned to optimal functionality.

Posted Jun 22, 2021 - 16:51 CAT

Update

Dear Clients,

Our engineers are still investigating the SYNAQ Cloud Mail incident. Please be assured that we have this under control and are treating this as a top priority.

We will send our next update in 60 minutes.

Posted Jun 22, 2021 - 15:56 CAT

Investigating

Dear Clients,

SYNAQ Cloud Mail is currently experiencing an incident where mail delivery is delayed. Engineers are investigating this as a matter of urgency.

We will send our next update in 60 minutes.

Posted Jun 22, 2021 - 14:48 CAT

This incident affected: SYNAQ Cloud Mail.