SYNAQ Archiving Incident 12-01.2017

Incident Report for SYNAQ

Postmortem

Summary and Impact to Customers

On Thursday 12th January 2017 from 10:05 – 17:25, SYNAQ Archive experienced a platform outage incident.

The resultant impact of the event was that a subset of Clients were unable to access their archived data.

Root cause and Solution

The root cause of this event was due to an unresponsive infrastructure switch responsible for managing multi-path routing, which caused the storage of the affected Archive cluster to become unreadable.

The unresponsive infrastructure switch caused the Archive Operating System to place the storage LUN’s in an unreadable state and as such, a subset of Clients on the affected Archive cluster were unable to access their Archive data.

In order to solve this issue, we remotely restarted the affected switch to bring up the multipath routing management. In addition, a file system integrity scan was initiated on all affected storage before restoring and activating the Archive service.

Remediation Actions

A project is underway for the review and enhancement of the existing Archive environment regarding infrastructure components and the system methodology to improve efficiencies and prevent any further incidents from recurring

Posted Jan 19, 2017 - 13:12 CAT

Resolved

All affected archives have been fully restored

Posted Jan 12, 2017 - 17:25 CAT

Monitoring

The issue has been resolved and the affected archives are slowly recovering.

Posted Jan 12, 2017 - 14:26 CAT

Update

The team is still currently working on the issue. We apologize for any inconvenience caused.

Posted Jan 12, 2017 - 12:46 CAT

Identified

The team has identified the cause of the issue and is currently working on a resolution

Posted Jan 12, 2017 - 11:17 CAT

Investigating

SYNAQ Archiving currently has a partial outage where a sub set of clients are unable to access their archive data.

Posted Jan 12, 2017 - 10:05 CAT

This incident affected: SYNAQ Archive.