The Workday team, like any SaaS provider or IT organization, does everything possible to eliminate the possibility of unplanned system failures. We also have detailed plans in place for getting customers back online as quickly as possible in the event of an unplanned outage.
Yesterday, the network attached storage (NAS) device that stores operating system files for our production servers detected a corrupted node within a backup RAID array. Rather than simply log the error, which is what it is supposed to do, the NAS took itself off-line. It is ironic that the redundant backup to a system with built-in redundancy caused the failure.
This type of error should not have caused the array to go offline, but it did. The most important result is that our failover plans worked as expected. Within hours, all customers were live in our secondary datacenter with all their data intact.
We’ve tested our failover plans many times, but this is the first time we did it for real. We’ve learned quite a bit in the process – some of it technical, some of it regarding communications with customers. That knowledge will be used to further refine our datacenter practices, our hardware choices, and our failover plans so that we can do even better in the future.
While any unplanned outage is unacceptable; successful and timely recovery – ensuring the integrity of our customers’ data – is something we’re very proud of. Moreover, I want to thank our customers for their understanding during the outage. We’ve had numerous communications with customers as we brought systems back online, and the overwhelming sense of support and teamwork has been an incredibly positive experience.
And now, back to work.