You wait ages for an outage and then two come along at once!
Our clients have been unable to access our app on two occasions in living memory. Both of them in the last 7 days. This is the story.
The first one, which affected access to Donorfy for a couple of hours, has still not been explained. Microsoft claim nothing changed at their end, and certainly nothing changed with our app between it working and then not. Only the cloud equivalent of switching it off and back on again made the problem go away.
The second, longer one, which hit in the early hours of this morning, was an issue at the Microsoft Azure Northern Europe data centre.
Outages happen - no one means them to happen but they’re a fact of life. But like all mishaps whether self-inflicted or not, it’s how you respond to them that matters, right? Once it became clear that whatever had befallen the Northern Europe Azure data centre was not going to be fixed in a hurry we decided to take action ourselves. Here’s what we did.
Our superstar dev team* dropped everything and set about copying the whole kit and caboodle from the Northern Europe data centre to the UK South data centre. Not moving physical servers as such, but kind of copying and pasting the software and setup. No mean feat for a complex app like Donorfy and its ancillary parts.
This meant changing where the URLs pointed, so once that had rippled its way around the world wide web Donorfy was up and running from its new home in the UK South data centre in under 5 hours.
It’s not an experience we want to repeat every Thursday but these things happen, as they say. They also say - whoever they are - that you learn from adversity, so what are our learning points and the outcomes from this?
Our clients depend on the service! Of course they do, but there’s nothing like an outage and the inevitable flood of tickets, emails, texts and phone calls to remind you of it.
Communication is everything - I think we got better at it 2nd time around.
We have a more resilient setup as a result of this - Donorfy is now deployed in 2 data centres and we know how to swap between them should anything like this happen again. This is a huge win for our clients and us.
Whatever B****t ends up meaning from a data protection point of view we are now close to being ready to deploy Donorfy 100% in UK for UK clients and 100% in EU for EU clients.
Was there anything we could have done better or differently? Honestly, we don’t think there is. The circumstances were beyond our control in both cases. That said, the extra resilience we now have does reduce the risk of Donorfy downtime in the future.
Having a 100% remote connected team (which we do) was not a barrier to solving the problem - it probably helped. There was no sense of panic that can infect an office when a crisis happens - just the right people quietly working remotely but together to resolve the issue for our clients.
And lastly, having the right team members is essential. I’m proud to say we do: trustworthy, skilled and committed to getting the service back online for our clients.
Thanks of course to you our clients for your patience and good humour. We know it was frustrating. For some it turned out to be a good time to get round to those other tasks on the to do list. We’re choosing to turn this into a positive for all of the reasons explained above. After all, every cloud (outage) should have a silver lining.