Hello everyone. Just a brief update on the outage issues we’ve experienced in the last 48 hours for one of our Australian environments. We’ll get you a more detailed version once this is all behind us.
We had a large outage on Monday, that kept Cliniko down for much of the day. We found this to be issues with the primary database keeping up with the load, and we implemented some quick changes to our infrastructure to alleviate the problems. This was primarily adding multiple “replica” databases to share the load (usually we just operated with one replica).
Yesterday, we experienced a few short (< 10min) outages, starting from 5pm AEST. These were caused by issues with the new replica infrastructure, that was brought in as a solution to the original problem. It appears that this was quite “unlucky” as hardware failures are the most likely cause of that, which is quite rare. Usually we’d have resilience built into the infrastructure, so that a faulty machine is automatically removed and replaced, with minimal impact, however with the fast set up to solve the original database load issue, we were not able to also add that resilience yet. The automated resilience has been worked on over the last 12 hours, and we’re trying to get that live. It’s tricky to do during the busy time of the day, so we’re being cautious in rolling it out.
We’ve also found that these new replicas, require approximately 13 hours warm up time, before they have all their storage ready. So as we add new ones, we have a significant delay before we can use them.
Lastly, for the original performance issues on Monday, we also working on some optimisations for the primary database, that would allow it to handle higher load, which in combination with moving much of the work to the replicas, should give us plenty of breathing room, until some bigger planned infrastructure changes come into place.
Right now, Cliniko is in a much better place than the last two days, but there is still is a risk of small outages today. We expect we’re safe from a significant one like we saw Monday, but we can’t be sure we won’t see a < 10min one again, until we have this resilience made live. Of course we’re doing everything we can to avoid it, and we have all hands on deck, like we’ve had for the last 72 hours.
Joel and the team will post updates as often as we’re able to, while balancing working with the team to get this infrastructure work complete. We’ll also try and answer any questions you may have in the comments.