Ubiquiti / UniFi Protect Outage (Aug 2020)

I wanted to share a couple of thoughts on the recent UniFi Protect outage that happened on Aug 25th 2020 as it raises a number of questions that I think Ubiquiti need to answer.

Firstly tho, outages happen. It’s not a question of if, but when, and how severe. What’s important is that companies (and individual teams) expect them, and plan for how to handle them and provide quick mitigations (including proactively designing for failure modes). Having the right data, and the correct processes is critical to handling outages.

The incident

The incident was described as “Elevated API Errors” that affected “AmpliFi Cloud & UniFi Protect (AmpliFi Cloud Production API)”. This translated to customers not being able to access the UniFi Protect cameras both locally and remotely.

In total there were two outages that apparently lasted for a total of 8 hours 14 mins (more on this figure later). The first outage was classed as a partial outage, then 11 mins after it was “resolved” a major outage was triggered.

Partial outage

The partial outage lasted for 6 hours, 27 minutes. The status website wasn’t updated until 4 hours into the outage. Meanwhile customers, like me, weren’t able to access devices remotely.

Major outage

The major outage was posted 11 mins after the partial outage was marked as resolved. The status updates were much faster but the timeframe on the status website doesn’t add up as the timeline shows updates over 12 hours and 9 mins. Which makes the total outage 14 hours and 39 mins and their rolling 30 day availability 97.96% (if you calculate this based solely on the status timeline).

Plenty of customers were vocal about the issue, I personally cut support tickets after a couple of hours (as soon as I could work around a separate issue with their ticketing system).

The questions

Let’s be honest, problems like this happen, but there are a number of issues here that Ubiquiti need to address publicly, if they want their customers to have confidence in their cloud systems (a lot of their target customer base are sysadmins and tech professionals after all).

  1. Why did this outage last so long? If it was caused by a change (bad deployment etc) why wasn’t it rolled back quickly? What actions did folks have to take that took so long to get out of the door?
  2. What happened between the two incidents? Why did resolving a partial outage cause an even worse outage? Who is in charge of resolving these incidents and making the decisions about the course of action and assessing the customer impact to help teams make the right decisions?
  3. Why was the blast radius so wide? This was either an automated deployment (see #1 about rollbacks) or a manual change. Either way, why did this have global impact? Are changes not deployed into different regional stacks to isolate faults? Was it deployed too quickly, so the damage was done by the time the issue was apparent? Or, do they just have a single global system, that represents a clear single point of failure?
  4. How can customers still retain access to their local camera systems when the Ubiquiti cloud is down? Customers like the fact that footage is local, but still accessible remotely. So why do both local and remote access require cloud connectivity?
  5. Why does the timeline and availability data not align with their own summary information? How is availability being measured?

I raised my request for a root cause analysis / post-mortem in my support ticket and I was told

Regarding your query, we don’t have any official updates as of now. When we do so it’ll available on community.ui.com

I understand your concern, but if you check the historical uptime we haven’t had any major outages before this.

I think Ubiquiti owe it to their customers to provide more analysis of this outage. It’s not about historic performance, it’s about having confidence in their engineering systems and processes to be sure that you can rely on them.