Unable to Login
Incident Report for Streem
Postmortem

Incident Summary

On May 4, starting at 3:52pm PT, Streem's Web Application became unavailable to new browser sessions (including new logins and page refreshes). Instead of loading the application, the user would see an Error Screen. The error screen appeared because a call to our global API endpoint (which determines which region a company code belongs to) was failing with a 403 error. This was resolved at 5:20pm PT for non-Canada customers, and 6:03pm PT for all customers.

Root Cause

During a routine deployment to production, terraform (which configures all our server infrastructure and settings) quit unexpectedly mid-run. This deployment included changes that required terraform to destroy and recreate the VPC Peering and Security Group configuration that allows our Global Region to talk to our Canada Region (which are deployed in fully-separate AWS accounts). The terraform process exited after tearing down the resources, but before it could recreate them. This meant that the Global Region was no longer able to communicate with the Canada Region, which is a requirement for the endpoint that determines which Region a Company should use. The first thing our applications do when loading the page is lookup which API Region to use, and because this endpoint was now failing, we would get an error screen.

Resolution

An interim solution was used to get our US Region working again, followed by a larger fix for CA.

The interim solution was to deploy new Global Services with an environment variable change that allowed it to operate without knowing about the Canada Region. Once the Global Services started back up, they could satisfy the request to lookup Region by Company. US was available after this change.

The larger solution was to get terraform back into a working state. To do that, we had to manually recreate the VPC Peering and Security Group configuration that would allow Global to communicate with CA. Once we did that, we were able to complete a full terraform run, and the incident was resolved.

Prevention

We have completed several dives into the cause and potential ways we could’ve avoided this incident. The following are the actions we are taking to mitigate future issues of this type:

  • Remove Global's dependency on all Regions being up in order to satisfy calls from any Region.
  • Protect VPC Peering and Security Group configuration so they are never accidentally altered.
  • Provide additional visibility into specific kinds of configuration changes that have a higher potential to cause issues.

We sincerely apologize for the inconvenience that this incident may have occurred, and are working hard to detect and prevent any future issues. Thank you for being a valued Streem customer.

Posted May 15, 2023 - 18:25 UTC

Resolved
This incident has been resolved.
Posted May 05, 2023 - 06:24 UTC
Monitoring
A fix has been implemented and we are monitoring the results.
Posted May 05, 2023 - 01:08 UTC
Update
US region is currently functional. CA region is still inaccessible.
Posted May 05, 2023 - 00:39 UTC
Identified
The issue has been identified and a fix is being implemented.
Posted May 04, 2023 - 23:39 UTC
Investigating
The streem web platform is currently showing an error page for anyone that tries to login. We are currently investigating this issue
Posted May 04, 2023 - 23:03 UTC
This incident affected: REST API, Admin Portal, and Live Remote Calls.