On May 4, starting at 3:52pm PT, Streem's Web Application became unavailable to new browser sessions (including new logins and page refreshes). Instead of loading the application, the user would see an Error Screen. The error screen appeared because a call to our global API endpoint (which determines which region a company code belongs to) was failing with a 403 error. This was resolved at 5:20pm PT for non-Canada customers, and 6:03pm PT for all customers.
During a routine deployment to production, terraform (which configures all our server infrastructure and settings) quit unexpectedly mid-run. This deployment included changes that required terraform to destroy and recreate the VPC Peering and Security Group configuration that allows our Global Region to talk to our Canada Region (which are deployed in fully-separate AWS accounts). The terraform process exited after tearing down the resources, but before it could recreate them. This meant that the Global Region was no longer able to communicate with the Canada Region, which is a requirement for the endpoint that determines which Region a Company should use. The first thing our applications do when loading the page is lookup which API Region to use, and because this endpoint was now failing, we would get an error screen.
An interim solution was used to get our US Region working again, followed by a larger fix for CA.
The interim solution was to deploy new Global Services with an environment variable change that allowed it to operate without knowing about the Canada Region. Once the Global Services started back up, they could satisfy the request to lookup Region by Company. US was available after this change.
The larger solution was to get terraform back into a working state. To do that, we had to manually recreate the VPC Peering and Security Group configuration that would allow Global to communicate with CA. Once we did that, we were able to complete a full terraform run, and the incident was resolved.
We have completed several dives into the cause and potential ways we could’ve avoided this incident. The following are the actions we are taking to mitigate future issues of this type:
We sincerely apologize for the inconvenience that this incident may have occurred, and are working hard to detect and prevent any future issues. Thank you for being a valued Streem customer.