On 30 May 2023 a minor release was deployed to all EU environments (clusters eu2 and eu3). Such a release should not cause any downtime, as our servers are deployed in a rollout mode, i.e. a new server is deployed before a server with the previous version is terminated.
We got alerted at 18:41 CEST that 3 environments were not reachable anymore (internal alarming system).
We quickly identified that the servers in question had been evicted due to lack of resources on our (Kubernetes) cluster and therefore scaled the cluster manually to add resources.
The 3 environments were up and running again at 18:46 CEST.
Impact: 3 customers were affected, as their environments were not reachable anymore during these 5 minutes. This would have affected test takers trying to open the system check of exam setup as well as administrators, proctors and reviewers. Note: Test takers in the middle of an exam were not affected, no video recordings were lost or interrupted.
Lead-up: Deployment of release 4.5.4.
Resolution: We scaled the cluster manually to create extra resources to host the servers.
Prevention measures: A deprecation warning was causing our deployment script to update more environments than normal, leading to the resource problem. Immediate remedy: Remove the deprecation. Longer term remedy: We will update the script to take warnings into account.
Posted Jun 05, 2023 - 11:52 UTC
Resolved
This incident has been resolved.
Posted May 30, 2023 - 17:11 UTC
Identified
Between 16:41 and 16:48 UTC, some environments may have been temporarily unavailable, responding with a 503 or 502 error page. This is now resolved.
Posted May 30, 2023 - 17:10 UTC
This incident affected: European Cluster EU2 (EU2 - User Sign-in, EU2 - Exam Administration) and Contacting Support Team.