Timeouts and Instability

Incident Report for CabMD

Postmortem

Service Disruption Report: April 15, 2025

Incident Summary

On April 15, 2025, our application experienced a service disruption for approximately 3.5 hours (11:30 PM - 3:00 PM EST). Some users encountered slow response times and occasional error messages when attempting to access the system. The issue was resolved at approximately 3:00 PM EST.

What Happened

Our monitoring systems detected unusually high resource usage on one of our application servers, which began shortly after noon on April 15. This resource constraint led to slowdowns in processing user requests and managing user sessions, resulting in the errors some users experienced.

Impact

  • Some users experienced slower-than-normal response times
  • Intermittent error messages when accessing certain features
  • Occasional session timeouts requiring re-login
  • No data loss occurred during this incident

Root Cause

The issue was traced to a code update deployed on April 7, 2025. Under certain conditions, this update caused the application to use computing resources inefficiently when retrieving service information. As user activity increased throughout the day on April 15, the system eventually reached a tipping point where these inefficiencies began impacting performance.

Resolution

Our technical team identified the issue and implemented an immediate fix by restarting the affected application service, which restored normal functionality. Our development team has identified the specific code pattern causing the inefficiency and is implementing a permanent solution.

Corrective Actions

Immediate Actions Taken

  1. Restarted the affected application service to restore functionality
  2. Identified the specific code pattern causing the issue
  3. Implemented enhanced monitoring to detect similar issues earlier

Planned Improvements

  1. Deploying an optimized code update to permanently resolve the issue
  2. Enhancing our performance testing processes prior to deployment
  3. Implementing additional automated alerts to catch similar issues before they impact users
  4. Reviewing similar code patterns across our application to prevent related issues

Commitment to Service Quality

We understand the importance of system reliability and apologize for any inconvenience this disruption may have caused. We are committed to the continuous improvement of our systems and processes to provide you with the most reliable service possible.

If you have any questions or need further assistance, please contact our support team.

Posted Apr 15, 2025 - 14:47 EDT

Resolved

On April 15, 2025, our application experienced Redis timeout errors, resulting in service degradation for approximately 3 hours. The root cause was thread pool exhaustion in the application server, triggered by a code change deployed on April 7, 2025. Post-mortem to follow.
Posted Apr 15, 2025 - 14:45 EDT

Update

The investigating continues, but we believe we've isolated the issue to the connection to the caching service. We've deployed a new instance of the caching server to try and alleviate that issue.
Posted Apr 15, 2025 - 13:36 EDT

Monitoring

A fix has been implemented and we are monitoring the results.
Posted Apr 15, 2025 - 13:12 EDT

Investigating

We are currently investigating this issue.
Posted Apr 15, 2025 - 12:33 EDT
This incident affected: Website Frontend (http://www.cab.md), Website Backend (https://my.cab.md), and API (api.cab.md).