Description:
Ensure high availability, scalability, and reliability of critical services and applications through monitoring, and incident response. Perform regular production readiness reviews (PRRs) and capacity planning to prepare systems for scaling and reliability. Lead incident response efforts, conduct root cause analysis (RCA), and drove post-mortem action items to improve system resilience. Collaborate with development and product teams to define and implement non-functional requirements (NFRs), inc
Sep 15, 2025;
from:
dice.com