Lead and manage incident response and disaster recovery efforts across high availability SaaS environments.
Design and execute robust disaster recovery strategies to ensure alignment with Recovery Point Objectives (RPO) and Recovery Time Objectives (RTO).
Drive compliance with organizational and industry standards by embedding best practices for disaster recovery, resilience, and fault tolerance, leveraging Chaos Engineering where appropriate.
Define and evolve the incident response framework to enable rapid, coordinated resolution of operational disruptions.
Proactively identify and mitigate potential points of failure through automation and predictive tooling to enhance system stability.
Analyze incident patterns and root causes to drive continuous improvement in reliability engineering practices and response processes.
Develop, maintain, and scale engineering tools for real time incident detection, diagnostics, and automated remediation.
Collaborate with cross functional teams to build frameworks for incident simulation, root cause analysis, and reproducibility at scale.
Own and Lead DR drills and chaos testing exercises, documenting findings and delivering actionable recommendations for resilience enhancement
Partner closely with development, operations, and security teams to ensure cohesive incident management and comprehensive post-incident reviews
What you’ll need (basic qualifications)
Minimum of 12 years of professional experience, including at least 2 years in a managerial capacity within a Site Reliability Engineering (SRE) focused team.
Demonstrate hands-on leadership in SRE for high-availability SaaS environments with a strong focus on reliability and operational excellence.
Possess a strong background in cloud-based software development and have led teams addressing scalability, performance, and reliability challenges.
Demonstrate excellent leadership and project management skills, with a track record of mentoring engineers and driving cross-functional collaboration.
Show a proactive approach to problem-solving, capable of anticipating and mitigating potential issues before they impact customers.
Are experienced in agile methodologies, leading teams with empathy, and committed to delivering high-quality, reliable software solutions. #LI-Hybrid