Implement best practices for system reliability and disaster recovery, including proactive identification of potential failure points and the development of automated mitigations.
Design and execute comprehensive DR testing strategies to identify bottlenecks and failure points that affect RPO and RTO across our cloud products.
Drive initiatives around DR compliance and implement best practices and technologies to improve system resilience, ensuring high availability and fault tolerance through the Chaos testing framework.
Work closely with engineering and product teams to integrate operational readiness into the development lifecycle, enhancing product stability and user satisfaction.
Build and refine tools and frameworks for automated testing, environment simulation, and incident reproduction, reducing manual effort and increasing test coverage.
Conduct mock drills and drive chaos tests in collaboration with partner teams, analyzing test results, documenting findings and making actionable recommendations for systemic improvements
Share your knowledge and expertise with team members, fostering a culture of learning and continuous improvement.
What you’ll need (basic qualifications)
4+ years of experience in software development, reliability engineering, systems engineering, or non functional testing roles with a focus on Disaster recovery or backup and recovery of Cloud based systems.
Having commitment to explore career opportunity in Reliability Engineering field
Proficient in Golang programming language or any other scripting language
Hands on experience with version control systems such as Git , Gitlab
Understands micro services architecture
Good understanding of CI/CD process and maintaining quality pipelines
Exposure to cloud technologies ( AWS, Azure, Or GCP) and container technologies like Nomad or Kubernetes.
Effective communication and collaboration skills, capable of working with cross-functional teams and articulating technical concepts to diverse audiences.