Implement best practices for system reliability, including proactive identification of potential failure points and the development of automated mitigations
Design and execute comprehensive load testing strategies to identify performance bottlenecks and scalability limits across our cloud products
Implement best practices and technologies to improve system resilience, ensuring high availability and fault tolerance.
Work closely with engineering and product teams to integrate operational readiness into the development lifecycle, enhancing product stability and user satisfaction.
Build and refine tools and frameworks for automated testing, environment simulation, and incident reproduction, reducing manual effort and increasing test coverage.
Conduct in-depth analysis of testing results, documenting findings and making actionable recommendations for system enhancements.
Drive Systemic Improvements to the products by introducing Chaos Testing and partnering with product development teams.
Share your knowledge and expertise with team members, fostering a culture of learning and continuous improvement.
Develop and implement disaster recovery and backup strategies to ensure data integrity and system resilience.
Ideal Candidate
5+ years of experience in SRE , systems engineering, or non functional testing roles with a focus on operational readiness, performance testing, or system scalability.
Experience in driving systemic improvements through Chaos engineering practices.
Programming skills in any of the high level languages or scripting
Proven track record of leading successful load testing and performance optimization initiatives in cloud and on-prem environments.
Experience in creating and managing test environments for automated testing.
Strong fundamentals of CI/CD process and maintaining quality pipelines.
Experience with version control systems (e.g., Git) and agile project management methodologies
Understanding of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.
Strong technical foundation in cloud technologies ( AWS, Azure, Or GCP) and container technologies like Nomad or Kubernetes.
Strong experience with performance testing tools like K6, Artillery, Vegeta, Locust etc
Effective communication and collaboration skills, capable of working with cross-functional teams and articulating technical concepts to diverse audiences.
Familiarity with HashiCorp products and tools is a plus.
Exposure to the disaster recovery domain is a plus.#LI-Hybrid