Sr. SRE II - Incident Excellence (10+)
hashicorp | 80 days ago | Bengaluru

In this role, you can expect to:

  • Be responsible for and drive incident management capabilities and culture.
  • Contribute to incident command on-call
  • Build technical skills and relationships within a team of engineers and SREs.
  • Lead and refine our incident response strategy, ensuring rapid and effective response to operational disruptions.
  • Analyze incident trends and root causes to drive continuous improvements in system reliability and response processes.
  • Develop and maintain tools for incident detection, analysis, and resolution, automating responses where possible to minimize human intervention.
  • Create comprehensive incident response documentation and conduct training sessions to prepare all relevant teams for effective incident handling.
  • Work closely with development, operations, and security teams to coordinate incident response efforts and post-incident analyses.

 

You may be a good fit for our team if:

  • Minimum 10 years of experience in site reliability engineering, systems administration, or software engineering, with a significant focus on incident response and operational reliability.
  • 2+ years managing, coordinating, and ensuring resolution of major incidents.
  • Professional experience with incident management in cloud environments.
  • Enjoy working on a variety of scopes spanning software engineering, cloud infrastructure, and SRE.
  • Proven track record of managing and resolving incidents in cloud-based environments, with expertise in major public cloud platforms (AWS, GCP, Azure).
  • Understanding of fundamental network technologies like DNS, Load Balancing, SSL, TCP/IP, HTTP
  • Strong understanding of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.
  • Experience with incident management tools and practices, including post-mortem analysis and root cause investigation.
  • Passion for consistently responding to and leading complex incidents in a 24x7x365 environment utilizing a globalized follow-the-sun model.
  • Customer-centric attitude with a focus on providing best-in-class incident response for customers and stakeholders
  • Familiarity with HashiCorp’s product suite and infrastructure automation tools is a plus.
  • Demonstrate strong leadership skills during periods of significant business impact, remaining calm and professional during high-pressure situations
  • A strong desire to drive customer success with partner teams and management on high-profile issues critical to the long-term success of the business
  • Outstanding verbal and written communication skills with the ability to convey information in a meaningful way to both engineers and executive-level management, during and outside of incidents
  • Adaptable to a wide variety of technologies and capable of incident response and troubleshooting activities in complex interconnected environments
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.