Sr. SRE - Incident Excellence (5+)
hashicorp | 133 days ago | Bengaluru

n this role, you can expect to:

  • Be responsible for and drive incident management capabilities and culture.
  • Contribute to incident command on-call
  • Build technical skills and relationships within a team of engineers and SREs.
  • Lead and refine our incident response strategy, ensuring rapid and effective response to operational disruptions.
  • Analyze incident trends and root causes to drive continuous improvements in system reliability and response processes.
  • Develop and maintain tools for incident detection, analysis, and resolution, automating responses where possible to minimize human intervention.
  • Create comprehensive incident response documentation and conduct training sessions to prepare all relevant teams for effective incident handling.
  • Work closely with development, operations, and security teams to coordinate incident response efforts and post-incident analyses.

 

You may be a good fit for our team if:

  • Minimum 5 - 8yrs years of experience in site reliability engineering, systems administration, or software engineering, with a significant focus on incident response and operational reliability.
  • 1+ years managing, coordinating, and ensuring resolution of major incidents.
  • Professional experience with incident management in cloud environments.
  • Enjoy working on a variety of scopes spanning software engineering, cloud infrastructure, and SRE.
  • Proven track record of managing and resolving incidents in cloud-based environments, with expertise in major public cloud platforms (AWS, GCP, Azure).
  • Understanding of fundamental network technologies like DNS, Load Balancing, SSL, TCP/IP, HTTP
  • Strong understanding of monitoring and alerting systems, with the ability to develop metrics and alarms that accurately reflect system health and operational risks.
  • Experience with incident management tools and practices, including post-mortem analysis and root cause investigation.
  • Passion for consistently responding to and leading complex incidents in a 24x7x365 environment utilizing a globalized follow-the-sun model.
  • Customer-centric attitude with a focus on providing best-in-class incident response for customers and stakeholders
  • Familiarity with HashiCorp’s product suite and infrastructure automation tools is a plus.
  • Demonstrate strong leadership skills during periods of significant business impact, remaining calm and professional during high-pressure situations
  • A strong desire to drive customer success with partner teams and management on high-profile issues critical to the long-term success of the business
  • Outstanding verbal and written communication skills with the ability to convey information in a meaningful way to both engineers and executive-level management, during and outside of incidents
  • Adaptable to a wide variety of technologies and capable of incident response and troubleshooting activities in complex interconnected environments #LI-Hybrid
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.