Site Reliability Engineering (NM+)

hsbc | 129 days ago | Hyderabad

In this role, you will:

Team Leadership: Lead and mentor a team of Site Reliability Engineers, fostering a culture of collaboration, continuous improvement, and operational excellence.
Reliability Strategy: Develop and implement strategies to improve the reliability and performance of applications and infrastructure, focusing on service level objectives (SLOs) and service level indicators (SLIs).
Incident Management: Oversee incident response processes, ensuring timely resolution of incidents and minimizing downtime. Conduct post-mortem analyses to identify root causes and implement preventive measures.
Automation and Tooling: Drive the automation of operational tasks and processes, leveraging tools and technologies to enhance efficiency and reduce manual intervention.
Monitoring and Alerting: Establish comprehensive monitoring and alerting systems to proactively identify and address performance issues, ensuring system health and availability.
Capacity Planning: Collaborate with engineering and product teams to forecast capacity needs and ensure that systems can scale effectively to meet demand.
Collaboration: Work closely with development teams to integrate reliability into the software development lifecycle, promoting best practices in coding, testing, and deployment.
Documentation: Maintain clear and comprehensive documentation of systems, processes, and procedures to facilitate knowledge sharing and compliance.
Security and Compliance: Ensure that all systems and processes adhere to security best practices and regulatory requirements, collaborating with security teams as needed.
Continuous Improvement: Identify opportunities for process improvements and lead initiatives to enhance the overall reliability and performance of systems and services

Requirements

To be successful in this role, you should meet the following requirements:

Bachelor's degree in Computer Science, Information Technology, or a related field.
Proven experience in Site Reliability Engineering, DevOps, or a related field, with a strong understanding of system architecture and cloud technologies.
Experience with incident management and response, including post-mortem analysis and root cause identification.
Proficiency in automation and scripting languages (e.g., Python, Go, Bash) and experience with configuration management tools (e.g., Ansible, Puppet, Chef).
Strong knowledge of monitoring and observability tools (e.g., Prometheus, Grafana, ELK Stack) and cloud platforms (e.g., AWS, Google Cloud, Azure).
Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).
Excellent problem-solving skills and the ability to work under pressure in a fast-paced environment.
Strong communication skills, both verbal and written, with the ability to convey technical concepts to diverse audiences.
Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI, CircleCI).
Relevant certifications (e.g., Google Professional Cloud DevOps Engineer, AWS Certified DevOps Engineer) are a plus.

Official notification