Job Responsibilities
Team Leadership & Development
· Lead and mentor a team of SREs, providing guidance, coaching, and support to foster growth and career development.
· Build and grow a high-performing team focused on operational excellence, reliability, and scalability.
· Establish and maintain a strong team culture of collaboration, accountability, and continuous improvement.
· Work with cross-functional teams (Engineering, Product and Project Management) to align priorities and build effective working relationships.
· Service Reliability & Performance
· Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and Service Level Agreements (SLAs) for critical systems.
· Monitor and improve the reliability, availability, and performance of all production services and infrastructure.
· Own and drive efforts to improve incident management, root cause analysis, and postmortem documentation.
· Implement proactive monitoring, alerting, and incident response strategies.
· System Automation & Scalability
· Lead efforts to automate and streamline operational processes, reduce manual toil, and improve system reliability.
· Identify and implement best practices for system design, capacity planning, and cost optimization.
· Work closely with engineering teams to build scalable, resilient, and efficient systems that can handle increasing load.
· Collaboration & Cross-functional Engagement
· Collaborate with Engineering & Product teams to ensure reliability is baked into the development process, including reviewing code, design, and deployment practices.
· Advocate for reliability improvements across the engineering and product teams, ensuring a balance between speed and reliability.
· Work with other engineering managers to align on long-term goals, technical debt, and infrastructure investments.
· Process & Efficiency Improvement
· Drive continuous improvements in incident management, deployment pipelines, and system observability.
· Champion the adoption of tools and processes that improve automation, monitoring, alerting, and reporting.
· Measure and track key operational metrics, using data to inform decision-making and drive improvements.
Qualifications
· 8 years of experience building, scaling, and supporting highly available systems and services
· 3-4 years of years of experience managing and leading technical teams, including mentoring engineers and fostering team development.
· Strong experience with enterprise grade middleware, e.g. Web Servers, Apache & Load Balancers (NetScaler) hosted on a virtual machine cluster.
· Strong Expertise in configuration management tools like Puppet.
· Experience with Infrastructure-as-Code, Linux, VmWare and API integration. Familiarity with Terraform desired.
· Proficiency in at least one scripting or programming language (Ansible, Python, Go, Ruby, etc.).
· Expertise in the delivery, maintenance, and support of Linux systems and infrastructure
· Experience with cloud platforms ( AWS), containerization ( Docker), and orchestration ( Kubernetes).
· Familiarity with observability tools (e.g., Prometheus, Grafana, ELK stack, CloudWatch, Splunk)
· Experience implementing solutions using SRE, DevOps principles,
·· Familiarity with telemetry, latest monitoring, visualization tools.
· Expertise in promoting and driving system visibility to aid in the rapid detection and resolution of issues
· Bachelor's or master's degree in computer science, Engineering, or a related field.
· Experience in industries with high uptime requirements (e.g., financial services, healthcare, SaaS)
Official notificationAny question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.