Site Reliability Engineer (NM+)
spglobal | 72 days ago | Noida

1. Observability & Proactive System Health 

  • Design, build, and maintain a comprehensive observability platform using tools like Splunk and OpenTelemetry to provide deep insights into system health and performance. 

  • Leverage AIOps principles and platforms to enhance anomaly detection, automate event correlation, and enable predictive alerting, reducing mean time to detection (MTTD). 

  • Develop and manage robust alerting strategies and SLO-based dashboards to ensure critical issues are addressed before they impact customers. 

  • Drive a data-driven culture by providing engineering teams with the visibility they need to understand the impact of their code in production. 

2. Reliability & Resilience Engineering 

  • Design, implement, and conduct Chaos Engineering experiments to proactively identify and remediate system weaknesses, architectural flaws, and potential cascading failures. 

  • Partner with software engineering teams throughout the application lifecycle to architect for high availability, disaster recovery, and fault tolerance. 

  • Define, measure, and evangelize Service Level Indicators (SLIs) and Service Level Objectives (SLOs), and manage the associated error budgets to balance reliability with feature velocity. 

  • Analyze and lead blameless post-mortems for incidents, ensuring that root causes are addressed and preventative measures are implemented to avoid recurrence. 

3. Performance & Efficiency Optimization 

  • Analyze performance metrics and distributed traces to identify and resolve latency bottlenecks across our infrastructure and applications. 

  • Implement cost optimization (FinOps) strategies by identifying and eliminating resource waste, optimizing cloud service usage, and promoting efficient architecture patterns. 

  • Work with development teams to conduct performance testing and ensure new features do not introduce performance regressions. 

4. Automation & Platform Engineering 

  • Identify and aggressively automate manual operational tasks (toil) by developing scripts, tools, and self-healing systems. 

  • Enhance and maintain our Infrastructure as Code (IaC) modules, promoting reusable patterns and best practices with Terraform. 

  • Improve and secure CI/CD pipelines (e.g., GitHub Actions, Azure DevOps) to enable safe, automated, and rapid deployment and rollback procedures. 

 

Requirements and Qualifications 

Core Technical Skills 

  • Experience: 4+ years in a Site Reliability, DevOps, or Cloud Engineering role, with demonstrable experience in a large-scale production environment. 

  • Cloud Proficiency: Deep experience with AWS services (EKS, ECS, EC2, S3, RDS, Lambda) and managing production workloads in the cloud. 

  • Observability: Proficient in application observability, monitoring, and logging. Hands-on experience with tools like Splunk, OpenTelemetry, Prometheus, Grafana, or Datadog is essential. 

  • Infrastructure as Code (IaC): Strong experience with Terraform for provisioning and managing cloud infrastructure. 

  • Containerization: Solid understanding of Containerization Technology particularly with managed services like EKS or ECS. 

  • CI/CD: Experience building and maintaining CI/CD pipelines using tools like GitHub Actions, Azure DevOps, or Jenkins. 

  • Scripting & Automation: Strong scripting skills in languages like Python, Bash, or PowerShell for automation and tooling. Familiarity with a higher-level language such as C# (.NET) is a plus. 

  • Modern Practices: Experience with or a demonstrated understanding of AIOps concepts and Chaos Engineering principles and tools (e.g., Gremlin, AWS Fault Injection Simulator). 

Professional Attributes 

  • SRE Mindset: A true understanding of Site Reliability Engineering principles, including SLOs, error budgets, and the value of eliminating toil. Official notification

⚡ Hot Jobs Trending Now

SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.