Senior Site Reliability Engineer (5+)
veeam | 144 days ago | Bengaluru

Reliability Engineering & Resilience

  • Design and evolve infrastructure to be highly available, fault tolerant, and scalable across public clouds (initially Azure, with future expansion plans to other providers).

  • Establish and maintain SLIs, SLOs, and error budgets that define and enforce reliability objectives.

  • Lead incident response, analysis, blameless postmortems, and sharing sessions in order to maximize learning across our entire engineering team and driving changes to the entire socio-technical engineering system.

Observability & Operational Excellence

  • Drive adoption of deep observability practices, ensuring telemetry, logs, metrics, and tracing are comprehensive and actionable.

  • Develop automation and self-healing tools to reduce toil and support Veeam’s fleet management strategy.

  • Participate in on-call rotations and lead operational excellence across the stack.

Engineering at Scale

  • Contribute to infrastructure as code (IaC), CI/CD systems, deployment automation, and scalable config management.

  • Integrate and extend monitoring and chaos engineering tools to validate reliability assumptions under load and failure conditions.

  • Implement testing strategies, canary deployments, and release validation pipelines to protect production environments and allow teams to safely deliver new features as quickly as possible.

Collaboration & Culture

  • Embed within product and platform teams to champion reliability from design through delivery.

  • Contribute to a learning culture focused on continuous improvement and proactive risk management.

  • Mentor engineers and advocate for DevOps/SRE best practices across global teams.

What we expect from you:

  • 5+ years of hands-on experience in a Software Engineering role with at least 2 years in Site Reliability, Platform Engineering, or similar.
  • Deep experience building systems on public cloud providers (Azure preferred)

  • Strong programming skills in JS, Node, Typescript, Go, Java, C#, or similar.

  • Proven track record in delivering monitoring, alerting, and observability tooling (e.g., Prometheus, Grafana, OpenTelemetry).

  • Experience with IaC tools like Terraform/Pulumi, and container orchestration (e.g., Kubernetes).

  • Solid understanding of distributed systems, cloud networking, and cloud-native system design.

  • Excellent communication and collaboration skills across geographies and disciplines.

Will be an added advantage:

  • Experience working on large-scale B2B SaaS platforms.

  • Background in chaos engineering, resilience testing, performance testing, load testing, or incident learning programs.

  • Familiarity with compliance frameworks (e.g., ISO, SOC 2, GDPR, FEDRAMP/CMMC).

Official notification

⚡ Hot Jobs Trending Now

SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.