Drive continuous improvements to our usage of Kubernetes, our Operators, and the GitOps deployment paradigm.
Extend our networking, service mesh and Kubernetes systems to support connectivity between GCP, AWS and Azure.
Collaborate with Engineering teams to support services before they go live through activities such as system design consulting, developing software platforms and frameworks, monitoring/alerting, capacity planning, production readiness and service reviews.
Help define and instrument Service Level indicators and objectives (SLIs/SLOs) with service owners in the Engineering teams. Develop SLO-based on-call strategies for service owners and their teams.
Collaborate within our virtual Observability team: develop and improve observability (tracing, events, metrics, profiling, logging and exceptions) of the Dremio Cloud product.
Ability to debug and optimize code written by others and automate routine tasks. You recognize complexity and are familiar with multiple techniques to manage it but recognize the folly in complete rewrites.
Evangelize and advocate for resilience engineering and reliability practices across our organization.
Scale systems sustainably through automation and evolve systems by pushing for changes that improve reliability and velocity.
Join an on-call rotation for systems and services that the SRE team owns.
Practice sustainable incident response and post-incident investigation analysis.
Drive the cultural, technical, and process changes to move towards a true continuous delivery model within the company.
What we’re looking for
3+ years of relevant experience in the following areas: SRE, DevOps, Distributed Systems, Cloud Operations, Software Engineering.
Familiarity in Kubernetes, Istio, Terraform, ArgoCD/Flux.
Familiarity with software defined networking infrastructure: dedicated and partner interconnects, VPNs, BGP.
Excellent command of cloud services on GCP/AWS/Azure, CI/CD pipelines.
Have moderate-advanced experience in Python/Go, and at least reading knowledge of Java.
You are interested in designing, analyzing and troubleshooting large-scale distributed systems.
You have a systematic problem-solving approach, coupled with strong communication skills and a sense of ownership, drive, and determination.
You have a great ability to debug and optimize code and automate routine tasks.