Senior Site Reliability/DevOps Engineer (7+)

equifax | 164 days ago | Trivandrum

What you’ll do

Architecture and Design: Participate in the design and architecture of highly scalable, resilient, and secure systems on Kubernetes. Contribute to the definition of SRE principles and best practices.
Automation: Develop and maintain automation frameworks for infrastructure provisioning, deployment, monitoring, and incident response using tools like Terraform, Ansible, Puppet, Chef, or similar.
Monitoring and Alerting: Design and implement comprehensive monitoring and alerting systems to proactively identify and resolve issues. Develop and maintain dashboards to track key performance indicators (KPIs).
Incident Management: Lead incident response efforts, conducting thorough post-incident reviews to identify root causes and implement preventative measures.
Capacity Planning: Proactively identify and address capacity constraints to ensure optimal system performance and availability.
Collaboration: Work closely with engineering, product, and security teams to ensure seamless collaboration and alignment on system requirements and priorities.
Mentorship: Mentor and guide junior SRE/DevOps engineers, fostering a culture of continuous learning and improvement.
On-call Rotation: Participate in a rotating on-call schedule to provide 24/7 support for critical systems.
Security: Contribute to the security posture of our systems by implementing security best practices and participating in security audits and reviews.
Performance Optimization: Identify and resolve performance bottlenecks, optimizing system performance and resource utilization.

What experience you need

7+ years of experience as an SRE, DevOps Engineer, or in a similar role.
Deep understanding of cloud platforms such as GCP (AWS and Azure are a plus)
Extensive experience with containerization technologies like Docker and Kubernetes.
Proven experience with configuration management tools (e.g., Terraform, Ansible, Puppet, Chef).
Strong scripting skills (e.g., Python, Go, Bash, Shell).
Experience with monitoring and logging tools (e.g., DataDog, Prometheus, Grafana, Datadog, ELK stack).
Experience with CI/CD pipelines and tools (e.g., Jenkins, GitLab CI, CircleCI).
Experience with incident management and post-incident reviews.
Excellent problem-solving and troubleshooting skills.
Strong communication and collaboration skills.
Bachelor's degree in Computer Science or a related field; equivalent experience considered.

Official notification