Site Reliability Lead (8+)

veeva | 88 days ago | Hyderabad

What You’ll Do

Head up a team of engineers, mentor, and provide onsite leadership
Rapidly build new applications on an existing, robust enterprise platform
Build new cloud infrastructure from scratch following the best practices in software development
Drive new features and improvements in a fast-changing environment
Partner with product management, design, and QA to deliver cutting-edge solutions and direct value to our customers
Work on multiple layers of our stack including backend (primary), front-end, and Infrastructure
Drive new features and improvements in a fast-changing environment
Build tools and automation that eliminate work and reduce the time it takes to resolve an issue
You want to make the system better every day and are self-driven to learn all that is necessary to provide full-stack diagnostics and determine the root cause of problems
Ensure our platform meets the scalability and reliability needs of our customers
During an incident, lead the effort to triage and mitigate. You might need to perform periodic on-call duty if issues are escalated
Strategize with engineering teams on complex problems. You know how to support a system that is used by 3M users and can help dev teams make decisions based on recommendations of what will work in production before it ships
Participate in engineering design reviews of new features. Drive focused initiatives that improve operational efficiency and scalability of the platform
Communicate effectively with engineering teams, and describe problems succinctly with sufficient detail that you can hand off an ongoing problem to another team or a peer for completion. Engage in real-time communication during outages with both technical and non-technical audiences

Requirements

8+ years experience in Java, preferably at an enterprise cloud software company
Proven ability to write clean, testable, readable code in a team environment
Hands-on experience with open-source technologies, such as Spring, MySQL, Hibernate, Solr, Maven, Git, Tomcat, Linux, AWS, Vagrant, Docker, Kubernetes
3+ years of experience in relational databases with a mastery of SQL
Demonstrated history of incident management and leadership ability
Experience in handling production outages and root-cause analysis
Hands-on operational experience in a high-volume or critical production service environment
Effective communication skills across all levels -- whether talking to individual contributors or executives
Solid scripting skills; experience with Shell, Bash, Ansible, Python, Go, Ruby, etc.
Ability to handle the periodic, on-call duty
Fluent in English - both written and verbal
We are looking for strong mentors with a proven record of making your team better

Official notification