Senior Site Reliability Engineer (4+)

okta | 171 days ago | Bengaluru

As a Senior Site Reliability Engineer at Spera, you will play a critical role in building, maintaining and scaling our platform. Your expertise and contributions will directly impact the success and effectiveness of our product. Specifically, your responsibilities will include:

Developing, operating, and maintaining critical infrastructure (EKS, ECS, Airflow, VPCs, Snowflake, MongoDB, etc).
Development and full ownership of our IaaS framework.
Building, maintaining, and managing Docker images and repositories, including the CI/CD pipeline and deployment processes
Integration with 3rd-party tools and other infrastructure in the Okta WIC environment (e.g. observability)
Evangelizing security best practices, leading initiatives to strengthen our security posture for critical infrastructure, and managing security & compliance requirements.
Developing and maintaining technical documentation, runbooks, and procedures
Triaging and troubleshooting complex production issues to ensure reliability and performance
Identifying and automating manual processes
Promoting and applying best practices for building scalable and reliable services across engineering
Supporting a 24x7 online environment, managing production incidents and determining how we can prevent them in the future as part of an on-call rotation

What you’ll bring to the role

4+ years of experience as a site reliability or platform engineer, preferably in a fast-scaling environment.
Proven hands-on experience with Docker and Kubernetes in production
Experience with the deployment of production workloads on public cloud infrastructure (AWS and GCP)
Strong expertise in configuration management using IaaS tools such as Terraform and Helm
Proficiency in ETL processes, showcasing the ability to handle data pipelines efficiently and securely, including experience with orchestration tools like Apache Airflow.
Experience in network engineering and security practices in AWS.
Experience managing CI/CD infrastructures, with a strong proficiency in platforms like GitHub Actions to streamline deployment pipelines and ensure efficient software delivery.
Knowledge of observability tools such as Grafana, Prometheus, and Splunk, as well as their implementation
Strong proficiency in Python for backend systems, demonstrating the ability to develop and maintain robust, scalable, and efficient software components essential for the reliability and performance of the infrastructure.
Excellent problem-solving skills and a detail-oriented mindset.
Strong communication and collaboration abilities to work effectively within a team.
Are passionate about encouraging the development of engineering peers and leading by example.

Official notification