As a Senior Site Reliability Engineer at Spera, you will play a critical role in building, maintaining and scaling our platform. Your expertise and contributions will directly impact the success and effectiveness of our product. Specifically, your responsibilities will include:
- Developing, operating, and maintaining critical infrastructure (EKS, ECS, Airflow, VPCs, Snowflake, MongoDB, etc).
- Development and full ownership of our IaaS framework.
- Building, maintaining, and managing Docker images and repositories, including the CI/CD pipeline and deployment processes
- Integration with 3rd-party tools and other infrastructure in the Okta WIC environment (e.g. observability)
- Evangelizing security best practices, leading initiatives to strengthen our security posture for critical infrastructure, and managing security & compliance requirements.
- Developing and maintaining technical documentation, runbooks, and procedures
- Triaging and troubleshooting complex production issues to ensure reliability and performance
- Identifying and automating manual processes
- Promoting and applying best practices for building scalable and reliable services across engineering
- Supporting a 24x7 online environment, managing production incidents and determining how we can prevent them in the future as part of an on-call rotation
What you’ll bring to the role
- 4+ years of experience as a site reliability or platform engineer, preferably in a fast-scaling environment.
- Proven hands-on experience with Docker and Kubernetes in production
- Experience with the deployment of production workloads on public cloud infrastructure (AWS and GCP)
- Strong expertise in configuration management using IaaS tools such as Terraform and Helm
- Proficiency in ETL processes, showcasing the ability to handle data pipelines efficiently and securely, including experience with orchestration tools like Apache Airflow.
- Experience in network engineering and security practices in AWS.
- Experience managing CI/CD infrastructures, with a strong proficiency in platforms like GitHub Actions to streamline deployment pipelines and ensure efficient software delivery.
- Knowledge of observability tools such as Grafana, Prometheus, and Splunk, as well as their implementation
- Strong proficiency in Python for backend systems, demonstrating the ability to develop and maintain robust, scalable, and efficient software components essential for the reliability and performance of the infrastructure.
- Excellent problem-solving skills and a detail-oriented mindset.
- Strong communication and collaboration abilities to work effectively within a team.
- Are passionate about encouraging the development of engineering peers and leading by example.
Official notification