Kubernetes Helm 5 years of experience in building and operating enterprise platforms for production in an agile development environment 3 years of experience managing containerization and orchestration technologies i e Kubernetes Helm Strong Linux system administration skills Proficiency in CI CD practices and tools such as GitLab Jenkins or similar Familiarity with monitoring logging and alerting tools Familiarity with code promotion processes and software lifecycles Experience participating in and contributing to agile ceremonies such as sprint planning refinement and retrospectives Strong programming skill in one or more languages TypeScript Python Shell etc
Preferred Strong knowledge of build and test tools such as maven npm yarn gradle sonarqube linters etc Good git skills ability to rebase cherry pick etc easily Knowledge of developer frontends such as backstage or port Good experience debugging network communication issues
Platform Reliability and Security Ensure the platform meets performance availability and reliability SLOs Proactively identify and resolve performance bottlenecks and risks in production environments Maintain and improve monitoring logging and alerting frameworks to detect and prevent incidents Securing the platform using vendor and in house best practices
Incident Management Act as the primary responder for critical incidents ensuring rapid mitigation and resolution Conduct post incident reviews and implement corrective actions to prevent recurrence Develop and maintain detailed runbooks and playbooks for operational excellence
Automation and Efficiency Build and maintain tools to automate routine tasks such as deployments scaling and failover Contribute to CI CD pipeline improvements for faster and more reliable software delivery Write and maintain Infrastructure as Code IaC using tools like Pulumi or Terraform to provision and manage resources Contribute to cost optimisation efforts tuning of deployments to make most efficient use of compute resources
Collaboration and Mentorship Collaborate with SRE CI CD Developer Experience and Templates teams to improve the platform s reliability and usability Mentor junior engineers by sharing knowledge and best practices in SRE and operational excellence Partner with developers to facilitate resolution of application tuning securing and networking Provide code reviews for others
Observability and Metrics Implement and optimize observability tools like Dynatrace Prometheus or Grafana for deep visibility into system performance Define key metrics and dashboards to track the health and reliability of platform components Continuously analyse operational data to identify and prioritize areas for improvement Building alerts for integration into pager duty Dynatrace
Supporting the Developers Providing support to our developers resolving build test and deployment issues Troubleshooting network issues Writing documentation on use behaviour of the platform On Call support for pager duty incidents on the platform usually one week in every 8 or so
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.