Your Role and Responsibilities
In this Site Reliability Engineer role, you will work closely with several Data Centers, the entire Cloud organization and IBM vendors to support, maintain and continously improve the IBM cloud infrastructure. You will focus on the following key responsibilities:
- Design & implement automation/infrastructure solutions for IBM Cloud products and services.
- Partner with other SRE teams and dev leaders to deliver mission-critical services to IBM Cloud
- Build new tools to improve automated resolution of production issues
- Monitor, respond promptly to production alerts, execute changes in Production through automation
- Support the compliance and security integrity of the environment
- Continually improve systems and processes regarding automation and monitoring
- Work with Support and Development teams to identify the root cause to resolve issues
- Discuss and plan continuous improvement in the stability of production environment
- Guide & provide technical escalation support for other Infrastructure Operations teams
Required Technical and Professional Expertise
- Excellent written and verbal communication skills.
- Overall 10+ years of experience in Public Cloud infrastructure
- Minimum 6+ year’s experience in handling large production systems in a cloud environment
- Strong skills on Linux, Scripting, Debugging complex issues working with other teams
- Ability to handle complex customer situations to resolution
- 7+ years of experience in Virtualization Technologies and Automation / Configuration Managements
- Automation and configuration management tools/solutions: Ansible, Python, bash, Terraform, GoLang etc. (any two)
- Virtualization technologies: Citrix Xen Hypervisor (Preferred), KVM(also preferred), libvirt, VMware vSphere, etc. (at least one)
- Monitoring technologies: Zabbix (preferred), Sysdig, Grafana, Nagios, Splunk, etc. (at least one)
- Strong skills on Container technologies: Kubernetes, Docker, etc.
- Work with Engineering to:
- Provide initial assessment and possible workaround of production issue
- Troubleshoot and resolve production issues
- Working knowledge with ServiceNow, JIRA, and GitHub
Preferred Technical and Professional Expertise
- Knowledge of compute, storage & networking systems in a public cloud environment
Official notification