Responsibilities
Key Responsibilities, Command Center Design & Implementation • Architect and implement a centralized command center that provides comprehensive visibility into both infrastructure and application layers • Establish standardized operational procedures, runbooks, and escalation protocols for incident management • Design and implement monitoring solutions that provide real-time insights into system health, performance metrics, and business KPIs Operations Management: • Lead the development of automated remediation solutions for common operational issues • Implement and maintain SLOs/SLIs across critical services and applications • Drive continuous improvement in incident response times and system reliability metrics • Collaborate with development teams to ensure applications are designed with operational excellence in mind Tool Development & Integration: • Develop and maintain monitoring dashboards that provide actionable insights for both technical and non-technical stakeholders • Implement and customize monitoring tools for infrastructure and application performance monitoring • Create automation scripts and tools to streamline operational processes • Integrate various monitoring and alerting systems to provide a unified view of system health Leadership & Collaboration: • Mentor junior engineers in SRE practices and command center operations • Collaborate with security, development, and infrastructure teams to ensure comprehensive monitoring coverage • Partner with business stakeholders to align monitoring strategies with business objectives • Lead post-incident reviews and drive implementation of learned improvements Preferred Qualifications: • Experience in designing and implementing enterprise-scale command centers • Knowledge of AIOps and machine learning for IT operations • Certification in relevant cloud platforms or technologies is good to have • Experience with chaos engineering and resilience testing • Background in implementing ITIL practices across any of the IT services
Technical and Professional Requirements:
• Bachelor's degree in Computer Science, Engineering, or related field • 5+ years of experience in Site Reliability Engineering or similar roles • Strong experience with cloud platforms (AWS/Azure/GCP) and infrastructure-as-code • Extensive knowledge of monitoring tools (e.g., Prometheus, Grafana, ELK Stack) • Proficiency in at least one programming language (Python, Go, or Java preferred) • Experience with containerization and orchestration (Docker, Kubernetes) • Strong understanding of networking, system design, and distributed systems
Preferred Skills:
Foundational->Service Management->ITIL
Domain->Telecom->Operations Management
Technology->Cloud Security->AWS - Infrastructure Security->AWS Network Security Groups (NSG)
Technology->Cloud Security->GCP - GRC
Technology->Cloud Platform->Azure Networking Services-> Azure Bastion
Additional Responsibilities:
• Excellent problem-solving and analytical abilities • Strong communication skills and ability to work with cross-functional teams • Experience in incident management and on-call rotations • Proven track record of improving system reliability and performance • Ability to handle high-pressure situations and make quick decisions • Strong documentation and technical writing skills
Official notification
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.