SRE Engineer (5+)

Infosys | 81 days ago | Bangalore

Responsibilities

Key Responsibilities, Command Center Design & Implementation • Architect and implement a centralized command center that provides comprehensive visibility into both infrastructure and application layers • Establish standardized operational procedures, runbooks, and escalation protocols for incident management • Design and implement monitoring solutions that provide real-time insights into system health, performance metrics, and business KPIs Operations Management: • Lead the development of automated remediation solutions for common operational issues • Implement and maintain SLOs/SLIs across critical services and applications • Drive continuous improvement in incident response times and system reliability metrics • Collaborate with development teams to ensure applications are designed with operational excellence in mind Tool Development & Integration: • Develop and maintain monitoring dashboards that provide actionable insights for both technical and non-technical stakeholders • Implement and customize monitoring tools for infrastructure and application performance monitoring • Create automation scripts and tools to streamline operational processes • Integrate various monitoring and alerting systems to provide a unified view of system health Leadership & Collaboration: • Mentor junior engineers in SRE practices and command center operations • Collaborate with security, development, and infrastructure teams to ensure comprehensive monitoring coverage • Partner with business stakeholders to align monitoring strategies with business objectives • Lead post-incident reviews and drive implementation of learned improvements Preferred Qualifications: • Experience in designing and implementing enterprise-scale command centers • Knowledge of AIOps and machine learning for IT operations • Certification in relevant cloud platforms or technologies is good to have • Experience with chaos engineering and resilience testing • Background in implementing ITIL practices across any of the IT services

Technical and Professional Requirements:

• Bachelor's degree in Computer Science, Engineering, or related field • 5+ years of experience in Site Reliability Engineering or similar roles • Strong experience with cloud platforms (AWS/Azure/GCP) and infrastructure-as-code • Extensive knowledge of monitoring tools (e.g., Prometheus, Grafana, ELK Stack) • Proficiency in at least one programming language (Python, Go, or Java preferred) • Experience with containerization and orchestration (Docker, Kubernetes) • Strong understanding of networking, system design, and distributed systems

Preferred Skills:

Foundational->Service Management->ITIL

Domain->Telecom->Operations Management

Technology->Cloud Security->AWS - Infrastructure Security->AWS Network Security Groups (NSG)

Technology->Cloud Security->GCP - GRC

Technology->Cloud Platform->Azure Networking Services-> Azure Bastion

Additional Responsibilities:

• Excellent problem-solving and analytical abilities • Strong communication skills and ability to work with cross-functional teams • Experience in incident management and on-call rotations • Proven track record of improving system reliability and performance • Ability to handle high-pressure situations and make quick decisions • Strong documentation and technical writing skills

Official notification

🌟 Don't Just Apply—Help Others Too! 🌟

Simply refer someone to your organization and make a difference in their career journey. 🚀

Join our Telegram group for daily job update