Contribute to the design and develop comprehensive observability and monitoring strategies for infrastructure and sophisticated engineering systems and applications.
Build and manage monitoring tools and platforms such as Prometheus, Grafana, Azure Monitoring, AWS CloudWatch, Dynatrace/Datadog and similar tools that forms our AIOps stack.
Develop and maintain dashboards, alerts, and reports to provide real-time insights into system performance and health.
Collaborate with multi-functional teams to identify and resolve performance bottlenecks and reliability issues.
Automate monitoring and alerting processes to improve efficiency and reduce manual intervention.
Conduct root cause analysis of incidents and implement preventive measures to avoid recurrence.
Mentor and guide junior engineers in standard methodologies for observability and monitoring.
Stay up-to-date with the latest industry trends and technologies to continuously improve our monitoring capabilities.
Required Skills and Experience :
Bachelor’s degree in Computer Science, Engineering, or a related field with demonstrated ability in observability and monitoring roles.
Proficiency in monitoring tools and platforms such as Prometheus, Grafana, AWS CloudWatch, Azure Monitor, Datadog, Dynatrace, etc.
Strong understanding of cloud environments (AWS, Azure, GCP) and container orchestration (Kubernetes, Docker).
Experience with scripting and automation using languages such as Python, Bash, or similar.
Excellent problem-solving skills and attention to detail.
Strong communication and teamwork skills.
Ability to work in a fast-paced, multifaceted environment.