## Key Responsibilities
### Reliability and Performance Management
- Design, implement, and maintain highly available, scalable, and resilient cloud-native architectures for mission-critical SaaS products.
- Develop and implement SLOs, SLIs, and SLAs to measure and improve service reliability.
- Continuously optimize system performance and resource utilization across multiple cloud platforms.
- Finetune/Optimize Application performance by analyzing the code, traces and database queries.
### Incident Management and Troubleshooting
- Lead incident response efforts, effectively troubleshooting complex issues to minimize downtime and impact.
- Reduce Mean Time to Recover (MTTR) through proactive monitoring, automated alerting, and efficient problem-solving techniques.
- Conduct thorough Root Cause Analysis (RCA) for all major incidents and implement preventive measures.
### Observability and Monitoring
- Design and implement end-to-end observability solutions across our distributed systems.
- Develop and maintain comprehensive monitoring strategies using tools like ELK Stack, Prometheus, Grafana.
- Create and optimize product status dashboards to provide real-time visibility into system health and performance.
### Automation and Infrastructure as Code (IaC)
- Implement Infrastructure as Code practices using tools like Terraform.
- Develop and maintain automated deployment pipelines and CI/CD workflows.
- Create self-healing systems and automate routine operational tasks to reduce manual intervention.
### Cloud-Agnostic Architecture
- Design and implement cloud-agnostic solutions that can operate efficiently across multiple cloud providers.
- Develop expertise in event-driven architectures and related technologies (e.g., Apache Kafka/Eventhub, Redis, Mongo Atlas, IoTHub).
- Implement and manage containerized applications using Kubernetes across different cloud environments.
### Continuous Improvement
- Regularly review and refine operational practices to enhance efficiency and reliability.
- Stay updated with the latest industry trends and technologies in SRE, cloud computing, and DevOps.
- Contribute to the development of internal tools and frameworks to support SRE practices.
## Requirements
- Strong knowledge of cloud platforms - Azure, and their associated services on private NW.
- Expert in Observability tools (ELK Stack, Dynatrace, Prometheus )
- Expertise in containerization technologies such as Docker and Kubernetes
- Understanding of Event-driven architecture and database technologies (Mongo Atlas, Azure SQL, PostgresDB )
- Proficient in IaaC tools such as - Terraform and GitHub Actions.
- Proficiency in one or more programming languages - Python/.Net/Java
- Strong understanding of networking concepts, load balancing, and security practices.
HTSIND2022
YOU MUST HAVE
WE VALUE
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.