Site Reliability Engineer (10+)
honeywell | 198 days ago | Hyderabad

## Key Responsibilities

### Reliability and Performance Management

- Design, implement, and maintain highly available, scalable, and resilient cloud-native architectures for mission-critical SaaS products.
- Develop and implement SLOs, SLIs, and SLAs to measure and improve service reliability.
- Continuously optimize system performance and resource utilization across multiple cloud platforms.
- Finetune/Optimize Application performance by analyzing the code, traces and database queries.

 

### Incident Management and Troubleshooting

- Lead incident response efforts, effectively troubleshooting complex issues to minimize downtime and impact.
- Reduce Mean Time to Recover (MTTR) through proactive monitoring, automated alerting, and efficient problem-solving techniques.
- Conduct thorough Root Cause Analysis (RCA) for all major incidents and implement preventive measures.

 

### Observability and Monitoring

- Design and implement end-to-end observability solutions across our distributed systems.
- Develop and maintain comprehensive monitoring strategies using tools like ELK Stack, Prometheus, Grafana.
- Create and optimize product status dashboards to provide real-time visibility into system health and performance.

 

### Automation and Infrastructure as Code (IaC)

- Implement Infrastructure as Code practices using tools like Terraform.
- Develop and maintain automated deployment pipelines and CI/CD workflows.
- Create self-healing systems and automate routine operational tasks to reduce manual intervention.

 

### Cloud-Agnostic Architecture

- Design and implement cloud-agnostic solutions that can operate efficiently across multiple cloud providers.
- Develop expertise in event-driven architectures and related technologies (e.g., Apache Kafka/Eventhub, Redis, Mongo Atlas, IoTHub).
- Implement and manage containerized applications using Kubernetes across different cloud environments.

 

### Continuous Improvement

- Regularly review and refine operational practices to enhance efficiency and reliability.
- Stay updated with the latest industry trends and technologies in SRE, cloud computing, and DevOps.
- Contribute to the development of internal tools and frameworks to support SRE practices.

 

## Requirements
- Strong knowledge of cloud platforms - Azure, and their associated services on private NW.
- Expert in Observability tools (ELK Stack, Dynatrace, Prometheus )
- Expertise in containerization technologies such as Docker and Kubernetes 
- Understanding of Event-driven architecture and database technologies (Mongo Atlas, Azure SQL, PostgresDB )
- Proficient in IaaC tools such as - Terraform and GitHub Actions.
- Proficiency in one or more programming languages - Python/.Net/Java
- Strong understanding of networking concepts, load balancing, and security practices.

HTSIND2022

YOU MUST HAVE

  • Bachelor’s degree with 10+ years of experience.

WE VALUE

  • Understanding various software development lifecycle
  • Some relevant experience
  • Knowledge of software configuration management and change management practices
  • Diverse and global teaming and collaboration
  • Effective communicator
  • Individuals who are self-motivated and able to work with little supervision, who consistently take the initiative to get things done, do things before being asked by others or forced to by events
  • Ability to consistently make timely decisions even in the face of complexity, balancing systematic analysis with decisiveness
  • Can quickly analyze, incorporate and apply new information and concepts
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.