Senior Site Reliability Engineer (NM+)
sirion | 4 days ago | Gurgaon

What You’ll do: 

  • System Monitoring and Incident Management: Monitor the health and performance of critical systems, applications, and services. Respond to incidents, troubleshoot issues, and ensure timely resolution to minimize downtime and service disruptions. 
  • Automation and Scripting: Develop and maintain automation scripts and tools to streamline operational tasks, deployment processes, and infrastructure management. 
  • Infrastructure Management: Manage and scale the underlying infrastructure, including servers, cloud services, and network components. Implement best practices for configuration management, monitoring, and disaster recovery. 
  • Release Management: Collaborate with development teams to ensure smooth and reliable software releases. Participate in the design and implementation of deployment strategies. 
  • Performance Optimization: Identify performance bottlenecks and optimize the system to improve reliability and response times. 
  • Capacity Planning: Analyze system capacity and plan for future growth to meet increasing demands. 
  • Security and Compliance: Implement security best practices and ensure compliance with relevant industry standards and regulations. 
  • Collaboration and Documentation: Work closely with cross-functional teams, including developers, product managers, and operations, to ensure efficient communication and knowledge sharing. Document processes, procedures, and troubleshooting guides. 
  • On-Call Support: Participate in an on-call rotation to handle urgent issues and incidents outside regular business hours. 

What You’ll Need

 

  • Experience with Cloud Technologies: Proficiency in working with one or more cloud platforms like AWS, Google Cloud Platform, or Microsoft Azure. 
  • Programming and Scripting Skills: Strong knowledge of at least one programming language (e.g., Python, Java,) and experience with shell scripting. 
  • System Administration: Linux/Unix system hands on and good to have administration and networking concepts. 
  • Monitoring and Logging: Experience with monitoring tools such as Prometheus, Grafana, Nagios, and log management solutions like ELK stack. 
  • Infrastructure as Code (IaC): Knowledge of Infrastructure as Code tools like Terraform or CloudFormation. 
  • Automation and Configuration Management: Experience with tools like Ansible, Chef, or Puppet for automating infrastructure management. 
  • Version Control: Familiarity with version control systems like Git. 
  • Problem-Solving Skills: Ability to analyze and troubleshoot complex technical issues and can work with other teams to help and streamline Process. 
  • Communication Skills: Strong verbal and written communication skills to collaborate effectively with team members and stakeholders. 
  • KPI/Metrics: Understand Key SRE Metrics such as Availability, SLA/SLO, MTTA and MTTR 
    • Any hands on individual with BCA/MCA and B.Tech background. 
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.