Senior Site Reliability Engineer (NM+)

sirion | 198 days ago | Gurgaon

What You’ll do:

System Monitoring and Incident Management: Monitor the health and performance of critical systems, applications, and services. Respond to incidents, troubleshoot issues, and ensure timely resolution to minimize downtime and service disruptions.
Automation and Scripting: Develop and maintain automation scripts and tools to streamline operational tasks, deployment processes, and infrastructure management.
Infrastructure Management: Manage and scale the underlying infrastructure, including servers, cloud services, and network components. Implement best practices for configuration management, monitoring, and disaster recovery.
Release Management: Collaborate with development teams to ensure smooth and reliable software releases. Participate in the design and implementation of deployment strategies.
Performance Optimization: Identify performance bottlenecks and optimize the system to improve reliability and response times.
Capacity Planning: Analyze system capacity and plan for future growth to meet increasing demands.
Security and Compliance: Implement security best practices and ensure compliance with relevant industry standards and regulations.
Collaboration and Documentation: Work closely with cross-functional teams, including developers, product managers, and operations, to ensure efficient communication and knowledge sharing. Document processes, procedures, and troubleshooting guides.
On-Call Support: Participate in an on-call rotation to handle urgent issues and incidents outside regular business hours.

What You’ll Need:

Experience with Cloud Technologies: Proficiency in working with one or more cloud platforms like AWS, Google Cloud Platform, or Microsoft Azure.
Programming and Scripting Skills: Strong knowledge of at least one programming language (e.g., Python, Java,) and experience with shell scripting.
System Administration: Linux/Unix system hands on and good to have administration and networking concepts.
Monitoring and Logging: Experience with monitoring tools such as Prometheus, Grafana, Nagios, and log management solutions like ELK stack.
Infrastructure as Code (IaC): Knowledge of Infrastructure as Code tools like Terraform or CloudFormation.
Automation and Configuration Management: Experience with tools like Ansible, Chef, or Puppet for automating infrastructure management.
Version Control: Familiarity with version control systems like Git.
Problem-Solving Skills: Ability to analyze and troubleshoot complex technical issues and can work with other teams to help and streamline Process.
Communication Skills: Strong verbal and written communication skills to collaborate effectively with team members and stakeholders.
KPI/Metrics: Understand Key SRE Metrics such as Availability, SLA/SLO, MTTA and MTTR
- Any hands on individual with BCA/MCA and B.Tech background.

Official notification

Join our Telegram group for daily job update

⚡ Hot Jobs Trending Now

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SRE

Sr. SRE Engineer

Stripe | Bangalore, India

DEV

Backend Developer

Coinbase | Remote, India

Infra

Cloud Infra Lead

Datadog | Pune, India

MLOps Architect

Anthropic | Hyderabad

Data

Fivetran Data Eng.

Fivetran | Mumbai

SDE

Staff Software Eng.

Airbnb | Gurgaon, India

Prod

Platform Engineer

Databricks | Bangalore

Quality Assurance

GitLab | Remote

Security

Cloud Security

Zscaler | Mumbai

Product Designer

Figma | Pune, India

SDE