Staff Site Reliability Engineer (7+)
gevernova | 127 days ago | Bengaluru

Roles & Responsibilities:

 

In this role, you will:

  • Own, manage and adapt effective monitoring and alerting systems for GEHC

  • Responsible for developing and managing a single pane of glass that provides for single view of GEHC ecosystem monitoring that includes top critical business applications, Sites and Critical network devices.

  • Own, Develop and manage world class monitoring data platform that ingests all the monitoring telemetric data across application / infrastructure with GEHC and integrates with AIOPS platform

  • Develop & product manage automated solutions / SAAS products to maintain and optimize the availability and performance of critical business processes / services and to address potential problems in the infrastructure and application ecosystem before they result in a service interruption

  • Ensure top critical business applications and their ecosystems are effectively monitored with appropriate alerting mechanisms integrated with event management systems for effective “single Pane of Glass”

  • Deliver self-service tools that rely on the monitoring platform / SRE – example, logs, and statistics visualization, monitoring dashboards etc.

  • Collaborate closely with product teams – Both Internal GE product teams and Monitoring/AIOPS tool vendors to ensure that the designed solution responds to non-functional requirements such as availability, performance, security, and maintainability. Contribute to SLI, SLO and SLA definition, monitoring, alerting, and reporting efforts.

  • Partner and Support other operations teams in investigating root cause of Major P1 and escalated P2 incidents through Monitoring lens

  • Establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria

  • Continuously identify patterns for a larger problem solve to avoid repeat issues.

  • Stay abreast of latest trends in application and infrastructure monitoring, provisioning, maintenance, and uptime. Learn, prototype, and apply newest tools and best practices in real life to meet the goals of SRE practice

 

 

Education Qualification

 

  • Bachelor's Degree in Computer Science or “STEM” Majors (Science, Technology, Engineering and Math) with advanced experience.

  • 7+ years of relevant experience in IT Operations/Site Reliability engineering domain and should have demonstrable expertise in architecting, designing, and implementing solutions for Availability and/or Performance

  • Comprehensive understanding in application performance monitoring, cloud technologies and ability to design and implement Dynatrace solutions in complex enterprise environments.

  • Solid expertise in designing and implementing Dynatrace / Dynatrace extension or managing APM / observability solution.

  • Proficient in Dynatrace features, architecture design along with installation, fine-tuning, and implementation experience for various environment (Production, Test, Development and Disaster Recovery)

  • Expertise in Dynatrace platform configuration including host grouping, auto tagging, naming rules, management zones, RUM (Real User Monitoring), Synthetics, session properties, request attributes, user tags, log monitoring alert profile, problem notifications, threshold tuning, & setting up Integrations with other monitoring tools and ServiceNow.

  • Experience in implementing and configuring Dynatrace tools, set up synthetic and transaction monitoring, ensure comprehensive infrastructure and application monitoring

  • Create custom extension in Dynatrace using shell, Python and batch script based on rest API and logs.

  • Setting up Dynatrace extension configurations, Dashboards (including business), Infrastructure, Analytics, Observability logs, metrics data collection and interpret the same.

  • Proficiency in Dynatrace Query Language (DQL) , creating custom dashboards as required

  • Establish and foster visible architectural principles and practices to build reusable designs and systems that promote reliability, velocity, scale, security, and efficiency

  • Understand and improve applications and plan for faster MTTD, MTTR, auto healing

  • Understand reliability metrics and enhance automation solutions for auto-healing and incident resolution

  • Experience Official notification

⚡ Hot Jobs Trending Now

SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.