SRE Engineer (5+)
mccain | 209 days ago | New Delhi

JOB RESPONSIBILITIES:

 

  • Work with stakeholders such as product owners and Engineering to define service level objectives (SLOs) for system operations.
  • Track performance against SLOs in partnership with monitoring teams or other stakeholders, and ensure systems continue to meet SLOs over time.
  • Create dashboards and reports to communicate key metrics.
  • Create software to improve performance, scalability, and stability of systems.
  • Collaborate with development teams to promote the concept of reliability engineering during all phases of the software development lifecycle to detect and correct performance issues and meet availability goals.
  • Design, code, test, and deliver infrastructure software to automate manual operational work (i.e., “toil”).
  • Participate in operational support and on-call rotation shifts for supported systems and products.
  • Conduct blameless post mortems to troubleshoot priority incidents.
  • Perform analytics on previous incidents to understand root causes and better predict and prevent future issues.
  • Use automation to reduce the probability and/or impact of problem recurrence.
  • Identify, evaluate, and recommend monitoring tools and diagnostic techniques to improve system observability.
  • Participate in system design consulting, platform management, capacity planning and launch reviews.
  • Collaborate and share lessons learned regarding performance and reliability issues with all stakeholders including developers, other SREs, operations teams, and project management teams.
  • Participate in communities of practice to share knowledge and foster continuous improvement.
  • Remain current on site reliability engineering methods and trends such as observability-driven development and chaos engineering.
  • Drive continuous improvement in software quality and infrastructure reliability and resilience.
  • Oversee, design, implement, and manage DevOps capabilities using continuous integration/continuous delivery toolsets and automation.
  • SRE engineer will focus on Application Performance Monitoring (APM) including Design, Solution, POC, profiling and tuning application compute and data nodes and resources. Some key duties of this role are:
  • Assist in defining SRE and Observability architecture, design
  • Analyze, Implement new features of SRE and Observability Platform
  • Full stack monitoring across all layers (Infrastructure/Network/Database/Application/Services/Third Party)
  • Provide technical hands-on leadership in commercial and Open source/commercial monitoring Tool salection Implementation.
  • Implement SRE driven automated Incident Detection -> automated Engagement –> Triage/Mitigate – RCA/Postmortems -> Problem task Remediation.
  • AI Driven Correlation, De-duplication Noise Reduction and Auto Remediation
  • Provide weekly monitoring and alert analysis and continuous improvement
  • Create a model of the run-time environment (discovery)
  • Profile the performance and behavior of user-defined transactions
  • Establish Performance metrics from each of the applications/systems technical components (Webserver, App server, Database, etc.)
  • Application performance management database
  • APM tool Administration and Support
  • Monitoring Tool design and implementation
  • APM Setup/Usage policies and guidelines
  • Capacity Planning and monitoring
  • Monitor selected application performance
  • Report vital statistics of application performance in production
  • Make recommendations for improvements with Service Desk
  • Make recommendations for adjustments to runtime resources to improve overall performance profile


 

KEY QUALIFICATION & EXPERIENCES:

  • Strong problem solving and analytical skills.
  • Strong interpersonal and written and verbal communication skills.
  • Highly adaptable to changing circumstances. Interest in continuously learning new skills and technologies.
  • Experience with programming and scripting languages (e.g. Java, C#, C++, Python, Bash, PowerShell).
  • Experience with incident and response management.
  • Experience with Agile and DevOps development methodologies.
  • Experience with container technologies and supporting tools (e.g. Docker Swarm, Podman, Kubernetes, Mesos).
  • Experience with working in cloud ecosystems (Microsoft Azure AWS, Google Cloud Platform,).
  • Experience with monitoring and observability tools (e.g. Splunk, Cloudwatch, AppDynamics, NewRelic, ELK, Prometheus, OpenTeleme Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.