Site Reliability Engineer (3+)
pubmatic | 100 days ago | Pune

Responsibilities:

  1. Operational Support  
    • Be a primary point of contact for operational support of multiple large-scale distributed software applications in the Ad Server environment.  
    • Monitor availability of applications, promptly detect anomalies, analyze the impact, debug the problems in production, and follow up for the resolution by working closely with the engineering team.  
    • Maintain services once they are live by measuring and monitoring availability, latency, and overall system health.  
    • Diligently work with the engineering team to expedite the resolution of incidents and ensure a swift return to normal operations.  
    • Be innovative in building dashboards, adding metrics, writing automation scripts to reduce operation toil, and streamlining processes to enhance system reliability and stability.  
    • Design and construct software and systems to effectively manage the Ad Serving platform, its underlying infrastructure, and applications. 
  2. On Call Availability and Support  
    • Work in shifts to provide continuous on-call support for the production systems and resolve issues on your own by using predefined handbooks  
    • Show a sense of urgency for high-priority issues and arrange war rooms to resolve the problems.  
    • Provide timely updates for high-priority issues and do handovers when a problem needs to be worked out 24*7  
    • Conduct post-incident reviews to identify root causes, recommend preventive measures, and contribute to a culture of learning and improvement. 

Requirements:

  • Bachelor's degree in computer science or related disciplines  
  • Total 3+ years' experience in software development  
  • Ability to program using programming languages like C or C++, Scripting languages like Shell or Python  
  • Good to have prior experience in technical engineering  
  • A proactive approach to identify the problems, performance bottlenecks, and areas of improvement  
  • Must know, Networking, Database (MySQL) and Linux System concepts, Debugging and analyzing the core dumps  
  • Hands-on experience with monitoring and observability tools like Grafana, Nagios, Influx, ELK, etc.  
  • Familiarity with orchestration tools like Docker and Grafana and incident management systems like Zenduty  
  • Excellent communication and collaboration skills, with the ability to work effectively across teams.  
  • Self-motivated and positive mindset to examine any incidents 
Official notification
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.