AI/ML Ops (5+)
globallogic | 133 days ago | Hyderabad

Requirements

Apply machine learning algorithms to existing operational data (logs, metrics, events) to predict system failures and proactively address potential incidents.
Implement automation for routine DevOps practices including automated scaling, resource optimization, and controlled restarts.
Develop and maintain self-healing systems to reduce manual intervention and enhance system reliability.
Build anomaly detection models to quickly identify and address unusual operational patterns.
Collaborate closely with SREs, developers, and infrastructure teams to continuously enhance the operational stability and performance of the system.
Provide insights and improvements through visualizations and reports leveraging AI-driven analytics.
Create a phased roadmap to incrementally enhance operational capabilities and align with strategic business goals.

 

Required Skills and Qualifications:
Strong experience with AI/ML frameworks and tools (e.g., TensorFlow, PyTorch, scikit-learn).
Proficiency in data processing and analytics tools (e.g., Splunk, Prometheus, Grafana, ELK stack).
Solid background in scripting and automation (Python, Bash, Ansible, etc.).
Experience with cloud environments and infrastructure automation.
Proven track record in implementing proactive monitoring, anomaly detection, and self-healing techniques.
Excellent analytical, problem-solving, and strategic planning skills.
Strong communication skills and the ability to effectively collaborate across teams.
Preferred Experience:
Background in DevOps/Site Reliability Engineering.
Familiarity with containerization and orchestration platforms (Kubernetes, Docker).
Experience in building scalable, distributed systems.

Job responsibilities

Apply machine learning algorithms to existing operational data (logs, metrics, events) to predict system failures and proactively address potential incidents.
Implement automation for routine DevOps practices including automated scaling, resource optimization, and controlled restarts.
Develop and maintain self-healing systems to reduce manual intervention and enhance system reliability.
Build anomaly detection models to quickly identify and address unusual operational patterns.
Collaborate closely with SREs, developers, and infrastructure teams to continuously enhance the operational stability and performance of the system.
Provide insights and improvements through visualizations and reports leveraging AI-driven analytics.
Create a phased roadmap to incrementally enhance operational capabilities and align with strategic business goals.

 

Required Skills and Qualifications:
Strong experience with AI/ML frameworks and tools (e.g., TensorFlow, PyTorch, scikit-learn).
Proficiency in data processing and analytics tools (e.g., Splunk, Prometheus, Grafana, ELK stack).
Solid background in scripting and automation (Python, Bash, Ansible, etc.).
Experience with cloud environments and infrastructure automation.
Proven track record in implementing proactive monitoring, anomaly detection, and self-healing techniques.
Excellent analytical, problem-solving, and strategic planning skills.
Strong communication skills and the ability to effectively collaborate across teams.
Preferred Experience:
Background in DevOps/Site Reliability Engineering.
Familiarity with containerization and orchestration platforms (Kubernetes, Docker).
Experience in building scalable, distributed systems.

Official notification

⚡ Hot Jobs Trending Now

SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SRE
Sr. SRE Engineer
Stripe | Bangalore, India
DEV
Backend Developer
Coinbase | Remote, India
Infra
Cloud Infra Lead
Datadog | Pune, India
ML
MLOps Architect
Anthropic | Hyderabad
Data
Fivetran Data Eng.
Fivetran | Mumbai
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
SDE
Staff Software Eng.
Airbnb | Gurgaon, India
Prod
Platform Engineer
Databricks | Bangalore
QA
Quality Assurance
GitLab | Remote
Security
Cloud Security
Zscaler | Mumbai
UX
Product Designer
Figma | Pune, India
Contact US

Let's work laptop charging together

Any question or remark? just write us a message

Send a message

If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.