Site Reliability Engineer (10+)

Gemraj Technologies Ltd | 184 days ago | India

MUST HAVE EXP :

Candidates should have moved from Devops to ML ops
Candidates who are on GEN AI with strong ML ops would also be a fit but must have prior DevOps exp
Candidates with ETL data pipelines with ML ops would also fit the role
Strong Python knowledge is a must for this role and should be an individual contributor

About the role:

Turing is looking for people to join us in building ML platforms for our Fortune 500 customers. You will be a key member of the Turing GenAI delivery organization heading a team of other Turing engineers across different skill sets.

Required skills

10+ years of professional experience in building applications using cloud services. Prior experience in building Machine Learning platforms using cloud services.
Cloud expertise: Deep knowledge of cloud platforms like AWS, Google Cloud Platform, or Azure, including their machine learning and data services (Azure preferred).
DevOps skills: Experience with CI/CD pipelines, infrastructure as code, and containerization technologies like Docker and Kubernetes.
Machine learning knowledge: Understanding of ML workflows, model training, and deployment processes.
Data engineering: Familiarity with data pipelines, ETL processes, and data storage solutions.
Software engineering: Strong programming skills, particularly in languages commonly used in ML like Python.
System design: Ability to architect scalable, reliable systems that integrate various services.
Automation: Expertise in automating workflows and processes across the ML lifecycle.
Security and compliance: Knowledge of best practices for securing ML pipelines and ensuring regulatory compliance.
Monitoring and logging: Experience setting up monitoring and logging for ML systems.
Collaboration: Ability to work with data scientists, software engineers, and other stakeholders.

Roles & responsibilities

Evaluate and select appropriate cloud services for each stage of the ML lifecycle
Design and implement the overall architecture of the MLOps platform
Set up automated pipelines for data preparation, model training, and deployment
Implement version control for code, data, and models
Ensure the platform is scalable, secure, and compliant with relevant regulations
Provide tools and interfaces for data scientists to easily leverage the platform
Continuously optimize the platform for performance and cost-efficiency
This role is crucial in bridging the gap between data science and operations, enabling organizations to efficiently develop, deploy, and maintain machine learning models at scale.

Official notification