What you’ll be doing:
Monitor and manage day-to-day operations of data pipelines, ETL jobs, and cloud-native data platforms (e.g., AWS, Databricks, Redshift).
Own incident response and resolution, including root cause analysis and post-mortem reporting for data failures and performance issues.
Perform regular system health checks, capacity planning, and cost optimization across operational environments.
Maintain and enhance logging, alerting, and monitoring frameworks using tools like CloudWatch, Datadog, Prometheus, etc.
Collaborate with development teams to operationalize new data workflows, including CI/CD deployment, scheduling, and support documentation.
Ensure data quality by executing validation checks, recon processes, and business rule compliance.
Work with vendors (if applicable) and internal teams to support migrations, upgrades, and production releases.
How You Will Succeed:
Automation and Self-Service Focus
Identify repetitive operational tasks and implement automation using Python, Airflow, Jenkins, or similar tools.
Enable self-service capabilities and alerting for platform users and stakeholders.
AI-Ready Operations Mindset
Explore and propose how AI can be used to detect anomalies, predict issues, and accelerate root cause analysis.
Collaborate with internal teams to experiment with LLMs, bots, or ML models for improving operational efficiency.
Stay informed on emerging AIOps tools and work toward integrating them gradually.
Continuous Optimization
Monitor pipeline performance and costs, and implement changes that optimize compute, memory, and storage usage.
Recommend and trial AI/ML-based approaches for pipeline tuning, scheduling, or resource allocation.
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.