Responsibilities
Reliability and Performance
Ensure the high availability, reliability, and performance of production systems and services
Implement and maintain disaster recovery plans and procedures
Monitor and manage system health using metrics, logs, and tracing to proactively identify and resolve issues
Automation and Infrastructure:
Automate repetitive tasks, including deployment, scaling, monitoring, and remediation of systems
Build and maintain infrastructure as code (IaC) using tools like Terraform, CloudFormation, or similar
Incident Management
Participate in incident response and troubleshooting efforts to minimize downtime and resolve issues quickly
Conduct root cause analysis for system failures and implement preventive measures to avoid future incidents
Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence
Maintain incident response playbooks and ensure efficient on-call rotations
Observability and Monitoring
Design and implement monitoring solutions using tools like Prometheus, Grafana, Datadog, or similar
Define and track SLIs, SLOs, and SLAs to measure and improve system performance
Collaboration
Work closely with development, QA, and operations teams to ensure smooth delivery of applications
Act as a bridge between software engineering and operations, advocating for DevOps best practices
Document system configurations, processes, and procedures to ensure knowledge sharing and maintain system integrity
Capacity and Scalability
Conduct capacity planning and optimize system scalability to meet future demands
Implement strategies for horizontal and vertical scaling of applications
Security and Compliance
Ensure infrastructure security by implementing best practices and addressing vulnerabilities
Collaborate with the security team to meet compliance standards and audits
Data Engineering & Automation
Design, develop, and maintain scalable and efficient data pipelines
Automate data workflows for ETL/ELT processes, integrating data from various sources into data warehouses and other storage solutions
Develop and maintain solutions for data transformation, data modelling, and automate the orchestration of data processing
Data Warehouse Management
Design, implement, and maintain modern data warehouse architectures, ensuring effective data storage, retrieval, and accessibility
Work with cloud-based data warehouses (e.g., BigQuery, Snowflake, Redshift) and optimize data models for analytics and reporting
Develop and manage dimensional models, star/snowflake schemas, and data marts for operational and analytical use cases
Real-time and Batch Data Processing
Build and manage real-time and batch data pipelines for high-volume data ingestion, processing, and analytics
Leverage technologies such as Apache Kafka, Apache Beam, Apache Spark, and Google Cloud Dataflow for streaming and batch processing
Qualifications
Experience
8+ years of experience in a Data Platform including Site Reliability Engineering, DevOps, or Systems Engineering role
Technical Skills
Strong programming skills in languages such as Python, Java, or similar
Experience in developing Data ingestion pipelines, Governance, Quality and automation
Proficiency in cloud platforms such as Google Cloud (Mandatory), AWS, Azure
Experience in leveraging AI/ML models to enhance efficiency in da Official notification
Any question or remark? just write us a message
If you would like to discuss anything related to payment, account, licensing,
partnerships, or have pre-sales questions, you’re at the right place.