# Data & ML Ops

> Salla · Madinah, Saudi Arabia (Hybrid) · Full-time · Posted 2025-10-08

**Workplace:** hybrid

**Department:** Technology

## Description

We are looking for a Senior Site Reliability Engineer (SRE) to help design, scale, and secure our rapidly growing platform infrastructure.  
You will work across all critical systems — from customer-facing applications and APIs to internal platforms and data services — ensuring availability, performance, and cost efficiency at scale.

You’ll be hands-on with Kubernetes, observability, GitOps, automation, and cloud infrastructure, while partnering closely with application, platform, and data teams to deliver a highly reliable and self-healing environment.

This role is ideal for an engineer who thrives on complex distributed systems, loves to automate everything, and can balance speed, stability, and cost-efficiency in production.

-   Bachelor’s degree in Computer Science, Engineering, or a related field — or **equivalent work experience**.
-   Design, deploy, monitor, and maintain production workloads across **Kubernetes (EKS/AKS/GKE)** clusters.
-   Build **self-healing, auto-scaling systems** that minimize manual intervention and ensure uptime.
-   Design and operate reliable **database and storage platforms** (SQL, NoSQL, and object stores) within Kubernetes environments.
-   Implement **backup, disaster recovery, replication, and failover** strategies to meet RPO/RTO targets.
-   Troubleshoot and recover **Kubernetes Persistent Volumes (StorageClasses, CSI drivers, PVC issues)**.
-   Optimize storage performance and cost through **multi-tier strategies, hot/cold data separation**, and **S3/offloading lifecycle policies**.
-   Secure and scale **object storage platforms** (e.g., MinIO/S3-compatible) for **high-throughput data pipelines**.
-   Manage **block storage (EBS/io2/gp3)** and **shared file systems (EFS, NFS)** for resilience and cost balance.
-   Collaborate with teams to **optimize networking, ingress/egress traffic**, and service mesh for secure communication.

### Platform & Infrastructure Reliability

-   Design, deploy, monitor, and maintain production workloads across Kubernetes (EKS/AKS/GKE) clusters.
-   Build self-healing, auto-scaling systems that minimize toil and manual intervention.
-   Optimize networking, ingress/egress traffic control, and service mesh for secure & performant communication.
-   Design and operate reliable database and storage platforms (SQL, NoSQL, and object stores) in Kubernetes environments.
-   Own backup, disaster recovery, replication, and failover strategies to meet RPO/RTO targets for critical data services.
-   Optimize storage performance and cost through multi-tier strategies, hot/cold data separation, and S3/offloading lifecycle policies.
-   Troubleshoot and recover Kubernetes Persistent Volumes confidently during incidents (StorageClasses, CSI drivers, PVC issues).
-   Secure and scale object storage platforms (e.g., MinIO/S3-compatible) and integrate with workloads for high-throughput data pipelines.
-   Work with block storage (EBS/io2/gp3) and shared file systems (EFS, NFS) to balance performance, resiliency, and cost.

### Automation & Delivery

-   Champion GitOps and CI/CD best practices (ArgoCD, Flux, GitHub Actions).  
    Build automation for infrastructure provisioning and upgrades using Terraform, Helm, and Kubernetes Operators.
-   Reduce release risk through progressive delivery strategies (blue/green, canary, spot instance rolling updates).

### Observability & Incident Response

-   Own the monitoring and alerting stack (Prometheus, Grafana, Loki, VictoriaMetrics, OpenSearch).
-   Lead incident management and postmortems to prevent recurrence.
-   Provide real-time visibility into system health, performance, and cost metrics.

### Security & Compliance

-   Implement least-privilege IAM policies, secure service-to-service communication, and network ACLs/firewalls.
-   Enforce Kubernetes RBAC, secret management, and secure image supply chain.
-   Participate in audit readiness and compliance efforts.

### Performance & Cost Optimization

-   Analyze and tune system performance under scale (CPU/memory/IO).
-   Partner with product and platform teams to right-size clusters, databases, and storage tiers.

Introduce cost visibility dashboards for engineering leadership.

### **Preferred Qualifications**

-   Experience managing mission-critical systems at scale (high traffic, multi-region).
-   Proven cost optimization in cloud/K8s environments.
-   Familiarity with service mesh (Istio, Linkerd) or advanced networking/egress control.
-   Experience with data platform components (Airflow, Debezium, ClickHouse, etc.) is a plus but not required.

Strong communication skills and teamworker — able to collaborate across engineering, DevOps, security, and product teams.

## Requirements

-   8+ years in SRE / DevOps / Infrastructure Engineering roles.
-   Deep Kubernetes expertise (multi-cluster, Helm chart development, advanced networking).
-   Strong GitOps workflows using ArgoCD/Flux.
-   Expertise with AWS (preferred) or Azure/GCP, plus Infrastructure-as-Code (Terraform, Pulumi, CloudFormation).
-   Advanced knowledge of SQL & NoSQL databases (MySQL/Aurora, PostgreSQL, MongoDB, Redis).
-   Scripting/automation skills in Python, Bash, or Go.
-   Solid background in monitoring/observability (Prometheus, Grafana, Loki, ELK/Opensearch, VictoriaMetrics).
-   Experience with CI/CD at scale and managing production incidents.

-   Experience with streaming/messaging (Kafka, RabbitMQ, or similar).

## Benefits

-   Comprehensive Training & Development programs.
-   Performance-based Bonus incentives.
-   Flexible Work From Home options.

## Apply

[Apply at Salla](https://apply.workable.com/salla/j/CC7B11DB9D/apply)

---
Powered by [Workable](https://www.workable.com)
