# Sr. Site Reliability Engineer

> Tiger Analytics Inc. · Washington, United States · — · Posted 2026-05-08

**Workplace:** on_site

## Description

### **Role Overview**

We are seeking a high-caliber **Site Reliability Engineer (SRE)** to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on **MLOps**—bridging the gap between model development and production-grade reliability.

### **Key Responsibilities**

### **1\. Reliability & Performance Engineering**

-   **SLA/SLO Management:** Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
-   **Error Budgeting:** Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
-   **Scalability:** Architect and manage auto-scaling strategies for **Kubernetes (GKE)** to handle fluctuating workloads during model training and high-volume inference.

### **2\. MLOps & AI Infrastructure**

-   **Model Serving Reliability:** Ensure the high availability of **Vertex AI endpoints** and custom inference services.
-   **GPU/TPU Optimization:** Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
-   **Pipeline Resilience:** Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.

### **3\. Automation & Orchestration (Eliminating "Toil")**

-   **Infrastructure as Code (IaC):** Use **Terraform** or Pulumi to provision and manage consistent, version-controlled cloud environments.
-   **CI/CD & GitOps:** Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
-   **Task Automation:** Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.

### **4\. Monitoring, Alerting & Incident Response**

-   **Observability:** Build and manage comprehensive dashboards using **Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver)**.
-   **Incident Management:** Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
-   **Blameless Post-Mortems:** Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

## Requirements

**Orchestration:** Expert-level knowledge of **Kubernetes (K8s)** and Docker.

**MLOps Stack:** Familiarity with tools such as **Kubeflow, Vertex AI, MLflow, or DVC**.

**Scripting:** Strong proficiency in **Python** (for automation) and Bash; knowledge of Go is a plus.

**Data Systems:** Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).

**Networking:** Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).

## Benefits

Benefits

Significant career development opportunities exist as the company grows. The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.

_**Tiger Analytics provides equal employment opportunities to applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, pregnancy, national origin, ancestry, marital status, protected veteran status, disability status, or any other basis as protected by federal, state, or local law.**_

## Apply

[Apply at Tiger Analytics Inc.](https://apply.workable.com/tiger-analytics/j/AF0D1A3499/apply)

---
Powered by [Workable](https://www.workable.com)