# Senior Site Reliability Engineer (SRE) - (GCP)

> Devsu · Peru (Remote) · Full-time · Posted 2026-05-18

**Salary:** USD 10,000–56,400

**Workplace:** remote

**Department:** Engineering Lucia

## Description

We are seeking a Site Reliability Engineer (SRE) with deep expertise in monitoring, observability, and reliability engineering to support systems running across on-premises infrastructure and Google Cloud Platform (GCP).

This role is primarily responsible for designing, operating, and improving monitoring, alerting, and observability platforms, with a strong focus on Grafana and Kubernetes environments.

As a secondary responsibility, this role provides backup coverage for the Application Support team during periods of resource constraints or major incidents, offering L2/L3 technical support when required.

### Responsibilities

Monitoring & Observability (Core Focus)

-   Own and operate the monitoring and observability stack across on-prem and GCP environments
-   Design, build, and maintain Grafana dashboards for infrastructure, Kubernetes, and applications
-   Define, tune, and maintain alerts to ensure high signal-to-noise ratio
-   Establish observability standards and best practices across teams
-   Improve visibility into system health, performance, and reliability

Site Reliability Engineering

-   Apply SRE principles to improve availability, performance, and resilience
-   Define and track SLIs, SLOs, and error budgets
-   Participate in on-call rotations and SEV incident response
-   Lead or contribute to incident investigations and root cause analysis (RCA)
-   Drive preventative actions to reduce repeat incidents

Kubernetes & Platform Reliability

-   Support and monitor Kubernetes environments (GKE and on-prem clusters)
-   Monitor cluster health, capacity, and resource utilization
-   Troubleshoot platform-level issues impacting application reliability
-   Collaborate with Platform and Engineering teams on reliability improvements

### Secondary Responsibilities (Backup Application Support)

-   These responsibilities are activated as needed, not part of day-to-day operations.
-   Provide L2/L3 application support coverage during:

-   Support team resource shortages
-   High-severity incidents (SEVs)
-   Peak support periods or escalations

-   Triage and troubleshoot application issues using existing runbooks and dashboards
-   Collaborate with Application Support and Engineering teams during incidents
-   Ensure all actions, findings, and resolutions are documented in ServiceNow (SNOW)

## Requirements

-   Strong experience as a **Site Reliability Engineer or Reliability Engineer**
-   Deep hands-on expertise with **Grafana** (dashboards, alerting, troubleshooting)
-   Solid experience with monitoring and observability systems
-   Production experience operating **Kubernetes** environments
-   Experience supporting systems in **GCP** and on-prem environments (mandatory)
-   Strong **Linux** systems and troubleshooting skills
-   Fluent **English** (written and spoken).
-   Ability to work in **PST time zone.**
-   Ability to participate in an **on-call rotation** that includes coverage for one weekend day. Time worked during the weekend is compensated with one day off during the week, in accordance with the established work schedule.

Technology Stack:

-   Observability: Grafana, Prometheus, logging platforms
-   Containers: Kubernetes (GKE and on-prem)
-   Cloud: Google Cloud Platform (GCP)
-   Operations: Linux, networking, infrastructure monitoring
-   Incident Tools: PagerDuty, ServiceNow, Slack (or equivalents)

Nice to have: 

-   Experience supporting application teams during SEV incidents
-   Knowledge of capacity planning and performance tuning
-   Scripting skills (Python, Bash, etc.)
-   Experience with hybrid infrastructure environments

## Benefits

At Devsu, we believe in creating an environment where you can thrive both personally and professionally. By joining our team, you’ll enjoy:

-   A stable, long-term contract with opportunities for career growth
-   Private health insurance
-   A remote-friendly culture that promotes work-life balance
-   Continuous training, mentorship, and learning programs to keep you at the forefront of the industry
-   Free access to AI training resources and state-of-the-art AI tools to elevate your daily work
-   A flexible Paid Time Off (PTO) policy as well as paid holiday days
-   Challenging, world-class software projects for clients in the US and LatAm
-   Collaboration with some of the most talented software engineers in Latin America and the US, in a diverse work environment

Join Devsu and discover a workplace that values your growth, supports your well-being, and empowers you to make a global impact.

## Apply

[Apply at Devsu](https://apply.workable.com/devsu/j/0E0E694B25/apply)

---
Powered by [Workable](https://www.workable.com)
