Principal Site Reliability Engineer

arrow

/ $175000 - $225000 annum

INFO

Salary
SALARY:

$175000 - $225000

Location

LOCATION

Job Type
JOB TYPE

Permanent

Principal Site Reliability Engineer (Principal SRE)

Location: Remote (Must be based in USA)
Pay: $175k - $225k base + equity

Highly competitive base salary + equity + comprehensive benefits
(Exact compensation dependent on experience and location)

Overview

We are partnered with a well‑funded, high‑growth AI technology company building a modern, scalable platform that enables customers to deploy advanced machine learning capabilities directly into their own environments.
This organisation is moving away from rigid, monolithic architectures toward a more flexible, event‑driven and platform‑oriented model. Infrastructure, reliability, and governance are treated as first‑class concerns rather than operational afterthoughts.
The Principal Site Reliability Engineer is a senior architectural leadership role, reporting directly into the executive technology organisation. This position sits at the intersection of infrastructure, machine learning systems, and internal platform governance, with broad influence across engineering, data, ML, and product teams.
The focus is on architectural leverage, durable systems, and long‑term platform evolution, with hands‑on involvement where it matters most.

Responsibilities

  • Define and evolve the architectural direction of a large‑scale, event‑driven infrastructure platform running primarily on AWS.
  • Influence how reliability, scalability, and operational standards are applied across engineering, data science, and ML teams.
  • Design and operate distributed systems that support ML training, orchestration, and model serving at scale.
  • Own and guide the reliability strategy for containerised workloads deployed both internally and into customer or partner environments.
  • Establish platform‑level standards around idempotency, event replay, schema governance, and operational traceability.
  • Provide technical leadership across Kubernetes control planes, multi‑cluster environments, and multi‑tenant isolation strategies.
  • Define and mature infrastructure‑as‑code standards, CI/CD pipelines, and governance guardrails that reduce risk and increase deployment confidence.
  • Shape observability, incident response, and post‑incident learning practices to drive systemic improvement rather than reactive fixes.
  • Mentor senior engineers and help scale an SRE function capable of supporting the platform as the company grows.

Must‑Have Qualifications

  • 10-15+ years of experience designing, building, and operating production infrastructure at scale.
  • Deep hands‑on experience with AWS, including networking, compute, storage, identity, and event‑driven services.
  • Proven experience designing or operating event‑driven architectures (e.g. Kafka/MSK, Kinesis, EventBridge, RabbitMQ).
  • Strong production experience managing Kubernetes control planes and multi‑cluster environments.
  • Experience with Infrastructure as Code (Terraform preferred) and modern CI/CD practices.
  • Background working in data‑heavy or ML‑centric environments, with an understanding of the reliability challenges unique to ML systems.
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.

CONTACT

Bekim Doçaj

Principal Recruitment Consultant

SIMILAR
JOB RESULTS

4k-Harnham_DA copy
CAN’T FIND THE RIGHT OPPORTUNITY?

STILL
LOOKING?

If you can’t see what you’re looking for right now, send us your CV anyway – we’re always getting fresh new roles through the door.