Principal Site Reliability Engineer

/ $175000 - $225000 annum

APPLY NOW

JOB ALERTS

INFO

SALARY:

$175000 - $225000

LOCATION

JOB TYPE

Permanent

Principal Site Reliability Engineer (Principal SRE)

Location: Remote (Must be based in USA)
Pay: $175k - $225k base + equity

Highly competitive base salary + equity + comprehensive benefits
(Exact compensation dependent on experience and location)

Overview

We are partnered with a well‑funded, high‑growth AI technology company building a modern, scalable platform that enables customers to deploy advanced machine learning capabilities directly into their own environments.
This organisation is moving away from rigid, monolithic architectures toward a more flexible, event‑driven and platform‑oriented model. Infrastructure, reliability, and governance are treated as first‑class concerns rather than operational afterthoughts.
The Principal Site Reliability Engineer is a senior architectural leadership role, reporting directly into the executive technology organisation. This position sits at the intersection of infrastructure, machine learning systems, and internal platform governance, with broad influence across engineering, data, ML, and product teams.
The focus is on architectural leverage, durable systems, and long‑term platform evolution, with hands‑on involvement where it matters most.

Responsibilities

Define and evolve the architectural direction of a large‑scale, event‑driven infrastructure platform running primarily on AWS.
Influence how reliability, scalability, and operational standards are applied across engineering, data science, and ML teams.
Design and operate distributed systems that support ML training, orchestration, and model serving at scale.
Own and guide the reliability strategy for containerised workloads deployed both internally and into customer or partner environments.
Establish platform‑level standards around idempotency, event replay, schema governance, and operational traceability.
Provide technical leadership across Kubernetes control planes, multi‑cluster environments, and multi‑tenant isolation strategies.
Define and mature infrastructure‑as‑code standards, CI/CD pipelines, and governance guardrails that reduce risk and increase deployment confidence.
Shape observability, incident response, and post‑incident learning practices to drive systemic improvement rather than reactive fixes.
Mentor senior engineers and help scale an SRE function capable of supporting the platform as the company grows.

Must‑Have Qualifications

10-15+ years of experience designing, building, and operating production infrastructure at scale.
Deep hands‑on experience with AWS, including networking, compute, storage, identity, and event‑driven services.
Proven experience designing or operating event‑driven architectures (e.g. Kafka/MSK, Kinesis, EventBridge, RabbitMQ).
Strong production experience managing Kubernetes control planes and multi‑cluster environments.
Experience with Infrastructure as Code (Terraform preferred) and modern CI/CD practices.
Background working in data‑heavy or ML‑centric environments, with an understanding of the reliability challenges unique to ML systems.
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience.

CONTACT

Bekim Doçaj

Principal Recruitment Consultant

APPLY NOW

JOB ALERTS

SIMILAR
JOB RESULTS

Senior Software Engineer (AI Augmented)

New York

$190000 - $200000

+ Data Engineering

Permanent
New York

Our Expertise

Data Engineering

Data Science Machine Learning & AI

Digital analytics

Risk analytics

Advanced Analytics

Life sciences

Computer vision

Data Management and Governance

Data & AI Recruitment Solutions

Data and AI Talent Solutions

Data and AI Recruitment & Staffing

Data and AI Contract & Freelance

Global Data & AI Hiring Insights

Data and AI Executive Search

Data and AI Graduate Talent

Data and AI Training

Data & AI Industry Hub

News, Blogs & Insights

Data and AI Salary Guides 2024

Data & AI Video Hub

The Data and AI Podcast

Diversity Reports

CAREERS AT HARNHAM

Careers at Harnham

Career Paths

Recruitment Rookies

Experienced Hires

Operations at Harnham

Diversity, Equity & inclusion

Harnham Life / Meet the Team

Training, Learning & Development

CSR

WHERE TO FIND US

New York

San Francisco

PHOENIX

LONDON

Amsterdam

France

Principal Site Reliability Engineer

Principal Site Reliability Engineer (Principal SRE)

Overview

Responsibilities

Must‑Have Qualifications

Bekim Doçaj

SIMILAR JOB RESULTS

Senior Software Engineer (AI Augmented)

What You’ll Do

What You Bring

AI-Assisted Engineering

Why This Role

Principal Site Reliability Engineer

Principal Site Reliability Engineer (Principal SRE)

Overview

Responsibilities

Must‑Have Qualifications

Paid Media Associate

About the Role

Responsibilities include but are not limited to:

Requirements

Nice to Have

Sr. Manufacturing Leader

Senior Solutions Architect

Sr. Data Modeler

Overview

Key Responsibilities

Required Skills & Experience

Preferred / Nice-to-Have Skills

Ideal Candidate Profile

Education

Career Growth

Work Environment

Senior Data Scientist

Data Engineer

Senior Data Scientist – Forward Deployed

Data Engineer (AWS & Kinesis)

Data Engineer

THE COMPANY

THE ROLE

Our
Expertise

Data & AI
Recruitment Solutions

Data & AI
Industry Hub

CAREERS
AT HARNHAM

WHERE
TO FIND US

SIMILAR
JOB RESULTS

STILL
LOOKING?