Staff Software Reliability Engineer (SRE)

Stairwell

Stairwell

Canada
Posted on Mar 18, 2026

About Stairwell

Stairwell is building the future of threat detection and malware analysis. Our platform helps security teams find threats that others miss by transforming how organizations collect, analyze, and act on file-based intelligence. We're a small, high-impact team solving hard problems at the intersection of cybersecurity and large-scale data systems.

The Role

We're looking for a Staff SRE who can own the reliability, scalability, and operational excellence of our platform. You'll work at the intersection of infrastructure and software engineering - building the systems, tooling, and practices that let our team ship confidently and operate at scale.

You'll set technical direction for how we approach infrastructure and reliability - making architectural decisions, evaluating tools and vendors, and establishing practices that scale with the team.

This is also a hands-on role. You'll be deep in Kubernetes, GCP, CI/CD pipelines, and our Bazel-powered build system. You'll help shape how we think about reliability as a company-defining SLOs, improving observability, and building the automation that keeps our platform running.

We are rapidly embracing AI-driven development, and expect you to be fluent with tools like Claude Code, Copilot, or similar. You should have strong opinions about where AI accelerates your work and where it doesn't. We also highly value the human-in-the-loop; engineers at Stairwell know their customer, know the business, and know the architecture well, so that they can design and build effective products quickly.

What You'll Do

  • Set technical direction for infrastructure and reliability - evaluate approaches, make architectural decisions, and establish standards

  • Own and evolve our Kubernetes-based infrastructure on GCP

  • Build and maintain CI/CD pipelines, deployment tooling, and release processes

  • Maintain and simplify our build system (Bazel) for faster, more reliable builds across the org

  • Define and instrument SLIs/SLOs; build dashboards and alerting that surface real problems

  • Drive incident response, post-mortems, and reliability improvements

  • Partner with product engineers to design systems that are reliable and operable from day one

  • Contribute to our engineering culture around AI-augmented development - sharing patterns, workflows, and lessons learned

What We're Looking For

  • Significant experience in SRE, platform engineering, or infrastructure roles at scale

  • Demonstrated technical leadership: you've driven significant infrastructure or reliability initiatives, not just executed on them

  • Deep hands-on expertise with Kubernetes (GKE preferred) and GCP services

  • Strong programming skills - Go preferred

  • Experience with build systems (Bazel strongly preferred) and CI/CD tooling

  • Practical experience with AI coding assistants as part of your regular workflow - not just experimentation, but daily use

  • Ability to critically evaluate AI-generated code and infrastructure configs: you know when to trust it, when to revise it, and when to write it yourself

  • Track record of improving reliability through automation, observability, and good engineering practices

  • Comfort with ambiguity and ownership; we're a small team where engineers drive decisions

Nice to Have

  • Background in security, malware analysis, or threat detection

  • Experience with large-scale data systems (BigTable, Spanner, BigQuery)

  • Deep proficiency in Go

Why Stairwell

  • Hard technical problems with real security impact

  • Small team, huge impact, high autonomy, low process overhead

  • Opportunity to collaborate with world-class experts in cybersecurity

  • Work remotely in the USA or Canada, or use our co-working space in Santa Clara to collaborate with teammates in-person