Staff Software Reliability Engineer (SRE)
Stairwell
About Stairwell
Stairwell is building the future of threat detection and malware analysis. Our platform helps security teams find threats that others miss by transforming how organizations collect, analyze, and act on file-based intelligence. We're a small, high-impact team solving hard problems at the intersection of cybersecurity and large-scale data systems.
The Role
We're looking for a Staff SRE who can own the reliability, scalability, and operational excellence of our platform. You'll work at the intersection of infrastructure and software engineering - building the systems, tooling, and practices that let our team ship confidently and operate at scale.
You'll set technical direction for how we approach infrastructure and reliability - making architectural decisions, evaluating tools and vendors, and establishing practices that scale with the team.
This is also a hands-on role. You'll be deep in Kubernetes, GCP, CI/CD pipelines, and our Bazel-powered build system. You'll help shape how we think about reliability as a company-defining SLOs, improving observability, and building the automation that keeps our platform running.
We are rapidly embracing AI-driven development, and expect you to be fluent with tools like Claude Code, Copilot, or similar. You should have strong opinions about where AI accelerates your work and where it doesn't. We also highly value the human-in-the-loop; engineers at Stairwell know their customer, know the business, and know the architecture well, so that they can design and build effective products quickly.
What You'll Do
Set technical direction for infrastructure and reliability - evaluate approaches, make architectural decisions, and establish standards
Own and evolve our Kubernetes-based infrastructure on GCP
Build and maintain CI/CD pipelines, deployment tooling, and release processes
Maintain and simplify our build system (Bazel) for faster, more reliable builds across the org
Define and instrument SLIs/SLOs; build dashboards and alerting that surface real problems
Drive incident response, post-mortems, and reliability improvements
Partner with product engineers to design systems that are reliable and operable from day one
Contribute to our engineering culture around AI-augmented development - sharing patterns, workflows, and lessons learned
What We're Looking For
Significant experience in SRE, platform engineering, or infrastructure roles at scale
Demonstrated technical leadership: you've driven significant infrastructure or reliability initiatives, not just executed on them
Deep hands-on expertise with Kubernetes (GKE preferred) and GCP services
Strong programming skills - Go preferred
Experience with build systems (Bazel strongly preferred) and CI/CD tooling
Practical experience with AI coding assistants as part of your regular workflow - not just experimentation, but daily use
Ability to critically evaluate AI-generated code and infrastructure configs: you know when to trust it, when to revise it, and when to write it yourself
Track record of improving reliability through automation, observability, and good engineering practices
Comfort with ambiguity and ownership; we're a small team where engineers drive decisions
Nice to Have
Background in security, malware analysis, or threat detection
Experience with large-scale data systems (BigTable, Spanner, BigQuery)
Deep proficiency in Go
Why Stairwell
Hard technical problems with real security impact
Small team, huge impact, high autonomy, low process overhead
Opportunity to collaborate with world-class experts in cybersecurity
Work remotely in the USA or Canada, or use our co-working space in Santa Clara to collaborate with teammates in-person
