SRE Engineer

Redotpay

Redotpay

Other Engineering

hong kong

Posted on May 15, 2026

SRE Engineer

Role Overview

As a Site Reliability Engineer (SRE), you will be the guardian of our app and core business systems, ensuring their stability, availability, and recoverability. Through robust monitoring and alerting, incident response, release governance, capacity planning, automation, and disaster recovery drills, you will safeguard our end-user experience and maintain uninterrupted business continuity.

Core Responsibilities

App Stability Assurance

  • Own the stability monitoring for critical user journeys, including login, homepage, trading, payments, deposits/withdrawals, and core APIs.
  • Define and track core Service Level Indicators (SLIs) such as user-side availability, API success/error rates, latency, and crash rates.
  • Promptly detect and address issues like app launch failures, API timeouts, service degradation, and regional access anomalies.

Monitoring, Alerting & Observability

  • Build and optimize comprehensive observability capabilities encompassing logs, metrics, distributed tracing, business probes, and Real User Monitoring (RUM).
  • Refine alerting rules to reduce noise/false positives and improve the accuracy of incident detection.
  • Establish and enforce tiered incident classification (P0/P1/P2), alongside clear notification, escalation, and response protocols.

Incident Response & Emergency Handling

  • Lead or actively participate in production incident triage, mitigation, recovery, and post-mortem analysis.
  • Develop and maintain emergency runbooks for critical scenarios (e.g., app downtime, core API failures, database anomalies, cloud service outages, network disruptions).
  • Drive Root Cause Analysis (RCA) and ensure the closed-loop implementation of corrective actions.

Release & Change Stability Governance

  • Participate in establishing best practices for production releases, canary/gray deployments, rollbacks, change windows, and post-release monitoring.
  • Identify and mitigate stability risks during the release pipeline to prevent incidents caused by deployments or configuration changes.
  • Champion the adoption of automated deployments, automated rollbacks, and advanced change risk controls.

Capacity, Performance & Resilience

  • Contribute to capacity planning, performance stress testing, resource utilization monitoring, and scaling strategies.
  • Drive the implementation of reliability patterns, including rate limiting, graceful degradation, circuit breaking, and backup/restore mechanisms.
  • Regularly organize or participate in chaos engineering/fault drills, disaster recovery exercises, and restoration validation.

Automation & Toil Reduction

  • Develop tools and platforms for automated health checks, alert analysis, and system self-healing.
  • Eliminate manual toil to drastically improve the efficiency of production issue resolution.
  • Standardize operations by documenting Standard Operating Procedures (SOPs), runbooks, and post-mortem templates.

Qualifications

  • Solid understanding of core infrastructure components: Linux, networking, databases, caching, middleware, and cloud services.
  • Familiarity with common modern architectures: App backend services, API gateways, load balancing, CDN, and Kubernetes/containerization.
  • Hands-on experience with one or more monitoring and observability ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, CloudWatch, APM, distributed tracing).
  • Proven track record in handling production incidents, with the ability to independently perform log analysis, trace debugging, performance profiling, and system recovery.
  • Strong understanding of SRE workflows, including deployments, canary releases, rollbacks, capacity planning, incident response, and post-mortems.
  • Proficiency in scripting or development (Shell, Python, or Go) to build automation tools.
  • Preferred: Experience ensuring the stability of global apps, or a background in Payments, FinTech, Web3, or Cross-border businesses.