Senior Site Reliability Engineer (SRE) - Messaging - Virtual Bank

Ascend Money

Thailand

3-5 Years

This job is no longer accepting applications

Posted 3 months ago

Job Description

We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to join our growing team. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our production systems. You will leverage your deep understanding of infrastructure, automation, and observability to champion operational excellence and build a resilient platform.

Key Responsibilities:

Design, deploy, and maintain highly available and resilient messaging platforms and middleware (Kafka, RabbitMQ, Redis, etc.) across cloud and hybrid environments.
Define and enforce best practices for message reliability, schema evolution, data integrity, and partitioning strategies.
Develop automation tools and infrastructure-as-code (IaC) to manage messaging clusters and configurations at scale.
Monitor and optimize messaging performance (throughput, latency, backlog), and tune systems for scale and reliability.
Build and maintain observability into messaging pipelines, including custom metrics, logging, and distributed tracing.
Collaborate with development and data teams to support event-driven architectures and microservices communication.
Drive root cause analysis for messaging-related incidents and lead incident response.
Participate in our on-call rotation, troubleshoot and resolve production incidents efficiently, and implement preventative measures.
Manage and operate our Kubernetes platform, ensuring high availability, performance, and security.
Build and maintain a comprehensive observability stack (monitoring, logging, tracing) to proactively identify and resolve issues.
Implement and maintain proactive measures to ensure platform stability, performance optimization, and capacity planning.

Qualification:

Positive attitude and empathy for others.
Passion for developing and maintaining reliable, scalable infrastructure.
A minimum of 3 years of working experience in relevant areas.
Deep understanding of middleware technologies like RabbitMQ, Redis, and Kafka.
Experience in managing and operating Kubernetes in a production environment.
Experienced with cloud platforms like AWS or GCP.
Experienced with high availability, high-scale, and performance systems.
Understanding of cloud-native architectures.
Experienced with DevSecOps practices.
Strong scripting and automation skills using languages like Python, Bash, or Go.
Proven experience in building and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI).
Deep understanding of monitoring, logging, and tracing tools and techniques.
Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
Strong understanding of Linux systems administration and networking concepts.
Excellent problem-solving and troubleshooting skills.
Excellent communication and collaboration skills.
Strong interest and ability to learn any new technical topic.