Senior Site Reliability Engineer (SRE) - Observability

Ascend Money

Thailand

3-5 Years

This job is no longer accepting applications

Posted 3 months ago

Job Description

We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to join our growing team. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our production systems. You will leverage your deep understanding of infrastructure, automation, and observability to champion operational excellence and build a resilient platform.

Key Responsibilities:

Design, build and maintain a comprehensive observability stack (monitoring, logging, tracing) to proactively identify and resolve issues across a complex and dynamic infrastructure.
Develop and maintain dashboards, alerts, and automated monitoring solutions to proactively detect and respond to incidents.
Improve incident detection and root cause analysis by enabling high-fidelity, real-time observability.
Manage and operate our Kubernetes platform, ensuring high availability, performance, and security.
Design, develop, and implement automation solutions for operational tasks, infrastructure provisioning, and application deployment.
Implement and maintain proactive measures to ensure platform stability, performance optimization, and capacity planning.
Provide support and expertise for critical middleware tools such as RabbitMQ, Redis, and Kafka, ensuring their optimal performance and reliability.
Participate in our on-call rotation, troubleshoot and resolve production incidents efficiently, and implement preventative measures.
Collaborate effectively with development and other engineering teams.

Qualification:

Positive attitude and empathy for others.
Passion for developing and maintaining reliable, scalable infrastructure.
A minimum of 3 years of working experience in SRE, DevOps, or infrastructure engineering roles.
Deep understanding of observability principles and practices: metrics, logs, traces, events.
Hands-on experience with tools such as Prometheus, Grafana, OpenTelemetry, ELK, Fluentd, Jaeger, Zipkin, Datadog, or similar.
Experience in managing and operating Kubernetes in a production environment.
Experienced with cloud platforms like AWS or GCP.
Experienced with high availability, high-scale, and performance systems.
Understanding of cloud-native architectures.
Experienced with DevSecOps practices.
Strong scripting and automation skills using languages like Python, Bash, or Go.
Proven experience in building and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI).
Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
Strong understanding of Linux systems administration and networking concepts.
Experience working with middleware technologies like RabbitMQ, Redis, and Kafka.
Excellent problem-solving and troubleshooting skills.
Excellent communication and collaboration skills.
Strong interest and ability to learn any new technical topic.