Search by job, company or skills

Ascend Money

Senior Site Reliability Engineer (SRE) - Observability

3-5 Years

This job is no longer accepting applications

new job description bg glownew job description bg glownew job description bg svg
  • Posted 3 months ago

Job Description

We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to join our growing team. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our production systems. You will leverage your deep understanding of infrastructure, automation, and observability to champion operational excellence and build a resilient platform.

Key Responsibilities:

  • Design, build and maintain a comprehensive observability stack (monitoring, logging, tracing) to proactively identify and resolve issues across a complex and dynamic infrastructure.
  • Develop and maintain dashboards, alerts, and automated monitoring solutions to proactively detect and respond to incidents.
  • Improve incident detection and root cause analysis by enabling high-fidelity, real-time observability.
  • Manage and operate our Kubernetes platform, ensuring high availability, performance, and security.
  • Design, develop, and implement automation solutions for operational tasks, infrastructure provisioning, and application deployment.
  • Implement and maintain proactive measures to ensure platform stability, performance optimization, and capacity planning.
  • Provide support and expertise for critical middleware tools such as RabbitMQ, Redis, and Kafka, ensuring their optimal performance and reliability.
  • Participate in our on-call rotation, troubleshoot and resolve production incidents efficiently, and implement preventative measures.
  • Collaborate effectively with development and other engineering teams.

Qualification:

  • Positive attitude and empathy for others.
  • Passion for developing and maintaining reliable, scalable infrastructure.
  • A minimum of 3 years of working experience in SRE, DevOps, or infrastructure engineering roles.
  • Deep understanding of observability principles and practices: metrics, logs, traces, events.
  • Hands-on experience with tools such as Prometheus, Grafana, OpenTelemetry, ELK, Fluentd, Jaeger, Zipkin, Datadog, or similar.
  • Experience in managing and operating Kubernetes in a production environment.
  • Experienced with cloud platforms like AWS or GCP.
  • Experienced with high availability, high-scale, and performance systems.
  • Understanding of cloud-native architectures.
  • Experienced with DevSecOps practices.
  • Strong scripting and automation skills using languages like Python, Bash, or Go.
  • Proven experience in building and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI).
  • Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
  • Strong understanding of Linux systems administration and networking concepts.
  • Experience working with middleware technologies like RabbitMQ, Redis, and Kafka.
  • Excellent problem-solving and troubleshooting skills.
  • Excellent communication and collaboration skills.
  • Strong interest and ability to learn any new technical topic.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 125598747