We are seeking a highly motivated and experienced Senior Site Reliability Engineer (SRE) to join our growing team. As a Senior SRE, you will play a critical role in ensuring the reliability, scalability, and performance of our production systems. You will leverage your deep understanding of infrastructure, automation, and observability to champion operational excellence and build a resilient platform.
Key Responsibilities:
- Design, deploy, and maintain highly available and resilient messaging platforms and middleware (Kafka, RabbitMQ, Redis, etc.) across cloud and hybrid environments.
- Define and enforce best practices for message reliability, schema evolution, data integrity, and partitioning strategies.
- Develop automation tools and infrastructure-as-code (IaC) to manage messaging clusters and configurations at scale.
- Monitor and optimize messaging performance (throughput, latency, backlog), and tune systems for scale and reliability.
- Build and maintain observability into messaging pipelines, including custom metrics, logging, and distributed tracing.
- Collaborate with development and data teams to support event-driven architectures and microservices communication.
- Drive root cause analysis for messaging-related incidents and lead incident response.
- Participate in our on-call rotation, troubleshoot and resolve production incidents efficiently, and implement preventative measures.
- Manage and operate our Kubernetes platform, ensuring high availability, performance, and security.
- Build and maintain a comprehensive observability stack (monitoring, logging, tracing) to proactively identify and resolve issues.
- Implement and maintain proactive measures to ensure platform stability, performance optimization, and capacity planning.
Qualification:
- Positive attitude and empathy for others.
- Passion for developing and maintaining reliable, scalable infrastructure.
- A minimum of 3 years of working experience in relevant areas.
- Deep understanding of middleware technologies like RabbitMQ, Redis, and Kafka.
- Experience in managing and operating Kubernetes in a production environment.
- Experienced with cloud platforms like AWS or GCP.
- Experienced with high availability, high-scale, and performance systems.
- Understanding of cloud-native architectures.
- Experienced with DevSecOps practices.
- Strong scripting and automation skills using languages like Python, Bash, or Go.
- Proven experience in building and maintaining CI/CD pipelines (e.g., Jenkins, GitLab CI).
- Deep understanding of monitoring, logging, and tracing tools and techniques.
- Experience with infrastructure-as-code tools (e.g., Terraform, Ansible).
- Strong understanding of Linux systems administration and networking concepts.
- Excellent problem-solving and troubleshooting skills.
- Excellent communication and collaboration skills.
- Strong interest and ability to learn any new technical topic.