System Reliability Engineer (Platform)

5-7 Years

Save

Early Applicant

Job Description

Job Description

AKS Cluster Health & Reliability

Monitor, maintain, and optimize Azure Kubernetes Service (AKS) cluster health, availability, and performance.
Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for all platform services.
Lead incident response across L1 (automated detection and recovery), L2 (root cause investigation), and L3 (deep engineering resolution); produce post-mortem documentation for P1/P2 incidents.
Implement and enforce cluster security policies, resource quotas, and namespace governance.

AI & API Platform Operations

Manage the central API gateway including authentication policies, rate limiting, blue-green traffic routing, and request monitoring; ensure zero-downtime deployments with instant rollback capability.
Operate the central abstraction layer for all Azure OpenAI and Azure AI Services — covering model version control, endpoint configuration, usage monitoring, and reliability mechanisms (retry, fallback, failover).
Monitor AI service usage and API traffic patterns daily; detect anomalies, adjust configurations proactively, and maintain SLAs for all platform-managed AI services.

Observability & Monitoring

Build and maintain an observability stack covering API metrics, resource metrics, log standards, and distributed tracing across all platform services.
Develop dashboards and alerting mechanisms across all support tiers; every alert must include service name, error type, likely cause, and trace ID.
Establish and maintain runbooks for L1 triage, L2 escalation, and L3 resolution across all critical platform components.

CI/CD & DevOps Pipelines

Design, implement, and maintain CI/CD pipelines using Jenkins and Azure DevOps for automated deployment of data and AI services.
Enforce code quality, testing standards, and deployment governance through pipeline automation.
Manage Kubernetes-based application deployments across UAT and production environments.

Cost & Capacity Management

Monitor and optimize cloud resource utilization and spending across Azure services.
Conduct capacity planning and provide recommendations to support business growth and AI project scaling.
Implement resource tagging, cost allocation policies, and rightsizing recommendations.

Qualification

Bachelor's degree or higher in Computer Science, Information Technology, Software Engineering, or a related field
5+ years of hands-on experience in data engineering, platform engineering, DevOps, or SRE roles
Demonstrated experience operating Kubernetes clusters in production environments, preferably on Azure (AKS)
Experience handling platform support across L1 (automated monitoring & recovery), L2 (escalation & triage), and L3 (deep engineering & resolution) tiers
Experience with API gateway management in a microservices environment is advantageous
Experience in a financial services or regulated environment is advantageous

Technical Skills — Require

Python: Proficient in writing automation scripts, tooling, and platform utilities
Bash / Shell Scripting: Capable of developing and maintaining operational scripts for infrastructure management
CI/CD Tooling: Hands-on experience with Jenkins and Azure DevOps for pipeline design and implementation
Observability: Experience with monitoring and logging tools covering metrics, logs, and distributed tracing (e.g., Prometheus, Grafana, Azure Monitor, or equivalent)

Technical Skills — Preferer

Helm & Kubernetes Manifests: Experience authoring, managing, and deploying Helm charts and Kubernetes manifests
API Gateway platforms (e.g., Kong, NGINX, Azure API Management, or equivalent)
Azure OpenAI or equivalent AI service operations, including model version management and usage monitoring
Terraform or equivalent Infrastructure-as-Code (IaC) tooling
Azure CLI and Azure Kubernetes Service familiarity
Databricks workspace administration and job management