Job Description
AKS Cluster Health & Reliability
- Monitor, maintain, and optimize Azure Kubernetes Service (AKS) cluster health, availability, and performance.
- Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for all platform services.
- Lead incident response across L1 (automated detection and recovery), L2 (root cause investigation), and L3 (deep engineering resolution); produce post-mortem documentation for P1/P2 incidents.
- Implement and enforce cluster security policies, resource quotas, and namespace governance.
AI & API Platform Operations
- Manage the central API gateway including authentication policies, rate limiting, blue-green traffic routing, and request monitoring; ensure zero-downtime deployments with instant rollback capability.
- Operate the central abstraction layer for all Azure OpenAI and Azure AI Services — covering model version control, endpoint configuration, usage monitoring, and reliability mechanisms (retry, fallback, failover).
- Monitor AI service usage and API traffic patterns daily; detect anomalies, adjust configurations proactively, and maintain SLAs for all platform-managed AI services.
Observability & Monitoring
- Build and maintain an observability stack covering API metrics, resource metrics, log standards, and distributed tracing across all platform services.
- Develop dashboards and alerting mechanisms across all support tiers; every alert must include service name, error type, likely cause, and trace ID.
- Establish and maintain runbooks for L1 triage, L2 escalation, and L3 resolution across all critical platform components.
CI/CD & DevOps Pipelines
- Design, implement, and maintain CI/CD pipelines using Jenkins and Azure DevOps for automated deployment of data and AI services.
- Enforce code quality, testing standards, and deployment governance through pipeline automation.
- Manage Kubernetes-based application deployments across UAT and production environments.
Cost & Capacity Management
- Monitor and optimize cloud resource utilization and spending across Azure services.
- Conduct capacity planning and provide recommendations to support business growth and AI project scaling.
- Implement resource tagging, cost allocation policies, and rightsizing recommendations.
Qualification
- Bachelor's degree or higher in Computer Science, Information Technology, Software Engineering, or a related field
- 5+ years of hands-on experience in data engineering, platform engineering, DevOps, or SRE roles
- Demonstrated experience operating Kubernetes clusters in production environments, preferably on Azure (AKS)
- Experience handling platform support across L1 (automated monitoring & recovery), L2 (escalation & triage), and L3 (deep engineering & resolution) tiers
- Experience with API gateway management in a microservices environment is advantageous
- Experience in a financial services or regulated environment is advantageous
Technical Skills — Require
- Python: Proficient in writing automation scripts, tooling, and platform utilities
- Bash / Shell Scripting: Capable of developing and maintaining operational scripts for infrastructure management
- CI/CD Tooling: Hands-on experience with Jenkins and Azure DevOps for pipeline design and implementation
- Observability: Experience with monitoring and logging tools covering metrics, logs, and distributed tracing (e.g., Prometheus, Grafana, Azure Monitor, or equivalent)
Technical Skills — Preferer
- Helm & Kubernetes Manifests: Experience authoring, managing, and deploying Helm charts and Kubernetes manifests
- API Gateway platforms (e.g., Kong, NGINX, Azure API Management, or equivalent)
- Azure OpenAI or equivalent AI service operations, including model version management and usage monitoring
- Terraform or equivalent Infrastructure-as-Code (IaC) tooling
- Azure CLI and Azure Kubernetes Service familiarity
- Databricks workspace administration and job management
.