Search by job, company or skills

ttb bank

System Reliability Engineer (Platform)

5-7 Years
Save
  • Posted 2 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Job Description

AKS Cluster Health & Reliability

  • Monitor, maintain, and optimize Azure Kubernetes Service (AKS) cluster health, availability, and performance.
  • Define and track Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets for all platform services.
  • Lead incident response across L1 (automated detection and recovery), L2 (root cause investigation), and L3 (deep engineering resolution); produce post-mortem documentation for P1/P2 incidents.
  • Implement and enforce cluster security policies, resource quotas, and namespace governance.

AI & API Platform Operations

  • Manage the central API gateway including authentication policies, rate limiting, blue-green traffic routing, and request monitoring; ensure zero-downtime deployments with instant rollback capability.
  • Operate the central abstraction layer for all Azure OpenAI and Azure AI Services — covering model version control, endpoint configuration, usage monitoring, and reliability mechanisms (retry, fallback, failover).
  • Monitor AI service usage and API traffic patterns daily; detect anomalies, adjust configurations proactively, and maintain SLAs for all platform-managed AI services.

Observability & Monitoring

  • Build and maintain an observability stack covering API metrics, resource metrics, log standards, and distributed tracing across all platform services.
  • Develop dashboards and alerting mechanisms across all support tiers; every alert must include service name, error type, likely cause, and trace ID.
  • Establish and maintain runbooks for L1 triage, L2 escalation, and L3 resolution across all critical platform components.

CI/CD & DevOps Pipelines

  • Design, implement, and maintain CI/CD pipelines using Jenkins and Azure DevOps for automated deployment of data and AI services.
  • Enforce code quality, testing standards, and deployment governance through pipeline automation.
  • Manage Kubernetes-based application deployments across UAT and production environments.

Cost & Capacity Management

  • Monitor and optimize cloud resource utilization and spending across Azure services.
  • Conduct capacity planning and provide recommendations to support business growth and AI project scaling.
  • Implement resource tagging, cost allocation policies, and rightsizing recommendations.

Qualification

  • Bachelor's degree or higher in Computer Science, Information Technology, Software Engineering, or a related field
  • 5+ years of hands-on experience in data engineering, platform engineering, DevOps, or SRE roles
  • Demonstrated experience operating Kubernetes clusters in production environments, preferably on Azure (AKS)
  • Experience handling platform support across L1 (automated monitoring & recovery), L2 (escalation & triage), and L3 (deep engineering & resolution) tiers
  • Experience with API gateway management in a microservices environment is advantageous
  • Experience in a financial services or regulated environment is advantageous

Technical Skills — Require

  • Python: Proficient in writing automation scripts, tooling, and platform utilities
  • Bash / Shell Scripting: Capable of developing and maintaining operational scripts for infrastructure management
  • CI/CD Tooling: Hands-on experience with Jenkins and Azure DevOps for pipeline design and implementation
  • Observability: Experience with monitoring and logging tools covering metrics, logs, and distributed tracing (e.g., Prometheus, Grafana, Azure Monitor, or equivalent)

Technical Skills — Preferer

  • Helm & Kubernetes Manifests: Experience authoring, managing, and deploying Helm charts and Kubernetes manifests
  • API Gateway platforms (e.g., Kong, NGINX, Azure API Management, or equivalent)
  • Azure OpenAI or equivalent AI service operations, including model version management and usage monitoring
  • Terraform or equivalent Infrastructure-as-Code (IaC) tooling
  • Azure CLI and Azure Kubernetes Service familiarity
  • Databricks workspace administration and job management

.

More Info

Job Type:
Industry:
Employment Type:

About Company

Job ID: 149100081