Qualifications
Technical Skills
- Strong Python programming skills with experience in production-grade development practices, including writing maintainable, scalable, and well-structured code. Knowledge of TypeScript or Go is a plus.
- Developer mindset with solid understanding of backend development, microservices architecture, and RESTful API design.
- Experience building automation workflows and data pipelines using Python, ETL processes, Bash scripting, and scheduled jobs (e.g., CronJobs).
- Hands-on experience with cloud platforms (Azure) including Azure AI Services, Azure AI Foundry, and related AI infrastructure.
- Experience with containerization and orchestration technologies, such as Docker and Kubernetes, for deploying scalable AI services.
- Good understanding of Agentic AI architectures, AI Agents, and modern LLM frameworks such as LangChain, LangGraph, and AI SDK.
- Ability to design and implement AI agents using:
- Pro-code frameworks (e.g., Agent SDK)
- Low-code / no-code solutions (e.g., Azure AI Foundry Agents)
AI Observability & Monitoring
- Experience implementing AI observability and monitoring systems for LLM-based applications and AI agents.
- Hands-on experience with AI observability tools such as LangFuse, and logging platforms such as Azure Log Analytics.
- Experience building monitoring dashboards and system telemetry using Grafana.
- Strong understanding of AI system logging, tracing, and performance monitoring for production AI systems.
- Familiarity with LLM evaluation techniques, including LLM-as-a-Judge frameworks to measure agent quality and response performance.
- Understanding of AI evaluation metrics, including:
- model accuracy and response quality
- latency and throughput
- reliability and system health
- token usage and cost efficiency
Operational Skills
- Experience with monitoring and observability platforms such as Prometheus, Grafana, and ELK Stack.
- Ability to design operational dashboards to track AI agent KPIs, system health, and service reliability.
- Understanding of model drift detection, data quality monitoring, and AI system observability practices.
- Strong communication and collaboration skills to work with AI engineers, risk teams, product teams, and business stakeholders.
Responsibilities
- Design and implement AI observability frameworks to monitor the performance, reliability, and behavior of AI agents and LLM-based systems in production environments.
- Integrate AI observability platforms with third-party systems (e.g., SAS solutions) for governance, compliance monitoring, and operational reporting, using ETL pipelines and data integration workflows.
- Build predictive analytics and automation frameworks to support AI system operations and operational decision-making.
- Develop real-time monitoring dashboards and operational analytics tools for AI systems, including capabilities such as:
- anomaly detection
- predictive forecasting
- incident monitoring and alerting
- root cause analysis for system failures or abnormal agent behavior
- Define and monitor AI agent KPIs and performance metrics, including quality evaluation using LLM-as-a-Judge approaches.
- Collaborate with AI Engineers, AI Scientists, and Risk teams to define:
- experiment metrics
- evaluation frameworks
- continuous monitoring strategies for AI systems
- Work closely with Product teams, customers, and stakeholders to define business-level KPIs for AI agents and measure their impact on business outcomes.
- Feed operational insights, monitoring results, and evaluation metrics back into the AI development lifecycle to drive continuous improvement of models and agents.
- Manage AI platform operations, including platform upgrades, governance compliance, and SLA monitoring for production AI services.
- Design and maintain data pipelines and operational data infrastructure to ensure standardized, clean, and reliable data for analytics, monitoring, and reporting.
- Collaborate with DevOps, SRE, and IT teams on tooling, infrastructure, CI/CD pipelines, and deployment processes to ensure reliable AI system delivery.