AIOps Engineer

DataX

Thailand

Fresher

Save

Posted 17 hours ago
Be among the first 10 applicants

Early Applicant

Job Description

Qualifications

Technical Skills

Strong Python programming skills with experience in production-grade development practices, including writing maintainable, scalable, and well-structured code. Knowledge of TypeScript or Go is a plus.
Developer mindset with solid understanding of backend development, microservices architecture, and RESTful API design.
Experience building automation workflows and data pipelines using Python, ETL processes, Bash scripting, and scheduled jobs (e.g., CronJobs).
Hands-on experience with cloud platforms (Azure) including Azure AI Services, Azure AI Foundry, and related AI infrastructure.
Experience with containerization and orchestration technologies, such as Docker and Kubernetes, for deploying scalable AI services.
Good understanding of Agentic AI architectures, AI Agents, and modern LLM frameworks such as LangChain, LangGraph, and AI SDK.
Ability to design and implement AI agents using:
Pro-code frameworks (e.g., Agent SDK)
Low-code / no-code solutions (e.g., Azure AI Foundry Agents)

AI Observability & Monitoring

Experience implementing AI observability and monitoring systems for LLM-based applications and AI agents.
Hands-on experience with AI observability tools such as LangFuse, and logging platforms such as Azure Log Analytics.
Experience building monitoring dashboards and system telemetry using Grafana.
Strong understanding of AI system logging, tracing, and performance monitoring for production AI systems.
Familiarity with LLM evaluation techniques, including LLM-as-a-Judge frameworks to measure agent quality and response performance.
Understanding of AI evaluation metrics, including:
model accuracy and response quality
latency and throughput
reliability and system health
token usage and cost efficiency

Operational Skills

Experience with monitoring and observability platforms such as Prometheus, Grafana, and ELK Stack.
Ability to design operational dashboards to track AI agent KPIs, system health, and service reliability.
Understanding of model drift detection, data quality monitoring, and AI system observability practices.
Strong communication and collaboration skills to work with AI engineers, risk teams, product teams, and business stakeholders.

Responsibilities

Design and implement AI observability frameworks to monitor the performance, reliability, and behavior of AI agents and LLM-based systems in production environments.
Integrate AI observability platforms with third-party systems (e.g., SAS solutions) for governance, compliance monitoring, and operational reporting, using ETL pipelines and data integration workflows.
Build predictive analytics and automation frameworks to support AI system operations and operational decision-making.
Develop real-time monitoring dashboards and operational analytics tools for AI systems, including capabilities such as:
anomaly detection
predictive forecasting
incident monitoring and alerting
root cause analysis for system failures or abnormal agent behavior
Define and monitor AI agent KPIs and performance metrics, including quality evaluation using LLM-as-a-Judge approaches.
Collaborate with AI Engineers, AI Scientists, and Risk teams to define:
experiment metrics
evaluation frameworks
continuous monitoring strategies for AI systems
Work closely with Product teams, customers, and stakeholders to define business-level KPIs for AI agents and measure their impact on business outcomes.
Feed operational insights, monitoring results, and evaluation metrics back into the AI development lifecycle to drive continuous improvement of models and agents.
Manage AI platform operations, including platform upgrades, governance compliance, and SLA monitoring for production AI services.
Design and maintain data pipelines and operational data infrastructure to ensure standardized, clean, and reliable data for analytics, monitoring, and reporting.
Collaborate with DevOps, SRE, and IT teams on tooling, infrastructure, CI/CD pipelines, and deployment processes to ensure reliable AI system delivery.