Search by job, company or skills

DataX

AIOps Engineer

Fresher
new job description bg glownew job description bg glownew job description bg svg
  • Posted 17 hours ago
  • Be among the first 10 applicants
Early Applicant

Job Description

Qualifications

Technical Skills

  • Strong Python programming skills with experience in production-grade development practices, including writing maintainable, scalable, and well-structured code. Knowledge of TypeScript or Go is a plus.
  • Developer mindset with solid understanding of backend development, microservices architecture, and RESTful API design.
  • Experience building automation workflows and data pipelines using Python, ETL processes, Bash scripting, and scheduled jobs (e.g., CronJobs).
  • Hands-on experience with cloud platforms (Azure) including Azure AI Services, Azure AI Foundry, and related AI infrastructure.
  • Experience with containerization and orchestration technologies, such as Docker and Kubernetes, for deploying scalable AI services.
  • Good understanding of Agentic AI architectures, AI Agents, and modern LLM frameworks such as LangChain, LangGraph, and AI SDK.
  • Ability to design and implement AI agents using:
  • Pro-code frameworks (e.g., Agent SDK)
  • Low-code / no-code solutions (e.g., Azure AI Foundry Agents)

AI Observability & Monitoring

  • Experience implementing AI observability and monitoring systems for LLM-based applications and AI agents.
  • Hands-on experience with AI observability tools such as LangFuse, and logging platforms such as Azure Log Analytics.
  • Experience building monitoring dashboards and system telemetry using Grafana.
  • Strong understanding of AI system logging, tracing, and performance monitoring for production AI systems.
  • Familiarity with LLM evaluation techniques, including LLM-as-a-Judge frameworks to measure agent quality and response performance.
  • Understanding of AI evaluation metrics, including:
  • model accuracy and response quality
  • latency and throughput
  • reliability and system health
  • token usage and cost efficiency

Operational Skills

  • Experience with monitoring and observability platforms such as Prometheus, Grafana, and ELK Stack.
  • Ability to design operational dashboards to track AI agent KPIs, system health, and service reliability.
  • Understanding of model drift detection, data quality monitoring, and AI system observability practices.
  • Strong communication and collaboration skills to work with AI engineers, risk teams, product teams, and business stakeholders.

Responsibilities

  • Design and implement AI observability frameworks to monitor the performance, reliability, and behavior of AI agents and LLM-based systems in production environments.
  • Integrate AI observability platforms with third-party systems (e.g., SAS solutions) for governance, compliance monitoring, and operational reporting, using ETL pipelines and data integration workflows.
  • Build predictive analytics and automation frameworks to support AI system operations and operational decision-making.
  • Develop real-time monitoring dashboards and operational analytics tools for AI systems, including capabilities such as:
  • anomaly detection
  • predictive forecasting
  • incident monitoring and alerting
  • root cause analysis for system failures or abnormal agent behavior
  • Define and monitor AI agent KPIs and performance metrics, including quality evaluation using LLM-as-a-Judge approaches.
  • Collaborate with AI Engineers, AI Scientists, and Risk teams to define:
  • experiment metrics
  • evaluation frameworks
  • continuous monitoring strategies for AI systems
  • Work closely with Product teams, customers, and stakeholders to define business-level KPIs for AI agents and measure their impact on business outcomes.
  • Feed operational insights, monitoring results, and evaluation metrics back into the AI development lifecycle to drive continuous improvement of models and agents.
  • Manage AI platform operations, including platform upgrades, governance compliance, and SLA monitoring for production AI services.
  • Design and maintain data pipelines and operational data infrastructure to ensure standardized, clean, and reliable data for analytics, monitoring, and reporting.
  • Collaborate with DevOps, SRE, and IT teams on tooling, infrastructure, CI/CD pipelines, and deployment processes to ensure reliable AI system delivery.

More Info

Job Type:
Industry:
Function:
Employment Type:

About Company

Job ID: 145275831