Job Summary:
The Machine Learning Platform Engineer will work closely with our AI and gaming technology teams to design, build, and scale a robust platform for machine learning model training and online inference services. This role involves developing ML Ops processes that integrate cutting-edge AI capabilities into our gaming products, improving automation, scalability, and reliability in our ML workflows. You will be part of an agile team that values innovation, code quality, and continuous learning.
Key Responsibilities:
- Participate in the architecture design, development, and optimization of our AI/ML platform.
- Implement and enhance MLOps workflows, including model training pipelines, CI/CD for ML, monitoring, and automated deployment.
- Build and optimize high-performance inference services for LLMs, agents, RAG systems, and other machine learning models.
- Collaborate with cross-functional teams (AI research, backend engineering, data engineering, product teams) to deliver seamless ML solutions from data ingestion to production deployment.
- Integrate logging, monitoring, and observability tools to ensure system stability, reliability, and scalability.
- Research, evaluate, and adopt new technologies, frameworks, and methodologies that enhance platform performance and engineering productivity.
Job Requirements:
- Bachelor's degree or above in Computer Science, Software Engineering, AI/ML, or a related field.
- Able to communicate fluently in both English and Chinese Mandarin languages. (work in a globally team)
- Solid experience in machine learning workflows, including: -Data preprocessing & feature engineering. -Model training, tuning, and evaluation -Model serving and production deployment
- Strong programming ability in Python or Golang (experience of 1+ year preferred).
- Familiarity with at least one ML framework: TensorFlow, PyTorch, scikit-learn, or Keras.
- Experience with MLOps tools and practices, such as: -Pipeline orchestration: Airflow, Kubeflow, Prefect, or MLflow. -Model serving: TensorFlow Serving, TorchServe, ONNX Runtime, Triton Inference Server. -Experiment tracking: MLflow, Weights & Biases
- Hands-on experience with containerization and cloud services, such as: -Docker, Kubernetes -AWS, GCP, or Azure ML services
- Knowledge of distributed systems, microservices, and performance optimization for large-scale ML workloads.
- Experience working with REST APIs, gRPC, and backend development is a plus.
Soft Skills:
- Ability to write clean, maintainable, and well-documented code.
- Strong analytical, debugging, and problem-solving skills.
- Proactive learning attitude and enthusiasm for exploring new technologies in AI/ML and infrastructure engineering.
- Strong collaboration and communication skills, capable of working effectively with diverse engineering teams.