Loading…
CNCF-hosted Co-located Events Europe 2025 taking place on 1 April. This event is happening in person at Excel London in London, England.

The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025, and have an All-Access pass in order to participate in the sessions.

To view the full event schedule for a specific CNCF-hosted Co-located event, you can use the right-hand navigation bar to sort and filter.

The schedule is subject to change.
Tuesday April 1, 2025 17:00 - 17:10 BST
Kubernetes is widely adopted for inference workloads, but distributed ML training still presents challenges, such as dynamic resource scaling, GPU scheduling, and efficient inter-node communication. Recent advancements, including KubeRay, Kubeflow, and Slurm integration, have expanded Kubernetes' capabilities for training workloads, making it a more viable option for complex, large-scale ML tasks.

This session focuses on the next step: benchmarking these tools to evaluate and optimize their performance for distributed ML training. We’ll review existing solutions, discuss the design and implementation of our benchmarking platform, and demonstrate how it provides actionable insights to improve throughput, scalability, and efficiency.
Speakers
avatar for Liang Yan

Liang Yan

Sr. Software Engineer, Coreweave
Liang Yan is a senior software engineer at Coreweave, specializing in AI Infra, heterogeneous architecture acceleration and distributed machine learning systems from the cloud base. He collaborates closely with upstream communities and leading vendors like NVIDIA, AMD and ARM, delivering... Read More →
Tuesday April 1, 2025 17:00 - 17:10 BST
Level 1 | Hall Entrance N10 | Room G

Attendees (4)


Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link