Loading…
CNCF-hosted Co-located Events Europe 2025 taking place on 1 April. This event is happening in person at Excel London in London, England.

The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025, and have an All-Access pass in order to participate in the sessions.

To view the full event schedule for a specific CNCF-hosted Co-located event, you can use the right-hand navigation bar to sort and filter.

The schedule is subject to change.
Tuesday April 1, 2025 13:30 - 13:55 BST
Optimizing execution time of AI training and inference is crucial in the era of LLMs. The workloads often exchange huge amounts of data between pods, making the network throughput a bottleneck.

Data centers have hierarchical organization with multiple layers, such as racks or blocks, however, leveraging this fact in vanilla Kubernetes is challenging as the scheduler needs to be aware of both workloads and the cluster topology. Kueue, as a Job-level scheduler, is already workload-aware. To tackle the second challenge, we propose a convention for labeling nodes by cloud-providers or cluster administrators. Leveraging this information, Kueue optimizes Pod placement within a cluster, ordering Pods by indices to enhance the performance of AI frameworks using NCCL.

In this session, we introduce the key concepts and machinery behind Topology-Aware Scheduling (TAS) in Kueue. We also compare TAS with alternatives and present results on how using it improves execution time of AI workloads.
Speakers
avatar for Michał Woźniak

Michał Woźniak

Software Engineer, Google
Michał is a software engineer with background in computer science, a PhD in computational biology, and 5+ years of professional experience. In his current role he is focusing on enhancing the support for batch workloads in the Kubernetes ecosystem. Outside of work he enjoys playing... Read More →
avatar for Yuki Iwai

Yuki Iwai

Software Engineer, CyberAgent, inc
Yuki is a Software Engineer at CyberAgent, Inc. He works on the internal platform for machine-learning applications and high-performance computing. He is currently a Technical Lead for Kubeflow WG AutoML / Training. He is also a Kubernetes WG Batch active member, Job API reviewer... Read More →
Tuesday April 1, 2025 13:30 - 13:55 BST
Level 1 | Hall Entrance N10 | Room G

Attendees (6)


Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link