Loading…
CNCF-hosted Co-located Events Europe 2025 taking place on 1 April. This event is happening in person at Excel London in London, England.

The Sched app allows you to build your schedule, but is not a substitute for your event registration. You must be registered for KubeCon + CloudNativeCon Europe 2025, and have an All-Access pass in order to participate in the sessions.

To view the full event schedule for a specific CNCF-hosted Co-located event, you can use the right-hand navigation bar to sort and filter.

The schedule is subject to change.
Tuesday April 1, 2025 14:05 - 14:30 BST
There are new challenges in managing large GPU clusters dedicated to cloud native AI workloads. The workload mix is diverse, and GPUs must be effectively utilized and dynamically shared across multiple teams. Furthermore, GPUs are subject to a variety of performance degradations and faults that can severely impact multi-GPU jobs, thus requiring continuous monitoring and enhanced diagnostics. Cloud native tools such Kubeflow, Kueue and others, are the building blocks for large scale GPU clusters used by teams across IBM Research for training, tuning, and inference jobs. In this talk, IBM Research will share and demonstrate lessons learnt on how they configure large scale GPU clusters and the development of Kubernetes native automation to run health checks on GPUs and report health. Finally, will show the use of diagnostics to enable both the dynamic adjustment of quotas to account for faulty GPUs, and the automatic steering of new and existing workloads away from nodes with faulty GPUs.
Speakers
avatar for Claudia Misale

Claudia Misale

Staff Research Scientist, IBM Research
Claudia Misale is a Staff Research Scientist in the Hybrid Cloud Infrastructure Software group at IBM T.J. Watson Research Center (NY). Her research is focused on Kubernetes and targets monitoring, observability and scheduling for HPC and AI training workloads. She is mainly interested... Read More →
avatar for David Grove

David Grove

Distinguished Research Scientist, IBM Research
David Grove is a Distinguished Research Scientist at IBM T.J. Watson, NY, USA. He has been a software systems researcher at IBM since 1998, specializing in programming language implementation and scalable runtime systems. His current research focuses on cloud-related technologies... Read More →
Tuesday April 1, 2025 14:05 - 14:30 BST
Level 1 | Hall Entrance N10 | Room G

Attendees (8)


Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link