Microsoft Ignite Session ID: BRK120
November 20, 2025
Brenan Burns, Microsoft
Jorge Palma, Principal PM Lead, Microsoft
Durga Rachapudi, Principal Software Engineer, Microsoft
As Kubernetes and AI adoption grow, managing your clusters at scale becomes a strategic imperative. We’ll share practical lessons from operating large clusters with AKS, including how to keep things reliable, efficient & secure. Learn about tools like Azure Kubernetes Fleet Manager, smart scheduling for AI, and ways to simplify multi-cluster ops. Whether you are scaling to thousands of nodes or fine-tuning for performance, you’ll leave with practical tips to improve cluster ops.
TL;DR Summary
Big Picture
Kubernetes is the backbone of AI and modern apps. Microsoft 365 runs hyperscale workloads on AKS, achieving consistency, safety, and agility worldwide. Migration required deep rethinking, but AKS now delivers a reproducible, secure, and efficient platform. Two operating models (Standard vs Automatic) give customers flexibility in how they adopt AKS.
- Portability, scalability, and reliability make it the backbone for startups, enterprises, and open-source AI ecosystems.
- Azure Kubernetes Service (AKS) = bridge
Connects Azure ecosystem (Entra ID, workload identity, security, reliability) with open-source tooling.
Microsoft 365 Case Study
- Scale
- 10s of millions of pods
- 100k+ nodes
- Thousands of clusters across all regions
- Why AKS?
- Standardization → avoids duplicated effort
- Containers → consistency & isolation
- AKS → managed control plane, Windows + Linux mix, resilience at hyperscale
- Fleet principles:
- Per-service isolation
- Clusters as failure domains
- Safe deployment practices (gradual rollouts, health probes, disruption budgets)
- Standardized configs across regions/cloud
Key Benefit
- Efficiency: Common toolchains, reduced learning curve, reproducible ops mode
- Safety: Built-in gradual rollouts, monitoring, compliance enforcement
- Agility: Easy onboarding, fast region expansion, node pool snapshots, maintenance windows
- Observability: Prometheus + Grafana + Headlamp for real-time metrics
- Security: Defense in depth, least privilege, Azure Policy + Gatekeeper
Learnings
- Migration wasn’t “lift-and-shift” → required rethinking Windows services → containers/pods
- Networking differences between Windows services and container networking required major adaptation
- Now all workloads run on AKS, freeing engineers to focus on features, not maintenance
What’s New in AKS
- Two operating models:
- Standard: Bring-your-own opinions, open-source integrations
- Automatic: Microsoft’s fully opinionated, paved-path model for production scale
Chapters:
0:00 – Kubernetes as foundational infrastructure for AI and modern applications
00:04:21 – M365 platform alignment with Azure and adoption of Kubernetes for efficiency
00:06:15 – AKS providing enterprise-grade scalability with millions of workloads and hybrid clusters
00:14:26 – Transition to George introducing ‘What’s new in AKS’
00:16:06 – Introduction of two AKS operating models: Standard and Automatic
00:23:35 – Leveraging Azure Fleet Manager for Global Capacity and Cluster Placement
00:29:00 – Automated Workload Upgrade Rollouts Across Environments
00:33:36 – Consistent Multi-Cluster Management Across Hybrid and Edge Environments
00:37:38 – Introduction of Local DNS for Improved Latency and Reliability