Scaling Kubernetes securely and reliably with AKS (Microsoft Ignite 2025)

Microsoft Ignite Session ID: BRK120

November 20, 2025

Brenan Burns, Microsoft

Jorge Palma, Principal PM Lead, Microsoft

Durga Rachapudi, Principal Software Engineer, Microsoft

As Kubernetes and AI adoption grow, managing your clusters at scale becomes a strategic imperative. We’ll share practical lessons from operating large clusters with AKS, including how to keep things reliable, efficient & secure. Learn about tools like Azure Kubernetes Fleet Manager, smart scheduling for AI, and ways to simplify multi-cluster ops. Whether you are scaling to thousands of nodes or fine-tuning for performance, you’ll leave with practical tips to improve cluster ops.

TL;DR Summary

Big Picture

Kubernetes is the backbone of AI and modern apps. Microsoft 365 runs hyperscale workloads on AKS, achieving consistency, safety, and agility worldwide. Migration required deep rethinking, but AKS now delivers a reproducible, secure, and efficient platform. Two operating models (Standard vs Automatic) give customers flexibility in how they adopt AKS.

  • Portability, scalability, and reliability make it the backbone for startups, enterprises, and open-source AI ecosystems.
  • Azure Kubernetes Service (AKS) = bridge
    Connects Azure ecosystem (Entra ID, workload identity, security, reliability) with open-source tooling.

Microsoft 365 Case Study

  • Scale
    • 10s of millions of pods
    • 100k+ nodes
    • Thousands of clusters across all regions
  • Why AKS?
    • Standardization → avoids duplicated effort
    • Containers → consistency & isolation
    • AKS → managed control plane, Windows + Linux mix, resilience at hyperscale
  • Fleet principles:
    • Per-service isolation
    • Clusters as failure domains
    • Safe deployment practices (gradual rollouts, health probes, disruption budgets)
    • Standardized configs across regions/cloud

Key Benefit

  • Efficiency: Common toolchains, reduced learning curve, reproducible ops mode
  • Safety: Built-in gradual rollouts, monitoring, compliance enforcement
  • Agility: Easy onboarding, fast region expansion, node pool snapshots, maintenance windows
  • Observability: Prometheus + Grafana + Headlamp for real-time metrics
  • Security: Defense in depth, least privilege, Azure Policy + Gatekeeper

Learnings

  • Migration wasn’t “lift-and-shift” → required rethinking Windows services → containers/pods
  • Networking differences between Windows services and container networking required major adaptation
  • Now all workloads run on AKS, freeing engineers to focus on features, not maintenance

What’s New in AKS

  • Two operating models:
    • Standard: Bring-your-own opinions, open-source integrations
    • Automatic: Microsoft’s fully opinionated, paved-path model for production scale

Chapters:

0:00 – Kubernetes as foundational infrastructure for AI and modern applications

00:04:21 – M365 platform alignment with Azure and adoption of Kubernetes for efficiency

00:06:15 – AKS providing enterprise-grade scalability with millions of workloads and hybrid clusters

00:14:26 – Transition to George introducing ‘What’s new in AKS’

00:16:06 – Introduction of two AKS operating models: Standard and Automatic

00:23:35 – Leveraging Azure Fleet Manager for Global Capacity and Cluster Placement

00:29:00 – Automated Workload Upgrade Rollouts Across Environments

00:33:36 – Consistent Multi-Cluster Management Across Hybrid and Edge Environments

00:37:38 – Introduction of Local DNS for Improved Latency and Reliability