6 minute read

KAITO on AKS

I recently added a new playground to cloud-playground-infra: a fully automated KAITO-on-AKS environment.

By the way, you should also check out my other blog post on what cloud-playground-infra helps us do.


What is KAITO?

KAITO (Kubernetes AI Toolchain Operator) is an operator that automates AI/ML model inference and tuning workloads in Kubernetes. Microsoft has enabled KAITO on AKS. KAITO simplifies running AI/ML inference by:

  • Automatic node provisioning - Spins up GPU/CPU nodes based on model requirements
  • Model lifecycle management - Downloads weights, manages inference server lifecycle
  • Preset models - Built-in support for popular models (Llama, Mistral, Falcon, Phi, etc.)
  • Custom models - Deploy your own models from HuggingFace, Azure Blob Storage, Azure Files, or Azure ML Model Registry
  • OpenAI-compatible API - Provides a standard interface for inference calls

KAITO on AKS vs. Microsoft Foundry

You might wonder: why use KAITO on AKS, when Microsoft Foundry offers thousands of models for inference? Well both approaches solve different problems. Some teams benefit from having both available, while others should choose carefully depending on which industry they serve. Microsoft Foundry is an excellent PaaS product. It gives customers a fully managed, secure, and production-ready platform for running LLMs without touching GPU infrastructure. So, how do we go about understanding when a Kubernetes-native approach like KAITO on AKS makes sense?

Side-by-Side Overview

Consideration KAITO on AKS Microsoft Foundry
Service model PaaS - you manage cluster and model deployments PaaS - you consume models via APIs
Model selection Full control - any model from HuggingFace, Azure Blob/Files, Azure ML Registry, or private registries Curated catalog with regional availability limitations (not all models available in all regions)
Compliance Easier to meet strict regulatory requirements (HIPAA, FedRAMP, etc.) Depends on service compliance certifications
Data sovereignty Models run in your cluster, data never leaves your network Data sent to Microsoft-managed endpoints
Cost model Pay for VM compute only, no per-token charges Pay-per-token or provisioned throughput
Customization Full control over inference parameters, batching, quantization Limited to provider-exposed options
Latency In-cluster inference, minimal network hops Network round-trip to external endpoint

When KAITO on AKS Makes Sense

Use KAITO on AKS when you need data to remain in your environment, want consistent compute-based costs, have strict compliance requirements, or need deep customization of how models run.

When Microsoft Foundry Makes Sense

Use Foundry when you want a fully managed experience, access to proprietary models like GPT‑4 or Claude, consumption-based pricing, and no GPU or cluster management.


Architecture

KAITO Architecture

KAITO follows the classic Kubernetes CRD/controller pattern. Its major components are:

  • Workspace controller - Reconciles the Workspace custom resource, triggers node provisioning via NodeClaim CRDs, and creates inference/tuning workloads based on model preset configurations
  • Node provisioner controller (gpu-provisioner) - Uses Karpenter-core NodeClaim CRD to integrate with Azure Resource Manager APIs, automatically adding GPU nodes to AKS clusters

Source: Project KAITO


Preset Models

AKS has enabled support for several open-source models that can be deployed with minimal configuration using KAITO. Instead of defining a custom inference template, you simply specify the preset name in your workspace manifest.

Model Family Examples
DeepSeek deepseek-r1
Falcon falcon-7b, falcon-40b
Gemma 3 gemma-3-4b, gemma-3-12b, gemma-3-27b
Llama 3 llama-3-8b, llama-3-70b, llama-3.1-8b, llama-3.1-70b, llama-3.1-405b
Mistral mistral-7b, mistral-nemo-12b, mistral-large-2-123b
Phi 3 phi-3-mini, phi-3-medium
Phi 4 phi-4
Qwen qwen-2.5-7b, qwen-2.5-72b, qwen-2.5-coder-32b

See the full list: KAITO Supported Models

Note: Preset models require GPU-enabled node pools. The current minimum requirement is Standard_NC24ads_A100_v4. Ensure your Azure subscription has sufficient GPU quota. This POC uses a custom model on CPU instead, as GPU quota was not available.

An example preset manifest is available at assets/kubernetes/kaito_preset_model.yaml.

apiVersion: kaito.sh/v1beta1
kind: Workspace
metadata:
  name: ${name}
  namespace: ${namespace}
  # annotations:
  #   kaito.sh/enablelb: "True"  # Creates LoadBalancer service automatically (testing only, not for production)

resource:
  instanceType: ${instanceType} # Must be GPU-enabled instance type.  Ensure your subscription has quota.
  labelSelector:
    matchLabels:
      apps: ${appLabel}

inference:
  preset:
    name: ${presetName}

Custom Models

For more advanced deployments, see the example manifests in assets/kubernetes/:

Manifest Use Case
kaito_custom_cpu_model.yaml Base template for public HuggingFace models for CPU VMs
kaito_option1_hf_private.yaml Private/gated HuggingFace models with HF_TOKEN
kaito_option2_azure_volume.yaml Models pre-loaded on Azure Blob/Files storage
kaito_option3_init_container_blob.yaml Download from Azure Blob at startup
kaito_option4_azureml.yaml Download from Azure ML Model Registry

The custom manifests are much more complex and involved than the preset ones. I encourage you to take a look inside my repo in assets/kubernetes/.


Testing the Model

Infrastructure Overview

The Terraform configuration (terraform/main.tf) provisions:

  • AKS Cluster - Kubernetes 1.34.2 with KAITO enabled
  • Kubernetes Namespace - kaito-custom-cpu-inference for isolating KAITO workloads
  • KAITO Workspace - Custom model deployment (bigscience/bloomz-560m) with kaito.sh/enablelb: "True" annotation for automatic LoadBalancer creation

Note: The kaito.sh/enablelb annotation automatically creates a LoadBalancer service with a public IP. This is for testing only and is NOT recommended for production. For production, use an Ingress Controller to safely expose the service.

POC Model Details

This POC uses bigscience/bloomz-560m, a small multilingual instruction-tuned model (~2.2GB) from Hugging Face. It runs on CPU for simplicity (no GPU quota required). By the way, Hugging Face is basically the open-source registry for modern AI models. Think of it as the GitHub for models that developers can download, fine‑tune, and run anywhere — including on AKS.

Configure kubectl

After deployment, configure kubectl to connect to your AKS cluster:

kaito@aks:~$ az aks get-credentials --resource-group <resource-group> --name <cluster-name>
Merged "aks-********" as current context in ****\.kube\config

Verify that AKS cluster was configured correctly:

kaito@aks:~$ kubectl config get-contexts
CURRENT   NAME                      CLUSTER                   AUTHINFO                                 NAMESPACE
*         aks-********              aks-********              clusterUser_*********_aks-********

kaito@aks:~$ kubectl get namespaces
NAME                         STATUS   AGE
default                      Active   13m
kaito-custom-cpu-inference   Active   11m
kube-node-lease              Active   13m
kube-public                  Active   13m
kube-system                  Active   13m

kaito@aks:~$ kubectl get workspaces -n kaito-custom-cpu-inference
NAME                    INSTANCE           RESOURCEREADY   INFERENCEREADY   JOBSTARTED   WORKSPACESUCCEEDED   AGE
bloomz-560m-workspace   Standard_D16s_v5   True            True                          True                 12m

kaito@aks:~$ kubectl get pods -n kaito-custom-cpu-inference
NAME                                     READY   STATUS    RESTARTS   AGE
bloomz-560m-workspace-78f597c8b8-q5m86   1/1     Running   0          11m

Testing with LoadBalancer

When the kaito.sh/enablelb: "True" annotation is enabled, you can test the inference endpoint directly from your machine using curl:

1. Set the external IP:

# Get the external IP (service name matches workspace name)
kaito@aks:~$ KAITO_IP=$(kubectl get svc bloomz-560m-workspace -n kaito-custom-cpu-inference -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

kaito@aks:~$ echo "KAITO endpoint: http://$KAITO_IP"
KAITO endpoint: http://**.***.***.***

2. Check health:

kaito@aks:~$ curl http://$KAITO_IP/health
{
  "status":"Healthy"
}

3. Sample prompts:

# Question answering
kaito@aks:~$ curl --max-time 60 -X POST http://$KAITO_IP/chat \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Is pineapple on a pizza acceptable?",
    "return_full_text": false,
    "generate_kwargs": {
      "max_new_tokens": 256,
      "do_sample": false
    }
  }'
{
  "Result":" no"
}

kaito@aks:~$ curl --max-time 60 -X POST http://$KAITO_IP/chat \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Is a tomato a fruit or a vegetable?",
    "return_full_text": false,
    "generate_kwargs": {
      "max_new_tokens": 256,
      "do_sample": false
    }
  }'
{
  "Result":" vegetable"
}

kaito@aks:~$ curl --max-time 60 -X POST http://$KAITO_IP/chat \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Answer briefly: What is cloud computing?",
    "return_full_text": false,
    "generate_kwargs": {
      "max_new_tokens": 256,
      "do_sample": false
    }
  }'
{
  "Result":" Cloud computing is a service that allows users to access data and services from a central location."
}

Final Thoughts

So now you’re asking - which should I use: KAITO on AKS or Microsoft Foundry? The answer - it depends.

KAITO on AKS isn’t meant to replace Microsoft Foundry, and it shouldn’t. Foundry is the right tool when you want a fully managed platform, access to premium proprietary models like GPT‑5 or Claude, simple pay‑as‑you‑go pricing, and zero responsibility for GPUs, cluster operations or infrastructure overhead.

KAITO on AKS, on the other hand, is ideal for when your data must stay inside your own environment, when you prefer predictable compute-only costs, when compliance is non‑negotiable, or when you need full control over how your models are configured and executed.