2bcloud earned the 𝗔𝗪𝗦 𝗦𝗺𝗮𝗹𝗹 & 𝗠𝗲𝗱𝗶𝘂𝗺 𝗕𝘂𝘀𝗶𝗻𝗲𝘀𝘀 𝗖𝗼𝗺𝗽𝗲𝘁𝗲𝗻𝗰𝘆.🏆

I Tested MIG in Real-Life Azure – Did It Feel Like a Stuffed Cubicle?

September 15, 2025
Written by Evgeniy Golovashev

TL;DR 

I carved one Azure H100 into virtual “cubicles” using MIG (Multi-Instance GPU), compared it to an A100, ran Triton inference workloads, and captured both latency and cost.  

The verdict – The H100 with MIG delivers better latency and consistency, while the A100 is more cost-effective at scale, depending on your workload. 

Evgeniy Golovashev, Cloud Solution Architect @ 2bcloud 

Co‑Working Meets the H100 

Imagine walking into a co‑working space where every desk is a cubicle, and yet the air‑conditioning is shared. You’re not cramped, just sharing cooled air and strict time slots. That’s what using MIG on GPUs feels like: each model gets its own cubicle, but the hardware stays communal. 

That analogy made me curious: can an air‑conditioned cubicle setup really stand up to real-world throughput and costs? To find out, I took MIG for a spin on Azure. 

How I Set Up the H100 and A100 Testbeds in Azure 

I spun up a Standard_NC40ads_H100_v5 VM (H100, 94 GB per GPU, MIG–enabled), on Ubuntu‑HPC with drivers pre‑installed, plugged into its own VNet, and gave it an API‑friendly public IP.  

A simple script automated the setup, Azure support just had to unlock the quota in East and West US. 

From there, enabling MIG was just a matter of turning it on, choosing a profile, and assigning slices to models via Triton. 

For comparison, I ran the same on a Standard_NC24ads_A100_v4 (A100, ~80 GB, also MIG‑capable but with smaller profiles).  

The H100 felt like a heavy tank: dense, swift, low‑latency.  

The A100 was more of a workhorse: steady, predictable, and budget‑saavy. 

Test Methodology with Punch 

To avoid Azure hiccups, I ran reproducible load tests, each configuration (H100 and A100, both MIG enabled and disabled) got at least three runs.  

Constant input, constant request count (about 100), ran both single‑ and dual‑instance scenarios.  

I tracked: average/median/min/max latency, success rate, failure count.  
On the economics side: VM type, hourly price, calculate requests/hour, and calculate cost per 100 requests. 

That gave me both raw performance and a lens on business efficiency under real‑world conditions

Test Results 

Test Avg Latency (ms) Median (ms) VM Calculated Cost per 100 Req ($) 
H100 MIG (2 instances parallel) 14513.84 11465.17 NC40ads H100 v5 1.12 
H100 MIG disabled 6649.42 6659.76 NC40ads H100 v5 1.29 
A100 MIG (2 instances parallel) 23998.51 22716.99 NC24ads A100 v4 1.51 
A100 MIG disabled 11078.25 11047.51 NC24ads A100 v4 1.47 

What I Learned 

  • H100 with MIG wins on latency and consistency. 
    Even though slicing adds overhead, the H100’s raw power crushes latency, about 6.6 s to 14.5 s per 100 requests, at roughly $1.12–$1.29 on-demand. Opt for a 3‑year reserved instance and we’re looking at ~$0.61 per 100 req in East US. 
  • A100 is leaner, slower, but cheaper in the long run. 
    At 11–24 s per 100 requests, cost dives from ~$1.47 to ~$0.56 in West US with 3‑year reserved pricing. 

When Does MIG Make Sense? 

  • Multi‑tenant setups: multiple models or teams sharing a GPU – isolated performance, shared hardware. 
  • R&D sandboxes: parallel testing of different versions without needing multiple GPUs. 
  • Heterogeneous workloads: tiny model? allocate a small profile. Big model? Give it a chunk. 
  • ROI maximization: one pricey GPU, many isolated accelerators. 

Business Scenarios 

  • Finance: run parallel risk models, fraud detection, credit scoring—all isolated. 
  • E‑commerce: multiple recommendation models, real‑time personalization under one hood. 
  • SaaS AI platforms: GPU‑as‑a‑Service with tenant isolation—MIG is your friend here. 

Hopper Is Fast, But You Can Still Squeeze More from Ampere  

The performance difference between Ampere (A100) and Hopper (H100) lines up: H100 roughly halves inference time. But if you’re stuck on A100, there are ways to optimize: 

  • Quantization or pruning—shrink models, speed inference. 
  • ONNX Runtime or Olive—specialize your models for the hardware. 
  • Kernel‑level tuning—nail the bottlenecks if you’re pushing every last cycle. 

MIG on Azure Can Deliver Practical Isolation and Cost Control 

Setting up MIG on Azure felt more like getting your own cubicle in a comfy co‑working space than being herded into a crowded open‑plan bullpen.  

The H100 brought speed and predictability; the A100 brought thriftiness.  

Choose based on your immediate needs – latency or budget – and MIG becomes less of a techy buzzphrase and more of a practical tool. 

____________________________________

— Looking to explore further?

Check out the NVIDIA MIG User Guide or Azure’s AKS multi‑instance GPU node pools

— Need help running your own tests?  

Evgeniy Golovashev, 2bcloud 
Solution Architect 

[email protected]