Inside Bria AI’s Cloud Strategy: Scaling Visual GenAI with Azure ML

Inside Bria AI’s Cloud Strategy: Scaling Visual GenAI with Azure ML

Company Snapshot

  • Industry: Artificial Intelligence / Machine Learning
  • Region: Global
  • Cloud Vendor: Azure
  • Solutions Used: Azure Machine Learning, NDm_A100_v4 GPUs, GPUs, Blob Storage, Azure Log Analytics, Azure Monitor, InfiniBand, Hugging Face Accelerator, PyTorch DDP, FSDP, Terraform, GitHub, Azure Container Registry

 

The Challenge

Bria AI, a pioneer in Visual Generative AI, was scaling fast—and with that came infrastructure pains. As their models grew in complexity, so did the need for more compute, faster training, and of course, lower costs. Their existing infrastructure wasn’t cutting it anymore. Bria needed a new environment that could support large-scale GPU training, without breaking the bank.

  • Training Costs: A100 GPU prices were skyrocketing. Bria needed a more cost-efficient alternative for large-scale model training.
  • Scalability & Performance: Training across hundreds of GPUs demanded ultra-low latency networking and smart workload balancing to keep GPU utilization >95%.
  • Data Bottlenecks: Streaming massive datasets from storage to compute clusters without starving the GPUs required highly efficient sharding and precomputed latency.
  • Data Movement: Bria needed to move data to Azure Blob and build a continuous training pipeline—without massive code rewrites or performance degradation.

 

What Bria Needed

  • Cost Optimization: Reduce cloud spend without compromising performance.
  • Expert Guidance: Tap into Azure ML and cloud-native experts to accelerate development.
  • Reliable Support: 24/7 hands-on help from engineers who understand their stack—and their business goals.

The Solution
Bria AI partnered with 2bcloud to migrate to Azure ML and build a scalable, high-performance ML training environment. Together, they fine-tuned GPU utilization, streamlined data workflows, and established cloud cost controls.


Key Elements of the Solution

  • Cloud Cost Optimization: Applied best practices to lower cloud spend while maintaining performance.
  • Seamless Scalability: Migrated to a distributed training architecture powered by Azure ML.
  • Expert Access: Direct collaboration with 2bcloud’s Azure-certified architects and engineers.
  • 24/7 Support: Around-the-clock troubleshooting and optimization by 2bcloud’s engineering team.

 

Architecture Overview

Kicking Off with a Minimum Viable Product (MVP)

Bria started small. The first step was to set up a secure, scalable landing zone in Azure. This MVP ensured governance and flexibility for scaling later.

Bria’s engineers worked hand-in-hand with 2bcloud’s DevOps team to containerize their models and orchestrate training jobs using Azure ML’s compute environment. The goal was to re-use their existing containers, minimize rework, and accelerate deployment.

 

bria architecture

Core Tech Stack

Model Training & Inference

Azure ML handled orchestration using custom containerized environments for both training and inference.

 

Compute Infrastructure

  • Training at scale: Up to 128 NVIDIA A100 GPUs in production
  • NDm_A100_v4 Instances: 1.6 TB/s interconnect, 200 GB/s InfiniBand—enabling high-speed, low-latency multi-GPU training

 

 

Data Streaming & Management

  • Blob Storage as the primary data layer: This is optimized for handling large datasets, ensuring minimal latency in GPU processing.
  • FFM-based streaming from Blob Storage reduces bottlenecks.
  • Sharding strategy: Instead of downloading entire datasets, a window-based approach ensures each training step loads the necessary data, preventing idle GPUs.

 

GPU Optimization Techniques

  • Task allocation per GPU: Instead of all GPUs iterating over the entire dataset, each GPU processes a specific subset, improving efficiency.
  • File sharding: Large 1.5GB parquet files reduce network overhead compared to numerous small file downloads.
  • Precomputing Latent: Text-to-vector conversions are computed beforehand, reducing real-time processing strain on GPUs.

 

Distributed Training & Parallelization

  • InfiniBand layer enables high-speed GPU-to-GPU communication.
  • The Hugging Face Accelerator integrates with PyTorch’s Distributed Data-Parallel (DDP) framework, ensuring efficient multi-GPU training synchronization.
  • FSDP (Fully Sharded Data Parallelism) enhances training efficiency by dynamically splitting model parameters across GPUs, reducing memory overhead.

 

 

Performance Optimization Metrics

  • GPU Utilization: Maintained > 95% efficiency through optimized data streaming and parallelized training.
  • Iteration per second: The progressive resolution refinement approach balances performance, cost, and model convergence.
  • Cost Predictability: Deterministic training duration allowed better budget planning.

 

 

Overcoming Technical Hurdles

Even with strong GPU utilization, Bria and 2bcloud identified two critical bottlenecks:

  • InfiniBand throughput
  • Storage account limits

 

Optimization Steps

  • Ran NCCL performance tests and profiling
  • Monitored key metrics: GPU utilization (>95%), memory usage (>70%), energy (>1000 kJ), InfiniBand (>4 GB/s)
  • Upgraded to Torch 2.4.1 with CUDA 12.4 in PyTorch container
  • Enabled MSSCL for network acceleration
  • Increased storage throughput to 120 GB/sec
  • Implemented automated checkpointing for fail-safe training resumption

 


Scaling to Production

After successful MVP validation, Bria scaled production to 16 ND96amsr A100 v4 machines in Azure’s Italy North region. This deployment featured:

  • Full DevOps automation with Terraform and networking integration
  • Real-time monitoring (Azure Monitor, Application Insights, Alarms)
  • Security-first infrastructure setup

Bria in action
Bria in action

 

 

Business Outcomes

  • Optimized Workflows: Smart data streaming cut iteration times and improved throughput.
  • 🧠 Maxed-Out GPUs: 95%+ utilization ensured peak hardware efficiency.
  • 🚀 Faster Time to Market: Shorter training/inference cycles accelerated innovation.
  • 📈 Scalable Infra: Auto-scaling and observability ensured growth without performance bottlenecks.

Why It Matters

“Bria AI’s transition to Azure ML was a strategic leap toward building a more scalable, efficient, and cost-effective AI infrastructure. 2bcloud’s deep Azure expertise helped us scale faster and smarter. They’ve been a true partner in turning our GenAI vision into a production-ready platform.”

– Michael (Misha) Feinstein, CTO, Bria AI

 

 

Looking Ahead

With a modern ML infrastructure on Azure and 2bcloud by their side, Bria AI is set up to move faster, scale smarter, and innovate without limits.

They can now:

  • Launch experiments quickly and iterate in hours, not days
  • Scale globally with low-latency, region-aware deployments
  • Optimize costs and performance with hands-on cloud guidance
  • Push updates to production faster and deliver more value
  • Stay focused on AI innovation while 2bcloud handles the cloud complexity

Bria isn’t just ready to scale—they’re ready to lead.

 

Contact us

Take the first step toward mastering your cloud

Let’s discuss how we can help you make the most of your cloud.