Aarna.ml

Introduction

Enterprises face significant challenges when adopting GPUs from Neoclouds and Nvidia Cloud Partners (NCPs). A primary concern is the lack of transparency in GPU quality and the complexity of qualifying different providers. This has led to a need for standardized evaluation systems. One such system, ClusterMAX™, has emerged as a promising rating system. This blog post provides an overview of the ClusterMAX™ rating system and its implications for achieving high-quality AI infrastructure in Neocloud environments.

The ClusterMAX™ Solution

ClusterMAX™, developed by SemiAnalysis, is a comprehensive rating system designed to evaluate GPU Neoclouds. It aims to address the problem of inconsistent GPU quality and help enterprises make informed decisions. The system covers approximately 90% of the GPU market by volume and comprises over 50 requirements across nine categories. This system seeks to provide a clear framework for understanding the capabilities and reliability of different Neocloud offerings. There are some areas of improvement for ClusterMAX™ e.g. covering more Neoclouds, focusing on inference rather than just training, and being more clear on how scoring is conducted. Nevertheless, these drawbacks are minor in the big scheme.

ClusterMAX™ Rating Tiers and Categories

The ClusterMAX™ rating system has five tiers: Underperform, Bronze, Silver, Gold, and Platinum, with Platinum being the highest.

“Anyone can cobble together open-source components to hit Underperform, but moving beyond Bronze takes months of engineering effort—and Platinum can take years,” Amar Kapadia, CEO & Co-founder, aarna.ml.

Achieving even a Silver or Gold rating requires substantial effort and a robust engineering foundation. The nine categories evaluated by ClusterMAX™ are:

Security: This category assesses the security measures implemented by GPU cloud providers. Key security requirements include isolated Ethernet networks (with VLAN/VRF), Infiniband isolation mechanisms (PKEY), hard isolation (especially per-tenant Kubernetes clusters), audit logging, data encryption, relevant certifications (SOC 2, GDPR), security checklists for tenants, and DPU-based tenant isolation. A significant weakness in many Neoclouds is seen in areas like Infiniband isolation and the tendency to use shared Kubernetes clusters.
Lifecycle and Technical Expertise: This category assesses the level of technical expertise of GPU cloud providers. It examines how providers handle the lifecycle of their services, from initial setup to ongoing maintenance, and the depth of technical knowledge and support available to users.
SLURM and Kubernetes: This evaluates the integration and management of SLURM and Kubernetes, which are essential tools for orchestrating and managing GPU workloads. It considers how well providers support these platforms, including ease of deployment, scalability, and compatibility with various GPU configurations. A key objective is to minimize the "time to value" for end customers by providing robust platform features and AI/ML tools. Specific requirements include managed SLURM, topology-aware SLURM, SLURM plugins for containerized workloads (like Pixie), managed Kubernetes with autoscaling, automated Kubernetes lifecycle management, and separate Kubernetes clusters for each tenant.
Storage: This category examines the storage solutions offered by GPU providers, including performance, capacity, and pricing. It looks at factors such as storage speed (IOPS, throughput), scalability, cost-effectiveness, and the availability of different storage tiers to meet varying workload requirements. Reliable and secure storage is crucial for importing training data and protecting sensitive information (weights and balances).
NCCL/RCCL Networking Performance: This assesses the networking performance of GPU clusters, focusing on the efficiency of collective operations. NCCL (NVIDIA Collective Communications Library) and RCCL (ROCm Collective Communications Library) are crucial for multi-GPU communication, and this category evaluates how well providers optimize network configurations for these operations. Optimal networking performance is essential for LLM training, inferencing, and SLM. Requirements include validating performance across various data sizes, proper non-blocking, rail-optimized fabric design, throughput monitoring per job, dynamic fabric telemetry, topology-aware tenant allocation, and support for SHARP (Scalable Hierarchical Aggregation and Reduction Protocol).
Reliability and SLA: This category evaluates the reliability of GPU cloud services and the guarantees provided through SLAs. It considers factors such as uptime, fault tolerance, redundancy, and the comprehensiveness of SLAs in terms of compensation for downtime or performance issues. Key requirements include clearly defined public SLAs, active and passive health checks for auto-healing, hot-spare availability, job auto-stop recovery and rescheduling, disaster recovery backup policies, tenant SLA metrics monitoring, and service credit policies.
Automated Active and Passive Health Checks and Monitoring: This assesses the systems in place for monitoring the health and performance of GPU resources. It examines the extent to which providers use automated tools to proactively detect and address potential problems, as well as the availability of monitoring dashboards and alerts for users. Proactive health checks and monitoring are crucial to maintaining reliability and meeting SLAs.
Consumption Models, Price Per Value, and Availability: This category examines the pricing structures, consumption models, and availability of GPU resources. It evaluates the flexibility of pricing options, the overall cost-effectiveness of the services, and the ability of providers to meet the demand for GPU resources. Requirements include on-demand boot time SLAs, flexible reservation models (shorter durations), capacity blocks, spot instances with boot time SLAs, and transparent pricing.
Technical Partnerships: This evaluates the partnerships and collaborations that GPU providers have with other technology companies. It considers how these partnerships enhance the provider's offerings, such as access to cutting-edge hardware, software integrations, and specialized expertise. Strong technical partnerships are crucial in the complex GPU cloud domain. Partnerships with NVIDIA (for NCPs), AI/ML tool providers, and hardware vendors are essential for designing robust data centers, providing support, and offering a comprehensive solution.

We at Aarna have created a spreadsheet that meticulously captures each requirement; see a screenshot below. If you want a copy, please request it here.

The aarna.ml GPU Cloud Management Software

Aarna offers a GPU cloud management software (GPU CMS) designed to help Neocloud providers meet the ClusterMAX™ requirements and achieve higher ratings. The GPU CMS interfaces configures various hardware components (CPU+GPU, Network Fabric, Storage, External Gateway etc.) and provides features such as hardware provisioning, an admin console (for topology management, tenant creation, and observability), a tenant portal (with admin and user personas), support for explicit intent (bare metal, VM, Kubernetes), job submission, and serverless inferencing. It also includes a third-party catalog, billing integration, and role-based access control.

Conclusion

The ClusterMAX™ rating system offers a valuable framework for evaluating GPU Neoclouds, enabling enterprises to make informed decisions and drive successful AI initiatives. Aarna Networks' GPU CMS can rapidly assist Neocloud providers in meeting these requirements and achieving higher ratings.

To gain a deeper understanding of this topic, you can watch the full webinar recording and are encouraged to:

Demystifying SemiAnalysis ClusterMAX™ and Achieving Platinum-Rated AI Infrastructure

Milind Jalwadi

Get the copy of SemiAnalysis ClusterMAX™ requirements spreadsheet

Main links

Products

Solutions

Stay up to date on aarna.ml