Aarna.ml

Resources

resources

Blog

Amar Kapadia

NCP Technical Considerations for Building GPUaaS Cloud Infra
Find out more

Introduction

To address the ever growing industry demand of executing the AI workloads, many different players are entering into the market of hosting NVIDIA GPU based infrastructure and providing it as a service; in some cases, by becoming a NVIDIA Cloud Partner (NCP). Essentially these entities need to offer GPU processing instances to their clients in a manner similar to how the hyperscalers offer their infrastructure -- API driven, on-demand, elastic, secure, isolated and usage based.

Let’s delve into various technical aspects for the implementation of such a GPU infrastructure that needs to be offered “as a service” by the NCPs.

Multi tenancy: The basic ask for any ”as-a-service” offering

Any “as-a-service” offering needs to fundamentally support “multi-tenancy” at its core. The same physical infrastructure needs to be logically sliced and isolated for every tenant without compromising on the throughput and latency requirements of the tenant workloads.

This “slicing” of infrastructure needs to span across all the layers encompassing host hardware (CPU & GPU), platform software (e.g. Bare Metal as a Service i.e. BMaaS / Virtualization / Container as a Service i.e. CaaS), storage and networking devices (Switches & Routers).  

Logical isolation of such an infrastructure also needs to be elastic and dynamic in nature. It should be completely software driven with no manual steps. All the required resources for that tenant should be reserved & inter-connected during the lifespan of that tenant instance and then released back to the common pool once the tenant instance is deleted.

In summary, for offering GPU as a service, the NCPs need to be able to dynamically provision all the layers of the GPU based infrastructure including hardware, networking and software platforms - based on the API driven request from tenants.

So what’s the list of functionalities that NCPs need to have for offering GPUaaS?

With this context, let’s now get to the precise functional capabilities that NCPs would need to implement in their infrastructure so that it could be offered in a GPUaaS model.

  1. Day 0 provisioning of  the GPU based DC -- This feature should bootstrap all the DC nodes and configure them with appropriate OS versions, firmwares, BIOS settings, GPU drivers etc. It should also perform day 0 provisioning of network switches involving Infiniband Quantum and ethernet based Spectrum switches. If included in the configuration, this software module should include provisioning of the Bluefield3 (BF3) DPU as well. In summary, this module should automate provisioning of hosts, GPUs and underlay networks. In essence it should make the infrastructure ready for usage by the NCP.
  2. Compute (CPU / GPU) allocations -- Next, the NCP needs the capability  to allocate CPUs and GPUs as per the tenant requested parameters. Allocation of CPUs for tenants is a solved problem (BMaaS, virtualization, or CaaS with correctly tuned OS packages and Kubernetes Operators) and the main focus is on how GPU allocations could be done for tenants. Here various options ranging from fractional GPUs to multiple GPU allocations to tenants should be done, based on the tenant workload requirements.
  3. Network isolation – The tenant AI workloads may be executed across multiple GPUs within the node and across nodes. The nodes could be connected using Infiniband Quantum or ethernet based Spectrum switches (BF3 soft switches may also be involved). Per tenant network isolation based on underlying network capabilities (e.g. PKEY for Infiniband and VXLAN for Ethernet) should be configured for ensuring tenant workloads do not impact each other.
  4. Storage configurations – The tenant specific ACLs should be configured on the storage solution so that tenant workloads are able to access their subscribed quota of the storage. The GPU Direct storage should also be configured as     required for tenants that enables faster data transfer between GPU memory and the storage device.
  5. Job scheduling – Many HPC and AI training workloads require unique batch scheduling tools / algorithms that are different from the transactional CPU based workloads. These scheduling tools ensure maximizing the GPU utilization and optimizing the workload executions across the tenants.
  6. RBAC -- Role Based Access Control should support various personas for global admins, tenant admins and tenant users along with the respective privileges. The global admin should be able to create tenant admins and allocate the quota for every tenant. Tenant admin should be able to create enterprise specific users and project hierarchies within its ambit. The tenant users should be able to manage and monitor their respective instances and workloads.  
  7. Observability & Monitoring - For every user, based on their privileges, they should be able to monitor statistics related to CPU, GPU, Memory, CaaS, workloads etc. and take any manual or automated actions as may be required to maintain the health of their workloads.
  8. Usage metrics - The NCPs should get per tenant usage metrics as per their desired intervals for billing purposes. Based on the NCP BSS capabilities, batch or connection oriented interfaces should be supported for passing on the tenant related usage metrics.
  9. GPUaaS API - Finally all of these features should be made accessible to the tenants through an API based interface. Tenants should be able to invoke appropriate APIs on the NCP gateway for requesting infrastructure, submitting the AI workloads, getting the usage metrics etc. The API should also be available through other means such as GUI or Kubernetes CRs.

In addition to this primary functionality, there is additional NCP functionality needed as well such as image service, key management, workflow orchestration, active &available inventory management and policy engine. These additional components bring in the benefits of automating the day n operations and also helps NCP tenants with optimally sizing the GPU based infrastructure resources for their workloads.

Implementation of the above functional modules shall enable NCPs to offer a complete E2E GPUaaS infra to their customers. It should be noted that various NVIDIA solution components and few other 3rd party components do support a few building blocks from the above list but they all need to be utilized in a logical manner and supplemented with additional features to provide a truly “as a service” platform.

In our next blog, we shall delve into the solution blueprint that NCPs could implement for offering their GPU infrastructure to tenants in a “as a service” and in a fully software driven mode.

Please reach out to [email protected]to learn more.

Amar Kapadia

99.X% Availability? Why Most GPUaaS SLAs Fall Short and How to Fix It
Find out more

With the growth of GPUs, there has also been a significant increase in the number of GPU-as-a-Service (GPUaaS) providers. Conventional wisdom suggests that GPU users primarily care about cost and performance. While these are indeed crucial factors, other aspects are equally important, such as availability, data locality/sovereignty, service termination features (e.g., bulk data transfer options), disaster recovery, business continuity, data privacy, ease of use, reliability, data egress costs, carbon footprint, and more.

In this blog, we will focus on availability. According to BMC Software, availability is the percentage of time that the infrastructure, system, or solution is operational under normal circumstances. For example, AWS EC2 provides a 99.5% availability SLA (which is quite low, roughly 3.5 hours of downtime per month), with service credits issued if this SLA is not met. To be fair, AWS also offers a higher regional SLA of 99.99%, equating to approximately 4.5 minutes of downtime per month.

If you are a GPUaaS provider (or an aspiring one) or an NVIDIA Cloud Partner (NCP), you need to determine what level of availability suits your ideal customer profile. You’ll also need to establish how to measure this SLA and what credits (if any) to issue if the SLA is breached. As an aside, availability can be a key differentiator for your GPU cloud service.

Once you’ve set your availability criteria, the next step is to figure out how to meet the availability SLA. Here’s the equation to calculate availability:

Availability = MTBF / (MTBF + MTTR)

MTBF = Mean time between failures

MTTR= Mean time to repair

In other words, to calculate availability, you need to determine the MTBF for your GPU cloud and calculate the MTTR across all failure types. Automated failure resolution is typically rapid and nearly instantaneous, whereas manual resolution can take minutes or hours. The challenge is deciding which faults should be automated and which should be repaired manually so that the blend of repair strategies results in an MTTR that is equal to or lower than the required MTTR. At Aarna, we’ve developed an MTTR calculator to help address this question.

The calculator uses data from Meta on GPU Cloud MTBF. With this data, you can align your repair strategy with your Availability SLA goals. The MTTR calculator requires two inputs:

  1. The required Availability SLA based on your (i.e. the GPUaaS provider or NCP) requirements.
  2. Average failure resolution time, assuming faults are identified and repaired manually.

After entering these inputs, the calculator will specify which fault repairs need to be automated and which can be managed manually.

For example, if your goal is 99.999% availability and it takes your operations team an average of 2 hours to identify and repair faults manually, you’ll need to automate the following types of faults:

  • Faulty GPU
  • GPU HBH3 Memory
  • Software bug
  • Network Switch/Cable
  • Host Maintenance

Feel free to experiment with the MTTR calculator and share your feedback. If you make any improvements, please let us know so we can update the tool for the benefit of the broader community.

MTTR Calculator for GPUaaS Providers
%
Hours
Required MTTR: 0 Minutes
Monthly Downtime: 0 Seconds
GPUaaS Failure Analysis & MTBF

Additionally, our GPU Cloud Management Software (AMCOP) features fault management and correlation capabilities to aid in automating repairs. In the future, our product will also provide your BSS system with Availability SLA violation details and a list of affected tenants, enabling you to issue credits as needed. Contact us to explore these topics further.

About us : Aarna.ml is an NVIDIA and venture backed startup building software that helps GPU-as-a-service companies build hyper scaler grade cloud services with multi-tenancy and isolation. 

Amar Kapadia

Aarna’s Role in Enabling Sovereign GPUaaS Providers in India
Find out more

The government of India (India AI) issued a document titled, “Inviting Applications for Empanelment of Agencies for providing AI services on Cloud.” This document invites in-country GPUaaS providers to bid for sovereign opportunities. It is a detailed and thoughtful document and will no doubt spur innovation at all levels of the AI/ML stack within India.

If you are responding to this invitation or plan to, we would like to congratulate you! However, some of the requirements in sections 6.7 “Admin Portal”, 6.8 “Service Provisioning”, 6.9 “Operational Management”, and 6.12 “SLA Management” are complicated. They essentially require a GPU Cloud Management Software layer. And this cloud management software needs to be up & running in t0 + 6 months.

Let’s explore what your options are since it’s the classic “make” vs. “buy” situation. Here are the pros and cons of these two options.

Pros Cons
“Make” Option
  • Full control of the software with ability to differentiate and customize (it may actually not be possible to differentiate at the IaaS layer, so the differentiation argument might be questionable)
  • Requires very strong in-house development skills, esp. given the tight development timelines
  • Matching ongoing feature requirements will get challenging in the long term
“Buy” Option
  • Get access to a purpose-built 3rd party product
  • Save cost (since 3rd party will be less expensive than in-house)
  • Focus precious development resources on AI/ML rather than Infra
  • Customization will be possible, but might be more difficult than in-house software

If you are going for the “make” option, the rest of this blog is moot. However, if you want to explore the “buy” option, we can help you with the below requirements[1]

Section Requirement
General
  • Admin portal available within 6 months of LOI
  • Dynamically manage 1,000+ GPUs
6.7 “Admin Portal”
  • User registration/account creation
  • Service catalog and prices
  • Capacity dashboard
  • Utilization monitoring
  • Incident management
  • Service Health Dashboard
  • Ability to customize dashboard for the subsidy workflow
6.8 “Service Provisioning”
  • Online, on-demand instances that can be scaled up/down
  • Management portal
  • Public internet access with VPN
  • Support for BMaaS and VMs
  • MTTR SLAs and recovery
  • User notifications
  • Data destruction (so it cannot be forensically recovered)
6.9 “Operational Management”
  • Patch management
  • OS images with latest security patches
  • Root cause analysis and timely repairs
  • System usage
6.12 “SLA Management”
  • SLA measurement and MTTR improvement to meet incident management SLA (99.95% or higher)
  • Service availability measurement

 Finally, to our knowledge, we are the only GPU Cloud Management Software company in the market. If this blog sounds interesting, learn more:

●    Our GPU Cloud Management Software demo

And please feel free to contact us.

Amar Kapadia

The Emerging GPU-as-a-Service Provider Industry
Find out more

The introductory blog introduced the concept of GPU-as-a-Service (GPUaaS). The next blog classifies GPUaaS providers and describes factors that are driving GPU demand. 

Key GPUaaS Provider Classification: 

GPUs are offered either as bare metal (one or more physical machines) or as a virtual machine (VM) to be consumed via APIs. The bare metal or VM instance may optionally have a Container-as-a-Service (CaaS) layer on top managed by Kubernetes. 

We have classified key GPUaaS players to help end users choose the right provider: 

  1. Hyperscalers like Google, AWS, Azure, Oracle and newer entrants like Lambda Labs, Coreweave, Digital Ocean etc, who provide bare metal, VMs or CaaS solutions, often packaged with their own PaaS or SaaS layers like LLM models, Pytorch, LLMOps/MLOps/RunOps. 
  2. Traditional Telcos and Data centers with large commitments to buying GPUs to build “Sovereign AI Clouds” in their host countries (this is discussed later in the blog), If they are based on Nvidia, these Telcos and Data centers are called as “Nvidia Cloud Partners” (NCP) if their GPU commitment is sufficiently large. There are numerous NCPs consuming GPUs in the billions or tens of billions of dollars. The primary use case is LLM training and for this reason, they tend to favor bare metal instances. 
  3. Small, regional or edge data centers with smaller commitments and a focus on other use cases beyond LLM training. They tend to offer bare metal and VM instances optionally with a CaaS layer. 
  4. Startups, particularly those that may have started with a crypto currency use case or have significant capital deployed in the acquisition of GPUs – these players may also choose to build “industry clouds”. Their offerings include bare metal and VM instances, but often go beyond by offering PaaS or SaaS layers. If there is a vertical industry orientation, these PaaS or SaaS layers tend to be industry specific e.g. Fintech, life science, and more. 

What is driving GPU demand?  :  

The primary use case for the massive growth in GPUs is LLM training. Massive data sets (at internet scale) across languages are driving an insatiable demand to lock up the latest GPUs to train state-of-the-art models. Goldman Sachs 2023 report had predicted training to drive most of Nvidia’s revenues in 2024 and 2025. 

 

AI Compute Revenue Opportunity

However, contrary to Goldman Sach’s opinion, we expect inference use cases to show up much earlier and drive the next wave of GPU growth because there is no choice; models will need to be deployed for end user applications in order to generate ROI on the initial LLM investment. By design we also expect inferencing to be at a scale that's an order of magnitude greater than learning (5x to 40x) of learning. It is anticipated that the inference use case will be fragmented across Nvidia, AMD, Qualcomm and emerging startups like Groq and Tenstorrent (additional reading here and topic for another blog). Over time, we also expect model fine tuning and Small Language Models (SLM) to drive additional GPU growth.

Another source that will continue to drive demand for GPUs is “Sovereign AI”. Sovereign AI as defined by Michael Dell in his recent blog is a nation’s capability to produce artificial intelligence using its own infrastructure and data. We expect countries’s governments and public sector to embrace this idea. Most will be reluctant to use hyperscaler AI Clouds for their AI initiatives.  

In summary, we expect a step change in the way the market works – newer entrants including Telcos and Data Center companies will shape the industry as they spot a unique window of opportunity to win local AI Cloud business away from Hyperscalers. GPU demand and the number of GPUaaS providers will grow significantly in the next few years. New use cases will further contribute to this trend. 

The next blog will cover additional GPUaaS and NCP topics – both business and technical.! 

About us : Aarna.ml is an NVIDIA and venture backed startup building software that helps GPU-as-a-service companies build hyperscaler grade cloud services with multi-tenancy and isolation. 

Amar Kapadia

GPU-as-a-Service Blog series : An Introduction
Find out more

The Gen AI industry has spurred massive growth in the GPU market. NVIDIA, the most valuable company in the world, is at the forefront of this explosion. When it comes to GPU consumption models, a number of factors affect this industry – GPU supply (shortage vs. glut), cost vs. performance in the midst of the explosion of the various LLM models, complexity in the underlying technology choices, and the need for enterprises to experiment (do PoCs) as opposed to making long term commitments. 

The lack of certainty means enterprises and startups prefer to “rent” GPUs as opposed to buying dedicated hardware. This has created a new industry of “GPU-as-a-Service” or “AI Cloud” providers who rent GPUs to customers – often bare metal GPUs, but sometimes integrated with sophisticated softwares and services packaged for customers. This nascent GPU-as-a-Service market is forecasted to grow 16x to $80B over the next decade. 

To date, a small set of hyperscalers, crypto mining companies, and startups offer GPU-as-a-Service (GPUaaS). Moving forward, the number is expected to grow massively. Sequoia Capital, the world's leading venture capital firm recently published a blog that likened the GPU Capex buildout to the erstwhile railroad industry – “build the railroad and hope they will come”. 

If you have decided to be in the GPUaaS business, you are looking at a great business opportunity. However, as with any attractive business, there is no free lunch. As of July 2024, there are 600 new competitors, intense margin pressure, and a complex tech stack to deal with. Bare metal GPU instances are already at 2 dollars and 30 cents per hour. Supply pressure easing, these prices are likely to drop further.

Offering only bare metal-as-a-service is not a prudent ROI option for most AI Cloud providers. While it works for very large and long term workloads like training LLMs, the majority of the market does not need such large capacities locked up for extended durations. Inferencing, fine tuning and training of smaller deep learning models account for a much larger market with “bursty” and dynamic requirements. 

Given the rapid shifts in the GPU market, what should an AI Cloud provider do? If your customers are startups or enterprises building products that require training smaller models, inferencing or fine tuning existing off the shelf models, how should they go about it? These are some of the questions participants in the AI value chain are seeking answers to.

The lack of clarity in this emerging industry has given us at Aarna.ml an opportunity to provide an independent point of view to the GPUaaS providers. Over the next few weeks we will be publishing a series of posts on the GPU-as-a-service industry. We hope enterprises, startups, data center companies and AI Cloud providers can find our observations and opinions useful. 

About us : Aarna.ml is an NVIDIA and venture backed startup building software that helps GPU-as-a-service companies build hyperscaler grade cloud services with multi-tenancy and isolation. 

Milind Jalwadi

Webinar Recap: Dynamically Orchestrating RAN and AI Workloads on a Common GPU Cloud
Find out more

In the recent webinar, "Dynamically Orchestrating RAN and AI Workloads on a Common GPU Cloud," the presenters highlighted innovative strategies for optimizing GPU infrastructure. The focus was on leveraging dynamic orchestration to manage both RAN and AI workloads efficiently, maximizing resource utilization and ROI for Mobile Network Operators (MNOs).

Key Takeaways:

  1. RAN and AI workloads on the same GPU cloud:some text
    • 5G RAN L1 layer acceleration is achieved by using the GPUs for processing. The same GPU infrastructure could also be used for running the AI workloads. Thus by configuring the same GPU infrastructure for both types of workloads, MNOs can significantly improve utilization rates.
  2. Dynamic Scaling:some text
    • The webinar demonstrated how dynamic scaling techniques allow RAN and AI workloads to scale in and out based on real-time traffic demands. This automation ensures optimal use of resources, reducing operational costs and enhancing performance.
  3. Monetizing Unused Capacity:some text
    • Traditional RAN infrastructure is provisioned for peak hours and hence is often underutilized during off-peak periods. This causes revenue loss especially because of costly GPU computes. MNOs can capitalize on their infrastructure by selling unused GPU cycles as spot instances for running the AI workloads. This additional revenue stream can significantly shorten the ROI period for existing investments.
  4. Automation and Efficiency:some text
    • Automating the orchestration of RAN and AI workloads minimizes manual intervention, leading to greater efficiency and consistency. This approach also simplifies management and operational tasks, allowing MNOs to focus on strategic initiatives.

Demo Highlights:

The webinar included a live demonstration of dynamic orchestration in action, showcasing real-world applications and benefits. Attendees were able to see firsthand how automated scaling and resource management can transform infrastructure utilization.

Conclusion:

The integration of RAN and AI workloads on a common GPU cloud represents a significant advancement for MNOs, offering enhanced efficiency, reduced costs, and new revenue opportunities. As the telecom industry continues to evolve, adopting such innovative solutions will be crucial for staying competitive and maximizing infrastructure investments.

For those who missed the live session, you can watch the recorded webinar here.