Introduction
To address the ever growing industry demand of executing the AI workloads, many different players are entering into the market of hosting NVIDIA GPU based infrastructure and providing it as a service; in some cases, by becoming a NVIDIA Cloud Partner (NCP). Essentially these entities need to offer GPU processing instances to their clients in a manner similar to how the hyperscalers offer their infrastructure -- API driven, on-demand, elastic, secure, isolated and usage based.
Let’s delve into various technical aspects for the implementation of such a GPU infrastructure that needs to be offered “as a service” by the NCPs.
Multi tenancy: The basic ask for any ”as-a-service” offering
Any “as-a-service” offering needs to fundamentally support “multi-tenancy” at its core. The same physical infrastructure needs to be logically sliced and isolated for every tenant without compromising on the throughput and latency requirements of the tenant workloads.
This “slicing” of infrastructure needs to span across all the layers encompassing host hardware (CPU & GPU), platform software (e.g. Bare Metal as a Service i.e. BMaaS / Virtualization / Container as a Service i.e. CaaS), storage and networking devices (Switches & Routers).
Logical isolation of such an infrastructure also needs to be elastic and dynamic in nature. It should be completely software driven with no manual steps. All the required resources for that tenant should be reserved & inter-connected during the lifespan of that tenant instance and then released back to the common pool once the tenant instance is deleted.
In summary, for offering GPU as a service, the NCPs need to be able to dynamically provision all the layers of the GPU based infrastructure including hardware, networking and software platforms - based on the API driven request from tenants.
So what’s the list of functionalities that NCPs need to have for offering GPUaaS?
With this context, let’s now get to the precise functional capabilities that NCPs would need to implement in their infrastructure so that it could be offered in a GPUaaS model.
- Day 0 provisioning of the GPU based DC -- This feature should bootstrap all the DC nodes and configure them with appropriate OS versions, firmwares, BIOS settings, GPU drivers etc. It should also perform day 0 provisioning of network switches involving Infiniband Quantum and ethernet based Spectrum switches. If included in the configuration, this software module should include provisioning of the Bluefield3 (BF3) DPU as well. In summary, this module should automate provisioning of hosts, GPUs and underlay networks. In essence it should make the infrastructure ready for usage by the NCP.
- Compute (CPU / GPU) allocations -- Next, the NCP needs the capability to allocate CPUs and GPUs as per the tenant requested parameters. Allocation of CPUs for tenants is a solved problem (BMaaS, virtualization, or CaaS with correctly tuned OS packages and Kubernetes Operators) and the main focus is on how GPU allocations could be done for tenants. Here various options ranging from fractional GPUs to multiple GPU allocations to tenants should be done, based on the tenant workload requirements.
- Network isolation – The tenant AI workloads may be executed across multiple GPUs within the node and across nodes. The nodes could be connected using Infiniband Quantum or ethernet based Spectrum switches (BF3 soft switches may also be involved). Per tenant network isolation based on underlying network capabilities (e.g. PKEY for Infiniband and VXLAN for Ethernet) should be configured for ensuring tenant workloads do not impact each other.
- Storage configurations – The tenant specific ACLs should be configured on the storage solution so that tenant workloads are able to access their subscribed quota of the storage. The GPU Direct storage should also be configured as required for tenants that enables faster data transfer between GPU memory and the storage device.
- Job scheduling – Many HPC and AI training workloads require unique batch scheduling tools / algorithms that are different from the transactional CPU based workloads. These scheduling tools ensure maximizing the GPU utilization and optimizing the workload executions across the tenants.
- RBAC -- Role Based Access Control should support various personas for global admins, tenant admins and tenant users along with the respective privileges. The global admin should be able to create tenant admins and allocate the quota for every tenant. Tenant admin should be able to create enterprise specific users and project hierarchies within its ambit. The tenant users should be able to manage and monitor their respective instances and workloads.
- Observability & Monitoring - For every user, based on their privileges, they should be able to monitor statistics related to CPU, GPU, Memory, CaaS, workloads etc. and take any manual or automated actions as may be required to maintain the health of their workloads.
- Usage metrics - The NCPs should get per tenant usage metrics as per their desired intervals for billing purposes. Based on the NCP BSS capabilities, batch or connection oriented interfaces should be supported for passing on the tenant related usage metrics.
- GPUaaS API - Finally all of these features should be made accessible to the tenants through an API based interface. Tenants should be able to invoke appropriate APIs on the NCP gateway for requesting infrastructure, submitting the AI workloads, getting the usage metrics etc. The API should also be available through other means such as GUI or Kubernetes CRs.
In addition to this primary functionality, there is additional NCP functionality needed as well such as image service, key management, workflow orchestration, active &available inventory management and policy engine. These additional components bring in the benefits of automating the day n operations and also helps NCP tenants with optimally sizing the GPU based infrastructure resources for their workloads.
Implementation of the above functional modules shall enable NCPs to offer a complete E2E GPUaaS infra to their customers. It should be noted that various NVIDIA solution components and few other 3rd party components do support a few building blocks from the above list but they all need to be utilized in a logical manner and supplemented with additional features to provide a truly “as a service” platform.
In our next blog, we shall delve into the solution blueprint that NCPs could implement for offering their GPU infrastructure to tenants in a “as a service” and in a fully software driven mode.
Please reach out to [email protected]to learn more.