Introduction: The Growing Need for Edge Inference
Centralized inference has served us well. However, new use cases such as physical AI, real-time conversational chatbots, and computer vision require reduced latency, computation closer to data sources, and optimizing bandwidth usage. These applications can’t tolerate the latency of round trips to centralized data centers nor can they afford the cost of constantly transferring large volumes of data. Instead, they require inference that is geographically distributed at the edge, dynamically orchestrated, and tightly optimized for latency and bandwidth.
Inference latency requirements:
- Physical AI < 1 sec
- Conversational AI < 6 sec
- Computer vision < 1 sec
This is fueling a surge in demand for edge inference infrastructure—capable of running AI models across clusters of GPUs residing at the telco edge, while maintaining cloud-like flexibility and scale. The edge inference market is poised for exceptional growth between 2025 and 2030, with projections indicating an expansion from USD 106.15 billion to USD 254.98 billion at a CAGR of 19.2%.
Why NVIDIA MGX Servers Are a Game-Changer for Edge Inference
NVIDIA MGX™ servers, based on a modular reference design, can be used for a wide variety of use cases, from remote visualization to supercomputing at the edge. MGX provides a new standard for modular server design by improving ROI and reducing time to market and is especially suited to distributed inference. Some of the reasons for this are:
- Modular design allows core and edge sites to scale from 1 RU to multiple racks
- High performance per watt allows maximum GPU compute capacity to be packed at the distributed inference site
- Unified memory between CPU and GPU speeds up reasoning based inference by up to 2x
- Integration with NVIDIA AI Enterprise stack (NVAIE) provides access to a large number of vertically oriented models and solutions
When combined with the NVIDIA Spectrum-X™ Ethernet networking platform for AI, customers can extract the full performance of the underlying GPUs.
Challenges in Building an Edge Inference Stack
While MGX servers along with Spectrum-X offer an integrated stack, edge inference presents a number of infrastructure challenges for a GPU-as-a-Service (GPUaaS) provider:
- Managing multiple sites: Edge GPUaaS providers typically have multiple sites that are often in light-out environments. The infrastructure consisting of compute, storage, networking, and WAN gateways has to be managed remotely with the lowest possible OPEX.
- Managing isolation between multiple tenants (users): Distributed GPU sites have multiple tenants that demand the highest level of security between tenants. These tenants can range from 1st party telco AI/ML applications and 3rd party ISV or partner AI/ML applications and specialized workloads such as 5G/6G RAN.
- Matching workloads to the correct GPU site: Workloads have to be mapped to the appropriate site for latency, bandwidth, compliance, or data gravity reasons.
- Maximizing utilization: Given the high cost of GPUs, utilization has to be as close to 100% as possible at all times. The edge GPUaaS provider requires dynamic scaling of tenant infrastructure for transactional jobs, efficient job scheduling for batch jobs. There may also be a default tenant that gets registered with marketplaces such as the NVIDIA DGX Lepton™ Cloud and utilities such as NVCF to ensure the utilization of unused GPUs.
The Need for Secure, Dynamic tenancy and isolation in AI Workloads
The above challenges require a secure, dynamic hard isolation software layer for distributed inference. The ideal software solution must offer:
- Zero touch management of the underlying hardware infrastructure potentially across 10,000s of edge sites to slash OPEX
- Isolation between tenants for security and compliance
- Dynamic resource scaling for maximizing GPU utilization
- Registration of underutilized resources with Marketplaces for maximizing GPU utilization
5G/6G RAN Workload
A special non-AI/ML workload that is also relevant in this discussion pertains to 5G/6G mobile networks. A key component of the 5G/6G stack is the radio area network (RAN) software that connects mobile devices to the core network, utilizing sophisticated modulation technologies that enable wireless connectivity with increased speeds, capacity, and efficiency. The RAN software is an edge workload that requires acceleration. For this reason, it makes sense to run the 5G/6G RAN software on the same edge inference infrastructure described above. Rather than require dedicated hardware that is often grossly underutilized in the 20%-30% range, unified GPU hardware for AI-and-RAN improves the effective hardware utilization for RAN sites.
Introducing aarna.ml GPU Cloud Management Software
The aarna.ml GPU Cloud Management Software (CMS) provides the following functionality:

- On-demand isolation spanning CPU, GPU, network, , storage, and the WAN gateway
- Bare metal, virtual machine, or container instances
- Automated infra management for tenants with scale-out and scale-in
- Admin functionality to discover, observe, and manage the underlying hardware (compute, networking, storage) across 10,000s of sites
- Billing and User management with RBAC
- Integration with open source (Ray, vLLM) or 3rd party PaaS (Red Hat OpenShift and more)
- Integration with DGX Lepton™ Cloud and NVIDIA Cloud Functions (NVCF) to monetize unused capacity is on going.
- Centralized Management for managing and orchestrating multiple Edge locations
Reference Architecture: NVIDIA + aarna.ml for Edge Inference
The aarna.ml GPU CMS when coupled with NVIDIA MGX and Spectrum-X solves the above-listed problems for edge GPUaaS providers. The high-level topology diagram for the distributed inference reference architecture (at each Edge site) is shown below.

Figure: Simplified aarna.ml GPU CMS Topology View
The components for this architecture are:
- NVIDIA MGX servers with optional NVIDIA Bluefield-3 DPU
- NVIDIA Spectrum-X switches for inband communication
- NVIDIA Network switches for OOB Management
- NVCF
- External High Performance Storage (HPS) from partner solutions (optional)
- aarna.ml GPU Cloud Management Software (CMS) - deployed at a centralized location with the ability to manage multiple edge locations
- Other components from the local IT infrastructure, such as External Gateway, DNS server etc.
The infrastructure is installed at the edge locations, along with other software components including the CMS, and all the hardware related tests are performed, before onboarding the resources to the CMS.
Once this initial step is completed, the site administrator (Admin Persona) discovers (or onboards) the infrastructure using CMS, and creates the underlay network using the Spectrum switches and the BlueField-3, SuperNICs or DPUs on the MGX servers. The Admin then goes on to create tenants, which are the logical entities that run different types of workloads (RAN or AI) on the same physical infrastructure. The tenants are allocated resources which could be one or more MGX servers, or a fraction of a server or GPU with a combination of virtual machine and MiG technology, along with any additional external storage.
The important consideration while allocating these resources is that they need to be isolated, so that each tenant’s workload can run without any performance or security implication from other tenants and there is no “noisy-neighbour” situation. The aarna.ml CMS ensures this by providing isolation at all levels - CPU, GPU, memory, network adapters, network switches, internal and external storage, all the way to external gateway.

The CMS functionality does not end there. It also creates fully isolated and configured Kubernetes clusters using upstream software or commercial solutions like RedHat Openshift with all the required K8s controllers such as GPU operators deployed on these clusters, in a per-tenant manner. This way each cluster can have its own set of dedicated resources, which are their master and worker nodes with the associated GPUs.

Moreover, the Admin can then use per-tenant resources to run various workloads, with guaranteed performance and security. These workloads include 5G/6G functions, which require GPUs for acceleration. These are RAN Distributed Unit (DU) components that can utilize the NVIDIA Aerial SDK to accelerate their L1 functionality using the GPUs. Managing the RAN DU brings with it additional complexity such as managing the RAN Service Management and Orchestration software, configuring front-haul switches, and configuring the PTP (Precision Time Protocol) Grandmaster. These functions are also performed by the aarna.ml GPU CMS.
The Admin can expose some of the GPU capacity to running AI workloads. This can be done in two ways:
- Using the aarna.ml GPU CMS, the Admin can create one or more fully isolated Kubernetes clusters, and register these clusters with NVCF. The NVCF service can schedule distributed Inference jobs on these clusters.
- Alternatively, using the aarna.ml GPU CMS, the Admins or the Tenants, can schedule any cloud native Inference workloads on these fully isolated Kubernetes clusters using the application catalog included with the aarna.ml GPU CMS.
Moreover, the Kubernetes clusters that are registered with NVCF or those used by other tenants for AI/ML workloads, can be dynamically scaled-out (by adding more worker nodes with GPUs), or scaled-in (by removing existing worker nodes). This process can be automated by a policy engine in the aarna.ml GPU CMS, which can be programmed. As an example, a policy can query the a GenAI model that can predict the RAN traffic patterns, and based on its recommendation, the GPU CMS can scale-in the cluster that is running RAN workloads, and add the resulting worker nodes (after re-provisioning them if needed) to the NVCF or DGX Lepton cluster. This will automatically expand the cluster use for inference workloads during the off-peak periods of RAN traffic, and perform the reverse operation when the RAN traffic starts increasing.
In both the cases, the aarna.ml GPU CMS also takes care of providing external connectivity to these Inference end-points, by creating the necessary infrastructure using other equipment such as external Gateways, L4 Load Balancers and Firewalls. This entire process is done dynamically, with all the security considerations, without any need for manual intervention. As an example, the Inference end-points can be exposed using DNS names, which map to the Public IP range that is available for this edge location. There may be a need for further translation of the Public IP for the internal IP of the cluster, in addition to configuring any site-specific Firewall rules and security groups. All of this functionality is done in a seamless manner, without any need for manual intervention.
This functionality is depicted as below:

Scaling K8s clusters is completely different than autoscaling workloads where the number of pods are scaled in response to application requirements.
This approach enables all the edge locations to fully utilize their GPU infrastructure, without bothering about the security and performance concerns associated with sharing of common resources. The aarna.ml GPU CMS guarantees the isolation while maintaining the performance and maximizing utilization.
Conclusion: The Path to Scalable, Secure Isolation, Distributed AI
- Call to action: learn more, try the architecture, join the ecosystem