NewsEdge AI Revolution: Unleashing Arm v9 and Boosted GEMM for Next-Generation IoT...

Edge AI Revolution: Unleashing Arm v9 and Boosted GEMM for Next-Generation IoT Devices

Category articles

Arm is driving a transformation in edge artificial intelligence with the launch of its new Cortex-A320 CPU core, the first Arm v9 core specifically designed for the IoT. Paired with Arm’s Ethos-U85 NPU, this innovation is set to enable complex generative and agentic AI applications on edge devices—even supporting models with over one billion parameters. This advancement marks not just an incremental upgrade but a fundamental shift in how edge computing and AI processing are approached.

The Cortex-A320 and Arm v9: A New Era for Edge Computing


The Cortex-A320 is a breakthrough in the Arm v9 architecture. Compared to its predecessor, the Cortex-A35 on Arm v8, the new core brings significantly enhanced AI performance and advanced security features. Key improvements include support for larger, more flexible address spaces, which is crucial for handling complex AI workloads. The architecture introduces new security measures such as pointer authentication and branch target identification, which help mitigate jump- and return-oriented programming attacks. Additionally, Arm’s memory tagging extension improves protection against memory safety exploits. These features are vital for IoT devices, where secure and efficient processing is paramount.

A notable upgrade in the Cortex-A320 is its intrinsic support for the Ethos-U85 NPU. Previously, only Cortex-M cores could drive such NPUs; now, the Cortex-A320 can directly manage the NPU, enabling faster access to a larger memory space. This enhanced memory access is essential for performing the intensive computations required for large model inference. Arm projects an overall 8× performance uplift when the Cortex-A320 is combined with the Ethos-U85, compared to older platforms like the Cortex-M85 driving the NPU.

Boosted GEMM: Accelerating Matrix Multiplication for AI


One of the standout features of the new Arm v9 core is its ability to boost GEMM, or General Matrix Multiply—a fundamental operation for many AI and machine learning applications. Matrix multiplication is a core component in deep learning, used extensively in neural network computations. The Cortex-A320 incorporates new instructions that improve GEMM performance by an order of magnitude. This means that the mathematical operations behind AI models, such as convolutional neural networks and transformer architectures, can be executed much faster. In addition, scalar compute performance has been improved by 30%, further optimizing the processing of AI workloads. Enhanced GEMM performance not only accelerates model training and inference but also contributes to more energy-efficient computing on edge devices.

Advancements in Software and AI Integration


The Cortex-A320 is designed to work seamlessly with Arm’s AI kernel libraries, collectively known as Kleidi AI. This suite of optimized libraries makes it easier for developers to deploy sophisticated AI workloads directly on the CPU, without the overhead of offloading tasks to the NPU. In practical scenarios, such as a camera system that processes always-on image data with the NPU and then uses the CPU for higher-level tasks on flagged images, running AI workloads directly on the Cortex-A320 can be more efficient. Compatibility with Linux, Android, and common real-time operating systems ensures that developers can migrate code developed for microcontrollers to systems with larger memory spaces, future-proofing today’s AI applications.

Educational Insights on Arm v9 Architecture


Arm v9 architecture represents a significant leap forward in CPU design. This new generation architecture emphasizes both performance and security, making it ideal for the increasingly complex demands of AI and edge computing. Notably, Arm v9 integrates scalable vector extensions (SVE2), a blend of Arm’s Neon vector extensions and its SIMD instruction set. This addition enhances vector processing capabilities, which are critical for high-throughput applications like image processing and scientific computations. The support for AI-friendly data types, such as BF16, further optimizes the processing of neural network operations, paving the way for more efficient generative and agentic AI applications on edge devices.

Understanding Boosted GEMM in the AI Landscape


General Matrix Multiply (GEMM) is a cornerstone of modern machine learning. It involves multiplying large matrices, an operation central to neural network computations, including deep learning and transformer models. The recent advancements in the Cortex-A320 include new instructions that significantly accelerate GEMM operations. Boosting matrix multiplication performance is crucial because it directly influences the speed at which AI models can be trained and inferred. Faster GEMM means that more complex models can be deployed on edge devices without compromising speed or energy efficiency, thereby enabling real-time processing and decision-making in a wide range of IoT applications.
Arm’s unveiling of the Cortex-A320 within the new Arm v9 architecture signals a major milestone in edge AI technology. With substantial improvements in security, memory handling, and performance—especially through boosted GEMM—the Cortex-A320 is set to power the next generation of IoT devices. By seamlessly integrating with the Ethos-U85 NPU and supporting advanced AI kernel libraries, the new platform not only enhances current applications but also lays the groundwork for innovative, autonomous AI systems. This breakthrough offers a glimpse into a future where edge devices can execute sophisticated generative AI workloads securely and efficiently, revolutionizing how we interact with technology at the edge.

Michal Pukala
Electronics and Telecommunications engineer with Electro-energetics Master degree graduation. Lightning designer experienced engineer. Currently working in IT industry.

News