Running a Neural Network on the NPU of the i.MX 8M Plus: Linumiz Blog

Hello everyone. I am Santhosh, an embedded software engineer at Linumiz. In this article we look at running a neural network on the i.MX 8M Plus NPU - a vendor NPU - but driven entirely by an open-source stack. This is the written companion to the talk; you can watch the full demo below.

Why use an NPU at all?

You can always run inference on the most powerful general-purpose block in the system: the CPU. The problem is that the CPU uses a general-purpose pipeline to perform the multiply-accumulate (MAC) operations at the heart of a neural network, so it takes far longer to produce a result.

An NPU, on the other hand, has dedicated MAC units. It runs the same inference many times faster while consuming much less power. A GPU sits in between - it has more cores than a CPU, but they are still general-purpose compared to the fixed-function MAC arrays of an NPU. For running inference, the NPU wins on both speed and power.

That is the whole reason we reach for the NPU instead of leaving the work on the CPU or GPU.

The software stack: Mesa and Gallium3D

The foundation of the open-source approach is Mesa, the open-source graphics library used on Linux. Inside Mesa there is an architecture called Gallium3D. The key idea of Gallium3D is to split a driver into two halves:

The state tracker (the upper half) is what the application talks to through API calls.
The pipe driver (the lower half) is what talks to the hardware.

We reuse this same split to drive a neural network instead of graphics. The application - TensorFlow Lite - issues API calls to the state tracker. Inside that path there is a component called Teflon. Teflon exposes an interface that goes straight down to the pipe driver, and the pipe driver talks to the kernel to drive the hardware.

The Vivante GPU/NPU IP is driven by Etnaviv, the open-source driver for Vivante hardware. Etnaviv already uses Mesa to drive the graphics stack. To run inference, we keep reusing that pipe driver but plug in Teflon at the front end to drive the neural network. In other words, we use the Mesa architecture but skip the 3D graphics part entirely and route the work to the machine-learning path.

Teflon: the front end for inference

Teflon is the front-end layer that the TensorFlow Lite application calls directly. Its job is to offload supported operations to the NPU through the Gallium driver.

Here is how the handoff actually works:

The application runs the model using the C or Python TensorFlow Lite API and hands off to the TensorFlow Lite runtime.
TensorFlow Lite talks to Teflon (a delegate) to decide which operations are supported on the NPU and which must stay on the CPU.
The model is split into subgraphs. Supported subgraphs are routed to the NPU through Teflon; the rest run on the CPU through XNNPACK (the default CPU delegate).
The CPU and NPU work on their subgraphs independently, and the results are combined back together through the shared library.

Teflon is also called a "non-stick" layer, because it does not depend on a specific piece of hardware. Using the same Teflon front end you can drive either Vivante or Rockchip NPUs - the hardware-specific part lives in the back end (Etnaviv, for example). That separation is what makes the approach reliable and portable.

Execution stack, end to end

Looking at the full execution stack from top to bottom:

Application - any model, such as image classification or object detection.
Runtime - TensorFlow Lite, which holds the delegate.
Delegate / front end - Teflon, which takes the supported operations and interfaces directly with the Mesa back end.
Back end and kernel - the same graphics back end and kernel port that drives the hardware.

We are using the same graphics stack, but instead of rendering graphics we run a neural network. That is only possible because of the Teflon front end.

Results

Here are the numbers from our logs running the same inference:

Path	Total inference time
NPU (Teflon delegate)	about 8 ms
CPU (XNNPACK delegate)	about 167 ms

That is roughly a 22x speed-up on the NPU, at much lower power, because the NPU has dedicated, parallel hardware for inference while the CPU repeatedly reuses a single general-purpose pipeline with no parallelism. We ran about four iterations to arrive at this result.

The effort behind the stack

How was this stack actually built? It was reverse-engineered from the vendor blob. The vendor does not publish a manual for programming the NPU, but the command streams going to the hardware can be captured. The work involved intercepting those command streams, decoding them, and re-implementing them on top of the open-source stack.

This work was done by Tomeu Vizoso and sponsored by Ideas on Board - they made it possible to drive a vendor NPU entirely on an open-source stack.

Takeaway

We have a vendor NPU driven entirely by an open-source stack:

Mesa and Gallium3D make the architecture possible.
Teflon is the front end and Etnaviv is the back end that drives the hardware.
The result is about 22x faster inference at low power.

The status: this is merged into open-source Mesa, version 24.1, so you can run hardware-accelerated inference yourself.

Demo: image classification on the i.MX 8M Plus

For the demonstration I used a PHYTEC phyBOARD-Pollux board, which is based on the i.MX 8M Plus SoC and includes the NPU. Teflon is already included in the build.

I ran image classification on two images.

First, on the CPU. Running the classifier without passing the Teflon library uses the CPU delegate. You can see four iterations going through XNNPACK (the default CPU delegate). The total time is about 600 ms and the average per iteration is about 166 ms. Classifying the Grace Hopper image (a person in military uniform) returns the correct label with its confidence level; a second image is correctly classified as a snail. Both run on the CPU.

Then, on the NPU. Running the same Grace Hopper image but additionally passing the external Teflon delegate loads the external delegate and routes the supported work to the NPU. The total time drops to about 30 ms with an average of about 7.5 ms per iteration. The result is the same, but the total time is much smaller - about 22x faster than the CPU run.

If you want more detail, you can pass extra arguments to dump the logs. The output lists all of the operations involved in producing the result - around 20 to 30 of them. Some operations (such as average pool, reshape and softmax) are not supported on the NPU, so they fall back to the CPU automatically. The final result, confidence level and predicted class are then displayed, along with the total elapsed time and the average time per iteration.

That is the demonstration: hardware-accelerated inference on the i.MX 8M Plus NPU using the open-source Mesa and Teflon stack.

Thank you for reading.

Running a Neural Network on the NPU of the i.MX 8M Plus