I’ve run the tests and here are the results for NPU active cycles against Vela’s total cycles reported. As the network runs completely on NPU, majority of NPU’s idle cycles are due to cache invalidation on the CPU side.
- Vela’s estimated performance - total cycles: 146909
- MPS3 FPGA run - NPU active: 103971
- Vela’s estimated performance - total cycles:: 22209
- MPS3 FPGA run - NPU active: 23158
I suspect you get different results on your side? If it helps, here are my steps (all commands are from the root of the repo unless specified):
- Optimise the rnnoise model for sram_only (we have the default shared_sram model already in the resources_downloaded).
$ source ./resources_downloaded/env/bin/activate
$ vela --list-config-files
$ vela --accelerator-config=ethos-u55-128 \
--optimise Performance \
--config Arm/vela.ini \
--output-dir resources_downloaded/noise_reduction/sram-only \
- Build the default shared SRAM configuration first:
$ cmake -B ./build-rnnoise-shared-sram \
$ cmake --build ./build-rnnoise-shared-sram/ -j
- Build the SRAM only configuration, using the model we have optimised in the first step
$ cmake -B ./build-rnnoise-sram-only \
$ cmake --build ./build-rnnoise-sram-only/ -j
- Deploy the binaries from there on the MPS3 FPGA/FVP to see the NPU active cycles. As mentioned, for me, the SRAM Only is about 4.5 times quicker than Shared SRAM.
Hope this is useful.