How to set U55 in 'Sram_Only' mode in FPGA?

Hi,

When I set memory mode to ‘Sram_Only’ for U55, I see cycle significant reduction from Vela output compared to default ‘Shared_Sram’. The application is noise_reduction use case.

However, in mps3 FPGA, I don’t see the similar reduction. In my build, ETHOS_U_NPU_MEMORY_MODE=Sram_Only is set. Please let me know if I missed something or did something wrong.

Thanks,
Kaiping

Hi @Kaipingli88,

Thanks for raising this. We will run the numbers on our side and share them with you.

It is important to note, however, that the Vela numbers are not always representative of what the actual hardware will show. The FPGA numbers should be definitive, but we will double-check that the build flow actually runs this configuration and the numbers are better than what Shared_Sram configuration gives us.

Hi @Kaipingli88,

I’ve run the tests and here are the results for NPU active cycles against Vela’s total cycles reported. As the network runs completely on NPU, majority of NPU’s idle cycles are due to cache invalidation on the CPU side.

  • Shared_Sram:
    • Vela’s estimated performance - total cycles: 146909
    • MPS3 FPGA run - NPU active: 103971
  • Sram_Only:
    • Vela’s estimated performance - total cycles:: 22209
    • MPS3 FPGA run - NPU active: 23158

I suspect you get different results on your side? If it helps, here are my steps (all commands are from the root of the repo unless specified):

  1. Optimise the rnnoise model for sram_only (we have the default shared_sram model already in the resources_downloaded).
$ source ./resources_downloaded/env/bin/activate
$ vela --list-config-files
$ vela --accelerator-config=ethos-u55-128 \
     --optimise Performance \
     --config Arm/vela.ini \
     --memory-mode=Sram_Only \
     --system-config=Ethos_U55_High_End_Embedded \
     --output-dir resources_downloaded/noise_reduction/sram-only \
     resources_downloaded/noise_reduction/rnnoise_INT8.tflite
  1. Build the default shared SRAM configuration first:
$ cmake -B ./build-rnnoise-shared-sram \
    -DUSE_CASE_BUILD=noise_reduction \
    --preset=mps3-300-clang
$ cmake --build ./build-rnnoise-shared-sram/ -j
  1. Build the SRAM only configuration, using the model we have optimised in the first step
$ cmake -B ./build-rnnoise-sram-only \
    -DUSE_CASE_BUILD=noise_reduction \
    --preset=mps3-300-clang \
    -Dnoise_reduction_MODEL_TFLITE_PATH=resources_downloaded/noise_reduction/sram-only/rnnoise_INT8_vela.tflite \
    -DETHOS_U_NPU_MEMORY_MODE=Sram_Only
$ cmake --build ./build-rnnoise-sram-only/ -j
  1. Deploy the binaries from there on the MPS3 FPGA/FVP to see the NPU active cycles. As mentioned, for me, the SRAM Only is about 4.5 times quicker than Shared SRAM.

Hope this is useful.

Regards,
Kshitij

Hi Kshitij,

Thanks a lot for your help! I got the same results in my side.

However, I have question on how you get the conclusion that “As the network runs completely on NPU, majority of NPU’s idle cycles are due to cache invalidation on the CPU side.” Since the network runs entirely on NPU, it seems the NPU total cycle should be used as the metric of performance. Please explain more on your conclusion.

Regards,
Kaiping

Hi @Kaipingli88

I missed this message. You are right, NPU total cycle can be used after de54e1606b21d333e126525807414455d2ff1840 change. Previously, the idle count was too high because of several cache maintenance overheads which have now been reduced to several thousand cycles. NPU active cycles will not be close to NPU total cycles.

Hope this helps.

Thanks,
Kshitij

Hi Kshitij,

Thanks a lot for fix this issue and let me know about it and this topic can be closed.

Thanks,
Kaiping