Optimizing Inference Performance on Arm ML

danieldelex · September 26, 2024, 11:15am

Hello

I’ve been experimenting with the Arm ML Embedded Evaluation Kit for running inference on a deep learning model; but I’m encountering memory constraints when deploying models that exceed a certain size.

What strategies /optimizations are recommended for managing memory usage effectively while maintaining performance?

Are there specific model compression techniques ; memory management best practices that are suited to this platform? I have checked Challenge Validation cpq documentation guide but still need help.

Also; has anyone had success in running larger models by using quantization / pruning?

Thank you !

kshitij-sisodia-arm · September 27, 2024, 8:15am

Hi @danieldelex,

It would help us provide better guidance if we know approx model size and it’s runtime memory usage. The biggest model we have executed on ML Embedded Evaluation Kit is probably around 90 MB. The techniques you have mentioned will definitely help in getting the model size down.

If you are attempting to execute the model on Arm® Ethos™ NPU, note that it will only work with 8-bit weights. Additionally, you may find memory considerations section of the documentation useful.

If it’s the runtime memory (tensor arena) that is insufficient for the model you are experimenting with, we recommend using the Dedicated_Sram memory mode and increasing the activation buffer size using ${use_case}_ACTIVATION_BUF_SZ CMake configuration parameter. See more on this here.

Hope this helps.

Kshitij

Burton2000 · September 27, 2024, 9:10am

In addition to what @kshitij-sisodia-arm has already linked to, you can check out some extra example scripts in the ML Embedded Evaluation Kit showing the different optimization techniques in action here