NPU config Mismatch on U65

HI, I’m working with Ethos U55/65 NPU with ML Embedded Evaluation kit.

I was experimenting the latency of NPU devices with quantized convolution operations. U55 device worked well without any error during my experiements. But U65 device caused NPU config mismatch invoke failure even though (I think that) I used all the configs on the build process properly to target U65 device.
Below is the detailed process for the error reproduction.

  • Environment
    vela version: 3.2.0
    ethos-u-vela version: 3.2.0
    tensorflow version: 2.5.0
    ml-embedded-evalution-kit commit version: MLECO-2921(ea8ce56630544600b112d24e6bf51307fcbb93ae)

  • TFLite Model
    int8 quantized 3x3 convolution layer
    H, W: 28, input channels: 320, outputh channels: 7
    Tflite file: conv_in28_k3_s1_ci320_co3.tflite - Google Drive

  • TFLite Model Converted with Vela Compiler

vela {tflite_model_path} --accelerator-config=ethos-u65-256 --optimise Performance --config {ini file path} --memory-mode=Dedicated_Sram --system-config=Ethos_U65_High_End --output-dir={output_path}

  • Building NPU system

cmake {ml_kit_path} -DETHOS_U_NPU_MEMORY_MODE=Dedicated_Sram -DETHOS_U_NPU_CONFIG_ID=Y256 -DUSE_CASE_BUILD=inference_runner -Dinference_runner_MODEL_TFLITE_PATH={vela_tflite_path} -DEthos_U_NPU_ENABLED=1 -DETHOS_U_NPU_ID=U65

make -j4

  • Run FVP

{PATH}/FVP_Corstone_SSE-300_Ethos-U65 {Build_path}/bin/ethos-u-inference_runner.axf -C ethosu.num_macs=256

  • The Results(stdout of the FVP)

INFO - WARN - MPS3_SCC->CFG_ACLK reads 0. Assuming default clock of 32000000
Processor internal clock: 32000000Hz
INFO - V2M-MPS3 revision A
INFO - Application Note AN228, Revision C
INFO - MPS3 build 3
INFO - MPS3 core clock has been set to: 32000000Hz
INFO - CPU ID: 0x411fd220
INFO - CPU: Cortex-M55 r1p0
INFO - TA0 values set
INFO - TA1 values set
DEBUG - EthosU IRQ#: 56, Handler: 0x0x9e03
INFO - Ethos-U device initialised
INFO - Ethos-U version info:
INFO - Arch: v1.1.0
INFO - Driver: v0.16.0
INFO - MACs/cc: 256
INFO - Cmd stream: v0
INFO - Target system design: Arm Corstone-300 - AN552
DEBUG - system tick config ready
INFO - ARM ML Embedded Evaluation Kit
INFO - Version 22.2.0 Build date: Apr 15 2022 @ 05:23:20
INFO - Copyright (C) ARM Ltd 2021-2022. All rights reserved.
DEBUG - loading model from @ 0x0x70000000
DEBUG - loading op resolver
INFO - Creating allocator using tensor arena in DDR/DRAM
DEBUG - Created new allocator @ 0x0x70203230
INFO - Allocating tensors
INFO - Model INPUT tensors:
DEBUG - tensor is assigned to 0x0x70203170
INFO - tensor type is INT8
INFO - tensor occupies 250880 bytes with dimensions
INFO - 0: 1
INFO - 1: 28
INFO - 2: 28
INFO - 3: 320
INFO - Quant dimension: 0
INFO - Scale[0] = 0.996078
INFO - ZeroPoint[0] = -128
INFO - Model OUTPUT tensors:
DEBUG - tensor is assigned to 0x0x70203130
INFO - tensor type is INT8
INFO - tensor occupies 2028 bytes with dimensions
INFO - 0: 1
INFO - 1: 26
INFO - 2: 26
INFO - 3: 3
INFO - Quant dimension: 0
INFO - Scale[0] = 3.780673
INFO - ZeroPoint[0] = 60
INFO - Activation buffer (a.k.a tensor arena) size used: 253380
INFO - Number of operators: 1
INFO - Operator 0: ethos-u
DEBUG - Populating input tensor 0@0x70203170
DEBUG - Total input size to be populated: 250880
DEBUG - system tick config ready
DEBUG - NPU IDLE: 5 cycles
DEBUG - NPU AXI0_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU AXI0_WR_DATA_BEAT_WRITTEN: 0 beats
DEBUG - NPU AXI1_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU ACTIVE: 14 cycles
DEBUG - NPU TOTAL: 19 cycles
E: NPU config mismatch. npu.product=ERROR - Invoke failed.
DEBUG - NPU IDLE: 65 cycles
DEBUG - NPU AXI0_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU AXI0_WR_DATA_BEAT_WRITTEN: 0 beats
DEBUG - NPU AXI1_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU ACTIVE: 14 cycles
DEBUG - NPU TOTAL: 79 cycles
ERROR - Inference failed.
INFO - program terminating…

The execution end up with NPU config mismatch invocation failure.

To clarify my point, my question is: Which part did I do wrong among the process above? I’m quite confused about this as I think all the configurations above seems fine with me.

Again, changing every build config to target U55 device makes clean execution without any error.

Thanks for your attention.

Hi,

Thanks for posting the question here and providing the details (which have been useful in spotting clues). There is nothing wrong in the whole flow you have mentioned, the issue is the Vela version you are using (= 3.2.0). In the information you provided before, you mentioned that that the default model for Y256 configuration works, right? It is because in resources_downloaded, the Python virtual environment created is using Vela version 3.3.0 (see ethos-u-vela · PyPI). You should be able to use the same virtual environment to optimise the model, using the same command as you have been. If you use this newly optimised model, you should see the error disappear.

We will raise the incomplete error reported by the driver to the team. Ideally, we should expect the error to be verbose enough to point out that the command stream generated is from an older/incompatible version. I suspect this issue of truncated error string is also down to this being compiled by the Arm GNU embedded toolchain and how it handles stderr, but we will see how it can be improved.

Hope this helps,
ML Ecosystem Team

Hi, Thanks for your help! Changing vela compiler verison removes the error above!
For your information, I’ll attach standard output and standard error dump with VERBOSE=1 options.

However, still I got issues with vela compiler with version > 3.3.0
After compiling the tflite model, vela compiler summarizes statistics related to the model on a single csv file. But the format of the csv file seems quite odd. (conv_in28_k3_s1_ci320_co3_summary_Ethos_U65_High_End.csv - Google Drive)
For example, you can see that length of the data row and label row does not match.

Could your team check this issue?

Sure, there is indeed an error with the csv file generated. We will report this to the Vela team. Thanks for bringing this to our attention.

1 Like