HI, I’m working with Ethos U55/65 NPU with ML Embedded Evaluation kit.
I was experimenting the latency of NPU devices with quantized convolution operations. U55 device worked well without any error during my experiements. But U65 device caused NPU config mismatch invoke failure even though (I think that) I used all the configs on the build process properly to target U65 device.
Below is the detailed process for the error reproduction.
-
Environment
vela version: 3.2.0
ethos-u-vela version: 3.2.0
tensorflow version: 2.5.0
ml-embedded-evalution-kit commit version: MLECO-2921(ea8ce56630544600b112d24e6bf51307fcbb93ae) -
TFLite Model
int8 quantized 3x3 convolution layer
H, W: 28, input channels: 320, outputh channels: 7
Tflite file: conv_in28_k3_s1_ci320_co3.tflite - Google Drive -
TFLite Model Converted with Vela Compiler
vela {tflite_model_path} --accelerator-config=ethos-u65-256 --optimise Performance --config {ini file path} --memory-mode=Dedicated_Sram --system-config=Ethos_U65_High_End --output-dir={output_path}
- Building NPU system
cmake {ml_kit_path} -DETHOS_U_NPU_MEMORY_MODE=Dedicated_Sram -DETHOS_U_NPU_CONFIG_ID=Y256 -DUSE_CASE_BUILD=inference_runner -Dinference_runner_MODEL_TFLITE_PATH={vela_tflite_path} -DEthos_U_NPU_ENABLED=1 -DETHOS_U_NPU_ID=U65
make -j4
- Run FVP
{PATH}/FVP_Corstone_SSE-300_Ethos-U65 {Build_path}/bin/ethos-u-inference_runner.axf -C ethosu.num_macs=256
- The Results(stdout of the FVP)
INFO - WARN - MPS3_SCC->CFG_ACLK reads 0. Assuming default clock of 32000000
Processor internal clock: 32000000Hz
INFO - V2M-MPS3 revision A
INFO - Application Note AN228, Revision C
INFO - MPS3 build 3
INFO - MPS3 core clock has been set to: 32000000Hz
INFO - CPU ID: 0x411fd220
INFO - CPU: Cortex-M55 r1p0
INFO - TA0 values set
INFO - TA1 values set
DEBUG - EthosU IRQ#: 56, Handler: 0x0x9e03
INFO - Ethos-U device initialised
INFO - Ethos-U version info:
INFO - Arch: v1.1.0
INFO - Driver: v0.16.0
INFO - MACs/cc: 256
INFO - Cmd stream: v0
INFO - Target system design: Arm Corstone-300 - AN552
DEBUG - system tick config ready
INFO - ARM ML Embedded Evaluation Kit
INFO - Version 22.2.0 Build date: Apr 15 2022 @ 05:23:20
INFO - Copyright (C) ARM Ltd 2021-2022. All rights reserved.
DEBUG - loading model from @ 0x0x70000000
DEBUG - loading op resolver
INFO - Creating allocator using tensor arena in DDR/DRAM
DEBUG - Created new allocator @ 0x0x70203230
INFO - Allocating tensors
INFO - Model INPUT tensors:
DEBUG - tensor is assigned to 0x0x70203170
INFO - tensor type is INT8
INFO - tensor occupies 250880 bytes with dimensions
INFO - 0: 1
INFO - 1: 28
INFO - 2: 28
INFO - 3: 320
INFO - Quant dimension: 0
INFO - Scale[0] = 0.996078
INFO - ZeroPoint[0] = -128
INFO - Model OUTPUT tensors:
DEBUG - tensor is assigned to 0x0x70203130
INFO - tensor type is INT8
INFO - tensor occupies 2028 bytes with dimensions
INFO - 0: 1
INFO - 1: 26
INFO - 2: 26
INFO - 3: 3
INFO - Quant dimension: 0
INFO - Scale[0] = 3.780673
INFO - ZeroPoint[0] = 60
INFO - Activation buffer (a.k.a tensor arena) size used: 253380
INFO - Number of operators: 1
INFO - Operator 0: ethos-u
DEBUG - Populating input tensor 0@0x70203170
DEBUG - Total input size to be populated: 250880
DEBUG - system tick config ready
DEBUG - NPU IDLE: 5 cycles
DEBUG - NPU AXI0_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU AXI0_WR_DATA_BEAT_WRITTEN: 0 beats
DEBUG - NPU AXI1_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU ACTIVE: 14 cycles
DEBUG - NPU TOTAL: 19 cycles
E: NPU config mismatch. npu.product=ERROR - Invoke failed.
DEBUG - NPU IDLE: 65 cycles
DEBUG - NPU AXI0_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU AXI0_WR_DATA_BEAT_WRITTEN: 0 beats
DEBUG - NPU AXI1_RD_DATA_BEAT_RECEIVED: 0 beats
DEBUG - NPU ACTIVE: 14 cycles
DEBUG - NPU TOTAL: 79 cycles
ERROR - Inference failed.
INFO - program terminating…
The execution end up with NPU config mismatch invocation failure.
To clarify my point, my question is: Which part did I do wrong among the process above? I’m quite confused about this as I think all the configurations above seems fine with me.
Again, changing every build config to target U55 device makes clean execution without any error.
Thanks for your attention.