Exceptions in FVP Corstone SSE-300

Try running a ML model on M55/U55 using this platform, which relays the following exception message:

Ethos-U rev 136b7d75 --- Nov 25 2021 12:05:57
(C) COPYRIGHT 2019-2021 Arm Limited
ALL RIGHTS RESERVED

INFO - Processor internal clock: 32000000Hz
WARN - MPS3_SCC->CFG_ACLK reads 0. Assuming default clock of 32000000
sh: xterm: command not found
INFO - V2M-MPS3 revision A
INFO - Application Note AN228, Revision C
INFO - MPS3 build 3
INFO - MPS3 core clock has been set to: 32000000Hz
INFO - CPU ID: 0x411fd220
INFO - CPU: Cortex-M55 r1p0
INFO - Ethos-U device initialised
INFO - Ethos-U version info:
INFO - Arch: v1.1.0
INFO - Driver: v0.16.0
INFO - MACs/cc: 128
INFO - Cmd stream: v0
INFO - SHRAM size: 24
INFO - Arm Corstone-300 (SSE-300) platform initialised
INFO - Running boltNNApp with flatbuffers runtimeUsing flatbuffers runtime,
Exception caught by function BusFault_Handler
CTRL : 0x0000000c
IPSR : 0x00000005
APSR : 0xa0000000
xPSR : 0xa0000005
PSP : 0x00000000
MSP : 0x2007fb58
PRIMASK : 0x00000000
BASEPRI : 0x00000000
FAULTMSK: 0x00000000

How to resolve this issue? Configs used are:

mps3_board.uart0.out_file=-
mps3_board.visualisation.disable-visualisation=1
mps3_board.uart0.shutdown_on_eot=1

Hi @limintang,

Thanks for getting in touch. We would need more information on this before we can help:

  1. What is your CMake configuration command?
  2. Is this model publicly available for us to give it a try?
  3. What is the size of the model?
  4. Which toolchain (with version) are you building with?
  5. I don’t see the output from Main.cc line 38 anywhere, suggesting that the application code has been modified. Is this correct, and if so what changes have been made to the repo?
  6. Are you using inference runner application or a custom use case that you have added?

As a first step you can enable a more verbose log with -DLOG_LEVEL=LOG_LEVEL_TRACE to your CMake configuration. This might help to narrow down the part where the error is triggered from. It doesn’t look like it is even reaching inference stage. The fault it right after a statement that doesn’t exist in our vanilla code base.

Additionally, you can build the application for native (your host machine) target using the original TFLite file. If the issue shows up in that build, it will be easier to debug as you can use gdb or other tools.

Thanks,
Kshitij

Hi Kshitij,

Sorry for providing very little information in the post. The context is, in our company ML models are designed and trained using PyTorch, so we developed internal toolchain to run ML inference of PyTorch models on ARM Ethos, leveraging ARM’s open source projects for which. While high-level architecture of our SW stack is similar to TFLite, implementation details and workflow are quite different.

To answer your questions one by one:

  1. We don’t use CMake, we use Buck. Will sharing building configurations, like preprocessor flags, compiler options, etc. help?
  2. Unfortunately no. Besides, it is a PyTorch model and relies on our internal toolchain to run. We don’t use the PyTorch → ONNX → TFLite flow.
  3. 3375168 bytes to be exact.
  4. ARM GCC: arm-none-eabi-g++ (GNU Arm Embedded Toolchain 10-2020-q4-major) 10.2.1 20201103 (release)
  5. We don’t use the app in the repo, we developed our own. The app still depends on hal, though, which was built using Buck.
  6. Custom use case. We were able to successfully execute many other models using our internal workflow on this platform, but can hit this issue when model size is large.

As a first step you can enable a more verbose log with -DLOG_LEVEL=LOG_LEVEL_TRACE

Thanks for the tip, tried this, but no additional message is relayed.

It doesn’t look like it is even reaching inference stage. The fault it right after a statement that doesn’t exist in our vanilla code base.

I believe so, too. The statement is from our custom app, it means the application starts to run. The next expected message is model loading succeeds, which doesn’t happen. It looks like the simulator faulted when model binary is loaded to the memory.

you can build the application for native (your host machine)

You mean building and running ML inference of the model on x86? The model was developed by our ML engineer, who already verified it can run successfully on x86 CPU.

Thanks for the help!

Regards,
Limin

Hi Limin,

Thanks for providing the details. It’s useful to get this context :slight_smile:.

So, if I understand correctly:

  • You have your own application level logic and are only using the HAL component from MLEK to drive the Corstone-300 platform.
  • You see this BusFault when running the above application.

Some general thoughts first:

  • You mention you see the issue when the model size is large - this model is about 3.2 MiB which is smaller than our image classification model file (~3.5M for Ethos-U55-128, 4M for CPU only) or speech recognition (~14M for Ethos-U55-128 and 21M for CPU only). So, the platform itself - and our original sources - are capable of running quite big models (at least from embedded perspective).
  • The only thing that could cause issues for larger models outside the usual RO storage is the allocated RW space for intermediate computation for an inference. I suspect your application logic will also be reserving some space for such buffers (TensorFlow Lite Micro world equivalent is tensor_arena). See Arm Compiler scatter file (mps3-sse-300.sct line 70) or GNU linker script (mps3-sse-300.ld line 144). How big is this buffer for you, and more importantly, are you sure it is sufficient and that it does not breach the memory limits for Corstone-300 memory system?

You mean building and running ML inference of the model on x86?

When I mentioned the native pipeline, I mean that our HAL logic supports the build for native target. But this would only work if your build system and your application level logic is relatively platform-agnostic and only depends on HAL calls for any interaction with the actual target. For vanilla MLEK, you could configure the CMake project with -DTARGET_PLATFORM=native and then build it to get exectuables that will run on your host machine. This, for us, is a good way to verify only the ML or application level logic. And if we find issues, the subsequent debugging is a lot easier :).

However, I understand that this path might not be a viable one for you if your application/build does not allow this. There are a few things to try/consider:

  • Can you get the equivalent of the model in tflite format so you can execute it via the vanilla MLEK inference runner? I understand this is not as useful for you, but just to get a reference and proof point.
  • Ensure that the memory system defined for the software (linker scripts essentially) adhere to the memory model defined for the target. See AN552 documentation - section 3 is the programmers model. This is important because even if we were to debug the application, even knowing the point of failure might not help much.
  • In terms of actual debugging, there are a few ways:
    • if you have access to Arm Development Studio (ArmDS), you could build your application with debug flags and then kick it off in ArmDS environment to debug it. The FVP can be executed with -I flag to start the Iris server that will allow ArmDS environment to connect to it and then debug an application.
    • I guess you’d have tried this already - but you can add more prints in the application to see at which point it fails. The clue is also in the MSP : 0x2007fb58. Your build (provided the right linker flags have been passed in) should have generated a map file that should tell you which code is around that region. If in doubt, you can always disassemble your application to check what that address refers to. arm-none-eabi-objdump -D -b binary -marm <your-app.elf/axf> should do the trick.
    • If you have the actual MPS3 FPGA board, debugging could be easier as it has CMSIS-DAP - making it possible to debug using any IDE in conjunction with pyOCD.

I hope this is useful. But as the software stack is heavily modified, I don’t know if our team is best placed to offer assistance on actual debugging for your application. I would recommend getting in touch with your Arm representative and opening a ticket with them to get access to tools or general guidance on those topics.

Best regards,
Kshitij

Hi Kshitij,

Thanks for the guide and detailed instructions. I’m starting my vacation today, will resume debugging the next year. Happy holiday!

Best regards,
Limin

Hi Kshitij,

We use GNU linker script mps3-sse-300.ld. What I found out is we use std::malloc to allocate the memory for the model binary at model loading stage, which triggers the exception. The linker script shows heap size is 960K (we use an older linker script), could this be the root cause?

Given the large model sizes you mentioned in previous reply, std::malloc is probably not the correct way to allocate memory for the model binary. So what is the correct way? i.e. which memory region we should use to allocate the memory and how is this done? Any code reference to your original source will be helpful.

Best regards,
Limin

Hi Limiin,

I’d like to understand the use of std::malloc further for the model binary. To me, this suggests that the model blob (the tflite or other format file with weights/biases/metadata) is not statically baked into the application and you are manually allocating space for it. Is this correct? The only case where I see the need for the model memory to be manually allocated is when the model is not known at build-time, and is being received over the network (or some other connection) by the application.

In our examples, the model is always baked-in (ref: mps3-sse-300.ld line 188 and BufAttributes.hpp line 58). The generated model C++ file, at CMake configuration stage, is created with certain named attribute to help linker script refer to it and place it in the DDR.

Hope this is useful.

Thanks,
Kshitij

Hi Kshitij,

I’d like to understand the use of std::malloc further for the model binary. To me, this suggests that the model blob (the tflite or other format file with weights/biases/metadata) is not statically baked into the application and you are manually allocating space for it. Is this correct? The only case where I see the need for the model memory to be manually allocated is when the model is not known at build-time, and is being received over the network (or some other connection) by the application.

Yes, this is the case. Models are not loaded from network, however they are indeed not known at application build time. The application is part of our test framework, which is model agnostic. In our production build, model blob is baked into the system image, so this issue doesn’t exist there.

Is it there any way to work around this limitation on FVP? I can use similar approach as our production build, but it could break the generality of our test framework.

Best regards,
Limin

Hi Kshitij,

I tried the same approach as in production build, baking the model binary into the application using __attribute__((section("nn_model"))), the model can executed successfully. However, after which FVP throws the following exception:

INFO - releasing platform Arm Corstone-300 (SSE-300)
Exception caught by function HardFault_Handler
CTRL : 0x0000000c
IPSR : 0x00000003
APSR : 0xa0000000
xPSR : 0xa0000003
PSP : 0x00000000
MSP : 0x2007fe68
PRIMASK : 0x00000001
BASEPRI : 0x00000000
FAULTMSK: 0x00000000

Any idea what’s wrong? Is there any cleanup needed after execution when model binary is baked into the memory section specified in the linker script?

Best regards,
Limin

Hi Limin,

Just to let you know we are looking at your latest replies but we might be a bit delayed in providing a response today.

In the meantime, your use case of a testing framework with the FVP and not knowing the model at build time sounds very similar to our inference_runner application with dynamic memory loading.

For this the model is not baked in at build time but instead is loaded to a specific address at runtime. See here in the docs for more details, maybe something like this is what you are looking for? (Note however that it is only applicable when using the FVP).

Best regards,
Richard

Hi Limin,

Again, it would be very difficult to debug based on just that snapshot and no code :). But to answer the second part of the question - no, you do not need any clean up when the model (or anything else) is baked in. It resides in memory just as an RO region, just like you application code does. You don’t need to allocate or deallocate it. Where errors could happen perhaps is if you try to write to such a memory.

Richard has rightly pointed you to an example where our model is also changed at runtime for the FVP execution for inference runner. I think this would be a good approach for you too perhaps but as Richard said, this will only work for the FVP. It doesn’t translate well enough to the FPGA instance which has limited load regions.

Hope this is useful.

Best regards,
Kshitij

Hi Richard,

Thanks for the reply! I found out we did use dynamic memory loading in our test framework, the culprit is in somewhere else std::malloc is used to allocate memory for the model again and the dynamically loaded model is copied over. Eliminating this copy and use the address of dynamically loaded model directly solves this issue.

Best regards,
Limin

Hi Kshitij,

I found out the reason. As mentioned in my reply to Richard, there are double memory allocations for the model in our test framework. Statically baking model into memory removes one of the them, but the test framework is unaware of this and tries to free the statically baked memory region, causing the exception.

As also mentioned in the reply to Richard, I identified the root cause and fixed the issue. Thanks for the great insights in all your replies, they are really helpful!

Best regards,
Limin