What does the ethosu.cc do in tensorflow micro

I’m looking at the tensorflow micro ethosu operators.
It seems like it just get the command stream and input/output tensors address/size and use those info to call ethosu_driver.
I saw this line:
““Ethos-U guarantees that the tensors that require a base pointer are amongthe 8 first tensors””"
I wonder what does it mean. why it’s “8”? how is this guaranteed? What does this imply to the tensors?

@slhskgsdlfjslg NPU has currently 8 regions. Vela (offline compiler for Ethos-u) map(command stream, IFM, OFM, weights etc. ) channels to these regional channels, and they are then further mapped to the physical AXI interface. The driver is responsible for mapping each region to one of the two AXI ports. This is done in the REGIONCFG register (ethosu_config.h).

Now, in the in ethosu.cc (https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/kernels/ethos_u/ethosu.cc) we have the 8 base pointers which are set from TFLM. The 8 base pointers are for the regions rather than the individual tensors Something like:
0. Command stream.

  1. Weights and biases.
  2. Scratch area
  3. Fast scratch area

The layer ethosu.cc - stores pointers to input- and output tensors in an array. This array is then passed as argument to ethosu_driver.c:ethosu_invoke().

Hope this makes it clear. You can reach us on support-ml support-ml@arm.com for anything specific or for more details.

Oh, I see…
Is that basically what rewrite_npu_call_ops is doing?
I saw there are 4 tensors: scratch_fast_tensor, scratch_tensor, flash_tensor, command_stream_tensor
https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-vela/+/refs/heads/master/ethosu/vela/npu_serialisation.py#128

@arm-ssingh
How does ethos get the information of exact layout of the regions, e.g. which part is weights of a sepcific Conv op, which part is LUT, where to put the output?

Is that part of the command stream?
On the other hand, I saw the command stream is generated before the scratch_tensors/scratch_fast_tens, fash tensors are generated.
Doesn’t vela needs to know the layout of the scratch area before generating the command stream?

@slhskgsdlfjslg Vela optimizers assign specific meaning to the above-mentioned tensors (command stream, weights, IFM/OFM etc… ) that are attached to a custom operator and put in the command stream with specific regions id.
For e.g:
cmd0.NPU_SET_IFM_REGION 1 (region)
cmd0.NPU_SET_WEIGHT_REGION 0 (region)

Ethos-u driver understands this arrangement. Now, from a perspective of TFLM there are only two memory areas: Model and Arena. The model contains readonly while arena has read/write data and works like a heap and Vela puts these tensors in Model/Arena areas. Where Vela will put these tensors in memory depends on the system config which we feed to Vela via Vela.ini (vela.ini - ml/ethos-u/ethos-u-vela - Gitiles). You can check various memory modes and how const_mem_area/ arena_mem_area which are mapped to AXI buses.

In Vela code refer: high_level_command_to_npu_op.py:
<snip>
class BasePointerIndex(IntEnum):
    WeightTensor = 0  # base address index for the Weight tensor
    ScratchTensor = 1  # base address index for the Scratch_tensor in the TensorArena
    ScratchFastTensor = 2  # base address for the Scratch_fast_tensor
 
def get_region(mem_type: MemType, arch: ArchitectureFeatures) -> int:
    base_ptr_idx_map = {
        MemType.Permanent_NPU: BasePointerIndex.WeightTensor,
        MemType.Permanent_CPU: BasePointerIndex.WeightTensor,
        MemType.Scratch: BasePointerIndex.ScratchTensor,
    }
 
    if arch.is_spilling_enabled():
        base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchFastTensor
    else:
        base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchTensor
 
    return base_ptr_idx_map[mem_type].value
<snip>

Hope this makes it clear. You can reach us on support-ml support-ml@arm.com for anything specific or for more details.

Thanks! @arm-ssingh
I think I understand the idea of different regions.
One thing I’m a bit confused is that are we actually using at most 3 regions (instead of 8)?
WeightTensor, ScratchTensor, ScratchFastTensor

In Vela code refer: high_level_command_to_npu_op.py:

    class BasePointerIndex(IntEnum):
        WeightTensor = 0  # base address index for the Weight tensor
        ScratchTensor = 1  # base address index for the Scratch_tensor in the TensorArena
        ScratchFastTensor = 2  # base address for the Scratch_fast_tensor
 
def get_region(mem_type: MemType, arch: ArchitectureFeatures) -> int:
    base_ptr_idx_map = {
        MemType.Permanent_NPU: BasePointerIndex.WeightTensor,
        MemType.Permanent_CPU: BasePointerIndex.WeightTensor,
        MemType.Scratch: BasePointerIndex.ScratchTensor,
    }

    if arch.is_spilling_enabled():
        base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchFastTensor
    else:
        base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchTensor
 
    return base_ptr_idx_map[mem_type].value
1 Like

consider this a design decision with the future possibility of using other regions.

Got it! Thanks! (20 characters…)