I’m looking at the tensorflow micro ethosu operators.
It seems like it just get the command stream and input/output tensors address/size and use those info to call ethosu_driver.
I saw this line:
““Ethos-U guarantees that the tensors that require a base pointer are amongthe 8 first tensors””"
I wonder what does it mean. why it’s “8”? how is this guaranteed? What does this imply to the tensors?
@slhskgsdlfjslg NPU has currently 8 regions. Vela (offline compiler for Ethos-u) map(command stream, IFM, OFM, weights etc. ) channels to these regional channels, and they are then further mapped to the physical AXI interface. The driver is responsible for mapping each region to one of the two AXI ports. This is done in the REGIONCFG register (ethosu_config.h).
Now, in the in ethosu.cc (https://github.com/tensorflow/tflite-micro/blob/main/tensorflow/lite/micro/kernels/ethos_u/ethosu.cc) we have the 8 base pointers which are set from TFLM. The 8 base pointers are for the regions rather than the individual tensors Something like:
0. Command stream.
- Weights and biases.
- Scratch area
- Fast scratch area
The layer ethosu.cc
- stores pointers to input- and output tensors in an array. This array is then passed as argument to ethosu_driver.c:ethosu_invoke()
.
Hope this makes it clear. You can reach us on support-ml support-ml@arm.com for anything specific or for more details.
Oh, I see…
Is that basically what rewrite_npu_call_ops is doing?
I saw there are 4 tensors: scratch_fast_tensor, scratch_tensor, flash_tensor, command_stream_tensor
https://review.mlplatform.org/plugins/gitiles/ml/ethos-u/ethos-u-vela/+/refs/heads/master/ethosu/vela/npu_serialisation.py#128
@arm-ssingh
How does ethos get the information of exact layout of the regions, e.g. which part is weights of a sepcific Conv op, which part is LUT, where to put the output?
Is that part of the command stream?
On the other hand, I saw the command stream is generated before the scratch_tensors/scratch_fast_tens, fash tensors are generated.
Doesn’t vela needs to know the layout of the scratch area before generating the command stream?
@slhskgsdlfjslg Vela optimizers assign specific meaning to the above-mentioned tensors (command stream, weights, IFM/OFM etc… ) that are attached to a custom operator and put in the command stream with specific regions id.
For e.g:
cmd0.NPU_SET_IFM_REGION 1 (region)
cmd0.NPU_SET_WEIGHT_REGION 0 (region)
Ethos-u driver understands this arrangement. Now, from a perspective of TFLM there are only two memory areas: Model and Arena. The model contains readonly while arena has read/write data and works like a heap and Vela puts these tensors in Model/Arena areas. Where Vela will put these tensors in memory depends on the system config which we feed to Vela via Vela.ini (vela.ini - ml/ethos-u/ethos-u-vela - Gitiles). You can check various memory modes and how const_mem_area/ arena_mem_area which are mapped to AXI buses.
In Vela code refer: high_level_command_to_npu_op.py:
<snip>
class BasePointerIndex(IntEnum):
WeightTensor = 0 # base address index for the Weight tensor
ScratchTensor = 1 # base address index for the Scratch_tensor in the TensorArena
ScratchFastTensor = 2 # base address for the Scratch_fast_tensor
def get_region(mem_type: MemType, arch: ArchitectureFeatures) -> int:
base_ptr_idx_map = {
MemType.Permanent_NPU: BasePointerIndex.WeightTensor,
MemType.Permanent_CPU: BasePointerIndex.WeightTensor,
MemType.Scratch: BasePointerIndex.ScratchTensor,
}
if arch.is_spilling_enabled():
base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchFastTensor
else:
base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchTensor
return base_ptr_idx_map[mem_type].value
<snip>
Hope this makes it clear. You can reach us on support-ml support-ml@arm.com for anything specific or for more details.
Thanks! @arm-ssingh
I think I understand the idea of different regions.
One thing I’m a bit confused is that are we actually using at most 3 regions (instead of 8)?
WeightTensor, ScratchTensor, ScratchFastTensor
In Vela code refer: high_level_command_to_npu_op.py:
class BasePointerIndex(IntEnum):
WeightTensor = 0 # base address index for the Weight tensor
ScratchTensor = 1 # base address index for the Scratch_tensor in the TensorArena
ScratchFastTensor = 2 # base address for the Scratch_fast_tensor
def get_region(mem_type: MemType, arch: ArchitectureFeatures) -> int:
base_ptr_idx_map = {
MemType.Permanent_NPU: BasePointerIndex.WeightTensor,
MemType.Permanent_CPU: BasePointerIndex.WeightTensor,
MemType.Scratch: BasePointerIndex.ScratchTensor,
}
if arch.is_spilling_enabled():
base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchFastTensor
else:
base_ptr_idx_map[MemType.Scratch_fast] = BasePointerIndex.ScratchTensor
return base_ptr_idx_map[mem_type].value
consider this a design decision with the future possibility of using other regions.
Got it! Thanks! (20 characters…)