How is block config decided and how is one op slipt into several blocks?

slhskgsdlfjslg · December 10, 2021, 7:27pm

Hi, I’m using vela on one of the NN model and saw the op is highly split into multiple blocks. (~1000 block for the model).
Which generate a very large command stream.

I’m read the “Block based operation” of U55 tech manual, but it doesn’t provide lots of details how the op is split.
https://developer.arm.com/documentation/102420/0200

Specifically, I’d like to know:

How is the “block_config” related to “ifm_box” and “ofm_box”, e.g. for a activation node, I’m seeing
block_config=[1, 2, 128, 128]>, ifm_box=<Box [0, 0, 0, 0] - [1, 1, 1, 1024]>
The block seems to be able to smartly choose different dims.

Is there some heuristic formula I can calculate the # of blocks it needed to estimate the command stream size?
Does block dependency change the way an op is split? (if I have two Conv-Conv, is the number of command stream the same as two separate subgraph containing one conv for each? I know the scheduling will be different).

On a high level, I’m trying to know ways to bring down the size of command stream. (either by changing the model or tuning configs of vela).

slhskgsdlfjslg · December 10, 2021, 11:02pm

To give an example: for convolution: input = [1, 97, 128, 1], output is [1, 97, 128, 32], (NHWC),
It generate a single block.
But I was reading the “block based operation” of manual.

• OFM_BLOCK_WIDTH must be in the range 1-64 and a multiple of theMIN_BLOCK_WIDTH. •
OFM_BLOCK_HEIGHT must be in the range 1-32 and a multiple of theMIN_BLOCK_HEIGHT. •
OFM_BLOCK_DEPTH must be in the range1-128 and a multiple of MIN_BLOCK_DEPTH.

Here neither outputH or outputW are in the range…

print_high_level_command_stream() <unnamed>_split_1
  0 <NPUStripe: ps=conv, ifm_box=<Box [0, 0, 0, 0] - [1, 97, 128, 1]>, ifm2_box=<Box [] - []>, ofm_box=<Box [0, 0, 0, 0] - [1, 97, 128, 32]>, weight_box=<Box [0, 0, 0, 0] - [1, 1, 1, 32]>, block_config=[8, 16, 8, 16]>
0 Conv2D, <NPUStripe: ps=conv, ifm_box=<Box [0, 0, 0, 0] - [1, 97, 128, 1]>, ifm2_box=<Box [] - []>, ofm_box=<Box [0, 0, 0, 0] - [1, 97, 128, 32]>, weight_box=<Box [0, 0, 0, 0] - [1, 1, 1, 32]>, block_config=[8, 16, 8, 16]>
      IFM: h=97,w=128,c=1, region=1, NHWC, UINT8, size=12416, scale: 0.011761300265789032, zero: 85
         Stride y/x/c: 128/1/1, tiles: w0=128, h0=97, h1=97, base=['0x0', '0x0', '0x0', '0x0']
      OFM: h=97,w=128,c=32, region=1, NHWC, UINT8, size=397312, scale: 0.006358124315738678, zero: 131
         Stride y/x/c: 4096/32/1, tiles: w0=128, h0=97, h1=97, base=['0x3080', '0x0', '0x0', '0x0']
      Kernel: w=3, h=3, stride=(1, 1), dilation=(1, 1)
      NpuPadding(top=1, left=1, bottom=1, right=1)
      Weights: (region=0, address=0x280, length=496)
      Scales: (region=0, address=0x140, length=320)
      NpuBlockTraversal.PART_KERNEL_FIRST
      Block config: h=8,w=16,c=16, NpuResamplingMode.NONE, NpuRoundingMode.TFL
Code:    Command:                       Param: Payload:
  0x010f cmd0.NPU_SET_IFM_REGION            1   -
  0x4000 cmd1.NPU_SET_IFM_BASE0             0   0x00000000 (0)
  0x4001 cmd1.NPU_SET_IFM_BASE1             0   0x00000000 (0)
  0x4002 cmd1.NPU_SET_IFM_BASE2             0   0x00000000 (0)
  0x4003 cmd1.NPU_SET_IFM_BASE3             0   0x00000000 (0)
  0x010b cmd0.NPU_SET_IFM_HEIGHT0_M1       96   -
  0x010c cmd0.NPU_SET_IFM_HEIGHT1_M1       96   -
  0x010a cmd0.NPU_SET_IFM_WIDTH0_M1       127   -
  0x0104 cmd0.NPU_SET_IFM_DEPTH_M1          0   -
  0x4006 cmd1.NPU_SET_IFM_STRIDE_C          0   0x00000001 (1)
  0x4005 cmd1.NPU_SET_IFM_STRIDE_Y          0   0x00000080 (128)
  0x4004 cmd1.NPU_SET_IFM_STRIDE_X          0   0x00000001 (1)
  0x0109 cmd0.NPU_SET_IFM_ZERO_POINT       85   -
  0x0105 cmd0.NPU_SET_IFM_PRECISION         0   -
  0x0107 cmd0.NPU_SET_IFM_UPSCALE           0   -
  0x0100 cmd0.NPU_SET_IFM_PAD_TOP           1   -
  0x0101 cmd0.NPU_SET_IFM_PAD_LEFT          1   -
  0x0103 cmd0.NPU_SET_IFM_PAD_BOTTOM        1   -
  0x0102 cmd0.NPU_SET_IFM_PAD_RIGHT         1   -
  0x011f cmd0.NPU_SET_OFM_REGION            1   -
  0x4010 cmd1.NPU_SET_OFM_BASE0             0   0x00003080 (12416)
  0x4011 cmd1.NPU_SET_OFM_BASE1             0   0x00000000 (0)
  0x4012 cmd1.NPU_SET_OFM_BASE2             0   0x00000000 (0)
  0x4013 cmd1.NPU_SET_OFM_BASE3             0   0x00000000 (0)
  0x011b cmd0.NPU_SET_OFM_HEIGHT0_M1       96   -
  0x011c cmd0.NPU_SET_OFM_HEIGHT1_M1       96   -
  0x011a cmd0.NPU_SET_OFM_WIDTH0_M1       127   -
  0x0112 cmd0.NPU_SET_OFM_HEIGHT_M1        96   -
  0x0111 cmd0.NPU_SET_OFM_WIDTH_M1        127   -
  0x0113 cmd0.NPU_SET_OFM_DEPTH_M1         31   -
  0x4016 cmd1.NPU_SET_OFM_STRIDE_C          0   0x00000001 (1)
  0x4015 cmd1.NPU_SET_OFM_STRIDE_Y          0   0x00001000 (4096)
  0x4014 cmd1.NPU_SET_OFM_STRIDE_X          0   0x00000020 (32)
  0x0118 cmd0.NPU_SET_OFM_ZERO_POINT      131   -
  0x0114 cmd0.NPU_SET_OFM_PRECISION         0   -
  0x0121 cmd0.NPU_SET_KERNEL_HEIGHT_M1      2   -
  0x0120 cmd0.NPU_SET_KERNEL_WIDTH_M1       2   -
  0x0122 cmd0.NPU_SET_KERNEL_STRIDE         4   -
  0x0128 cmd0.NPU_SET_WEIGHT_REGION         0   -
  0x4020 cmd1.NPU_SET_WEIGHT_BASE           0   0x00000280 (640)
  0x4021 cmd1.NPU_SET_WEIGHT_LENGTH         0   0x000001f0 (496)
  0x0129 cmd0.NPU_SET_SCALE_REGION          0   -
  0x4022 cmd1.NPU_SET_SCALE_BASE            0   0x00000140 (320)
  0x4023 cmd1.NPU_SET_SCALE_LENGTH          0   0x00000140 (320)
  0x0125 cmd0.NPU_SET_ACTIVATION            0   -
  0x0126 cmd0.NPU_SET_ACTIVATION_MIN        0   -
  0x0127 cmd0.NPU_SET_ACTIVATION_MAX      255   -
  0x0116 cmd0.NPU_SET_OFM_BLK_HEIGHT_M1     7   -
  0x0115 cmd0.NPU_SET_OFM_BLK_WIDTH_M1     15   -
  0x0117 cmd0.NPU_SET_OFM_BLK_DEPTH_M1     15   -
  0x010d cmd0.NPU_SET_IFM_IB_END            6   -
  0x012d cmd0.NPU_SET_AB_START              6   -
  0x0124 cmd0.NPU_SET_ACC_FORMAT            0   -
  0x012f cmd0.NPU_SET_BLOCKDEP              0   -
  0x0002 cmd0.NPU_OP_CONV                   0   -
  0x0000 cmd0.NPU_OP_STOP               65535   -
number of commands 56
command stream length in words 74

slhskgsdlfjslg · December 13, 2021, 8:15am

I’m trying to look into the code, and found that it’s mainly about the stripe thing:
e.g. I have an output of

OFM <Shape4D [1, 192, 1, 12]>

Stripe:

OFM Stripe   = <Shape4D [1, 1, 1, 12]>

Then it will need to iterate 192 times…

AlexanderEfremovArm · December 13, 2021, 12:56pm

Hi @slhskgsdlfjslg,

Thanks, for the question, please, bare with us while we answer it.

slhskgsdlfjslg · December 13, 2021, 11:05pm

Some update, I tried OptimizationStrategy.Performance vs OptimizationStrategy.Size,
OptimizationStrategy.Performance generated a much smaller command stream.
But the command stream is still pretty large: > 20KB for 20 conv layers.
I’m still looking into more details.

slhskgsdlfjslg · December 14, 2021, 1:12am

Oh, I think I wasn’t setting the vela config correctly.
Using OptimizationStrategy.Performance and right sram_target gives me 4KB command stream, which seems to be reasonable.

I was using OptimizationStrategy.Size with sram_limit = 0, it just generate an unreasonable schedule.
Still, I’d like to learn more about the logic behind this if possible.

arm-ssingh · December 15, 2021, 2:32pm

@slhskgsdlfjslg I need to investigate this further. Can you let me know which model/ops which you used and the complete CLI options which you provided in Vela ? We also need to know the size of the config which you are using as based on that Vela selects the appropriate block. While you can calculate MAC’s required for Conv running manfully, checking IFM block loading into internal RAM as block is separate from running the conv. Vela has the optimization strategy for “optimized size” and for “optimized performance” for choosing the optimizations, details of which can be find in :Vela Options

Block size is set by OFM_BLK_HEIGHT_M1, OFM_BLK_WIDTH_M1, OFM_BLK_DEPTH_M1 which is 81616.
You can see that the accumulator which Vela is choosing is 32-bit based on “ACC_FORMAT”: 0=32-bit integer, 1=40-bit integer, 2=s5.10 floating point. Block size table is defined in TRM:

Now, in case of NHWC, if there are 16 channels both in the block and in the IFM itself then we transfer the full block width all in one burst and we are not splitting in 16 channel sections. if we do not have stride_x = depth = 16 instead we cannot do this optimization and we will split in singular transfers. As far as I understand there is no formula based on which you can select the block size required.

So, we would like to know if Ithe concern is the amount of memory taken up by “large” command streams or the perceived amount of time to execute “large” command streams? Also need to know your complete CLI options to Vela ?