Extending TOSA to Support Symbolic Shapes

Hi TOSA-authors, I took a stab at writing up an addendum to the TOSA specification which I believe would enable compilers and tools to be easily constructed for fixed-rank, dynamically shaped tensor programs of the form that TOSA seems at a level to specify. These definitions represent several years of learnings having implemented similar approaches in MHLO, IREE, and XLA – and encodes the decisions that I wish we had made from the get-go on those projects while the representations were new and unencumbered. There is much more that can be done in this area to enable increasingly dynamic programs, but I believe what I present below to strike the right balance with respect to what TOSA is trying to achieve in its current state, supporting lowerings for common programs, and preserving the simplicity and accessibility of the static form for implementations that are not dynamic.

I further believe that the existing LLVM/MLIR based implementation of TOSA (and the lowerings from TF/TFLite) could be fairly trivially extended to implement this, providing direct tooling for expressing symbolically-shaped tensor programs (and tooling to specialize, etc).

I wrote this in a form intended for evaluation/discussion, not necessarily direct contribution to the Spec. It would need to be reworked and folded into relevant sections of the specification.

For the lawyers: This thread can be considered a “Google Contribution to the TOSA Specification”.


Dynamic Shape Support

In its present form, TOSA defines operations sufficient to represent a wide variety of fixed-rank tensor programs. While this clearly includes fully statically shaped programs (where there are no unknown dimensions in any tensor), it is also useful for the representation to be complete for expressing fixed-rank symbolically shaped programs. History has shown that this is a useful subset (compared to the full generality of a completely rank-unknown representation).

Supporting symbolically shaped programs in detail is left to specific toolchain and compiler implementations. TOSA itself merely seeks to ensure that programs which are otherwise spec-compliant but have unknown, symbolic dimensions are unambiguous in a very direct fashion.

At the present time, there is no requirement that conforming implementations accept programs for execution with unknown, symbolic dimensions, but if they do so, they must do so completely/correctly in conformance with this specification.

Definitions

Fixed-rank

: A tensor whose rank is known. Specific dimensions of the tensor may be unknown.

Static-shaped

: A tensor or program consisting entirely of Fixed-rank tensors (or other primitive, unshaped datatypes) where there are no Dimension Symbols.

Dimension Symbol

: A dimension of a tensor that is unknown. The unknown dimension is uniquely identified by a reference to the tensor, combined with its 0-based offset in the dimension list.

Index Type

: A type sufficient to represent all legal values of a Dimension Symbol. While this can be a more restrictive type, it is assumed that a signed 64-bit integer is sufficient for all implementations.

Dimension Value

: An Index Typed value which represents the explicit quantity that a Dimension Symbol is assumed to take.

Shape Ambiguous

: An operation which, when any of its input or output dimensions are unknown, is ambiguous with respect to its output shape unless if it is parameterized with explicit values for each unknown Dimension Symbol.

Expansion Symbol

: For operations with broadcast semantics, a special symbol which signifies the the dimension expands relative to its broadcast-peers. At present, the only valid Expansion Symbol is the literal 1 dimension in a tensor dimension list. Notably, a generic Dimension Symbol can never be interpreted as an Expansion Symbol in the current spec version.

Approach

A majority of TOSA ops are not Shape Ambiguous. These operations require no further information in the presence of Dimension Symbols and are not discussed in this section, unless if there are specific caveats.

This leaves three categories of ops requiring further definition:

  • Shape Ambiguous operations: each is outlined here with the additional definitions to make it unambiguous.

  • Shape helper operations: the above (and common usage) requires a library of additional shape helper operations in order to be complete. These are specified in this section. Such operations should never exist in a Static-shaped program, and it is assumed that any conforming compiler which accepts symbolically shaped programs can eliminate them if all constraints are known.

  • Data Dependent Shaped Operations: A number of TOSA operations are defined such that they require constant (attribute) arguments for shape-impacting behavior. This section does not expand the semantics for these operations and assumes that a future version of the specification will define new operations, which are expected to decay to more primitive form when all symbols are known.

Shape Ambiguous Operations

Typically shape ambiguous operations are made unambiguous by requiring that they accept an additional list of Dimension Values for each Dimension Symbol. The order and any additional interpretation is provided below for each op. These addendums should be added to the overall op description prior to publishing the spec.

RESHAPE operation

If any of the shape dimensions is unknown, then the op must be parameterized with an additional shape_symbols variadic list of Dimension Value corresponding to each Dimension Symbol in the output shape. In this case, the new_shape attribute must have a -1 value for any Dimension Symbol, and all dimensions must be explicit (i.e. rescinds the rule "At most one dimension may be given as-1 to automatically calculate the dimension size).

Data Dependent Shaped Operations

In order to represent more symbolically shaped programs, it is expected that a future version of the specification will need to define several data-dependent variants of ops. Effort should be undertaken to make these not exhibit shape ambiguity and preserve simple transformations to purely static forms as possible.

  • IOTA: Creates linear increasing tensor of numeric values. Will require explicit variadic of IndexType for the range.

  • DATA_DEPENDENT_SLICE: Equivalent to SLICE but takes start and/or sizes dynamically. Extending in this way does not introduce shape ambiguity.

  • DATA_DEPENDENT_PAD: Equivalent to PAD but accepts padding as an explicit variadic of IndexType quantities for padding.

  • DATA_DEPENDENT_GATHER / DATA_DEPENDENT_SCATTER: Variants of GATHER and SCATTER which accept variadic of IndexType quantities for indices.

  • DATA_DEPENDENT_CONV2D / DATA_DEPENDENT_CONV3D / DATA_DEPENDENT_DEPTHWISE_CONV2D / DATA_DEPENDENT_TRANSPOSE_CONV2D: Variants of the conv ops which take all parameters as operands instead of compile time constants (pad, stride, dilations). Each should be a variadic of IndexType quantities suitable for the specific form.

In general, as the design space permits, new ops should be defined in ways that limit shape ambiguity and accept additional index/dimension arguments as variadic packs of IndexType that correspond to the overall constraints of the operation. Keeping such symbols as simple scalars helps distinguish the types of data dependence and simplifies analysis/compilation, especially for more constrained programs (i.e. as opposed to representing indices/dimensions as tensor values).

Defining operations with data dependent dimensions in this fashion allows straight-forward tooling to be constructed which can fixate Dimension Symbols in the graph, allowing all such data dependent operations to decay to purely static forms, leveraging simple transformations and shape transfer functions.

Shape Helper Operations

A small number of helper operations to allow programs conforming to the above definitions to be represented.

INDEX_CONST

Attributes:

  • value: IndexType value of the constant

Returns: An IndexType value.

GET_DIM

Inputs:

  • input: The tensor value to extract a dimension from.

Attributes:

  • dim_index: The 0-based index of the dimension to extract from input.

Returns: An IndexType value corresponding to the runtime size of the given dimension (which can either be static or symbolic).

EXTRACT_DIM

Inputs:

  • input: A 1D tensor of an integer type.

Attributes:

  • dim_index: The 0-based index of the element to extract from input.

Returns: An IndexType value corresponding to the integer value at input[dim_index]

DIMS_TO_TENSOR

Inputs:

  • dims: Variadic list of IndexType dimensions with an arity equal to the rank of the result.

Returns: A 1D tensor of an integer type containing the given list of dims, cast to the specified result element type.

EXPAND_DIMS

Inputs:

  • input: An arbitrary tensor.

Attributes:

  • indices: IndexType list of indices in the input tensor at which point to insert an Expansion Symbol (i.e. static 1 dim in the present implementation). Must be sorted in increasing order.

Returns:

  • A tensor with the same contents as input but with its dimension list expanded by adding a 1 dim at each location in the input shape matching a the next value in indices, proceeding left to right through that list.

COLLAPSE_DIMS

Inputs:

  • input: An arbitrary tensor.

Attributes:

  • indices: IndexType list of indices in the input tensor at which point to remove an Expansion Symbol (i.e. static 1 dim in the present implementation). Must be sorted in increasing order.

Returns:

  • A tensor with the same contents as input but with every dimension in indices removed, presuming that each is an Expansion Symbol (i.e. static 1 dim in the present implementation).
1 Like

Overall this sounds pretty good. One question is why EXTRACT_DIM / DIMS_TO_TENSOR are needed. I.e. why do we need extent tensors? We can use loose SSA values since we only support a ranked programming model.

A few nits below:

Should this somehow be formalized as a “profile” like the “Base Inference” profile? It seems a little wishy-washy to have this kind of additional “what it means to be spec compliant / optionally compliant” be open-coded here.

On first reading, it’s unclear what " parameterized with explicit values for each unknown Dimension Symbol ." means. Just that the size is explicit? (why the mention of Dimension Symbol’s then)

Nit: usually “data dependent” in this sort of context is for things like np.where where the output size depends on the tensor contents (i.e. “data”). Maybe call these variants “dynamically shape parameterized” (as opposed to the current situation where they are parameterized by static values – attributes).

1 Like

Thanks for the review, Sean. I’ll just comment here vs changing the source for now so that it reads consistently.

For better or for worse, because some of our favorite frontends dip in and out of “tensor land” for doing shape massaging, and this happens pretty extensively in some models (various attention mechanisms/tiling come to mind). My intent by having these two “bounding” ops was:

  • At least provide clear bracketing points in the IR where we cross these domains (tensor and index domain).
  • Have a high degree of representability for “normal” programs so that they don’t have to do this: it really is an escape valve for open coded shape massaging in high level frameworks.
  • Capturing the weird in a way that we can reason about in a safe way, error report on, etc.

I’m of the strong opinion that for compiler backends, the right answer for these kind of open coded shape-massaging-with-tensors use cases, is to scalarize them. If a compiler wanted to do that, having such bracketing ops makes it trivial to analyze and do that.

Yeah - I’ll leave the ARM folks to comment as I think they have ideas about different profiles and tiers of support. Agreed that should be formalized somehow.

I’m thinking of collapsing this with whatever we rename Data Dependent Shaped Operations to, and since there is only one thing in this category (RESHAPE), it might be best to have a DATA_DEPENDENT_RESHAPE in addition to the fully static form: that would add consistency. That also gives the DATA_DEPENDENT_* ops that nice quality as a set that a compiler supporting them directly needs to add input/bounds checking to them but the non DATA_DEPENDENT variants can all be statically verified.

+1 - should bike-shed the name. I’d also be ok if we fell back to calling these DYNAMIC, but that gets reused in various ways a lot and I was trying to avoid it. It would be good to call out the characteristics and dynamic correctness checks that must be carried out for all of the operations in this class.

It would be really nice if we could avoid introducing new DATA_DEPENDENT versions of the ops. I wonder if we can do an OffsetsSizesAndStrides-style mixed static/dynamic representation that cleanly bottoms out on the fully static case.

MATMUL needs dynamic correctness checks but doesn’t require any additional Index operands (so would not be DATA_DEPENDENT in your terminology).

Yeah, we’ll need to get the ARM folks take on that: the spec does not bottom out on MLIR for all use cases, and with their view of this as more of an “ISA”, I think there is some value in preferring to steer away from advanced encodings in favor of having structurally different things be different. Just because we can engineer a mixed representation in MLIR doesn’t mean it fits the grain for what TOSA is targeting. Just my 2 cents having witnessed some of the thinking that has gone in to defining things. Probably also depends on where they are on their progression towards 1.0 and other factors influencing how much of breaking changes are still tolerable.

When I did my (admittedly quick) audit of the ops to give the DATA_DEPENDENT set, RESHAPE was the only one I found that might just be able to be upgraded in-situ (i.e. it could take an additional, optional pack of indices as operands).

Good point. We just need to completely specify dynamic correctness constraints for everything then.

Thanks for this detailed proposal @stellaraccident! Given that this follows on some earlier conversation we had over email, I thought I’d start by summarizing those before commenting. The quickest way is a (very simplified) picture:

There are two goals here - somehow generate a statically shape resolved TOSA form to drive the reference model for verification (hardware and software golden reference), and enable general codegen supporting dynamic shapes.

The simple flow is roughly the existing verification focused flow. We start from networks that are already conditioned and shape resolved at frontend. This is not ideal but serves the immediate purpose of verification trivially.

The general solution would propagate dynamic shapes through TOSA legalization with the help of helper ops and definition semantics, intersecting with the subject of this contribution. This infrastructure would enable static or dynamic shape resolution at lower level codegen. A fully shape-resolved form could run on the reference model or with a simple runtime. Dynamic shape resolution would involve complementary APIs on the runtime enabling specification of Dimension Values at runtime.

Getting back to this contribution itself:

The TOSA specification doesn’t explicitly call for shapes to be resolved, it describes functional behavior in terms of resolved shapes, and does not specifically address the presence of unresolved shapes or how/when they must be resolved.

Therefore, a compiler flow that supports a particular profile (or multiple profiles) would itself define the extent of dynamic shape support. It should not require a different profile, since a compiler could iteratively support just models with pre-resolved shapes, emitting binaries that run on a simple runtime, and build levels of complexity both within the compiler and the complementary runtime artifacts.

For simpler compilers, resolving the ‘Dynamic Shaped TOSA’ form to ‘Shape Resolved TOSA’ form by leveraging this API to implement a shape inference pass would address the need to feed a fully-shape resolved TOSA form to a simpler compiler flow, while enabling the development of a more complex flow as well.

Thinking about this a little, something like this is a necessity as a TOSA implementation framework for a general purpose compiler capable of expressing dynamic shape support. However as mentioned, a TOSA compiler could just take a fully shape resolved network and that is still a valid TOSA form.

Right now, a TOSA software implementation isn’t mandated to support dynamic shapes, but an implementation specification that defines standard constructs (“if you’re going to implement a compiler with this capability, implement these APIs ?”) might be helpful, e.g. multiple compilers (need not all be MLIR based) can then emit code that runs on the same runtime in the process. Is that too optimistic, or aligned to the thought process behind this contribution ?

There’s clearly significant value in an infrastructure that lets a compiler define shape resolution algorithms, and this contribution doesn’t mandate a particular algorithmic construct, just the support infrastructure. To me, that nuance makes this really valuable.

I don’t have a set opinion as to whether this must be specification-defined or described as an implementation reference document of some form. The main reason is that specification defines what TOSA compliance mandates, while this defines specific infrastructure to express a carried capability should a compiler choose to implement it. @Eric ?

+1. ‘Dynamic’ is prone to overloading. I also agree with @_sean_silva that ‘data_dependent’ seems a little misleading, but this topic is covered further below:

Mechanically, these could readily populate the TosaUtilOps set of operations currently present in the MLIR implementation, but ideally given that they’re either ‘spec-defined’ or ‘implementation-defined’ (as the resolution of the nature of this contribution dictates), it might nicely sit in an appropriately named IR definition file.

Their description also fits well with my drawing - they’d be consumed by the shape resolution pass and would never be visible in the ‘Shape Resolved TOSA’ form.

I think all the quoted ops here reflect the same semantic question - what parameters must not be mutable entities from a spec perspective ? For example, the fully_connected op expects that weight are not mutable. Where they are, the matmul op is mandated, as listed in spec. Similarly, conv2d defines weights as mutable for MT but not inference profiles. Fundamentally this is tied to efficient underlying hardware design options.

This doesn’t necessarily extend to non-tensor entities of ops, e.g. start/size of tosa.slice or pad/stride/dilation of conv2d. The considerations behind the current design choices include:

  • they are compile time constants in the frontends too.
  • their definition as attributes enables meta-information definition. E.g. conv2d pad is a 4-element array defining pads in the order TBLR. How does its variadic IndexType construct look ?
  • What are rules for legalizing such attributes where the meta-information is differently organized by different frontends, e.g. framework A defines 4 pads TBLR, framework B defines pads TLBR, and framework C defines just TL pads ?
    I think this needs better definition. Within the dialect itself, these could implement variadic entities rather than MLIR attributes in our current construct, but with a clearer set of semantics. This would address:

Previously we’ve been open to having a clearer definition of this. In particular cases, e.g. fully_connected vs matmul described earlier there’s particular meaning to ‘attribute’ vs ‘input’ in the spec. But for non-tensor parameters, perhaps a term ‘parameter’ might be suitable. It just needs to be clearly defined and resolved at compile time where we already do get them from the frontend, e.f. pad/stride/dilation in TensorFlow / TF Lite conv2d.

Thanks for the comments. A few responses inline.

Diagram looks good to me!

Yes - I’ve been back and forth on whether dynamic shapes are spec-impacting or not. I eventually settled on maybe needing an addendum/implementors guide or something like I wrote up. The reason is that it is not possible to legalize common forms from frontends without additional structures.

To stretch the ISA analogy that you all use, we might be talking about the difference between the ISA (the TOSA spec as it is today) and the Assembly Language (which has additional structures and formalism for expanding the expressiveness of the raw underlying instructions in some ways).

The reasoning makes sense to me and I don’t know the answer as to spec-defined or implementation reference document (this feels like the latter). Practically: we have worked to make sure that the TOSA dialect in MLIR is as isomorphic to the spec as possible, and in the absence of an additional implementors reference which allows further structures, we are in a bit of an under-defined territory: we can move the MLIR dialect and supporting tools in this direction, but I would rather do it with something in writing that adds some formalism to the approach.

If we could come up with the right adjective to replace DYNAMIC/DATA_DEPENDENT, it would be nice to cluster all of these from a namespace perspective with that moniker. Let’s say for a moment that we had two namespaces, dynamic and shape. We could cluster some of these new helpers as:

  • tosa.dynamic.conv2d
  • tosa.dynamic.resize
  • etc

And then shape helpers:

  • tosa.shape.expand_dims
  • tosa.shape.extract_dim
  • tosa.shape.index_const
  • etc

These would all be in the tosa MLIR dialect but would be organized for clear distinction.

Yes, the spec takes specific opinions on this – as it should. However, my experience is that when interfacing to frontends, it is a bit of a losing game to try to predict exactly how static or dynamic they are going to go. It certainly helps implementors if there are variants of these ops that are fully dynamic so that we can get them into the ecosystem/tools and then apply further constraints and simplifications to reduce them to a pure spec-compliant form.

Some frontends, yes.

In MLIR, it is declared as a Variadic<IndexType> in either an operand or result. In PyTorch, these tends to be their own built-in list type. The TensorFlow frontends are a bit unique in that they really only have one data-type and that is a tensor. Others have lists and scalars and those map pretty directly.

Why does the reference model need static shapes? Is this essential or just an artifact of the current implementation?

I guess a broader question would be: what is the downside of making TOSA be intrinsically fixed-ranked+dynamically-shaped with the fully static case be the degenerate special case of that? Personally, I think we should be aiming for a future where that’s the default in the ML ecosystem at these levels of abstraction.

The tosa.dynamic.* subset seems like a weird mishmash of frontend-level (ops that need error handling, conv dilations/etc. dynamic parameters) and codegen-level (no dynamic size-1 broadcasting, fixed ranks).

Are we maybe overfitting to the current TFLite import use case that is motivating this?

I think we can separate these from my overall proposal: I was trying to be comprehensive wrt issues I see, but they can be a separate discussion. The only one that really needs attention as part of this topic is RESHAPE.

I’m in complete agreement here that this is necessary, and support having this defined formally. As to the two concerns:

  1. Is this a spec or implementation guide ? Regardless of what it is, this document can be built up.
  2. Why these specific helper ops ? Again, this can be defended in some document/rationale form.
    So these concerns are likely procedural and not in blocking; as an infrastructural entity this is useful to have.

To express more nuance on this, I’ll start by considering what’s being desired as ‘dynamic’ (for the lack of a better adjective) - parameters of a few ops, i.e. slice, pad, scatter/gather, conv* . An implementation (e.g. MLIR) need not define them using compile time constant structures (e.g. MLIR attributes). But they must be defined entitiies. For example, pad must be a 4-element entity that has a fixed sequence (T B L R in this case).

It’s possible for these to be expressed using a lazily evaluable structure like DimensionValue if there are accessors that let us receive the contents in a const array form if these are fully evaluated at the TOSA abstraction.

For example, Conv2d pad could be legalized into a DimensionValue instead of an Confined<I64ArrayAttr, [ArrayCount<4>]>, with the DimensionValue then capable of being tested for static-evaluated values, e.g. something like
PadDimensionValues.dyn_cast<I64ArrayAttr>() which would succeed on a statically evaluated content.

This would keep the intent of the spec and current MLIR implementation unchanged, while adding support for lazily evaluating these values.

For these, legalization would have to carry the transmission of DimensionValues to the TOSA form. Of course, where the network cannot be statically evaluated at all, this is a problem. Are there cases where a graph form would not be statically evaluable at compile time because of dynamically data-dependent evaluation of parameters ?

Another topic here is that the implementation/spec guide on this topic should clearly express the low level communication constructs that would be exposed between the compiled network and the runtime. This is a necessity so that microarchitectural optimizations can minimize communication of runtime information between host and device sides. So this is design information that would need a greater level of detailed specification and agreement upon.

The reference model isn’t intended to do more than run fully formed TOSA graphs. Yes it wouldn’t work on graphs where certain constructs are dynamically data-dependent, as described in the earlier part of this post. But that’s still alright - there’s significant value towards hw/sw design in being able to construct statically evaluable networks as conformance tests for verification purposes, and that’s what the reference model enables right now.

Probably no downside beyond the definition of the necessary implementation guide or specification detail needed to ensure how that degenerate form can then be constructed (and define when it cannot be done).

For compile design, this is a very pertinent question, and I think it’s easier to approach it from the perspective of taking an implicitly statically formed construct as we have now, and express generalizations like this, from which the static form can still be derived.

Good question. It sounds like there is enough interest in an implementation guide to write this out a bit more formally, and that including a rationale would be useful. The short answer for the rationale is a combination of principles (i.e. preserve the highest level common form that preserves the static information in the program without prematurely lowering to an overly dynamic form) and experience in finding these specific ops to be sufficient and useful for both representing and compiling a broad set of symbolically shaped programs directly, without complex analysis.

I can go into more detail when I formalize such a document (which will need to wait for a few weeks due to a leave). We have a lot of this implemented in IREE and upstream LLVM/MLIR, which has a mode where it can function as a TOSA based compiler. It is a bit spread out across layers and projects right now, and getting agreement on this kind of proposal would let us pull it all together into MLIR itself, support it directly from TensorFlow conversions and make it more generally useful (both to us and others).

That sounds quite reasonable :slight_smile: Yes I think a standard guide would be very useful. Ideally it would be agnostic to a compiler infrastructure but could use LLVM/MLIR for examples. It would also need to define a host-device abstraction to express what run time information is carried to the implementations of the operators as dynamically resolved/broadcasted shapes are known.

I also realized I missed the Iota op. It’s not clear why that could not just be tosa.const() compiler-generated construct. Looking at the description: https://www.tensorflow.org/xla/operation_semantics#iota it seems to define an intent to build a constant on device. Can it not use const ? If the range is run-time resolved, there’s going to be a runtime construction hit yes. It’s not clear how this is a ‘hot’ op indicating the need for a hardware supported capability in the form of a unique op.

Let’s just descope all of these DATA_DEPENDENT additions from this proposal (except for RESHAPE). We can work them as the cases arise. Me including them was more of having an inkling that they are needed for various advanced cases just based on history – but we can walk them forward then.

FWIW, when switching from static->symbolic shapes, a lot of things that were constants need to become actual ops that compose with transformations, and at least in the XLA world, IOTA is a primitive that gets used to compose various other higher level computations. We can present it separately with a bit more of a study/rationale.

Sounds good. With an effectve construct in place (a combination of semantics to define symbolic shapes, and a means to infer/extract static values from it that can apply to conventional tensor content as well as op parameters), it might follow that the data_dependent op forms may not be necessary.