Coordinate independent CNNs on Riemannian manifolds

An introduction to equivariant & coordinate independent CNNs – Part 5

Maurice Weiler

May 3, 2024 24 min read

Observers and their internal coordinate systems

This is the last post in my series on equivariant deep learning and coordinate independent CNNs.

Part 1: Equivariant neural networks – what, why and how ?
Part 2: Convolutional networks & translation equivariance
Part 3: Equivariant CNNs & G-steerable kernels
Part 4: Data gauging, co variance and equi variance
Part 5: Coordinate independent CNNs on Riemannian manifolds

In this post we investigate how convolutional networks are generalized to the differential geometric setting, that is, to process feature fields on manifolds (curved spaces). There are numerous applications of such networks, for instance, to classify, segment or deform meshes, or to predict physical quantities like the wall shear stress on an artery surface or tensor fields in curved spacetime. A differential geometric formulation establishes furthermore a unified framework for more classical models like spherical and Euclidean CNNs.

Manifolds do in general not come with a canonical choice of coordinates. CNNs are therefore naturally formulated as a gauge field theory, where the gauge freedom is given by choices of local reference frames, relative to which features and network layers are expressed.

Demanding the coordinate independence (gauge independence) of such CNNs leads inevitably to their equivariance under local gauge transformations. These gauge equivariance requirements correspond exactly to the $G$-steerability constraints from the third post, however, they are here derived in a more general setting.

This post is structured in the following five sections, which cover:

An intuitive introduction from an engineering viewpoint. It identifies the gauge freedom of choosing reference frames with an ambiguity in aligning convolution kernels on manifolds.
Coordinate independent feature spaces, which may be represented in arbitrary gauges, and are characterized by their transformation laws when transforming frames.
The necessity for the gauge equivariance of neural network layers.
The global isometry equivariance of these operations.
Applications on different manifolds and with various equivariance properties.

This post’s content is more thoroughly covered in our book Equivariant and Coordinate Independent CNNs, specifically in part II (simplified formulation), part III (fiber bundle formulation), and part IV (applications).

To define CNNs on manifolds, one needs to come up with a reasonable definition of convolution operations. As discussed in the second post of this series, convolutions on Euclidean spaces can be defined as those linear maps that share synapse weights across space, i.e. apply the same kernel at each location.

Kernel alignments as gauge freedom

As it turns out, finding a consistent definition of spatial weight sharing on manifolds is quite tricky. The central issue is the following:

The geometric alignment of convolution kernels on manifolds is inherently ambiguous.

For instance, on the monkey’s head below, it is unclear in which rotation a given kernel should be applied.

Rotationally ambiguous kernel alignment on a mesh of spherical topology.

The specific level of ambiguity depends on the manifold’s geometry. For example, the Möbius strip allows for kernels to be aligned along the strip’s circular direction, disambiguating rotations. However, as the strip is twisted, it is a non-orientable manifold, i.e. it does not have a well-defined inside and outside. This implies that the kernel’s reflection remains ambiguous.

Ambiguous kernel orientation on the Möbius strip.

As a third example, consider Euclidean vector spaces $\mathbb{R}^d$. They come canonically with Cartesian coordinates, along which one can uniquely align kernels without any remaining ambiguity. Transformations between “different” alignments are hence trivial, i.e. restricted to the identity map.

Canonical kernel alignment on Euclidean vector spaces.

In each of these examples, kernel alignments are specified up to transformations in some matrix group $G\leq \mathrm{GL}(d)$, e.g. rotations $G=\mathrm{SO}(2)$, reflections $G=\{\pm1\}$, or the trivial group $G=\{e\}$. The specific group $G$ depends hereby on the mathematical structure of the manifold. Since we are assuming Riemannian manifolds, we always have access to a metric structure, which allows to align kernels without stretching or shearing them but leaves their rotation and reflection ambiguous, i.e. $G=\mathrm{O}(d)$. That we could reduce $G$ further in the above examples implies that we assumed additional geometric structure, e.g. an orientation (inside/outside) on the monkey head. Any mathematical structure which disambiguates kernel alignments up to transformations in $G$ is called G-structure.

Steerable kernels as gauge independent operations

To remain general, we assume any Riemannian manifolds with any additional $G$-structure for arbitrary $G\leq\mathrm{GL}(d)$. From a practical viewpoint this means that we need to address $G$-ambiguities in kernel alignments when defining convolutions.

Given the context of steerable CNNs from the third post, an obvious solution is to use $G$-steerable kernels. We introduced these kernels there as being $G$-equivariant in the sense that any $G$-transformations of their field of view results in a corresponding $G$-transformations of their response feature vector.

Equivariance of G-steerable kernels under active transformations of their fields of view.

Here we have the slightly different situation of $G$-transformations of the kernels’ alignments, while keeping their fields of view fixed. However, viewed from a kernel’s frame of reference, these two situations are actually indistinguishable! Different alignments of steerable kernels are therefore guaranteed to result in $G$-transformed responses. Such features can hence be viewed as different (covariant) numerical representations of the same abstract feature, just being expressed relative to different frames of reference.

Covariance of G-steerable kernels under passive gauge transformations.

In short, coordinate independent CNNs are just neural networks which apply $G$-steerable kernels, biases or nonlinearities on manifolds with a $G$-structure. The covariance of kernel responses guarantees the independence of the encoded information from particular choices among $G$-ambiguous kernel alignments.

As the visualizations above already suggest, $G$-ambiguities of kernel alignments relate to $G$-ambiguities in choosing reference frames on manifolds.

$G\mkern-2mu=\mkern-2mu\{e\}$ (trivial)

$G\mkern-2mu=\mkern-2mu\{\pm1\}$ (reflections)

$G\mkern-2mu=\mkern-2mu\mathrm{SO}(2)$ (rotations)

$G\mkern-2mu=\mkern-2mu\{\mathbb{R}_{>0},*\}$ (scaling)

Any geometric quantities, in particular feature vectors, are required to be $G$-covariant, that is, coordinate independent in the sense that they are expressible relative to any of the ambiguous frames. That convolutional networks need to apply $G$-steerable kernels, biases or nonlinearities follows then by demanding that their layers respect the features’ coordinate independence.

To explain what I mean with “coordinate independent feature spaces”, this section discusses

tangent spaces and their frames of reference,
$G$-structures as bundles of geometrically preferred frames, and
coordinate independent feature vectors.

Formally, these quantities are described by fiber bundles that are $G$-associated to each other. An intuitive introduction to the content discussed here is given in chapters seven and eight of our book.

Tangent vectors and reference frames

The idea of coordinate independence is best illustrated by the example of tangent vectors. Click on the ❯ arrow to uncover the diagram and the explanatory bullet points step by step.

Gauges as coordinate representations of a single tangent space TpM, slide 1

Gauges as coordinate representations of a single tangent space TpM, slide 2

Gauges as coordinate representations of a single tangent space TpM, slide 3

Gauges as coordinate representations of a single tangent space TpM, slide 4

❮ ❯

The tangent spaces of a $d$-dimensional manifold $M$ are vector spaces $T_pM\cong\mathbb{R}^d$. A-priori, they are coordinate free, that is, they come without any choice of basis.

To represent an abstract tangent vector $v\in T_pM$ numerically one chooses any reference frame $A$ (ordered basis), relative to which $v$ has coefficients $v^A\in\mathbb{R}^d$. Choosing a frame is equivalent to specifying a vector space isomorphism $\psi_p^A: T_pM \xrightarrow{\,\sim\,} \mathbb{R}^d$, which we call gauge.

The choice of frame (or isomorphism) is in general not canonical. One could therefore have picked another frame $B$, relative to which $v\in T_pM$ has different coefficients $v^B = \psi_p^B(v) \in \mathbb{R}^d$.

Numerical coefficients in different gauges are related by gauge transformations $v^B = g_p^{BA} v^A$ where $g_p^{BA} := \psi_p^B\circ (\psi_p^A)^{-1}: \mathbb{R}^d\to \mathbb{R}^d$. This is just a change of basis as well-known from linear algebra.

Note how transformations of reference frames and tangent vector coefficients are coupled to each other. This is what is meant when saying that they are associated to each other.

A more formal mathematical description would consider the whole tangent bundle instead of a single tangent space since this allows to capture concepts like the continuity or smoothness of tangent and feature fields. Gauges (local bundle trivializations) correspond then to smooth frame fields (gauge fields) on local neighborhoods $U^A$ or $U^B \subseteq M$ and gauge transformations are defined on overlaps $U^A\cap U^B$ of these neighborhoods.

If one picks a coordinate chart $x^A\mkern-3mu: U^A\to V^A\mkern-2mu\subseteq\mkern-1mu\mathbb{R}^d$, frames are induced as coordinate bases with axes $\frac{\partial}{\partial x^A_\mu} \Big|_p$, gauge maps are given by chart differentials $(\psi_p^A)_\mu = dx_\mu^A\big|_p$, and gauge transformations correspond to Jacobians $(g_p^{BA})_{\mu\nu} = \frac{\partial x^B_\mu}{\partial x^A_\nu}$ of chart transition maps. However, the gauge formalism in terms of fiber bundles is more general and more suitable for our purpose.

$G$-structures

Recall how the manifold’s mathematical structure reduces the ambiguity in kernel alignments such that transformations between them took values in some subgroup $G\leq\mathrm{GL}(d)$. This is made technically precise by $G$-structures, which are $G$-bundles of structurally distinguished frames.

A-priori, a smooth manifold has no additional structure that would prefer any reference frame. One therefore considers sets $F_pM$ of all possible frames of tangent spaces $T_pM$. Gauge transformations between such general frames are any invertible linear maps, i.e. take values in the general linear group $\mathrm{GL}(d)$. Taken together, the sets of frames $F_pM$ from all points $p\in M$ form the frame bundle $FM$.

Additional structure on smooth manifolds allows to restrict attention to specific subsets of frames. For instance, a Riemannian metric allows to measure distances and angles, and hence to single out orthonormal frames, which are mutually related by rotations and reflections in $G=\mathrm{O}(d)$. Conversely, any $\mathrm{O}(d)$-subbundle of frames determines a unique metric since such sets of orthonormal frames allow for consistent angle and distance measurements. Such equivalences

$G$-structure $\mkern10mu\iff\mkern10mu$ $G$-subbundle of frames

hold for other structure groups $G$ as well. Some more examples:

$\boldsymbol{G}$-structure	$\boldsymbol{G}$-subbundle of frames	structure group $\boldsymbol{G\leq\mathrm{GL}(d)}$
smooth structure only	any frames	$\mathrm{GL}(d)$
orientation	right-handed frames	$\mathrm{GL}^+(d)$
volume form	unit-volume frames	$\mathrm{SL}(d)$
Riemannian metric	orthonormal frames	$\mathrm{O}(d)$
pseudo-Riemannian metric	Lorentz frames	$\mathrm{O}(1,\,d\mkern-2mu-\mkern-2mu1)$
metric + orientation	right-handed orthonormal frames	$\mathrm{SO}(d)$
parallelization	frame field (unique frames)	$\{e\}$

The graphics below give a visual intuition for $G$-structures on different manifolds $M$ and various structure groups $G$. Note that one may have different $G$-structures for the same manifold and group, similar to how one can have different metrics, orientations or volume forms on a manifold.

As explained below, each of these $G$-structures implies corresponding convolution operations, whose local gauge equivariance depends on the structure group $G$ and whose global equivariance is determined by the $G$-structure’s global symmetries.

The manifold’s topology may obstruct the existence of a continuous $G$-structure for groups beyond an irreducible structure group. For instance, the Möbius strip is non-orientable, which means that there is no way to disambiguate reflections without introducing a discontinuity. Similarly, the hairy ball theorem implies that frame fields on the sphere ($G=\{e\}$) will inevitably have singularities. This implies:

Any CNN will necessarily have to be $G$-covariant w.r.t. the manifolds irreducible structure group $G$ if the continuity of its predictions are desired.

The manifold's topology might therefore make the use of $G$-steerable kernels strictly necessary!

Coordinate independent feature vectors

Feature vectors on a manifold with $G$-structure need to be $G$-covariant, i.e. expressible in any frame of the $G$-structure. This requires them to be equipped with a $G$-representation $\rho$, called feature field type, which specifies their gauge transformations when transitioning between frames. Specifically, when $f(p)$ is an abstract $c$-dimensional feature at $p$, it is represented numerically by coefficient vectors $f^A(p)$ or $f^B(p)$ in $\mathbb{R}^c$, which are related by $$f^B(p) \,=\, \rho\big(g_p^{BA}\big) f^A(p)$$ when changing from frame $A$ to frame $B$ via $g_p^{BA}\in G$. Feature vector gauges and gauge transformations between them are again concisely represented via a commutative diagram. Note the similarity to the gauge diagram for tangent vectors above!

Gauges and gauge transformations of feature vectors.

This construction allows to model scalar, vector, tensor, or more general feature fields:

feature field	field type $\boldsymbol{\rho}$
scalar field	trivial representation $\rho(g)=1$
vector field	standard representation $\rho(g)=g$
tensor field	tensor representation $\rho(g) = (g^{-\top})^{\otimes s} \otimes g^{\otimes r}$
irrep field	irreducible representation
regular feature field	regular representation

For a geometric interpretation and specific examples of feature field types, have a look at the examples given in the third post on Euclidean steerable CNNs. In fact, the feature fields introduced here are the differential geometric generalization of the fields discussed there.

Overall, we have the following $G$-associated gauge transformations of objects on the manifold:

frames transform according to a right action of $(g_p^{BA})^{-1}$
tangent vector coefficients get left-multiplied by $g_p^{BA}$
feature vector coefficients transform according to $\rho(g_p^{BA})$

Furthermore, these objects have by construction compatible parallel transporters and isometry pushforwards (global group actions).

Coordinate independent CNNs are built from layers that are 1) coordinate independent and 2) share synapse weights between spatial locations (synapse weights referring e.g. to kernels, biases or nonlinearities). Together, these two requirements enforce the shared weight’s steerability, that is, their equivariance under $G$-valued gauge transformations: $$ \begin{array}{c} \textup{coordinate independence} \\[2pt] \textup{spatial weight sharing} \end{array} \mkern24mu\bigg]\mkern-11mu\Longrightarrow\mkern16mu \textup{$G$-steerability / gauge equivariance} $$

Kernels in geodesic normal coordinates

In contrast to Euclidean spaces, the local geometry of a Riemannian manifold might vary from point to point. It is therefore not immediately clear how convolution kernels should be defined on it and how they could be shared between different locations. A common solution is to define kernels as usual on flat Euclidean space and to apply them on tangent spaces instead of the manifold itself.

Kernel on tangent space and projection via the Riemannian exponential map.

To match the kernel with feature fields it needs to be projected to the manifold, for which we leverage the Riemannian exponential map. Equivalently, one can think about this as pulling back the feature field from the manifold to the tangent spaces. When being expressed in a gauge, this corresponds to applying the kernel in geodesic normal coordinates.

Gauge equivariance

The equivariance requirement on kernels follows by the same logic as discussed in the previous post: a-priori, $G$-covariance just requires consistent gauge transformation laws of kernels, but weight sharing can only remain coordinate independent when kernels are constrained to be $G$-steerable.

G-covariance : Assume that we are given a coordinate free kernel $\mathcal{K}_p$ on a tangent space $T_pM$. It can of course be expressed in different frames of reference. The coordinate expressions $\mathcal{K}_p^A$ and $\mathcal{K}_p^B$, defined on $\mathbb{R}^d$, are then related by some gauge transformation law which is derived here.

It is important to note that this $G$-covariant formulation does not yet impose any symmetry constraints on kernels, but is merely tying their coordinate representations in a consistent manner.

G-equivariance : In the case of convolutions, there is no initial kernel $\mathcal{K}_p$ on $T_pM$ given, but rather a kernel $K$ on $\mathbb{R}^d$ which should be shared over all tangent spaces. Choosing kernel alignments $A$ or $B$ corresponds mathematically to defining $\mathcal{K}_p$ by setting $\mathcal{K}_p^A=K$ or $\mathcal{K}_p^B=K$. However, the choice of gauge/frame/alignment is ambiguous, and one will in general obtain incompatible results.

Weight sharing of a given kernel along different gauges.

Aligning $K$ in any single gauge would prefer that specific gauge, and therefore break coordinate independence. The solution is to treat all gauges equivalently, that is to set $$ \mathcal{K}_p^X = K \mkern24mu\textup{for }\textit{any }\textup{gauge $X$ of the $G$-structure,} $$ which can be interpreted as an additional weight sharing over all frames of the $G$-structure. Doing so turns the covariance conditions of the form $$ \mathcal{K}_p^B\ =\ g_p^{BA} \circ \mathcal{K}_p^A \circ \big( g_p^{BA} \big)^{-1} $$ for any gauges $A$ and $B$ into $G$-equivariance constraints $$ K\ =\ g \circ K \circ g^{-1} \qquad\forall\ g\in G. \mkern-43mu $$ These are exactly the $G$-steerability constraints known from steerable CNNs on Euclidean spaces.

Example : To get an intuition for the role of steerable kernels, let's consider the example of a reflection group structure. As discussed in the third post, two possible field types $\rho$ for the reflection group are the trivial and the sign-flip representation. They correspond to scalar and pseudoscalar fields, whose numerical coefficients stay invariant and negate under frame reflections, respectively. Reflection-steerable kernels that map from scalar to pseudoscalar fields were constrained to be antisymmetric.

Applying such an antisymmetric kernel in some gauge $A$ to a scalar field results in some response field in gauge $A$. If the kernel is instead applied in the reflected gauge $B$, the response field will due to the kernel's antisymmetry end up negated. This transformation behavior does indeed identify the response field as being of pseudoscalar type.

Weight sharing of reflection steerable kernels along different gauges of a reflection group structure.

You can check equivalent properties for any of the other pairs of field types and their steerable kernels from the third post. The difference here is that we are considering passive transformations of frames and kernel alignments instead of active transformations of signals. As only the relative alignment of kernels and signals matters, the behavior is ultimately equivalent.

Similar $G$-steerability requirements as for convolution kernels can be derived for ${1\mkern-6.5mu\times\mkern-5.5mu1}$-convolutions, bias vectors and nonlinearities.

Classically, convolutional networks are those networks that are equivariant w.r.t. symmetries of the space they are operating on. For instance, conventional CNNs on Euclidean spaces commute with translations, Euclidean steerable CNNs commute with affine groups, and spherical CNNs commute with $\mathrm{SO}(3)$ rotations. Similarly, one may ask in how far coordinate independent CNNs are equivariant w.r.t. isometries, which are the symmetries of Riemannian manifolds (distance preserving diffeomorphisms).

Commutative diagram visualizing the isometry equivariance of coordinate independent convolutions.

The following section investigates the prerequisites for a layer’s isometry equivariance prior to convolutional weight sharing. In the one thereafter we apply the results to coordinate independent convolutions.

Isometry invariant kernel fields

Our main theorem regarding the isometry equivariance of coordinate independent CNNs establishes the mutual implication

isometry equivariant layer $\mkern10mu\iff\mkern10mu$ isometry invariant neural connectivity ,

which applies either to the manifold's full isometry group $\mathrm{Isom}(M)$ or to any of its subgroups. For linear layers this requires

weight sharing of kernels across isometry orbits (points related by the isometry action) and
the kernels' steerability w.r.t. their respective stabilizer subgroup.

Two examples are shown below. The first one considers the rotational isometries of an egg-shaped manifold, whose orbits are rings at different heights and the north and south pole. In principle, equivariance does not require weight sharing across the whole manifold, but just on the rings, allowing for different kernels on different rings. The stabilizer subgroups on the rings are trivial, leaving the kernels themselves unconstrained. The second example considers an $\mathrm{O}(2)$-invariant kernel field. While the orbits remain the same, their stabilizer subgroups are extended by reflections, such that the kernels are required to become reflection-steerable. The specific steerability constraints depend of course on the field types between which the kernels are supposed to map.

Isometry invariant kernel fields on an egg-shaped manifold.

A special case are manifolds like Euclidean spaces or the sphere $S^2$ – they are homogeneous spaces, which means that their isometry group acts transitively, i.e. allows to map any point $p$ to any other point $q$. Consequently, there exists only one single orbit and kernel. Another one of our theorems asserts:

Isometry equivariant linear layers on homogeneous spaces are necessarily convolutions.

Isometry equivariance of convolutions

Coordinate independent convolutions rely on specific convolutional kernel fields, constructed by sharing a single $G$-steerable kernel along the frames of a $G$-structure. The convolutions’ isometry equivariance depends consequently on the extent of invariance of these kernel fields.

As a first example, consider the $\{e\}$-structure (frame field) in the left figure below. Its invariance under horizontal translations leads to an equivalent symmetry in kernel fields – the corresponding convolutions are therefore equivariant under horizontal translations. The translation and reflection invariance of the reflection-structure in the right figure is similarly carried over to its kernel fields, thus implying translation and reflection equivariant convolutions.

A convolutional kernel field on Euclidean space that is invariant under horizontal translations.

A convolutional kernel field on Euclidean space that is invariant under translations and reflections.

The observation that convolutional kernel fields inherit the symmetries of their underlying $G$-structure holds in general. Based on this insight, we have the following theorem:

Coordinate independent CNNs are equivariant w.r.t. those isometries that are symmetries of the $G$-structure.

This allows us to design equivariant convolutions on manifolds by designing $G$-structures with the appropriate symmetries! More examples of $G$-structures and the implied equivariance properties are discussed in the applications section below.

Diffeomorphism and affine group equivariance

Beyond isometries, one could consider general diffeomorphisms. Any operation that acts pointwise, for instance bias summation or ${1\mkern-6.5mu\times\mkern-5.5mu1}$-convolutions, can indeed be made fully diffeomorphism equivariant by choosing $G=\mathrm{GL}(d)$. However, that does not apply to convolutions with spatially extended kernels as their projection to the manifold via the exponential map depends on the metric structure and does therefore only commute with isometries, i.e. metric preserving diffeomorphisms. To achieve diffeomorphism equivariance, steerable kernels would have to be replaced by steerable partial differential operators.

Specifically on Euclidean spaces, the Riemannian exponential map does not only commute with isometries in the Euclidean group $\mathrm{E}(d) =$ $(\mathbb{R}^d,+)\rtimes \mathrm{O}(d)$, but also with more general affine groups $\mathrm{Aff}(G)$ $=$ $(\mathbb{R}^d,+)\rtimes G$ for arbitrary $G\leq\mathrm{GL}(d)$. Choosing $\mathrm{Aff}(G)$-invariant $G$-structures on Euclidean spaces leads therefore to $\mathrm{Aff}(G)$-equivariant convolutions which turn out to be exactly the Euclidean steerable CNNs from the third post.

Our gauge theoretic formulation of feature vector fields and network layers is quite general: we were able to identify more than 100 models from the literature as specific instantiations of coordinate independent CNNs. While the authors did not formulate their models in terms of $G$-structures and field types $\rho$, these geometric properties follow from the weight sharing patterns and kernel symmetries that they proposed.

Table of coordinate independent CNNs in the literature.

The next sections give a brief overview of the different model categories in this literature review. As the networks’ local and global equivariance properties correspond to symmetries of the underlying $G$-structures, they are intuitively visualized by plots of the latter.

Euclidean steerable CNNs

All of the models in the first 30 lines are steerable CNNs on Euclidean spaces, that is, conventional convolutions with $G$-steerable kernels. From the differential geometric viewpoint, they correspond to $G$-structures that are $\mathrm{Aff}(G)$-invariant, i.e. invariant under translations and $G$-transformations. This affine symmetry of the $G$-structures implies the convolutions’ $\mathrm{Aff}(G)$-equivariance. Aside from the choice of structure group, these models differ mainly in the feature fields types $\rho$ that they operate on.

The major new insight in comparison to the classical formulation of equivariant CNNs is that coordinate independent CNNs do not only describe the models’ global $\mathrm{Aff}(G)$-equivariance, but also their local gauge equivariance, i.e. generalization over local $G$-transformations of patterns.

Comparison of global and local transformations

For more details on Euclidean steerable CNNs, have a look at the third post of this series.

Polar and hyperspherical convolutions on Euclidean spaces

An alternative is to perform convolutions in polar coordinates, which are defined everywhere but at the origin of $\mathbb{R}^2$ (line 31). As evident from the visualization of the corresponding $\{e\}$-structure, such convolutions are $\mathrm{SO}(2)$ equivariant around the origin, but no longer translation equivariant. Adding reflected frames and reflection-steerable kernels implies similar $\mathrm{O}(2)$-equivariant convolutions.

Instead of using polar coordinates with an isometric radial part one may use log-polar coordinates (line 32), whose frames scale exponentially with the radius. This $\{e\}$-structure is not only rotation invariant but also invariant under rescaling of $\mathbb{R}^2\backslash\{0\}$ and implies hence rotation and scale equivariant convolutions (Esteves et al., 2018). It is easily implemented as a conventional Euclidean convolution after resampling the feature field in the coordinate chart (pulling it from the left to the right side of the figure). Rotations and scaling correspond in the chart to horizontal and vertical translations, respectively.

Higher-dimensional analogs of these architectures on $\mathbb{R}^d\backslash\{0\}$ rely on convolutions on $d\mkern-3mu-\mkern-4.5mu1$-dimensional spherical shells at different radii (lines 33,34); see here for more infos and visualizations.

Spherical CNNs

Spherical CNNs are relevant for processing omnidirectional images from 360° cameras, the cosmic microwave background, or climate patterns on the earth’s surface. They come in two main flavors:

The first one are fully $\mathrm{SO}(3)$-rotation equivariant models which rely on an $\mathrm{SO}(3)$-invariant $\mathrm{SO}(2)$-structure on the $2$-sphere $S^2$. A seeming difference to related work is that we define steerable kernels on tangent spaces instead of defining them directly on the sphere. However, we could prove that the resulting convolution operations are ultimately equivalent since the kernel spaces can, up to a set of zero measure, be mapped to each other. A related variant are $\mathrm{O}(3)$-equivariant spherical CNNs that apply $\mathrm{O}(2)$-steerable kernels. Spherical CNNs come of course again with different field types, listed in lines 35-37 in the table above.

The second flavor are spherical CNNs that are only $\mathrm{SO}(2)$-rotation equivariant around a preferred polar axis (line 38). They correspond to the visualized $\mathrm{SO}(2)$-invariant $\{e\}$-structure, whose frames $\big[\frac{\partial}{\partial\theta},\, \frac{1}{\cos\theta}\frac{\partial}{\partial\varphi}\big]$ are induced by spherical coordinates $(\theta,\varphi)$ $\mapsto$ $\big( \cos\theta\cos\varphi,$ $\cos\theta\sin\varphi,$ $\sin\theta \big)^\top\mkern-4mu$. These convolutions are not defined at the poles, which manifests in our formulation in the singularities at the (cut out) poles. Kernels are here commonly projected via the gnomonic projection, which we could prove to be a special case of our projection via the exponential map after a radial kernel warping.

The spherical geometry may furthermore be approximated by that of an icosahedron (lines 39,40). An advantage of this approach is that the icosahedron is locally flat and allows for an efficient implementation via Euclidean convolutions on the five visualized charts. The non-trivial topology and geometry manifests in parallel feature transporters along the cut edges (colored chart borders). Icosahedral CNNs appear again in the two flavors above, where $\mathrm{SO}(2)$ is typically approximated by the cyclic group $\mathrm{C}_6$, which is a symmetry of the utilized hexagonal lattice.

Möbius CNNs

A simple toy model for illustrating the ideas of coordinate independent CNNs are convolutions on the Möbius strip. Due to the strip's non-orientability, such convolutions need to be orientation independent, i.e. they need to apply reflection-steerable kernels. The visualized reflection-structure implies convolutions that are equivariant w.r.t. translations of features along the strip's circular dimension. Note that this includes in particular reflections since they coincide with translations once around the strip.

Assuming the strip to be flat (have zero curvature), such convolutions are conveniently implemented in isometric coordinate charts. When splitting the strip in two charts as shown below, their transition map will at one end be trivial and at the other be related by a reflection. In an implementation, one can glue the two chart codomains at their trivial transition ($\mathrm{id}$). The only difference to reflection-steerable convolutions on Euclidean space is that the strip needs to be glued at the non-trivial cut ($\mathrm{flip}$). This is implemented via a parallel transport padding operation which pads both ends of the strip with spatially reflected and gauge reflected features from the respective other end.

If you are interested in learning more about Möbius convolutions, check out our implementation on github and the explanations and derivations in chapter 10 of our book.

General surfaces

Lastly, there are implementations on general surfaces ($d=2$), represented for instance by triangle meshes or as point clouds. Since surfaces are in general lacking symmetries, the requirement for isometry-equivariance does no longer guide the choice of $G$-structure. One should therefore merely assume the metric structure, i.e. use $\mathrm{O}(2)$-steerable kernels. If the surfaces are oriented, the structure group may be reduced further to $G=\mathrm{SO}(2)$. Most models in the literature rely on such rotation-steerable kernels, implicitly assuming what we identify as scalar, irrep or regular feature fields (lines 41-44).

An alternative approach is to address the ambiguity of kernel alignments on surfaces by computing them via some sort of heuristic. The examples in line 45 of the table above do this by aligning kernels along principal curvature directions of their embedding in $\mathbb{R}^3$ or along the embedding space’s z-axis, by parallel transporting kernels, via a local PCA of nodes, or by convolving over texture maps. In the framework of coordinate independent CNNs, these heuristics are interpreted as specifying $\{e\}$-structures on the manifold. Note that the heuristics may be instable under deformations, may not everywhere be well defined, and are likely to have singularities - the latter is actually unavoidable on non-parallelizable surfaces!

The main points discussed in this post are:

The geometric alignment of convolution kernels on a manifolds is often inherently ambiguous. This ambiguity can be identified with the gauge freedom of choosing reference frames.
The specific level of ambiguity depends on the manifold's mathematical structure. $G$-structures disambiguate frames up to $G$-valued gauge transformations.
Feature vectors and other mathematical objects on the manifold should be $G$-covariant, i.e. expressible relative to any frame from the $G$-structure (coordinate independent). They transform according to a $G$-representation $\rho$, called field type. Gauge transformations of frames, tangent and feature vector coefficients are synchronized, that is, their fiber bundles are $G$-associated.
In order for the spatial weight sharing of a kernel to remain coordinate independent, the kernel is required to be $G$-steerable, i.e. equivariant under gauge transformations. The same holds for other shared operations like bias summation or nonlinearities.
A layer is isometry equivariant iff its neural connectivity is invariant under isometry actions.
For convolutions, this neural connectivity is given by a kernel field whose symmetries coincide by construction with those of the $G$-structure. Convolutions are therefore equivariant under those isometries that are symmetries of the $G$-structure.

While being somewhat abstract, our differential geometric formulation of coordinate independent CNNs in terms of fiber bundles is highly flexible and allows to unify a wide range of related work in a common framework. It even includes completely non-equivariant models like those in line 45 of the table above – they correspond in our framework to asymmetric $\{e\}$-structures.

Of course there are neural networks for processing feature fields that are not explained by our formulation of coordinate independent CNNs. Such models could, for instance, rely on spectral operations, involve multi-linear correlators of feature vectors, operate on renderings, be based on graph neural networks, or on stochastic PDEs like diffusion processes, to name but a few alternatives. Importantly, these approaches are compatible with our definition of feature spaces in terms of associated $G$-bundles, they are just processing these features in a different way.

An interesting extension would be to formulate a differential version of coordinate independent CNNs, replacing our spatially extended steerable kernels by steerable partial differential operators. As mentioned above, this would allow for diffeomorphism equivariant CNNs.

Lastly, I would like to mention gauge equivariant neural networks for lattice gauge theories in fundamental physics, for instance (Boyda et al., 2021) or (Katsman et al., 2021). The main difference to our work is that their gauge transformations operate in an “internal” quantum space instead of spatial dimensions. However, both are naturally formulated in terms of associated fiber bundles. Their models are furthermore spatially equivariant and are in this sense compatible with our gauge equivariant CNNs.

Image references

Lizards and butterflies adapted under the Creative Commons Attribution 4.0 International license by courtesy of Twitter.
Mesh segmentation rendering from Sidi et al. (2021).
Artery wall stress rendering from Shiba et al. (2017).
Spacetime visualization from WGBH.
Cosmic microwave background adapted from Tegmark et al. (2023).
Electron microscopy of neural tissue adapted from the ISBI 2012 EM segmentation challenge.
Lobsters adapted under the Apache license 2.0 by courtesy of Google.
Owl teacher adapted from Freepik

Other visualizations and any adaptations of graphics are my own.