Data gauging, co variance and equi variance
An introduction to equivariant & coordinate independent CNNs – Part 4
This post is the fourth in a series on equivariant deep learning and coordinate independent CNNs.
- Part 1: Equivariant neural networks – what, why and how ?
- Part 2: Convolutional networks & translation equivariance
- Part 3: Equivariant CNNs & G-steerable kernels
- Part 4: Data gauging, co variance and equi variance
- Part 5: Coordinate independent CNNs on Riemannian manifolds
The fundamental theories of nature, as well as equivariant convolutional networks, are adequately formulated as gauge field theories. But what exactly is a gauge, why are they required, and how do they relate to equivariant networks? This post clarifies these questions from a deep learning viewpoint. We discuss:
- Data gauging, which is the process of arbitrarily picking a specific numerical representation of data among a set of equivalent (redundant) representations.
- The covariance and equivariance of neural networks w.r.t. gauge transformations between different data representations, and how this relates to Einstein's theory of relativity.
This post is meant to give a first introduction to the concepts of gauges and gauge transformations in deep learning, applying to arbitrary data modalities and neural networks in general. A gauge theoretic formulation of convolutional neural networks will be covered in the next post.
Gauge theories regulate redundancies in the mathematical description of data or other mathematical objects. A simple example from everyday life is that one may use different yet equivalent units of measurement, e.g. centimeters vs. inches for distances, or kilograms vs. pounds to measure weight. The specific choice of units (the gauge) is ultimately arbitrary since we can always convert between them (perform gauge transformations).
Neural networks crunch numbers – it is therefore necessary to numerically quantify data like physical lengths or masses by picking some gauge. A consistent gauge theoretic description simply ensures that the predictions of a network or theory remain independent from the particular choice of gauge.
The following paragraphs clarify the concepts of gauges and gauge transformations with simple examples, like measuring distances, ordering set elements or the nodes of a graph, or assigning coordinates to Euclidean space. Each example will be explained as a different instantiation of the same abstract commutative diagram shown below. Click on the ❯ arrow to uncover the diagram and the explanatory bullet points step by step.
Why should gauge transformations form a group in the first place? To clarify this question, note that the inherent ambiguity in picking gauges for an object comes from its symmetries. For instance, the identification of the circle below with numerical angles is only unique up to, e.g, rotations.
Rotations are just one example of what could be meant by “symmetries of a circle”. The specific level of symmetries of an object depends on the mathematical structure with which it is equipped. Let’s look at typical mathematical structures and the implied covariance groups for our example of gauging a circle.
mathematical structure | covariance group $\boldsymbol{G}$ | |
topology | $\mathrm{Homeo}(S^1)$ | continuous symmetries (homeomorphism group) |
smooth | $\mathrm{Diff}(S^1)$ | smooth symmetries (diffeomorphism group) |
Riemannian metric | $\mathrm{O}(2)$ | rotations and reflections (isometry group) |
oriented Riemannian | $\mathrm{SO}(2)$ | rotations (orientation-preserving isometries) |
angles explicitly given | $\{e\}$ | trivial group (there is a single unique gauge) |
In the context of deep learning, such $G$-ambiguities in the numerical representation of data need to be addressed by making neural networks $G$-covariant or $G$-equivariant. Before coming to equivariant networks, we give several examples of gauging data that is commonly appearing in deep learning applications.
Example : Ordering set elements
Data often comes in the form of sets (or multisets) of features without additional mathematical structure. Neural networks for such data, like the Deep Sets by (Zaheer et al., 2017), are typically permutation equivariant. Let’s see how these permutations arise in our gauge theoretic viewpoint.
Example : Labeling graph vertices
Graphs $\mathcal{G}=(V,E)$ are sets of vertices $V$, equipped with an additional edge structure $E \subseteq V\mkern-4mu\times\mkern-3muV$. This edge structure reduces the ambiguity of gauges such that gauge transformations take values in graph automorphisms (the symmetries of graphs) instead of more general vertex label permutations.
Some exemplary graphs and their automorphisms are shown below. Differently colored nodes are distinguished by their edge structure, while those of the same color remain indistinguishable. Gauges are accordingly disambiguated such that the remaining gauge transformations are only permutations within nodes of the same color. The simple graph in the slideshow above would only require two gauges, differing solely in how they assign the labels 3 and 4 to the two nodes on the right.
Some graph neural networks in the literature are fully permutation equivariant (Maron et al. 2019a), (2019b), while others are explicitly leveraging the edge structure, reducing equivariance to automorphisms (de Haan et al. 2020), (Thiede et al. 2021). From our gauge theoretic perspective, we interpret these models as different choices of covariance groups.
Example : Assigning coordinates to Euclidean spaces
Convolutional neural networks (CNNs) process spatial signals, in the simplest case on Euclidean spaces. Gauges corresponds here to choices of coordinate systems, relative to which feature vector fields are represented numerically.
If you never thought about choices of coordinate charts in the context of convolutional networks, this is since numerical samples are usually already expressed in some gauge. In practice, this choice of chart would find expression in the specific pixel grid of an image or the coordinates of a point cloud. In the case of photos, it is the photographer who selects a gauge by picking some camera alignment relative to the scene.
When processing such data, it is important to ask in how far the choice of gauge was canonical. In the forest photo below, the photographer aligned their camera horizontally, which is canonically possible due to the preferred “up/down” direction implied by gravity. However, for aerial or microscopy images, the camera rotation is arbitrary, i.e. represents just one of a whole set of equivalent gauges.
As explained in the following section, such gauge ambiguities need to be addressed by “gauge equivariant” neural networks.
So far, we discussed gauges and gauge transformations of data or other mathematical objects. We turn now to the gauging of neural network layers, i.e. of the functions that map between data (morphisms). Just as for the data itself, concrete numerical representations of an abstract neural operation will (in general) differ among gauges, and are related by gauge transformations. The specific choice of gauge is ultimately arbitrary, since the outputs of such numerical representations of network layers are guaranteed to transform as expected under gauge transformations.
It is important to distinguish two closely related concepts:
- Covariance refers to the consistency of a mathematical theory under gauge transformations. All computations, i.e. all objects and morphisms, can be expressed in any gauge.
- Equivariance is a stronger condition which imposes symmetry constraints on network layers. It can either be understood as the property of the layer to commute with (gauge)transformations, or, equivalently, as the invariance of the layer's numerical representation under (gauge)transformations.
The next paragraphs clarify what it means for a network layer to be covariant. After that, we explain how this relates to equivariant layers, when or why equivariance is actually required, and how it relates to Einstein’s theory of relativity.
Covariant network operations
The principle of covariance has a long history in physics, usually referring specifically to the coordinate independence of a theory. Albert Einstein (1916) formulated it as follows:
“Universal laws of nature are to be expressed by equations which hold good for all systems of coordinates, that is, are covariant with respect to any substitutions whatever.”
Adapted to machine learning and to more general data gauges, this principle becomes:
“Machine learning models are to be expressed by equations which hold good for all numerical data representations, that is, are covariant with respect to any gauge transformations.”
This is best explained with a few examples.
Example : Covariant convolutions
As it is geometrically very intuitive – almost tautological – our first example is that of a convolution layer, which can be expressed in different coordinate charts.
Note that covariance merely explains how different coordinate expressions $\mathrm{conv}^A$ and $\mathrm{conv}^B$ of the convolution relate, but does not require any constraints on the convolution itself! The outer commutative square is in particular not an equivariance diagram, as this would require that the top and bottom arrows are the same (invariant) operation. The corresponding equivariance diagram is discussed further below.
Example : Covariant operations on sets of features
Analogous considerations apply to the other examples above, for instance to layers that map between unordered sets of feature vectors.
Equivariant network operations
In the above analysis we assumed some abstract operation to be given (the middle horizontal arrow), and studied its numerical representations in various gauges (the top and bottom horizontal arrows). In practice, the situation is insofar different that we are rather given a concrete numerical operation, e.g. a kernel or weight matrix optimized on previous samples, and need to apply it to the next sample. The fundamental issue here is that the gauges that we could pick to represent this next sample are inherently arbitrary, and that
applying a given (fixed) numerical operation in different gauges yields in general different results!
The usual naive “solution” is to ignore the arbitrariness, and simply go ahead with the specific gauge in which the sample happens to be stored in the dataset. A clean solution should instead respect the structural equivalence of gauges – as we will see, this leads to gauge equivariant operations.
(Counter)Example : Applying a kernel in different charts
Let’s get back to the concrete example of convolutions to make this issue more intuitive. Assume that we are given a learned numerical kernel . We apply it to an image, which the photographer arbitrarily decided to shoot in some chart $A$. This concrete numerical operation, given by the top horizontal arrow in the diagram below, implies the corresponding abstract convolution operation given by the bottom horizontal arrow.
However, the photographer could equally well have chosen any other camera alignment aka chart $B$. Applying the kernel in this gauge instead implies a different abstract convolution operation with a correspondingly transformed kernel.
The issue is that applying the kernel in any particular chart breaks the structural equivalence of gauges. To prevent this, we need to find a way to apply the kernel in such a way that all gauges are treated equivalently.
How can the equivalence of gauges be preserved? Denoting the given numerical operation that we would like to apply by $\widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}}$, we either assigned $$ \mathrm{op}^A\ :=\ \widetilde{\textup{op}\vphantom{\rule{0pt}{5.2pt}}} \mkern40mu\textit{or}\mkern40mu \mathrm{op}^B\ :=\ \widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}}\mkern2mu, $$ thus arbitrarily preferring one of the gauges $A$ or $B$. Such symmetry breaking choices can be avoided by assigning $\widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}}$ to all gauges simultaneously, implying in particular $$\mathrm{op}^A\ =\ \mathrm{op}^B\ :=\ \widetilde{\textup{op}\vphantom{\rule{0pt}{5.2pt}}}$$ for any two gauges $A$ and $B$. One may think about this approach as a form of parameter sharing across gauges, or as the invariance of the operation’s numerical representations under gauge transformations.
An immediate consequence of this parameter sharing across gauges is that $\widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}}$ needs to be equivariant under gauge transformations. To derive this result, recall that the numerical expressions of an operation in two gauges are related by: $$\mkern{93mu} \mathrm{op}^B\ =\ g^{BA}\,\circ\, \mathrm{op}^A \,\circ\, \big(g^{BA}\big)^{-1} \mkern57mu(\textit{co}\textup{variance}) $$ Substituting $\mathrm{op}^A$ and $\mathrm{op}^B$ with the shared operation $\widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}}$ implies a gauge equivariance constraint : $$\begin{align} \widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}} \ &=\ g^{BA}\,\circ\, \widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}} \,\circ\, \big(g^{BA}\big)^{-1} \\[6pt] \iff\mkern40mu \widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}} \,\circ\, g^{BA} \ &=\ g^{BA}\,\circ\, \widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}} \mkern{150mu}(\textit{equi}\textup{variance}) \end{align}\mkern{20mu} $$ The relation between covariant and equivariant operations is concisely summarized by their commutative diagrams:
Intuitively, one may apply a gauge equivariant operation $\widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}}$ in any choice of gauge $A$ without breaking the structural equivalence of gauges: When $\widetilde{\mathrm{op}\vphantom{\rule{0pt}{5.2pt}}}$ is applied in any other gauge $B$, the resulting features are guaranteed to transform by $g^{BA}$, i.e. they are covariant (gauge independent).
Example : Transition map equivariant convolutions
Let the gauges again be charts that assign coordinates to Euclidean space and let the chart transition maps be affine transformations $\mathrm{Aff}(G)$. Any numerical operation whose responses should be equivalent (covariant) when applying it in different charts needs to be $\mathrm{Aff}(G)$-equivariant. This brings us back to the steerable CNNs from the previous post. For instance, any coordinate chart independent linear map is necessarily a convolution with a $G$-steerable kernel.
An advantage of the gauge theoretic viewpoint is that it clarifies how equivariant networks imply a consistent coordinate free operation. More importantly, only this viewpoint extends to convolutions on more general manifolds and to local gauge transformations (gauge fields).
Other examples of (gauge) equivariant operations, which may be applied in arbitrary gauges, are Deep Sets (Zaheer et al., 2017) or the models of (Maron et al. 2019a), (2019b) for full permutation equivariance, or the graph neural networks by (de Haan et al. 2020) and (Thiede et al. 2021) for graph automorphisms. They correspond to similar commutative diagrams as that in the previous example.
On equivariance and relativity
The considerations in this post show remarkable analogies to Einstein’s theory of relativity. A physical theory should be covariant in the sense that any physical quantity and the system dynamics should be expressible in arbitrary coordinates. We apply the same principle of covariance to more general data and functions.
It is often emphasized that the principle of covariance is merely a statement about the mathematical formulation of physics (its coordinate independence), but does not impact the laws of nature themselves. Similarly, covariant network layers are consistently expressed in different gauges, but are themselves not constrained.
The special principle of relativity requires that the laws of nature take the same form in all inertial frames of reference, i.e., that they are invariant under Lorentz (or Poincaré) transformations. A simple example is the metric $\eta$ of Minkowski space, which takes the same form $\eta^A = \eta^B = \widetilde{\eta} := \mathrm{diag}(1,-1,-1,-1)$ in any inertial frames $A$ or $B$. Our analog are gauge equivariant network operations, which take the same invariant expression $\mathrm{op}^A = \mathrm{op}^B = \widetilde{\textup{op}\vphantom{\rule{0pt}{5.2pt}}}$ in any gauges. The invariant laws of nature correspond similarly to an equivariant system dynamics in the sense that Lorentz transformed initial conditions result in accordingly transformed evolved states.
In physics, the invariance of the laws of nature implies that an experimenter can only perform relative measurements. $G$-equivariant network layers in deep learning are similarly only sensitive to the $G$-relative pose of feature vectors. For instance, one of our theorems requires that convolutional network features depend only on the relative distance between features but not on their absolute position.
The main takeaways of this post are:
- Numerical representation of data are not necessarily unique. A specific choice among the equivalent numerical representations is called gauge. Formally, gauges are choices of non-canonical isomorphisms $\psi^X: D \to R$ from the data $D$ to a numerical reference object $R$.
- Gauge transformations $g^{BA} := \psi^B \mkern-2mu\circ\mkern-2mu (\psi^A)^{-1}:\ R\to R$ allow to transform back and forth between different numerical representations $A$ and $B$.
- In how far gauges are ambiguous depends on the mathematical structure of the data. The level of ambiguity is captured by the covariance group $\boldsymbol{G}$, consisting of all gauge transformations. This group may be larger, but can not canonically be reduced beyond the symmetries of the data.
- Gauges do not only fix the numerical representation of data (objects), but also of neural network layers (morphisms). The consistent transformation behavior of a theory is referred to as covariance. Covariance does not imply a constraint on the network layers, but only their gauge independence.
- Given a numerical operation, it is unclear in which of the equivalent gauges it should be applied. When the output of the operation should be independent form the specific choice of gauge, i.e. transform covariantly, the operation needs to be equivariant under gauge transformations. This result explains common equivariant networks from a novel viewpoint.
- Equivariant network layers can be viewed as invariants under gauge transformations. They take the same role as invariant laws of nature in physics. Equivariant deep learning can therefore be viewed as a theory of relativity for neural networks. The formulation in terms of gauges makes these analogies explicit.
Further reading :
Despite the large body of literature on equi variant neural networks, there are few publications discussing the relation to the principle of co variance. A notable exception is "Towards fully covariant machine learning" (Villar et al., 2023), which establishes additional links to dimensional analysis and causal inference. However, they do not make the same distinction between covariance and equivariance as I do, but define covariance as equivariance w.r.t. passive (gauge) transformations.
Kristiadi et al. (2023) investigate "The Geometry of Neural Nets' Parameter Spaces Under Reparametrization". While my blog post focued on the covariance of neural networks under gauge transformations of feature spaces, they investigate their covariance under coordinate transformations of parameter spaces. In both cases, covariance ensures that the theory remains independent from any choices of gauges or coordinates.
Outlook :
This post focused on "global" gauges, which identify data samples as a whole with a numerical reference object. Gauge fields, on the other hand, consist of an independent gauge for each point of space.
An example are local trivializations of the tangent bundle $TM$ of a manifold $M$. Instead of assigning global coordinates to the space as a whole, such trivializations assign local reference frames to the individual tangent spaces $T_pM$, thus identifying them with vector spaces $\mathbb{R}^d \cong T_pM$. Gauge transformations become fields of group elements, each of which transforms its corresponding frame.
Such a local description becomes necessary when describing CNNs on topologically non-trivial manifolds like the sphere. The implications are similar to those in this post, however, now applying locally:
- Individual feature vectors need to be covariant, which requires that they are associated with a group representation $\rho$ of the covariance group $G$ (in this context called "structure group").
- Network layers need to be covariant as well, here in the sense that the local neural connectivity, e.g. a kernel or bias, needs to be expressible relative to different local reference frames.
- The application of a given local numerical operation is only then independent from the chosen gauge when it is locally gauge equivariant. This requires e.g. $G$-steerable kernels.
The next blog post investigates such coordinate independent CNNs on manifolds in greater detail.
Another example are local graph neighborhood gauges, which assign an independent subgraph labeling to each node neighborhood. Demanding the gauge independence of a local message passing operation requires it to be equivariant under the automorphisms of these graph neighborhoods. Using a slightly different formulation, such "natural graph networks" are described in (de Haan et al., 2020).
Image references
- Dragonfly adapted from macrovector on Freepik.
- Ruler adapted from macrovector on Freepik.
- Strawberry, roller-skate and teddy graphics adapted under the Apache license 2.0 by courtesy of Google.
- Graph automorphism figure adapted from Thiede et al. (2021).
- Lizards adapted under the Creative Commons Attribution 4.0 International license by courtesy of Twitter.
- Camera adapted from Freepik.
- Histology image from Maria Tretiakova on PathologyOutlines.com.
- Forest photo from Stefan Haderlein on Pexels.