Note on the SimSiam objective

Notations

We are using similar notations to the SimSiam paper. For a single input image \(x\) the model generates two random augmented views \(x_1 = \mathcal{T}_1(x)\) and \(x_2 = \mathcal{T}_2(x)\). These views are then fed into an encoder \(\mathcal{F}_{\phi}\) parameterized by \(\phi\) \[\begin{align*} z_i &= \mathcal{F}_{\phi}(\mathcal{T}_i(x)),\ i = 1,2 \end{align*}\] Predictions are produced using a separate predictor network \(h\) parameterized by \(\theta\) \[\begin{align*} p_i = h_\theta(z_i),\ i=1,2 \end{align*}\] Finally, the loss is computed as \[\begin{align} \mathcal{L}_{\text{SimSiam}}(z_1, z_2) &= \frac12 \mathcal{D}(p_1, \text{SG}(z_2)) + \frac12 \mathcal{D}(p_2, \text{SG}(z_1)) \label{eq:loss} \end{align}\] where \(\text{SG}\) is the stop gradient operator and \(\mathcal{D}\) is some similarity measure (e.g., cosine similarity or L2 distance).

For the subsequent derivations we assume an input image \(x\) to be fixed and transformations \(\mathcal{T}_1, \mathcal{T}_2\) are randomly and independently sampled. In this context, view encodings \(z_1\) and \(z_2\) also become random variables.

SimSiam and Mutual Information

We are going to show that minimizing \(\mathcal{L}_{\text{SimSiam}}\) is equivalent to maximizing the lower bound on the mutual information between different view encodings \(z_1, z_2\) of the same image \(x\). In other words, \[\begin{align} \mathrm{E}_{\mathcal{T}_1, \mathcal{T}_2}[\mathcal{L}_{\text{SimSiam}}] &\ge \text{constant} - \mathcal{I}(z_1, z_2) \label{eq:simsiam_is_mi} \end{align}\] which becomes tighter as the predictor \(h_\theta\) becomes optimal. This makes the training objective very similar to the Contrastive Predictive Coding or InfoNCE (or SimCLR in the context of images).

First, let \(Q(\cdot ; \mu)\) be a probabilistic distribution over view encodings parameterized by some \(\mu\). Then, \[\begin{align} \mathcal{D}(h_\theta(z_1), z_2) &= -\log Q(z_2 ; \mu = h_\theta(z_1)) \label{eq:metric_is_nll} \end{align}\] for an appropriate choice of \(Q\). For example, when \(\mathcal{D}\) is L2 distance then \(Q\) is a multivariate Gaussian distribution with an identity covariance and mean \(\mu\).

Our first observation is that \(Q(\cdot ; \mu = h_\theta(z_1))\) is trained¹ to approximate \(P(z_2 \mid z_1)\), since the objective function is a cross-entropy between these two distributions. More formally, consider the expected loss conditioned on a known \(z_1\) \[\begin{align} \mathrm{E}_{\mathcal{T}_2} \left[\mathcal{D}(h_\theta(z_1), z_2) \mid z_1 \right] &= \mathrm{E}_{\mathcal{T}_2} \left[-\log Q(z_2 ; \mu = h_\theta(z_1)) \mid z_1 \right]\nonumber\\ &= \mathrm{E}_{\mathcal{T}_2} \left[-\log Q(z_2 ; \mu = h_\theta(z_1)) + \log P(z_2 | z_1) - \log P(z_2 | z_1)\mid z_1 \right]\nonumber\\ &= \mathrm{E}_{\mathcal{T}_2} \left[-\log P(z_2 \mid z_1) \right] + D_{\mathrm{KL}}\left( P(z_2 | z_1) \;\|\|\; Q(z_2 ; \mu = h_\theta(z_1)) \right)\nonumber\\ &\ge \mathrm{E}_{\mathcal{T}_2} \left[-\log P(z_2 \mid z_1) \right] \end{align}\] where the inequality becomes tighter when \(Q(\cdot ; \mu = h_\theta(z_1))\) approximates \(P(z_2 \mid z_1)\) better, which happens when parameters \(\theta\) are closer to optimum. This corresponds to the empirical evidence by SimSiam and follow-up papers that the model benefits from the predictor \(h_\theta\) being optimal – for example by making several gradient updates or using higher learning rate just for \(\theta\).

If we take expectation with respect to the \(\mathcal{T}_1\) \[\begin{align} \mathrm{E}_{\mathcal{T}_1,\mathcal{T}_2}[\mathcal{D}(p_1, z_2)] &\ge \mathrm{E}_{\mathcal{T}_1, \mathcal{T}_2}\left[-\log P(z_2 | z_1)\right]\nonumber\\ &= \mathrm{E}_{\mathcal{T}_1, \mathcal{T}_2}\left[-\log \frac{P(z_1, z_2)}{P(z_1)}\right]\nonumber\\ &= \mathrm{E}_{\mathcal{T}_1,\mathcal{T}_2}\left[-\log \frac{P(z_1, z_2)}{P(z_1)P(z_2)} - \log P(z_2)\right]\nonumber\\ &= \mathcal{H}(z_2) - \mathcal{I}(z_1, z_2) \end{align}\]

Finally, we can substitute it back into the \(\mathcal{L}_{\text{SimSiam}}\) and add SG operations \[\begin{align} \mathrm{E}_{\mathcal{T}_1,\mathcal{T}_2}\left[\frac12\mathcal{D}(p_1, \text{SG}(z_2)) + \frac12\mathcal{D}(p_2, \text{SG}(z_1))\right] &\ge \frac12\mathcal{H}(\text{SG}(z_1)) + \frac12\mathcal{H}(\text{SG}(z_2))\nonumber\\ &- \frac12\left(\mathcal{I}(z_1, \text{SG}(z_2)) + \mathcal{I}(\text{SG}(z_1), z_2)\right)\nonumber\\ &= \frac12 \mathcal{H}(\text{SG}(z_1)) + \frac12 \mathcal{H}(\text{SG}(z_2)) - \mathcal{I}(z_1, z_2) \end{align}\] where we can treat \(\mathcal{H}(\text{SG}(z_1))\) and \(\mathcal{H}(\text{SG}(z_2))\) as constants because of the \(\text{SG}\) operations.

meaning that the predictor \(h_\theta\) is being optimized↩︎