CLIP Enrichment Circuits

We investigate late circuits on CLIP's vision side to understand how they glue abstract concepts together and build more general multimodal neurons.

Context

We build off of the work done in Multimodal Neurons in Artificial Neural Networks, diving deeper into the features in hidden layers of the CLIP model. In the spirit of the circuits agenda, we looked at individual units and their connections to attempt to unravel a small part of the algorithm CLIP uses to identify images.

For context, the CLIP model we studied (CLIP-RN50This is a smaller model than the RN50-4x version studied by Goh et al. 2021, as weights for RN50-4x were not publicly available at the time of research.) has two parts: a ResNet vision model and a Transformer language model . We investigated the ResNet side, which has as its final layer a relu layer with 2048 channels that combines via pairwise addition the activations from a conv2d layer and the previous relu layer (the residual connection).

We confirm significant similarities in the final layer to RN50-4x, including features for letters and crystallized concepts, such as "Trump," "Buddha," and "hat." Our primary goal was to find enrichment circuits, a term that deserves more rigorous definition, but that we shall tentatively describe as a circuit involving a unit in the final layer that combines one facet or modality from the the conv side with another from the residual connection. To do so, we investigated both case studies of circuits in the final layer as well as macroscopic properties of the network.

Terminology

The terms "neuron," "unit," and "channel" can often be used ambiguously in writing about circuits. For clarity, we explain the terminology we use in this article below. Our terminology comes from Feature Visualization.

Neuron: the specific activation of a channel in a given layer at an (x, y) position in the image.
Unit, Channel: the activation of a channel in a given layer, averaged over all x and y spatial positions in the image. This gives a good general impression over what a given channel (corresponding, for conv layers, to one filter) is detecting.

Most references to a given feature in this article are to a unit as a whole. To refer to a unit, we use the notation layer/bottleneck/index/channel; for example, corresponds to unit 1 of layer 4/2/Add_6 on OpenAI Microscope. Units are highlighted in purple; you can click the unit number to open it in Microscope or hover over the image to see a feature visualization preview.

Case Studies

Below, we present our findings from backtracing the contributions to several units in the final relu layer. The amount that one unit contributes to another is roughly the magnitude of the weights connecting one unit to the other. We present contributions as both a magnitude and a mean, where the magnitude shows the size of the contribution, and the mean roughly tells us whether the weight is inhibitory or excitatory.

Note: some neurons studied activate on explicit content. Explicit text and images will be blurred until manually revealed.

You"/"Bloody Wound" Unit

FIGURE A simplified diagram of the residual connections between , , and . The unit to the right represents the final layer of a convolutional bottleneck, while the unit on the bottom represents the output of the previous relu.

In each figure, you can click any unit to open it in OpenAI Microscope.

We view this unit as a clear "enrichment circuit" — it portrays sharply different concepts in its relu and conv contributions, with both contributing to the final unit. Looking at the feature vis and dataset examples, the final unit appears to be a mix of "you" text detection and a bloody wound detector; looking at the previous conv unit, we see it acts primarily as a bloody wound detector, while looking at the previous relu, it appears to detect "you."

While the rationale for combining "you" + bloody wound into a single neuron is unknown, we find it interesting how decomposable the two concepts are by looking at earlier units, and we conjecture the same decomposability may apply to other units, e.g. "text detector for Trump" + "image detector for Trump" = "Trump unit."

To see the relative contributions of the relu unit versus the convolutional bottleneck unit to 4/2/6/34, we also developed a technique for measuring the percent relu contribution based on the relative activations of the two units on the feature visualization of the final unit. While we have uncertainties about the reliability of this technique, in this case we find that the relu unit contributes 99.66% to the final activation, which matches the observation that all the dataset examples and the majority of the feature visualizations are of "you."

"Cap" Unit

FIGURE Diagram showing identity connections (grey) and convolutional contributions (green/red) discovered via expanded weights. Single images in squares represent the feature visualization of that particular unit, while quadruplets of images represent highly-activating dataset examples taken from Yahoo Flickr Creative Commons.

One of the first coherent units we found was 4/2/6/2, which seems to activate highly on pictures of hats, as well as the words "cap," "famous," "savage,", "pussy," and "caps."

4/2/6/2 receives contributions from (prev relu) and (prev conv). Interestingly, these two contributors represent radically different concepts, with 4/1/6/2 activating on typical pictures of hats, but with 4/2/5/2 activating on images of trucks and swears (along with hats to a lesser extent). This suggests that while 4/2/6/2 activates primarily on hats, it gains novel facets from 4/2/5/2, which explains the explicit words it activates on.

Diving deeper, is itself the result of a residual addition between and . 4/1/5/2 appears to be a hat detector with an emphasis on woolen texture. We then used expanded weights to find its most excitatory and inhibitory weight connections. We found that the most positive connections were for "dome" and "clothesrack/aisle", while the most negative connections were for "sombrero hat" and "boots". These results seem to imply that the "cap" neuron may learn some of its roundness from "dome" detection, and is strongly distinct from other types of hats like sombreros.

Moving to the previous conv unit, we can again use expanded weights to see that it has strong positive connections to "Skateboarder" and "Cruise ship", and has strong negative connections to "?"Appears to have no single interpretable meaning. and "computer chip"We call this the "computer chip" unit, but it also activates for wind turbines, oil, shipping containers, and a variety of other features, suggesting it is not distinct in meaning.. This explains to some extent the examples of skateboarders and heavy machinery in 4/2/5/2, although the precise functionality of the contributing units (especially the inhibitory ones) is unknown.

Macroscopic Properties of CLIP

Not only did we look at specific circuits, but we also tried to understand a few higher-level properties of the network — what we call "feature versatility" and "bottleneck learning."

Forward Contributions

Just as we can see which units contribute to a given later unit and to what degree, we can see which later units a given unit contributes to and to what degree. When analyzing the network from the latter point of view, we will use the terminology "forward contributions" (as opposed to "backward contributions" or just "contributions"). We found that in contrast to backward contributions, which exhibit an exponential distribution, the forward contributions we looked at seemed to exhibit something a less spiky distribution, closer to normal. Additionally, we found that units from earlier and earlier layers were less and less likely to have forward contributions to the final conv layer (4/2/5) near zero, tending toward a bimodal distribution of forward contributions. The earlier units are more "versatile" in the sense that they contribute to more features to a greater degree.

FIGURE Histograms showing the frequency (y-axis) of various contribution strengths (x-axis) from the unit to each unit in the later layer. Starting earlier in the network leads to fewer contributions near zero as well as a roughly balanced degree of positive and negative contributions. By varying the lower unit, we can see a gradient from unimodal to bimodal forward contributions.

This isn't surprising, since earlier parts of the network tend to form simpler features. What may be surprising though, is that earlier features tend to strongly contribute both positively and negatively. From our limited investigations, it seems that features have roughly balanced positive and negative forward contributions on average.

A Peculiarity

Above, we examined forward contributions by fixing the upper layer and then seeing how contributions varied as the lower unit varied. But what if we instead vary the upper layer and fix the lower unit? Before doing this, we expected to see a gradual shift from unimodal to bimodal contribution distributions, just as in the above histograms. Instead, we found that the shift to bimodality is sudden.

FIGURE Histograms of forward contributions to various upper layers, where the upper layer varies from left to right and the lower unit is held constant. In all three cases (each case designated by a row in this grid), we see a sudden shift to bimodality at layer 4/2/5, the final conv layer in the network.

Additionally, we only found bimodal forward contributions toward layer 4/2/5. For example, we didn't see bimodality for forward contributions to the last conv layer of any other bottleneck, though more investigation is needed to completely verify this as we only looked at a small set of examples. If layer 4/2/5 is indeed the only layer that exhibits contribution bimodality, this might mean that the rest of the network is optimized for building "versatile" features, whereas the late parts of the network are tasked with actually putting together all of these features (hence having so few contributions near zero).

Residual Investigation and Bottleneck Learning

The skip connections in the ResNet made it difficult to understand if contributions from bottlenecks were actually important to the final output of the network, or if any concepts learned in bottlenecks throughout the network were "retained." For instance, we would often see a set of highly activating dataset examples for some channel in the 1×1 convolution layer preceding the ultimate relu layer, but these same dataset examples would not strongly activate the corresponding channel in the relu layer. Instead, the channel in the relu layer closely resembled the same channel in the previous relu (in the sense that there was significant overlap in their sets of highly activating dataset examples). This indicates that visual concepts were learned within the last bottleneck but were "discarded" or dominated by the added output from the previous ReLU.

FIGURE Unit 295 from layer 4/1/6 (a ReLU), 4/2/5 (a 1×1 conv), and 4/2/6 (the final ReLU). The unit in 4/1/6 and 4/2/6 activate strongly on images of daisies. Unit 295 in 4/2/5 activates strongly on monarch butterflies, but these butterflies aren't seen in the top dataset examples in 4/2/6.

To determine which channels learned new information within a bottleneck, we scraped OpenAI Microscope for the feature visualizations of several bottlenecks' relu layers. The expectation was that these would be visual representations of the combination of concepts contributed by the residual function and shortcut. We then computed the activations of units in the last convolutional layer and previous relu layer on their corresponding feature visualization (the feature visualization for the relu channel both of these units contribute to). Next, we divided the activation of the shortcut by the sum of the output of the residual mapping and the shortcut; this gave us a rough sense of how much the skip connection was responsible for the bottleneck's final output.

FIGURE Unit 11 from layer 4/1/6 (a ReLU), 4/2/5 (a 1×1 conv), and 4/2/6 (the final ReLU). The unit in 4/1/6 seems to detect for NFL/ football-related images, while the same unit in 4/2/5 activates highly on babies. Unit 11 in 4/2/6 appears to activate most strongly on babies, but also detects for the NFL/footballs.

It appeared that on most channels, the network learned a near identity mapping over the ultimate bottleneck; around 86% of the channels, when activated on the respective feature visualization of the last ReLU, had over 90% of the input to the last ReLU coming from the shortcut. The penultimate bottleneck exhibited similar qualities. On channels that learned near identity mappings, we would see that the feature visualizations and dataset examples from the two ReLUs - one at the end of the bottleneck and one at the end of the previous bottleneck) looked relatively similar, while the feature visualizations and dataset examples from the corresponding 1×1 convolutional layer would often resemble very unrelated concepts. Likewise, on channels that appeared to learn within the bottleneck, we would see the dataset examples from the two ReLUs tell somewhat different stories, and the examples from the preceding convolutional layer more prevalent in the ultimate ReLU.

We take this to mean that enrichment circuits are the exception, not the rule — by the last couple of bottlenecks, the network's representations have mostly solidified, with only a few abstract concepts contributing to more general neurons in the last layer. This interpretation should be qualified by our uncertainty over the accuracy of using feature visualizations to analyze contributions. While this method appeared to yield results that were consistent with dataset examples, we'd like to repeat this experiment in the future using e.g. the top 100 dataset examples of the bottleneck's relu and averaging the contributions of the convolutional layer and previous ReLU for each example.

Methodology

Discovering Units of Interest

In the initial stage of our research, we focused on developing tools to make sense of the 2,048 channels in CLIP-RN50's final layer. These tools, listed below, allowed us to find units of interest such as those that activated most highly for a given text.

Text Rasterization and the Channel Difference Engine

One initial goal was to identify the unit that activated most highly for a given word, e.g. Trump.". We used Captum's LayerActivation module to rasterize a 224×224 image of the text "Trump", and then instrumented the network to track activations for each layer. Then, we took the mean of the x, y spatial positions to get the average activation per channel in each layer. We then investigated the maximumally-activating channels produced from this process; however, we found that this approach gave us a unit merely tuned for detecting text rather than any units of interest.

To resolve this, we decided to calculate a baseline activation distribution by rasterizing the meaningless text "eeeeee" and subtracting it from the activations of our target word; this allowed us to find meaningful units in the difference between the two activations.

Activation Databases

After developing the channel difference engine, we then used a list of the top 10,000 most common English words to create mappings from "word → most highly-activating unit" and "unit → most-highly activating word." We found this useful for exploring concepts in the final layer through an interactive interface.

Credit to Chelsea Voss for the inspiration and design behind these tools.

Contributions

To understand feature contributions, we relied on expanded weights which takes in two units (from any two layers) and returns a matrix that represents the strength of the connection between the layers. As in Voss et al, this matrix is spatially meaningful, though for our purposes we did not make much use of it — to find the top contributing units from one layer to some unit in a later layer, we just looked at which units had expanded weights matrices with the highest magnitudes. Though our tool also allows for finding the top units order by mean, though we found these measures of contribution strength to roughly correspond.

Source Code

The libraries and notebooks used to discover the above results can be found here on GitHub.

Acknowledgements

Many thanks to Chelsea Voss for mentoring us and providing valuable feedback on technical issues, methodology, and written drafts. Thanks to Chris Olah for initial guidance on circuit investigations, SERI for funding and application feedback, and Distill contributors for the publication format and valuable discussions in the #circuits channel.

Contributions

Sidney Hough: Implemented reading activations from the network and expanded weights. Came up with the idea to compare the residual and conv contributions, and implemented a scraper to download feature visualizations from Microscope. Performed investigation of circuits for the case studies. For the final writeup, created visualizations and styled all figures shown above.

Kevin Liu: Implemented input generation for typographic attacks, image optimization, and initial expanded weights algorithm. Developed channel difference engine. Worked jointly on developing a database of activations from text to unit and unit to text. Performed investigation of circuits for the case studies.

Jack Ryan: Developed contributions and versatility tools to visualize several properties of a given unit, as well as a tool to visualize the images corresponding to a certain level of activation of a unit (activation distribution tool). Came up with the idea to study large-scale forward contributions across layers and implemented an algorithm to do so using expanded weights. Performed investigation of circuits for the case studies.

† equal contributors