We investigate late circuits on CLIP's vision side to understand how they glue abstract concepts together and build more general multimodal neurons.
We build off of the work done in Multimodal Neurons in Artificial Neural Networks
For context, the CLIP model we studied (CLIP-RN50
relu layer with 2048 channels that combines via pairwise addition the activations from a
conv2d layer and the previous
relu layer (the residual connection).
We confirm significant similarities in the final layer to RN50-4x, including features for letters and crystallized concepts, such as "
conv side with another from the residual connection. To do so, we investigated both case studies of circuits in the final layer as well as macroscopic properties of the network.
The terms "neuron," "unit," and "channel" can often be used ambiguously in writing about circuits. For clarity, we explain the terminology we use in this article below. Our terminology comes from Feature Visualization
convlayers, to one filter) is detecting.
Most references to a given feature in this article are to a unit as a whole. To refer to a unit, we use the notation
layer/bottleneck/index/channel; for example,
Below, we present our findings from backtracing the contributions to several units in the final
relu layer. The amount that one unit contributes to another is roughly the magnitude of the weights connecting one unit to the other. We present contributions as both a magnitude and a mean, where the magnitude shows the size of the contribution, and the mean roughly tells us whether the weight is inhibitory or excitatory.
Note: some neurons studied activate on explicit content. Explicit text and images will be blurred until manually revealed.
We view this unit as a clear "enrichment circuit" — it portrays sharply different concepts in its
conv contributions, with both contributing to the final unit. Looking at the feature vis and dataset examples, the final unit appears to be a mix of "you" text detection and a bloody wound detector; looking at the previous
While the rationale for combining "you" + bloody wound into a single neuron is unknown, we find it interesting how decomposable the two concepts are by looking at earlier units, and we conjecture the same decomposability may apply to other units, e.g. "text detector for Trump" + "image detector for Trump" = "Trump unit."
To see the relative contributions of the
relu unit versus the convolutional bottleneck unit to 4/2/6/34, we also developed a technique for measuring the percent
relu contribution based on the relative activations of the two units on the feature visualization of the final unit. While we have uncertainties about the reliability of this technique, in this case we find that the
99.66% to the final activation, which matches the observation that all the dataset examples and the majority of the feature visualizations are of "you."
One of the first coherent units we found was 4/2/6/2, which seems to activate highly on pictures of hats, as well as the words "cap," "famous," "savage,", "pussy," and "caps."
4/2/6/2 receives contributions from
Not only did we look at specific circuits, but we also tried to understand a few higher-level properties of the network — what we call "feature versatility" and "bottleneck learning."
Just as we can see which units contribute to a given later unit and to what degree, we can see which later units a given unit contributes to and to what degree. When analyzing the network from the latter point of view, we will use the terminology "forward contributions" (as opposed to "backward contributions" or just "contributions"). We found that in contrast to backward contributions, which exhibit an exponential distribution, the forward contributions we looked at seemed to exhibit something a less spiky distribution, closer to normal. Additionally, we found that units from earlier and earlier layers were less and less likely to have forward contributions to the final conv layer (4/2/5) near zero, tending toward a bimodal distribution of forward contributions. The earlier units are more "versatile" in the sense that they contribute to more features to a greater degree.
This isn't surprising, since earlier parts of the network tend to form simpler features. What may be surprising though, is that earlier features tend to strongly contribute both positively and negatively. From our limited investigations, it seems that features have roughly balanced positive and negative forward contributions on average.
Above, we examined forward contributions by fixing the upper layer and then seeing how contributions varied as the lower unit varied. But what if we instead vary the upper layer and fix the lower unit? Before doing this, we expected to see a gradual shift from unimodal to bimodal contribution distributions, just as in the above histograms. Instead, we found that the shift to bimodality is sudden.
Additionally, we only found bimodal forward contributions toward layer 4/2/5. For example, we didn't see bimodality for forward contributions to the last conv layer of any other bottleneck, though more investigation is needed to completely verify this as we only looked at a small set of examples. If layer 4/2/5 is indeed the only layer that exhibits contribution bimodality, this might mean that the rest of the network is optimized for building "versatile" features, whereas the late parts of the network are tasked with actually putting together all of these features (hence having so few contributions near zero).
The skip connections in the ResNet made it difficult to understand if contributions from bottlenecks were actually important
to the final output of the network, or if any concepts learned in bottlenecks throughout the network were "retained." For instance,
we would often see a set of highly activating dataset examples for some channel in the
1×1 convolution layer preceding the ultimate
relu layer, but these same dataset examples would not strongly activate the corresponding channel in the
relu layer. Instead,
the channel in the
relu layer closely resembled the same channel in the previous
relu (in the sense that there was significant overlap
in their sets of highly activating dataset examples). This indicates that visual concepts were learned within the last bottleneck but were "discarded" or
dominated by the added output from the previous ReLU.
To determine which channels learned new information within a bottleneck, we scraped OpenAI Microscope
for the feature visualizations of several bottlenecks'
relu layers. The expectation was that these would be visual representations of the combination of
concepts contributed by the residual function and shortcut. We then computed the activations of units in the last convolutional layer and previous
relu layer on their corresponding feature visualization (the feature visualization for the
relu channel both of these units contribute to).
Next, we divided the activation of the shortcut by the sum of the output of the residual mapping and the shortcut;
this gave us a rough sense of how much the skip connection was responsible for the bottleneck's final output.
It appeared that on most channels, the network learned a near identity mapping over the ultimate bottleneck; around 86% of the channels, when activated on the respective feature visualization of the last ReLU, had over 90% of the input to the last ReLU coming from the shortcut. The penultimate bottleneck exhibited similar qualities. On channels that learned near identity mappings, we would see that the feature visualizations and dataset examples from the two ReLUs - one at the end of the bottleneck and one at the end of the previous bottleneck) looked relatively similar, while the feature visualizations and dataset examples from the corresponding 1×1 convolutional layer would often resemble very unrelated concepts. Likewise, on channels that appeared to learn within the bottleneck, we would see the dataset examples from the two ReLUs tell somewhat different stories, and the examples from the preceding convolutional layer more prevalent in the ultimate ReLU.
We take this to mean that enrichment circuits are the exception, not the rule — by the last couple of bottlenecks,
the network's representations have mostly solidified, with only a few abstract concepts contributing to more general neurons in the last layer.
This interpretation should be qualified by our uncertainty over the accuracy of using feature visualizations to analyze contributions.
While this method appeared to yield results that were consistent with dataset examples, we'd like to repeat this experiment in the future
using e.g. the top 100 dataset examples of the bottleneck's
relu and averaging the contributions of the convolutional layer and previous ReLU
for each example.
In the initial stage of our research, we focused on developing tools to make sense of the 2,048 channels in CLIP-RN50's final layer. These tools, listed below, allowed us to find units of interest such as those that activated most highly for a given text.
One initial goal was to identify the unit that activated most highly for a given word, e.g.
LayerActivation module to rasterize a 224×224 image of the text "Trump", and then instrumented the network to track activations for each layer. Then, we took the mean of the
x, y spatial positions to get the average activation per channel in each layer. We then investigated the maximumally-activating channels produced from this process; however, we found that this approach gave us
To resolve this, we decided to calculate a baseline activation distribution by rasterizing the meaningless text "eeeeee" and subtracting it from the activations of our target word; this allowed us to find meaningful units in the difference between the two activations.
After developing the channel difference engine, we then used a list of the top 10,000 most common English words to create mappings from "word → most highly-activating unit" and "unit → most-highly activating word." We found this useful for exploring concepts in the final layer through an interactive interface.
Credit to Chelsea Voss for the inspiration and design behind these tools.
To understand feature contributions, we relied on expanded weights
The libraries and notebooks used to discover the above results can be found here on GitHub.
Many thanks to Chelsea Voss for mentoring us and providing valuable feedback on technical issues, methodology, and written drafts. Thanks to Chris Olah for initial guidance on circuit investigations, SERI for funding and application feedback, and Distill contributors for the publication format and valuable discussions in the #circuits channel.
Sidney Hough: Implemented reading activations from the network and expanded weights. Came up with the idea to compare the residual and conv contributions, and implemented a scraper to download feature visualizations from Microscope. Performed investigation of circuits for the case studies. For the final writeup, created visualizations and styled all figures shown above.
Kevin Liu: Implemented input generation for typographic attacks, image optimization, and initial expanded weights algorithm. Developed channel difference engine. Worked jointly on developing a database of activations from text to unit and unit to text. Performed investigation of circuits for the case studies.
Jack Ryan: Developed contributions and versatility tools to visualize several properties of a given unit, as well as a tool to visualize the images corresponding to a certain level of activation of a unit (activation distribution tool). Came up with the idea to study large-scale forward contributions across layers and implemented an algorithm to do so using expanded weights. Performed investigation of circuits for the case studies.
† equal contributors