Pretraining and Attention Infrastructure Made Vision Transformers Practical

Isaac RobinsonAI EngineerFriday, May 8, 202610 min read

Isaac Robinson of Roboflow argues that transformers overtook convolutional networks in vision not because images stopped needing visual structure, but because that structure moved from hand-built architecture into pretraining, scaling and tooling. In his account, ViT-style models first lacked the inductive biases and efficiency that made CNNs dominant, but self-supervised vision pretraining and attention infrastructure from the LLM world made the simpler architecture practical. Robinson frames the next problem as deployment: turning large foundation backbones into model families that can meet real latency, cost and hardware constraints.

Transformers won vision by learning the biases they lacked

Isaac Robinson frames the modern shift in vision models as a counterintuitive result: the architecture that looked least naturally suited to images ended up beating the architecture built around image structure.

Convolutional neural networks began with what Robinson calls an “excellent inductive bias.” A convolutional filter moves across an image, and the same activation can fire regardless of whether the object appears in the upper left or bottom right. A person is a person wherever they appear. That locality and translation invariance made CNNs the obvious substrate for vision, and it supported the long line of hierarchical image models, including ResNets.

Transformers entered with the opposite profile. In Robinson’s description, a Transformer is “just a set of tokens” processed by a set-to-set operation. It has no native visual bias. Any bias has to be injected. In an autoregressive language model, for example, the causal mask imposes sequence modeling. In vision, the original Vision Transformer did the simplest possible thing: split an image into patches, linearly project the flattened patches, add learned positional encodings, and pass the resulting tokens into a Transformer encoder.

That simplicity carried two obvious liabilities. First, it removed the image-specific bias that had made CNNs so effective. The model did not inherently know that the same feature should behave similarly in different image locations. Second, Robinson says, the compute scaling was ugly. With 16-by-16 patches, a side length of $n$ produces $n /16$ patches along a dimension, but full attention across image patches leads to what he describes as $n^{4}$ scaling with resolution.

The natural comparison, then, was not flattering to ViT: a high-inductive-bias, $n^{2}$ convolutional network versus a no-inductive-bias, $n^{4}$ Vision Transformer. Robinson’s punchline is that the Vision Transformer won anyway: “So as everyone would expect, it’s the ViT.”

His explanation has two parts. Vision Transformers benefited from massive, ViT-specific pretraining that recovered useful visual structure from data. They also inherited infrastructure and speedups built for large language models, especially attention optimizations such as Flash Attention. The original objections did not disappear; scale, pretraining, and tooling changed which tradeoffs mattered.

The field kept adding vision bias, then stripping it back out

Isaac Robinson describes the architecture history as a loop: ViT to Swin, then ConvNeXt, then Hiera, and finally back to ViT. The loop matters because each step tried to solve a real weakness in the preceding design, only for the field to return to the simple architecture that scaled and pretrained well.

Swin Transformer addressed the most obvious Vision Transformer problem by restoring locality. Instead of applying global attention across all patches, Swin performs attention inside local windows. If the windows stayed fixed, tokens in different windows would never interact, so later layers shift the windows. That creates overlapping regions and allows information to move across the image. Robinson points out that the operation begins to resemble convolution: similar computation applied locally across sections of the image, with overlap between application locations.

Swin therefore solved two problems at once. It added a locality inductive bias, and if the window size remains independent of image resolution, it reduces the scaling back down to $n^{2}$ . In Robinson’s words, “that seems logical, that makes sense.”

ConvNeXt represented a different response: if Transformer machinery had no inherent relationship to vision, perhaps the better move was to take the lessons learned from Vision Transformers and put them back into a convolutional network. Robinson describes ConvNeXt as borrowing the Transformer pattern of alternating a spatial mixer and a feed-forward block. In a Transformer, self-attention mixes spatial information. In ConvNeXt, convolution plays that spatial-mixing role. The model keeps a hierarchical convolutional structure and adds details learned from Transformer practice, including LayerNorm and other innovations.

On standard ImageNet comparisons, Robinson says, ConvNeXt beat ViT and Swin. For a moment, that result seemed to restore architectural common sense: a vision-specific model, updated with Transformer-era lessons, could outperform the less-biased Transformer.

But the next step, Hiera, shifted the conclusion. Robinson attributes this move to Meta: take a strong, inductively biased Transformer model, remove the specialized biases one by one, and use pretraining to learn the bias instead. The motivation was speed as well as performance. Specialized architectural machinery can help, but it also adds overhead. If pretraining can recover the same useful behavior, stripping out the manually injected bias can produce a faster model without losing the capabilities that mattered.

Stage	Architectural move	Robinson’s interpretation
ViT	Patchify the image and apply a standard Transformer encoder	Simple, scalable, but initially lacks visual inductive bias and has expensive attention scaling
Swin	Apply attention in shifted local windows	Adds locality and reduces resolution scaling when window size is fixed
ConvNeXt	Bring Transformer-era design patterns back into CNNs	Shows CNNs can absorb Transformer lessons and perform strongly on ImageNet
Hiera	Strip out injected biases and rely on pretraining to recover them	Shows the tradeoff between handcrafted bias and learned bias
ViT again	Return to the simple architecture with massive pretraining and attention infrastructure	The simple thing that scales well wins

Robinson’s architecture history runs from adding visual structure to learning it through pretraining.

For Robinson, Hiera is the cleanest example of the balance between inductive bias and pretraining. The architecture does not need to hard-code as much visual structure if the training objective can force the model to learn it.

Masked and self-supervised pretraining made ViTs useful before task training

Isaac Robinson uses MAE, the Masked Autoencoder, as the central pretraining example. The procedure is straightforward: split an image into patches, drop many of the patches, and train the model to reconstruct what should have been there from the remaining context. Robinson compares it to BERT for an audience familiar with language models.

The point is not just that MAE is a good training objective. It is ViT-specific in a way that matters to the architecture debate. A Vision Transformer naturally handles image patches as tokens, so dropping patches and reconstructing them fits the model’s representation. Robinson argues that the same objective cannot be applied in the same way to a convolutional network: if convolution is invariant across patches, “how do you drop out a patch” in the equivalent sense? The pretraining technique supplies a learned version of the visual bias the architecture itself does not contain.

DINOv2 and DINOv3 push this further. Robinson describes them as “really, again, ViT-specific pretraining techniques” that produce not merely better image-processing tendencies but rich feature maps out of the box. A PCA decomposition of feature maps from a DINOv3-pretrained ViT, shown in Robinson’s materials, displays semantically coherent structure: cat paws appear in distinct colors and are tracked across different cats; satellite imagery decomposes in a semantically meaningful way.

The implication is that the frozen representation is already close to useful before downstream training. Robinson says that under linear probing — freezing the feature extractor and training only a linear projection on top — self-supervised learning is approaching the best results available from fully supervised learning. The model has not simply learned to classify a benchmark. It has learned visual structure that transfers.

LLM attention infrastructure weakened the speed argument against ViT

Even if pretraining recovers visual structure, global attention over image patches remains expensive. Isaac Robinson argues that the LLM boom changed the economics of attention itself.

He points specifically to Flash Attention as the largest example of tooling developed because “people care a lot about attention” in language models. Hiera had shown speedups against ViT at the same accuracy, but Robinson notes that the Hiera paper explicitly did not measure with Flash Attention. Once attention-specific speedups from the LLM world are added back into the comparison, the penalty of the “very silly” $n^{4}$ setup matters less.

This is a subtle part of Robinson’s thesis. Vision Transformers did not win only because they were better vision models in isolation. They won because their core operation was the same operation receiving enormous optimization pressure elsewhere. A convolutional architecture could be carefully engineered for images, but the broader industry was making attention faster, more scalable, and better supported.

The architecture with fewer built-in visual assumptions therefore gained two external advantages at once: self-supervised pretraining that taught it image structure, and infrastructure from language models that made its expensive computation more practical.

SAM repeated the same architecture cycle

The Segment Anything Model family supplies Isaac Robinson’s practical example of the broader pattern. From his perspective, SAM was “one of the most important foundation model series in vision, period.” Its backbones trace the same loop from simple ViT to specialized alternatives and back.

SAM used a ViT trained with MAE. MobileSAM replaced that backbone with TinyViT, which Robinson describes as a specialized convolutional-Transformer hybrid, reflecting the familiar instinct that the plain ViT could not possibly be the best option. SAM2 used Hiera with MAE pretraining. For SAM3, Robinson says the series “just gives up on the architecture ablation” and puts in a massively pretrained backbone; his slide identifies the sequence as moving to “ViT (PE).”

Model	Backbone described in the talk	Pattern
SAM	ViT with MAE	Foundation model begins with a pretrained Vision Transformer
MobileSAM	TinyViT	A specialized convolutional-Transformer hybrid replaces the ViT backbone
SAM2	Hiera with MAE	Bias is reduced and recovered through pretraining
SAM3	ViT (PE), alongside Robinson’s description of a massively pretrained backbone	The series returns to a large pretrained ViT-style backbone

Robinson reads the SAM series as a concrete replay of the broader ViT architecture cycle.

The practical consequence is mixed. SAM3 is powerful, but Robinson says it is also about 800 million parameters and takes about 300 milliseconds to run on a T4 GPU. That makes it unusable in many cases, especially the low-power edge deployments and resource-constrained scenarios that have historically mattered in computer vision.

300 ms

SAM3 latency on a T4 GPU, as cited by Robinson

For Robinson, this is the new problem created by the very strategy that made Transformers win vision. If performance depends on enormous pretraining runs and large general-purpose backbones, deployment flexibility suffers. The model becomes one-size-fits-all. It may be strong in aggregate but poorly matched to the latency, power, cost, or hardware constraints of a specific application.

Roboflow’s answer is to keep the foundation model, then search for deployable variants

Isaac Robinson positions Roboflow’s RF-DETR work as an attempt to preserve the gains from foundation-model pretraining while restoring deployment flexibility. The premise is that fixed foundation models need to be transformed into model families that can be matched to target data and target hardware.

He says Roboflow introduced RF100-VL to measure how well foundation models transfer to diverse downstream tasks for object detection, which he calls one of the canonical vision-centric tasks. In the comparison he shows, RF-DETR is plotted against SAM3 across T4 latency and COCO instance mAP. Robinson reports roughly a 40x speedup at the same accuracy versus fine-tuning SAM3. He adds that even at a “merely” 15x speedup, RF-DETR produces a meaningful improvement.

40x

speedup Robinson reports for RF-DETR at the same accuracy versus fine-tuning SAM3

The latency/mAP chart matters because Robinson’s claim is not simply that a smaller model is faster. It is that the accuracy-latency frontier can be shifted while still drawing on foundation-model pretraining. The chart places RF-DETR and SAM3 in the same latency-accuracy space, with RF-DETR shown far to the lower-latency side while preserving or improving the relevant accuracy point.

Robinson also makes a related claim about real-time instance segmentation. He says that, at the time RF-DETR was published, the models shown along that comparison line were the best real-time instance segmentation models, and that Roboflow outperformed them in a meaningful way. The talk therefore keeps several evaluation frames in view: RF100-VL for downstream transfer on object detection, COCO instance mAP on the latency chart, and a comparison to real-time instance segmentation models.

Evaluation frame	What Robinson says it is used for	Claim in the talk
RF100-VL	A dataset for measuring transfer of foundation models to diverse downstream object-detection tasks	Used to evaluate whether foundation models remain useful after adaptation
T4 latency vs. COCO instance mAP chart	A speed-accuracy comparison between RF-DETR and SAM3	RF-DETR shows about a 40x speedup at the same accuracy versus fine-tuning SAM3, and a meaningful improvement at about 15x speedup
Real-time instance segmentation comparison	A comparison to the best real-time instance segmentation models at the time of RF-DETR’s publication	Roboflow outperformed those models meaningfully, according to Robinson

Robinson separates object-detection transfer, COCO instance mAP, and real-time instance segmentation comparisons rather than treating them as one metric.

The architectural mechanism is pretraining-compatible neural architecture search. Robinson says the models in the RF-DETR family use the same foundation model, but Roboflow modifies that model through neural architecture search to generate a family of high-performance models in one go. The goal is not to discard the pretrained representation. It is to adapt the deployment profile without losing the advantages of the foundation backbone.

The tunable knobs he lists are all framed as drop-in compatible with existing foundation-model infrastructure. They include changing input resolution, interpolating patch embeddings, dropping queries, varying decoder layers, interpolating resolution, and changing the number of windows. Each knob exposes a speed-accuracy tradeoff: fewer patches, fewer queries, fewer layers, lower resolution, or fewer windows make inference faster and less accurate; the reverse makes it slower and more accurate.

That framing turns the Transformer victory from a benchmark story into a deployment story. Massive pretraining and LLM-derived attention infrastructure explain why ViTs became strong enough to challenge older convolutional approaches. But without a way to tune those models for concrete hardware constraints, the victory is incomplete. Robinson’s final formula is therefore: massive ViT-specific pretraining, speedups borrowed from LLMs, and pretraining-compatible neural architecture search.

Video, text, and world-model pretraining remain unsettled

Asked whether architectures already support unified video, image, and text pretraining, Isaac Robinson answers cautiously: many people are working on many combinations. He points again to SAM3 as a useful example, not because it is a fully unified answer, but because it shows large-scale training applied beyond static image perception.

For vision-specific video processing, Robinson says SAM3 handles video from the perspective of tracking objects through video. It uses massive-scale pretraining, perception-encoder pretraining for the backbone, and substantial downstream training. That is not presented as the final unified architecture, but as evidence that the field is already blending image and video objectives around foundation-model backbones.

Another audience member asks about JEPA and V-JEPA. Robinson describes them as another variety of foundation model. His assessment is explicitly an impression from what he has seen: for single-image-centric pretraining, JEPA does not seem to outperform many of the other approaches; for Video JEPA, he has not yet seen it used meaningfully in a video context for downstream transfer. He leaves the question open: “we’ll see.”

That uncertainty is consistent with the rest of the talk. Robinson is confident about the mechanism that brought Transformers to the center of vision: pretraining learned missing biases, attention infrastructure reduced the cost, and deployment adaptation became the next frontier. He is less definitive about which multimodal or video pretraining approach will dominate next.

Evals and Benchmarks Inference and Deployment AI Infrastructure and Compute Data and Training Multimodal AI AI Research Methods