Zero-Shot Image Classification: An ML Whitepaper

The global computer vision market was estimated at USD 19.82 billion in 2024 and is projected to reach USD 58.29 billion by 2030, growing at a 19.8% CAGR (Grand View Research, 2025). A growing share of that market is being captured by models that classify images they were never trained on - a capability known as zero-shot image classification. Three concurrent forces have made zero-shot classification practical at production scale: (i) contrastive vision-language pretraining at web scale, which aligns images and text in a shared embedding space (Radford et al., 2021; Jia et al., 2021); (ii) self-supervised foundation models that learn powerful visual features from unlabeled images alone (Caron et al., 2021; Oquab et al., 2024; Siméoni et al., 2025); and (iii) multimodal large language models that reason about images in open-ended natural language (Alayrac et al., 2022; Liu et al., 2023; OpenAI, 2023).

This whitepaper covers the formal problem definitions, the end-to-end mechanics, the dominant model families with verified benchmark numbers, production engineering trade-offs, and six application domains, with citations to the primary literature throughout.

What is Zero-Shot Image Classification?

Let S be the set of seen classes (those with training signal) and U the set of unseen classes, with S ∩ U = ∅. In classic zero-shot learning (ZSL), the classifier maps inputs only onto unseen classes, f: X → U. In generalized zero-shot learning (GZSL), the test label space is the union, f: X → S ∪ U, a harder and more realistic setting, because models tend to over-predict seen classes (Chao et al., 2016; Xian et al., 2018a).

The bridge between seen and unseen classes is a semantic embedding space: each class y is represented by a vector φ(y) (an attribute vector, a word embedding, or a text-encoder output) and each image x by θ(x). Many ZSL methods learn a compatibility function

F(x, y) = θ(x)ᵀ W φ(y)

and predict the class with the highest compatibility.

Historical milestones. Attribute-based prediction launched the field (Lampert et al., 2009; Farhadi et al., 2009; Lampert et al., 2014), followed by cross-modal and embedding transfer (Socher et al., 2013; Frome et al., 2013, DeViSE; Norouzi et al., 2014, ConSE), and then the contrastive web-scale paradigm shift of CLIP (Radford et al., 2021) and ALIGN (Jia et al., 2021), which turned zero-shot classification into open-vocabulary classification: any class expressible as a text prompt becomes classifiable at inference time.

A useful working taxonomy in 2026: classic ZSL (unseen-only test space, AwA2/CUB/SUN benchmarks), generalized ZSL (seen + unseen, harmonic-mean evaluation), and open-vocabulary classification (arbitrary text labels at inference, the CLIP regime).

A critical correction: linear probing of self-supervised features (DINOv2, DINOv3) is not zero-shot classification. A linear probe is a classifier trained on labeled examples of the target classes on top of frozen features; it requires labels and is therefore supervised. The contrast is not “self-supervised features are zero-shot” - but rather “self-supervised features need either a trained probe (supervised) or a separate text-alignment module (e.g., dino.txt) to perform zero-shot classification.” Conflating the two leads to roadmaps that promise label-free deployment and then discover an annotation budget at integration time.

How Does Zero-Shot Image Classification Work?

How Contrastive Pretraining Powers Zero-Shot Image Classification

CLIP trains an image encoder and a text encoder jointly so that matching image–text pairs have high cosine similarity and non-matching pairs low similarity, using a symmetric InfoNCE/softmax contrastive loss with a learned temperature τ:

L = ½ (L_image→text + L_text→image)

where each term is a cross-entropy over in-batch similarities scaled by 1/τ. CLIP was trained on WIT-400M, 400 million image–text pairs collected from the internet (Radford et al., 2021). ALIGN scaled the recipe to 1.8 billion noisy alt-text pairs (Jia et al., 2021); the open LAION-5B dataset provides 5.85 billion CLIP-filtered pairs (Schuhmann et al., 2022); and the DataComp systematized multimodal dataset design as a benchmark in its own right (Gadre et al., 2023). SigLIP replaces the softmax with a pairwise sigmoid loss, removing the global normalization over the batch and performing better at modest batch sizes; the authors found gains from batch-size scaling saturate, with ~32k being sufficient (Zhai et al., 2023).

ResNet vs. ViT Encoders in Zero-Shot Image Classification

CLIP’s 2021 release shipped both convolutional (ResNet; He et al., 2016) and Vision Transformer (Dosovitskiy et al., 2021) image encoders simultaneously, five ResNets and three ViTs, and its best model is ViT-L/14@336px. The contrast between the two is not “ViTs see the whole image while CNNs cannot”; both ultimately integrate the entire image, but rather that ViT self-attention (Vaswani et al., 2017) is global from the very first layer, while a CNN’s effective receptive field expands gradually with depth. ViT backbones have since become the default for the strongest zero-shot models because they scale predictably with data and compute (Cherti et al., 2023).

How CLIP Zero-Shot Inference Works

Given class names, build prompts (e.g., “a photo of a {label}”), encode them to text vectors t_y, encode the image to v, and predict

ŷ = argmax_y cos(v, t_y)

No retraining or labeled examples of the target classes are needed. Prompt engineering matters quantitatively: in the CLIP paper, using the template “a photo of a {label}” instead of the raw class name added 1.3 points of ImageNet accuracy, and ensembling 80 hand-written prompts added a further 3.5 points (Radford et al., 2021). LLM-generated prompts extend the idea: CuPL queries a language model for customized class descriptions (Pratt et al., 2023), and description-based classification scores images against LLM-generated visual descriptors, adding interpretability (Menon & Vondrick, 2023).

Classic Zero-Shot Learning Methods

Attribute-based (DAP): learn attribute classifiers (“has stripes,” “can fly”) on seen classes, then compose them for unseen classes (Lampert et al., 2009; Lampert et al., 2014). Standard benchmarks: Animals with Attributes / AwA2 (Xian et al., 2018a), CUB-200-2011 (Wah et al., 2011), aPY (Farhadi et al., 2009).
Embedding-based: project images into a semantic space and classify by nearest class embedding (Socher et al., 2013; DeViSE, Frome et al., 2013; ConSE, Norouzi et al., 2014). These methods suffer the hubness problem - in high-dimensional spaces, a few points become the nearest neighbor of disproportionately many queries (Radovanović et al., 2010); regression-direction and ridge-based mitigations were analyzed by Shigeto et al. (2015).
Generative: synthesize visual features for unseen classes with conditional GANs or VAEs - f-CLSWGAN (Xian et al., 2018b) and f-VAEGAN-D2 (Xian et al., 2019), then train an ordinary classifier on real seen-class plus synthetic unseen-class features. This directly attacks the seen-class bias of GZSL.

How to Improve Zero-Shot Image Classification Accuracy

Linear probe: train a linear head on frozen features (supervised, see the correction in §2).
Prompt tuning: CoOp learns continuous context vectors in place of hand-written prompt words; CoCoOp conditions them on the image to generalize better to unseen classes (Zhou et al., 2022a; Zhou et al., 2022b).
Adapters: CLIP-Adapter (Gao et al., 2024) and training-free Tip-Adapter (Zhang et al., 2022) add small modules over frozen encoders; LoRA injects low-rank weight updates (Hu et al., 2022).
Robust fine-tuning: naive fine-tuning of a zero-shot model erodes its distribution-shift robustness; WiSE-FT interpolates zero-shot and fine-tuned weights, gaining 4–6 points under distribution shift while preserving or improving target accuracy (Wortsman et al., 2022).

Production Considerations for Zero-Shot Image Classification

Embedding caching: precompute and cache text embeddings for fixed label sets; only the image encoder runs per request.
Vector search: for very large or open-label spaces, index label embeddings with approximate nearest-neighbor search for sub-linear lookup.
Quantization and distillation: INT8 post-training quantization and distillation to smaller backbones cut GPU memory and latency with minor accuracy cost.
Batching: server-side dynamic batching amortizes kernel launches at the cost of tail latency.
Monitoring: zero-shot accuracy degrades on inputs unlike the pretraining distribution; track input drift and per-class confidence calibration in production.

Best Zero-Shot Image Classification Models and Benchmarks

benchmark for best zero-shot image classification models

CLIP, SigLIP, ALIGN, and Other Vision-Language Models

The takeaway: zero-shot ImageNet accuracy has climbed roughly ten points since CLIP, from 76.2% to the 85–86% range, driven by data scale (LAION-5B, DataComp), locked-image tuning (LiT), captioning-plus-contrastive objectives (CoCa), loss design (SigLIP), and parameter scaling (EVA-CLIP-18B).

In 2026, the practical open defaults are SigLIP/SigLIP-family and EVA-CLIP-family checkpoints; CoCa’s 86.3% remains the headline contrastive zero-shot number.

Model	Params	Pretraining data	Zero-shot ImageNet top-1	Source
CLIP ViT-B/32	~151M	WIT-400M	~63%	Radford et al., 2021
CLIP ViT-L/14@336	~428M	WIT-400M	76.2%	Radford et al., 2021
ALIGN (EfficientNet-L2 + BERT)	—	1.8B noisy pairs	76.4%	Jia et al., 2021
Florence	~893M	FLD-900M	83.7%	Yuan et al., 2021
LiT (ViT-g/14)	—	~4B pairs	85.2%	Zhai et al., 2022
BASIC	—	6.6B pairs	85.7%	Pham et al., 2021
CoCa	2.1B	JFT-3B + ALIGN	86.3%	Yu et al., 2022
OpenCLIP ViT-G/14	~1.8B	LAION-2B	80.1%	Cherti et al., 2023
SigLiT ViT-g/14 (sigmoid loss)	—	—	84.5%	Zhai et al., 2023
EVA-CLIP-18B	18B	Merged-2B	83.8% (80.7% avg / 27 benchmarks)	Sun et al., 2024

Linear Probing vs. Zero-Shot Classification

Model	Backbone	ImageNet linear probe	Source
DINO	ViT-B/16	~78.2%	Caron et al., 2021
MAE	ViT-L/16	strong fine-tune transfer	He et al., 2022
DINOv2	ViT-L/14 (distilled from ViT-g)	86.3%	Oquab et al., 2024
DINOv3	ViT-7B, 1.7B curated images	SOTA frozen features across classification, segmentation, depth, detection	Siméoni et al., 2025

‍

DINOv3 (Siméoni et al., 2025, arXiv:2508.10104) is trained on ~1.7 billion curated Instagram images with a ~7B-parameter ViT, combining teacher–student self-distillation, an iBOT-style patch objective, and a new Gram anchoring loss that stabilizes dense features at long training schedules. Its frozen features lead on dense prediction tasks without fine-tuning.

Every number in this table is a linear probe; a classifier trained on labeled ImageNet examples over frozen features, and therefore supervised, not zero-shot. For genuine zero-shot classification from DINO-style features, a separate text-alignment module (the dino.txt framework) attaches a text encoder to the frozen vision backbone; otherwise, use a contrastive VLM (§4.1).

Multimodal LLMs for Image Classification

Flamingo (Alayrac et al., 2022), BLIP-2 (Li et al., 2023), LLaVA (Liu et al., 2023), and GPT-4V (OpenAI, 2023) perform open-ended, VQA-style image understanding.

Strengths: compositional reasoning, attribute extraction, free-form labels, instruction following.

Costs: higher latency and inference price than a dual-encoder, and outputs that are generated text rather than a calibrated similarity score over a fixed label set. They are not evaluated on the contrastive zero-shot ImageNet protocol; treating them as drop-in CLIP replacements for high-throughput classification is a category error.

The right mental model: CLIP/SigLIP for fast, calibrated open-vocabulary classification; multimodal LLMs for reasoning and description.

How Zero-Shot Models Handle Distribution Shift

A signature CLIP result is effective robustness: at matched in-distribution accuracy, zero-shot CLIP degrades far less under natural distribution shift than supervised models, closing up to 75% of the robustness gap (Radford et al., 2021; Taori et al., 2020). CLIP ViT-L/14@336 scores 76.2% on ImageNet, 70.1% on ImageNet-V2 (Recht et al., 2019), 88.9% on ImageNet-R, 77.2% on ImageNet-A, and 60.2% on ImageNet-Sketch - a much flatter degradation profile than a supervised ResNet at equivalent in-distribution accuracy. This robustness is one of the strongest practical arguments for zero-shot deployment, and it is precisely what naive fine-tuning erodes and WiSE-FT preserves (Wortsman et al., 2022).

Zero-Shot Image Classification Use Cases

Zero Shot Medical Imaging

The rigorous reference is CheXzero (Tiu et al., 2022, Nature Biomedical Engineering 6(12):1399–1406): a CLIP-style model trained on chest X-rays paired with their radiology reports, no explicit labels, that performs zero-shot multi-label pathology classification at a level statistically indistinguishable from board-certified radiologists on MCC and F1, with a mean AUC of 0.889 on CheXpert, only 0.042 below the best fully supervised model.

Domain-specific contrastive pretraining (e.g., BiomedCLIP-style models trained on biomedical figure–caption pairs) extends the recipe across modalities. Caveats are non-negotiable: distribution shift across hospitals and scanners is the dominant failure mode, and clinical deployment requires prospective validation and regulatory clearance (FDA 510(k)/De Novo, CE under EU MDR). Conventional CNN transfer-learning studies (including most COVID-19 chest X-ray work) are supervised, not zero-shot, and should not be cited as ZSL.

E-Commerce and Product Tagging

Open-vocabulary tagging lets catalogs absorb new products, seasonal items, and regional variants by writing text descriptions, with no labeling lag and no retraining cycle. Attribute prompts (“v-neck,” “floral print,” “leather strap”) compose naturally with category prompts. The size of the accuracy benefit depends entirely on the catalog, taxonomy, and baseline; it should be measured per deployment rather than quoted as a single industry-wide percentage.

Content Moderation

Prompt-driven moderation (“graphic violence,” “hate symbols”) flags candidates without new labels, but two engineering realities apply. First, throughput depends on model size and hardware; “real time” is a property of the deployment, not the model. Second, contrastive VLMs are vulnerable to typographic attacks: text rendered into an image can hijack the prediction (Goh et al., 2021). Production moderation, therefore, pairs zero-shot flagging with human review, adversarial testing, and threshold calibration.

Wildlife and Biodiversity Monitoring

Fine-grained species recognition is the historical home of attribute ZSL (CUB-200-2011; Wah et al., 2011) and remains a strong fit: descriptive prompts encoding plumage, pattern, and morphology let contrastive models triage camera-trap and drone imagery where labeled data is scarce, with expert verification on the long tail.

Autonomous Perception and Open-Vocabulary Detection

The correct framing for driving scenes is open-vocabulary detection, not whole image classification: OWL-ViT (Minderer et al., 2022), Grounding DINO (Liu et al., 2023b), and YOLO-World (Cheng et al., 2024) localize and label novel object categories from text queries, which is what “recognizing an object the system has never seen” actually requires on the road. Safety-critical use still demands extensive validation; zero-shot output feeds perception research and data mining more than it feeds the runtime control loop.

Data Labeling and Annotation Workflows

Zero-shot models pre-label raw datasets to bootstrap annotation: a CLIP/SigLIP pass proposes labels, annotators verify and correct, and active learning prioritizes the uncertain tail. Combined with generative feature synthesis for rare classes (§3.4), this shortens prototyping cycles substantially, with human verification as the quality gate.

How Azumo Helps Build Zero-Shot Image Classification Systems

Azumo builds production zero-shot and multimodal vision systems across classification, detection, visual search, moderation, and annotation tooling. Typical engagements select the architecture by requirement, contrastive VLMs (CLIP/SigLIP) for calibrated open-vocabulary classification, DINOv3 frozen features plus trained heads for dense and transfer tasks, multimodal LLMs (LLaVA/GPT-4V) for reasoning and description, then adapt with prompt tuning (CoOp/CoCoOp), adapters, or LoRA; build CVAT-based pre-labeling pipelines; and deploy with quantization, embedding caches, vector search, and drift monitoring.

References

Alayrac, J.-B., et al. (2022). Flamingo: A Visual Language Model for Few-Shot Learning. NeurIPS 2022. arXiv:2204.14198.

Caron, M., et al. (2021). Emerging Properties in Self-Supervised Vision Transformers (DINO). ICCV 2021. arXiv:2104.14294.

Chao, W.-L., Changpinyo, S., Gong, B., & Sha, F. (2016). An Empirical Study and Analysis of Generalized Zero-Shot Learning for Object Recognition in the Wild. ECCV 2016.

Cheng, T., et al. (2024). YOLO-World: Real-Time Open-Vocabulary Object Detection. CVPR 2024. arXiv:2401.17270.

Cherti, M., et al. (2023). Reproducible Scaling Laws for Contrastive Language-Image Learning (OpenCLIP). CVPR 2023. arXiv:2212.07143.

Dosovitskiy, A., et al. (2021). An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale (ViT). ICLR 2021. arXiv:2010.11929.

Farhadi, A., Endres, I., Hoiem, D., & Forsyth, D. (2009). Describing Objects by their Attributes. CVPR 2009.

Frome, A., et al. (2013). DeViSE: A Deep Visual-Semantic Embedding Model. NeurIPS 2013.

Gadre, S. Y., et al. (2023). DataComp: In Search of the Next Generation of Multimodal Datasets. NeurIPS 2023 Datasets & Benchmarks. arXiv:2304.14108.

Gao, P., et al. (2024). CLIP-Adapter: Better Vision-Language Models with Feature Adapters. IJCV. arXiv:2110.04544.

Goh, G., et al. (2021). Multimodal Neurons in Artificial Neural Networks. Distill, 6(3), e30.

Grand View Research. (2025). Computer Vision Market Size, Share & Trends Report, 2025–2030. Industry report.

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition (ResNet). CVPR 2016. arXiv:1512.03385.

He, K., et al. (2022). Masked Autoencoders Are Scalable Vision Learners (MAE). CVPR 2022. arXiv:2111.06377.

Hu, E. J., et al. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022. arXiv:2106.09685.

Jia, C., et al. (2021). Scaling Up Visual and Vision-Language Representation Learning with Noisy Text Supervision (ALIGN). ICML 2021. arXiv:2102.05918.

Lampert, C. H., Nickisch, H., & Harmeling, S. (2009). Learning to Detect Unseen Object Classes by Between-Class Attribute Transfer. CVPR 2009.

Lampert, C. H., Nickisch, H., & Harmeling, S. (2014). Attribute-Based Classification for Zero-Shot Visual Object Categorization. IEEE TPAMI, 36(3), 453–465.

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. ICML 2023. arXiv:2301.12597.

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual Instruction Tuning (LLaVA). NeurIPS 2023. arXiv:2304.08485.

Liu, S., et al. (2023b). Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection. arXiv:2303.05499.

Menon, S., & Vondrick, C. (2023). Visual Classification via Description from Large Language Models. ICLR 2023. arXiv:2210.07183.

Minderer, M., et al. (2022). Simple Open-Vocabulary Object Detection with Vision Transformers (OWL-ViT). ECCV 2022. arXiv:2205.06230.

Norouzi, M., et al. (2014). Zero-Shot Learning by Convex Combination of Semantic Embeddings (ConSE). ICLR 2014. arXiv:1312.5650.

OpenAI. (2023). GPT-4V(ision) System Card.

Oquab, M., et al. (2024). DINOv2: Learning Robust Visual Features without Supervision. Transactions on Machine Learning Research. arXiv:2304.07193.

Pham, H., et al. (2021). Combined Scaling for Zero-Shot Transfer Learning (BASIC). arXiv:2111.10050.

Pratt, S., Covert, I., Liu, R., & Farhadi, A. (2023). What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification (CuPL). ICCV 2023. arXiv:2209.03320.

Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). ICML 2021. arXiv:2103.00020.

Radovanović, M., Nanopoulos, A., & Ivanović, M. (2010). Hubs in Space: Popular Nearest Neighbors in High-Dimensional Data. JMLR, 11, 2487–2531.

Recht, B., Roelofs, R., Schmidt, L., & Shankar, V. (2019). Do ImageNet Classifiers Generalize to ImageNet? (ImageNet-V2). ICML 2019.

Schuhmann, C., et al. (2022). LAION-5B: An Open Large-Scale Dataset for Training Next Generation Image-Text Models. NeurIPS 2022 Datasets & Benchmarks. arXiv:2210.08402.

Shigeto, Y., Suzuki, I., Hara, K., Shimbo, M., & Matsumoto, Y. (2015). Ridge Regression, Hubness, and Zero-Shot Learning. ECML PKDD 2015.

Siméoni, O., et al. (2025). DINOv3. arXiv:2508.10104.

Socher, R., Ganjoo, M., Manning, C. D., & Ng, A. Y. (2013). Zero-Shot Learning Through Cross-Modal Transfer. NeurIPS 2013.

Sun, Q., et al. (2024). EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters. arXiv:2402.04252.

Taori, R., et al. (2020). Measuring Robustness to Natural Distribution Shifts in Image Classification. NeurIPS 2020.

Tiu, E., Talius, E., Patel, P., Langlotz, C. P., Ng, A. Y., & Rajpurkar, P. (2022). Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning (CheXzero). Nature Biomedical Engineering, 6(12), 1399–1406.

Vaswani, A., et al. (2017). Attention Is All You Need. NeurIPS 2017.

Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Caltech Technical Report CNS-TR-2011-001.

Wortsman, M., et al. (2022). Robust Fine-Tuning of Zero-Shot Models (WiSE-FT). CVPR 2022. arXiv:2109.01903.

Xian, Y., Lampert, C. H., Schiele, B., & Akata, Z. (2018a). Zero-Shot Learning — A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE TPAMI, 41(9), 2251–2265.

Xian, Y., Lorenz, T., Schiele, B., & Akata, Z. (2018b). Feature Generating Networks for Zero-Shot Learning (f-CLSWGAN). CVPR 2018.

Xian, Y., Sharma, S., Schiele, B., & Akata, Z. (2019). f-VAEGAN-D2: A Feature Generating Framework for Any-Shot Learning. CVPR 2019.

Yu, J., et al. (2022). CoCa: Contrastive Captioners are Image-Text Foundation Models. Transactions on Machine Learning Research. arXiv:2205.01917.

Yuan, L., et al. (2021). Florence: A New Foundation Model for Computer Vision. arXiv:2111.11432.

Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., & Beyer, L. (2022). LiT: Zero-Shot Transfer with Locked-image Text Tuning. CVPR 2022. arXiv:2111.07991.

Zhai, X., Mustafa, B., Kolesnikov, A., & Beyer, L. (2023). Sigmoid Loss for Language Image Pre-Training (SigLIP). ICCV 2023. arXiv:2303.15343.

Zhang, R., et al. (2022). Tip-Adapter: Training-free Adaption of CLIP for Few-Shot Classification. ECCV 2022. arXiv:2207.09519.

Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022a). Learning to Prompt for Vision-Language Models (CoOp). IJCV, 130(9), 2337–2348.

Zhou, K., Yang, J., Loy, C. C., & Liu, Z. (2022b). Conditional Prompt Learning for Vision-Language Models (CoCoOp). CVPR 2022.

Authored as a technical reference for ML engineers and researchers. Benchmark numbers are drawn from the cited primary papers; all cited papers were verified at arXiv, the CVF Open Access archive, Nature, JMLR, and venue proceedings prior to publication.

Frequently Asked Questions

Q:
What is the difference between computer vision and image processing?
Image processing transforms an image into another image (denoising, color correction, super-resolution). Computer vision extracts symbolic or structured information, labels, boxes, masks, embeddings, 3D structure, from images. Modern CV pipelines almost always include some image processing as preprocessing.
Q:
Do I need a Vision Transformer, or is a CNN still fine?
Both are competitive. For ImageNet-scale supervised training, ConvNeXt and Swin are within 1–2% of each other. For self-supervised pretraining at scale (DINOv2, MAE), ViTs currently dominate. For mobile/edge with strict latency budgets, MobileNetV3 and MobileViT are typically still the right choice.
Q:
How much labeled data do I need?
For a transfer-learning fine-tune from a strong pretrained backbone, useful prototypes can be built with 50–500 labeled examples per class. Production systems usually need 1k–10k+ examples per class with diverse coverage of operating conditions. Self-supervised and foundation-model approaches (DINOv2, SAM, CLIP) reduce this requirement substantially.
Q:
Is YOLO still the best detector?
“Best” is no longer well-defined. Modern YOLOs (YOLOv8/v9/v10, YOLO11) and RT-DETR are all competitive; on COCO they cluster around 53–55% AP at 70–120 FPS on a T4. Choose RT-DETR if NMS-free end-to-end inference matters; choose modern YOLO if you want a smaller dependency footprint and well-trodden fine-tuning recipes.
Q:
Why does “regularization” mean more than dropout?
Dropout is one regularization technique. Others include L2/weight decay, early stopping, label smoothing, BatchNorm, stochastic depth, Mixup, CutMix, and data augmentation. Modern recipes typically combine several; dropout is sometimes omitted entirely (e.g., in standard ResNets).
Q:
Can LiDAR or radar replace cameras?
No. LiDAR and radar provide depth, 3D geometry, and velocity, but they cannot read text or fine-grained semantics. Reading a stop sign or a speed-limit number is a camera-and-CV task. Production AVs fuse all three modalities.
Q:
What’s the right tool for tracking ML experiments?
PyTorch and TensorFlow are training frameworks, not trackers. Use MLflow, Weights & Biases, Neptune, or Comet for experiment tracking, dataset versioning, and artifact lineage.
Q:
When should I use a foundation model vs. training from scratch?
Almost always start from a pretrained foundation model in 2026 (DINOv2 for general features, CLIP for text-aligned features, SAM/SAM 2 for segmentation, a strong ImageNet checkpoint for legacy compatibility). Training from scratch is justified only for very large novel domains (satellite, medical 3D) where pretrained features transfer poorly, and even then, domain-adaptive pretraining from a foundation model is usually better than scratch.