網易首頁 > 網易號 > 正文申請入駐

Meta公司：DINOv3是以前所未有的規(guī)模進行視覺自我監(jiān)督學習

2025-08-15 10:20:57　來源: 親愛的數據

內蒙古舉報

分享至

Meta公司網站原文，請享用，
譚老師我看完感慨一句：性能確實很棒，但 Apache 許可證已改商業(yè)許可證了。換句話說，原來可以免費使用、修改甚至商用的 Apache 許可證被換成了需要付費或受更多限制的商業(yè)許可證，想繼續(xù)用就得按新規(guī)矩來。

Open Source 開源

DINOv3: Self-supervised learning for vision at unprecedented scaleDINOv3：以前所未有的規(guī)模進行視覺自我監(jiān)督學習

August 14, 2025 2025年8月14日

Takeaways: 要點：

We’re introducing DINOv3, which scales self-supervised learning for images to create universal vision backbones that achieve absolute state-of-the-art performance across diverse domains, including web and satellite imagery.
我們正在推出 DINOv3，它擴展圖像的自監(jiān)督學習，以創(chuàng)建通用視覺主干，從而在不同領域（包括網絡和衛(wèi)星圖像）實現絕對最先進的性能。
DINOv3 backbones produce powerful, high-resolution image features that make it easy to train lightweight adapters. This leads to exceptional performance on a broad array of downstream vision tasks, including image classification, semantic segmentation, and object tracking in video.
DINOv3 主干可生成強大的高分辨率圖像功能，使訓練輕量級適配器變得容易。這導致了廣泛的下游視覺任務的卓越性能，包括圖像分類，語義分割和視頻中的對象跟蹤。
We’ve incorporated valuable community feedback, enhancing the versatility of DINOv3 by shipping smaller models that outperform comparable CLIP-based derivatives across a broad evaluation suite, as well as alternative ConvNeXt architectures for resource-constrained use cases.
我們已經整合了寶貴的社區(qū)反饋，通過在廣泛的評估套件中提供比基于 CLIP 的衍生產品性能更好的小型模型，以及用于資源受限用例的替代 ConvNeXt 架構，增強了 DINOv3 的多功能性。
We’re releasing the DINOv3 training code and pre-trained backbones under a commercial license to help drive innovation and advancements in the computer vision and multimodal ecosystem.
我們將在商業(yè)許可證下發(fā)布 DINOv 3 訓練代碼和預先訓練的骨干，以幫助推動計算機視覺和多模式生態(tài)系統(tǒng)的創(chuàng)新和進步。

Self-supervised learning (SSL) —the concept that AI models can learn independently without human supervision—has emerged as the dominant paradigm in modern machine learning. It has driven the rise of large language models that acquire universal representations by pre-training on massive text corpora. However, progress in computer vision has lagged behind, as the most powerful image encoding models still rely heavily on human-generated metadata, such as web captions, for training.
自監(jiān)督學習（SSL）-AI 模型可以在沒有人類監(jiān)督的情況下獨立學習的概念-已成為現代機器學習的主導范式。它推動了大型語言模型的興起，這些模型通過在大量文本語料庫上進行預訓練來獲得通用表示。然而，計算機視覺的進展卻落后了，因為最強大的圖像編碼模型仍然嚴重依賴于人類生成的元數據，例如網絡標題。

Today, we’re releasing DINOv3, a generalist, state-of-the-art computer vision model trained with SSL that produces superior high-resolution visual features. For the first time, a single frozen vision backbone outperforms specialized solutions on multiple long-standing dense prediction tasks including object detection and semantic segmentation.
今天，我們發(fā)布了 DINOv3，這是一個通用的、最先進的計算機視覺模型，使用 SSL 進行訓練，可以產生上級高分辨率的視覺特征。這是第一次，單一的凍結視覺骨干在多個長期存在的密集預測任務（包括對象檢測和語義分割）上的表現優(yōu)于專業(yè)解決方案。

DINOv3’s breakthrough performance is driven by innovative SSL techniques that eliminate the need for labeled data—drastically reducing the time and resources required for training and enabling us to scale training data to 1.7B images and model size to 7B parameters. This label-free approach enables applications where annotations are scarce, costly, or impossible.

For example, our research shows that DINOv3 backbones pre-trained on satellite imagery achieve exceptional performance on downstream tasks such as canopy height estimation.
DINOv3 的突破性性能是由創(chuàng)新的 SSL 技術驅動的，該技術消除了對標記數據的需求，大大減少了訓練所需的時間和資源，使我們能夠將訓練數據擴展到 1.7 B 圖像，并將模型大小擴展到 7 B 參數。這種無標簽的方法使應用程序能夠在注釋稀缺、昂貴或不可能的情況下使用。例如，我們的研究表明，在衛(wèi)星圖像上預訓練的 DINOv3 骨干在下游任務（如冠層高度估計）上實現了卓越的性能。

We believe DINOv3 will help accelerate existing use cases and also unlock new ones, leading to advancements in industries such as healthcare, environmental monitoring, autonomous vehicles, retail, and manufacturing—enabling more accurate and efficient visual understanding at scale.
我們相信，DINOv3 將有助于加速現有的用例，并解鎖新的用例，從而推動醫(yī)療保健、環(huán)境監(jiān)測、自動駕駛汽車、零售和制造等行業(yè)的進步，從而實現更準確、更高效的大規(guī)模視覺理解。

We’re releasing DINOv3 with a comprehensive suite of open sourced backbones under a commercial license, including a satellite backbone trained on MAXAR imagery. We’re also sharing a subset of our downstream evaluation heads, enabling the community to reproduce our results and build upon them. Additionally, we’re providing sample notebooks so the community has detailed documentation to help them start building with DINOv3 today.
我們將在商業(yè)許可下發(fā)布 DINOv3，其中包含一套全面的開源主干，包括一個在 MAXAR 圖像上訓練的衛(wèi)星主干。我們還共享了下游評估負責人的子集，使社區(qū)能夠復制我們的結果并在此基礎上進行構建。此外，我們還提供了示例筆記本，以便社區(qū)擁有詳細的文檔，幫助他們立即開始使用 DINOv3 進行構建。

Unlocking high-impact applications with self-supervised learning
通過自我監(jiān)督學習解鎖高影響力的應用程序

DINOv3 achieves a new milestone by demonstrating, for the first time, that SSL models can outperform their weakly supervised counterparts across a wide range of tasks.

While previous DINO models set a significant lead in dense prediction tasks, such as segmentation and monocular depth estimation, DINOv3 surpasses these accomplishments.

Our models match or exceed the performance of the strongest recent models such as SigLIP 2 and Perception Encoder on many image classification benchmarks, and at the same time, they drastically widen the performance gap for dense prediction tasks.

DINOv3 實現了一個新的里程碑，首次證明 SSL 模型可以在廣泛的任務中優(yōu)于弱監(jiān)督模型。雖然以前的 DINO 模型在密集預測任務（如分割和單目深度估計）方面取得了顯著領先，但 DINOv3 超越了這些成就。我們的模型在許多圖像分類基準測試中的性能與最近最強的模型（如 SigLIP 2 和 Perception Encoder）相匹配或超過，同時，它們大大擴大了密集預測任務的性能差距。

DINOv3 builds on the breakthrough DINO algorithm, requiring no metadata input, consuming only a fraction of the training compute compared to prior methods, and still delivering exceptionally strong vision foundation models.

The novel refinements introduced in DINOv3 lead to state-of-the-art performance on competitive downstream tasks such as object detection under the severe constraint of frozen weights. This eliminates the need for researchers and developers to fine-tune the model for specific tasks, enabling broader and more efficient application.
DINOv3 建立在突破性的 DINO 算法之上，不需要元數據輸入，與以前的方法相比，只消耗一小部分訓練計算，并且仍然提供非常強大的視覺基礎模型。DINOv3 中引入的新改進導致競爭性下游任務的最新性能，例如在凍結權重的嚴格約束下的對象檢測。這消除了研究人員和開發(fā)人員為特定任務微調模型的需要，從而實現更廣泛和更有效的應用。

Finally, because the DINO approach is not specifically tailored to any image modality, the same algorithm can be applied beyond web imagery to other domains where labeling is prohibitively difficult or expensive. DINOv2 already leverages vast amounts of unlabeled data to support diagnostic and research efforts in histology, endoscopy, and medical imaging. In satellite and aerial imagery, the overwhelming volume and complexity of data make manual labeling impractical.

With DINOv3, we make it possible for these rich datasets to be used to train a single backbone that can then be used across satellite types, enabling general applications in environmental monitoring, urban planning, and disaster response.
最后，由于 DINO 方法不是專門針對任何圖像模態(tài)定制的，因此相同的算法可以應用于 Web 圖像之外的其他領域，這些領域的標記非常困難或昂貴。DINOv2 已經利用大量未標記的數據來支持組織學、內窺鏡檢查和醫(yī)學成像方面的診斷和研究工作。在衛(wèi)星和航空圖像中，數據的巨大數量和復雜性使得手動標記不切實際。通過 DINOv3，我們可以使用這些豐富的數據集來訓練單個骨干，然后可以跨衛(wèi)星類型使用，從而實現環(huán)境監(jiān)測，城市規(guī)劃和災害響應中的一般應用。

DINOv3 is already having real-world impact.

The World Resources Institute (WRI) is using our latest model to monitor deforestation and support restoration, helping local groups protect vulnerable ecosystems. WRI uses DINOv3 to analyze satellite images and detect tree loss and land-use changes in affected ecosystems. The accuracy gains from DINOv3 support automating climate finance payments by verifying restoration outcomes, reducing transaction costs, and accelerating funding to small, local groups.

For example, compared to DINOv2, DINOv3 trained on satellite and aerial imagery reduces the average error in measuring tree canopy height in a region of Kenya from 4.1 meters to 1.2 meters. WRI is now able to scale support for thousands of farmers and conservation projects more efficiently.

DINOv3 已經對現實世界產生了影響。世界資源研究所（WRI）正在使用我們的最新模型來監(jiān)測森林砍伐和支持恢復，幫助當地團體保護脆弱的生態(tài)系統(tǒng)。世界資源研究所使用 DINOv3 分析衛(wèi)星圖像，并檢測受影響生態(tài)系統(tǒng)中的樹木損失和土地使用變化。

DINOv3 帶來的準確性收益通過驗證恢復結果、降低交易成本和加速向小型地方團體提供資金，支持氣候融資支付的自動化。例如，與 DINOv2 相比，在衛(wèi)星和航空圖像上訓練的 DINOv3 將測量肯尼亞地區(qū)樹冠高度的平均誤差從 4.1 米降低到 1.2 米。世界資源研究所現在能夠更有效地擴大對數千名農民和保護項目的支持。

Scalable and efficient visual modeling without fine-tuning
可擴展且高效的可視化建模，無需微調

We built DINOv3 by training a 7x larger model on a 12x larger dataset than its predecessor, DINOv2. To showcase the model’s versatility, we evaluate it across 15 diverse visual tasks and more than 60 benchmarks. The DINOv3 backbone particularly shines on all dense prediction tasks, showing an exceptional understanding of the scene layout and underlying physics.
我們通過在比其前身 DINOv2 大 12 倍的數據集上訓練 7 倍大的模型來構建 DINOv3。為了展示該模型的多功能性，我們在 15 個不同的視覺任務和 60 多個基準測試中對其進行了評估。DINOv3 主干在所有密集預測任務中表現出色，表現出對場景布局和底層物理的卓越理解。

The rich, dense features capture measurable attributes or characteristics of each pixel in an image and are represented as vectors of floating-point numbers. These features are capable of parsing objects into finer parts, even generalizing across instances and categories. This dense representation power makes it easy to train lightweight adapters with minimal annotations on top of DINOv3, meaning a few annotations and a linear model are sufficient to obtain robust dense predictions.

Pushing things further and using a more sophisticated decoder, we show that it’s possible to achieve state-of-the-art performance on long-standing core computer vision tasks without fine-tuning the backbone.

We show such results on object detection, semantic segmentation, and relative depth estimation.
豐富、密集的特征捕捉圖像中每個像素的可測量屬性或特征，并表示為浮點數向量。這些功能能夠將對象解析為更精細的部分，甚至跨實例和類別進行概括。這種密集表示能力使得在 DINOv3 之上使用最少的注釋來訓練輕量級適配器變得很容易，這意味著一些注釋和線性模型就足以獲得強大的密集預測。通過進一步推進并使用更復雜的解碼器，我們證明了在無需微調主干的情況下，可以在長期的核心計算機視覺任務上實現最先進的性能。我們展示了這樣的結果，對象檢測，語義分割和相對深度估計。

Because state-of-the-art results can be achieved without fine-tuning the backbone, a single forward pass can serve multiple applications simultaneously.

This enables the inference cost of the backbone to be shared across tasks, which is especially critical for edge applications that often require running many predictions at once.

DINOv3’s versatility and efficiency make it the perfect candidate for such deployment scenarios, as demonstrated by NASA’s Jet Propulsion Laboratory (JPL), which is already using DINOv2 to build exploration robots for Mars, enabling multiple vision tasks with minimal compute.
由于無需微調主干即可實現最先進的結果，因此單個前向通道可以同時服務于多個應用。這使得骨干網的推理成本能夠在任務之間共享，這對于經常需要同時運行許多預測的邊緣應用程序尤其重要。DINOv3 的多功能性和效率使其成為此類部署場景的完美候選者，正如 NASA 噴氣推進實驗室（JPL）所證明的那樣，該實驗室已經使用 DINOv2 為火星建造探測機器人，以最小的計算實現多個視覺任務。

A family of deployment-friendly models一系列部署友好型模型

Scaling DINOv3 to 7B parameters shows SSL’s full potential. However, a 7B model is impractical for many downstream applications. Following feedback from the community, we built a family of models spanning a large range of inference compute requirements to empower researchers and developers across diverse use cases.

By distilling the ViT-7B model into smaller, high-performing variants like ViT-B and ViT-L, DINOv3 outperforms comparable CLIP-based models across a broad evaluation suite.

Additionally, we introduce alternative ConvNeXt architectures (T, S, B, L) distilled from ViT-7B, that can accommodate varying compute constraints. We’re also releasing our distillation pipeline to enable the community to build upon this foundation.
將 DINOv 3 參數擴展到 7 B 顯示了 SSL 的全部潛力。然而，7 B 模型對于許多下游應用是不切實際的。根據社區(qū)的反饋，我們構建了一系列涵蓋大量推理計算需求的模型，以支持研究人員和開發(fā)人員跨各種用例。通過將 ViT-7 B 模型提煉成更小的高性能變體，如 ViT-B 和 ViT-L，DINOv 3 在廣泛的評估套件中優(yōu)于基于 CLIP 的同類模型。此外，我們介紹了替代 ConvNeXt 架構（T，S，B，L）從 ViT-7 B，可以適應不同的計算約束。我們還發(fā)布了我們的蒸餾管道，以使社區(qū)能夠在此基礎上再接再厲。

聲明：個人原創(chuàng)，僅供參考

特別聲明：以上內容(如有圖片或視頻亦包括在內)為自媒體平臺“網易號”用戶上傳并發(fā)布，本平臺僅提供信息存儲服務。

Notice: The content above (including the pictures and videos if any) is uploaded and posted by a user of NetEase Hao, which is a social media platform and only provides information storage services.