145:
27:
1248:
914:. The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order.
899:
530:
152:
The CLIP method trains a pair of models contrastively. One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that
1458:
CLIP can perform zero-shot image classification tasks, i.e. without any explicit training on those specific classes. This is achieved by prompting the text encoder with class names and selecting the class whose embedding is closest to the image embedding. For example, to classify an image, they
1489:'s Flamingo (2022), the authors trained a CLIP pair, with BERT as the text encoder and NormalizerFree ResNet F6 as the image encoder. The image encoder of the CLIP pair was taken with parameters frozen and the text encoder was discarded. The frozen image encoder was then combined with a frozen
2552:
Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale".
1310:
The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training
276:
2368:
Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021). "Learning
Transferable Visual Models From Natural Language Supervision".
1294:, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408.
1439:
uses the text encoder of CLIP ViT-L/14 to transform text prompts to an embedding space, as these embeddings provide detailed semantic information. CLIP can also be used as a gradient signal for directly guiding diffusion ("CLIP guidance") or other generative
3131:
Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel; Mensch, Arthur; Millican, Katherine; Reynolds, Malcolm; Ring, Roman; Rutherford, Eliza; Cabi, Serkan; Han, Tengda; Gong, Zhitao (2022-12-06).
813:
2834:
Cherti, Mehdi; Beaumont, Romain; Wightman, Ross; Wortsman, Mitchell; Ilharco, Gabriel; Gordon, Cade; Schuhmann, Christoph; Schmidt, Ludwig; Jitsev, Jenia (June 2023). "Reproducible
Scaling Laws for Contrastive Language-Image Learning".
928:
Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension".
2973:
Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards
Photorealistic Image Generation and Editing with Text-Guided Diffusion Models".
1348:
For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by , and dividing by .
2285:
Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021-07-01).
525:{\displaystyle -{\frac {1}{N}}\sum _{i}\ln {\frac {e^{v_{i}\cdot w_{i}/T}}{\sum _{j}e^{v_{i}\cdot w_{j}/T}}}-{\frac {1}{N}}\sum _{j}\ln {\frac {e^{v_{j}\cdot w_{j}/T}}{\sum _{i}e^{v_{i}\cdot w_{j}/T}}}}
1379:
from online crawling. The method was described as similar to how the
Conceptual Captions dataset was constructed, but instead of complex filtering, they only applied a frequency-based filtering.
264:
1409:
All ViT models were trained on 224x224 image resolution. The ViT-L/14 was then boosted to 336x336 resolution by FixRes, resulting in a model. They found this was the best-performing model.
1359:
If the input image does not have the same resolution as the native resolution (224x224 for all except ViT-L/14@336px, which has 336x336 resolution), then the input image is scaled down by
2738:
Srinivasan, Krishna; Raman, Karthik; Chen, Jiecao; Bendersky, Michael; Najork, Marc (2021-07-11). "WIT: Knowledge-based Image Text
Dataset for Multimodal Multilingual Machine Learning".
672:
1318:
The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in
664:
876:
570:
153:
the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart.
636:
2320:
2500:
Ilharco, Gabriel; Wortsman, Mitchell; Wightman, Ross; Gordon, Cade; Carlini, Nicholas; Taori, Rohan; Dave, Achal; Shankar, Vaishaal; Namkoong, Hongseok (July 2021),
596:
2599:
He, Tong; Zhang, Zhi; Zhang, Hang; Zhang, Zhongyue; Xie, Junyuan; Li, Mu (2018-12-05). "Bag of Tricks for Image
Classification with Convolutional Neural Networks".
1202:
Its implementation of ViT was the same as the original one, with one modification: after position embeddings are added to the initial patch embeddings, there is a
1459:
compared the embedding of the image with the embedding of the text "A photo of a {class}.", and the {class} that results in the highest dot product is outputted.
174:
156:
To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of
1251:
One decoder layer. The
Transformer used in the CLIP text encoder was made by removing the cross-attention module, then stacking the resulting module 12 times.
2464:
130:
2332:
2952:
Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image
Generation with CLIP Latents".
3133:
1256:
932:
For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024. The ViTs, ranging from 512 to 768.
1449:
Given an image, CLIP can generate captions. This is done by finding the text input that maximizes the similarity score with the image embedding.
1352:
The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly
1446:
CLIP can retrieve for images based on textual descriptions. This is possible even if the images were not explicitly tagged with those keywords.
2862:
2765:
2421:
Jia, Chao; Yang, Yinfei; Xia, Ye; Chen, Yi-Ting; Parekh, Zarana; Pham, Hieu; Le, Quoc; Sung, Yun-Hsuan; Li, Zhen; Duerig, Tom (2021-07-01).
3225:
136:
Concurrent to CLIP was ALIGN, published at the same conference. It was done by researchers at Google, with essentially the same algorithm.
2698:
Radford, Alec; Wu, Jeff; Child, R.; Luan, D.; Amodei, Dario; Sutskever, I. (2019). "Language Models are
Unsupervised Multitask Learners".
1262:
In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased
3215:
1443:
A finetuned model based on CLIP can be used to rank images for their aesthetic quality, which can be used for improving dataset quality.
2481:
2321:
https://web.archive.org/web/20210105204011/https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language.pdf
1267:
1462:
The authors found that the trained model for ImageNet classification would fail in case of polysemy, requiring a moderate amount of
1217:
In the start of the CNN (the "stem"), they used three stacked 3x3 convolutions instead of a single 7x7 convolution, as suggested by.
129:
The report (with some details removed, and its appendix cut out to a "Supplementary PDF") was published in Proceedings of the 38th
2996:
2879:
2441:
3107:
69:
2930:
918:
54:
1490:
1402:(ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592
890:
While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well.
2677:
Tan, Mingxing; Le, Quoc V. (2020-09-11). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks".
179:
1466:. For example, in ImageNet, the class "kite" refers to a kind of bird, not a toy, so the authors changed "kite" to "
118:
It was first announced on OpenAI's official blog on January 5, 2021, with a report served directly through OpenAI's
1413:
1287:("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the
3210:
2805:
Sharma, Piyush; Ding, Nan; Goodman, Sebastian; Soricut, Radu (July 2018). Gurevych, Iryna; Miyao, Yusuke (eds.).
2340:
1403:
103:
1356:
the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses and .
1224:
according to the terminology of ). This has the effect of blurring images before downsampling, for antialiasing.
3220:
2811:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
2740:
Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
1353:
119:
1220:
There is an average pooling of stride 2 at the start of each downsampling convolutional layer (they called it
3183:
1395:
1210:
922:
144:
26:
1340:
The dataset is private and has not been released to the public, and there is no further information on it.
641:
1363:, so that its shorter side is the same as the native resolution, then the central square of the image is
2711:
2629:
2480:
Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022).
1360:
1278:
818:
535:
107:
808:{\displaystyle L={\frac {1}{N}}\sum _{i,j\in 1:N}f((2\delta _{i,j}-1)(e^{\tau }w_{i}\cdot v_{j}+b))}
2807:"Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning"
1479:
532:
In essence, this loss function encourages the dot product between matching image and text vectors (
1247:
1236:
669:
Other loss functions are possible. For example, Sigmoid CLIP proposes the following loss function:
605:
3087:
2975:
2953:
2840:
2743:
2699:
2678:
2657:
2600:
2579:
2554:
2370:
2292:. Proceedings of the 38th International Conference on Machine Learning. PMLR. pp. 8748โ8763.
1463:
1399:
1327:
1270:, it was decoder-only, with only causally-masked self-attention. Its architecture is the same as
1263:
911:
270:
3086:
Mokady, Ron; Hertz, Amir; Bermano, Amit H. (2021). "ClipCap: CLIP Prefix for Image Captioning".
2901:
2858:
2761:
2486:. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11976โ11986.
1319:
572:) to be high, while discouraging high dot products between non-matching pairs. The parameter
3043:
2850:
2814:
2753:
2507:
1436:
1429:
1364:
879:
575:
76:
2423:"Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision"
2724:
2642:
1486:
2782:"std and mean for image normalization different from ImageNet ยท Issue #20 ยท openai/CLIP"
2929:. CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022.
1266:(BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like
159:
1375:
ALIGN used over one billion image-text pairs, obtained by extracting images and their
3204:
2703:
1376:
1228:
269:
The loss incurred on this batch is the multi-class N-pair loss, which is a symmetric
2656:
Zhang, Richard (2019-06-08). "Making Convolutional Networks Shift-Invariant Again".
2469:. IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11975โ11986.
176:
image-caption pairs. Let the outputs from the text and image models be respectively
3156:
3020:
898:
16:
Technique in neural networks for learning joint representations of text and images
2854:
2422:
2390:
1493:, by finetuning with some further parameters that connect the two frozen models.
1382:
Later models trained by other organizations had published datasets. For example,
2287:
2235:
It is not the same as the Knowledge-based Image Text dataset, also called "WIT".
1467:
1386:
trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B.
599:
81:
2926:
2813:. Melbourne, Australia: Association for Computational Linguistics: 2556โ2565.
1416:
GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen.
34:
2621:
2501:
1509:
Similar to the "embedding dimension" of text embedding in Transformer models.
3108:"CLIP/notebooks/Prompt_Engineering_for_ImageNet.ipynb at main ยท openai/CLIP"
3065:
2757:
1291:
1203:
106:
models, one for image understanding and one for text understanding, using a
2819:
2781:
2511:
2463:
Zhai, Xiaohua; Mustafa, Basil; Kolesnikov, Alexander; Beyer, Lucas (2023).
2837:
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
1428:
A trained image encoder of a CLIP pair can be used as a pre-trained image
3155:
Brock, Andy; De, Soham; Smith, Samuel L.; Simonyan, Karen (2021-07-01).
1331:
925:(in the original report), and ConvNeXt (in the OpenCLIP model series).
3157:"High-Performance Large-Scale Image Recognition Without Normalization"
2878:
Touvron, Hugo; Vedaldi, Andrea; Douze, Matthijs; Jegou, Herve (2019).
2442:"Improved Deep Metric Learning with Multi-class N-pair Loss Objective"
266:. Two vectors are considered "similar" if their dot product is large.
2806:
2574:
He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015).
2289:
Learning Transferable Visual Models From Natural Language Supervision
1334:
1330:, names of all Knowledge articles above a certain search volume, and
1323:
39:
3161:
Proceedings of the 38th International Conference on Machine Learning
2427:
Proceedings of the 38th International Conference on Machine Learning
3176:
3092:
2980:
2958:
2845:
2748:
2683:
2662:
2605:
2584:
2559:
2375:
91:
2530:
1383:
1312:
1271:
1246:
897:
123:
2303:
59:
1412:
In the OpenCLIP series, the ViT-L/14 model was trained on 384
1394:
In the original OpenAI CLIP report, they reported training 5
1406:
GPUs. The largest ViT model took 12 days on 256 V100 GPUs.
1239:
of various sizes, a kind of convolutional neural network.
133:, PMLR, which had a submission deadline of February 2021.
3134:"Flamingo: a Visual Language Model for Few-Shot Learning"
1213:
was the same as the original one, with 3 modifications:
1281:, the text sequence is bracketed by two special tokens
602:, which is parameterized in the original CLIP model as
2902:"laion/CLIP-ViT-L-14-laion2B-s32B-b82K ยท Hugging Face"
122:, and a GitHub repository. The paper was delivered on
2622:"Making Convolutional Networks Shift-Invariant Again"
1424:
CLIP has found wide applications in various domains.
910:
The image encoding models used in CLIP are typically
906:
output vector is used as the image encoding for CLIP.
821:
675:
644:
608:
578:
538:
279:
182:
162:
1255:
The text encoding models used in CLIP are typically
87:
75:
65:
53:
45:
33:
1315:, which contains about 40 gigabytes of text data.
870:
807:
658:
630:
590:
564:
524:
258:
168:
3138:Advances in Neural Information Processing Systems
2884:Advances in Neural Information Processing Systems
2446:Advances in Neural Information Processing Systems
1227:The final convolutional layer is followed by a
917:Other than ViT, the image model is typically a
259:{\displaystyle v_{1},...,v_{N},w_{1},...,w_{N}}
2880:"Fixing the train-test resolution discrepancy"
100:Contrastive Language-Image Pre-training (CLIP)
8:
2576:Deep Residual Learning for Image Recognition
2466:Sigmoid Loss for Language Image Pre-Training
1432:. This can then be fed into other AI models.
131:International Conference on Machine Learning
19:
2495:
2493:
25:
18:
3091:
2979:
2957:
2844:
2818:
2747:
2682:
2661:
2604:
2583:
2558:
2374:
856:
820:
787:
774:
764:
736:
696:
682:
674:
652:
651:
643:
619:
607:
577:
556:
543:
537:
509:
503:
490:
485:
475:
460:
454:
441:
436:
430:
418:
404:
388:
382:
369:
364:
354:
339:
333:
320:
315:
309:
297:
283:
278:
250:
225:
212:
187:
181:
161:
934:
902:Vision Transformer architecture. The Rep
143:
2927:"Stable Diffusion Repository on GitHub"
2264:
2252:, inconsistently throughout the report.
1502:
3045:haltakov/natural-language-image-search
2720:
2709:
2638:
2627:
2333:"initial commit ยท openai/CLIP@b1c4b6b"
102:is a technique for training a pair of
2933:from the original on January 18, 2023
1478:CLIP has been used as a component in
659:{\displaystyle \tau \in \mathbb {R} }
7:
2416:
2414:
2412:
2410:
2363:
2361:
2359:
2357:
2280:
2278:
2276:
2274:
2272:
2270:
2268:
1485:For example, during the training of
2997:"Fun With Neural Cellular Automata"
2304:"Clip: Connecting text and images"
1297:ALIGN used BERT of various sizes.
871:{\displaystyle f(x)=\ln(1+e^{-x})}
14:
3042:Haltakov, Vladimir (2024-09-03),
2995:Whitaker, Jonathan (2022-05-22).
2339:. 5 January 2021. Archived from
947:Parameters (total, in millions)
565:{\displaystyle v_{i}\cdot w_{i}}
3064:Beaumont, Romain (2024-09-07),
2244:They referred to this as both
1435:For text-to-image generation,
865:
843:
831:
825:
802:
799:
757:
754:
726:
723:
148:Architecture overview of CLIP.
60:https://github.com/OpenAI/CLIP
1:
3184:"The Annotated CLIP (Part-2)"
2620:Zhang, Richard (2018-09-27).
1887:"Input resolution:"
1839:"Model parameters:"
1229:multiheaded attention pooling
3022:LAION-AI/aesthetic-predictor
2855:10.1109/CVPR52729.2023.00276
919:convolutional neural network
631:{\displaystyle T=e^{-\tau }}
3226:Natural language processing
2391:"ICML 2021 Call for Papers"
1905:"Context length:"
3242:
3216:Artificial neural networks
3182:Arora, Aman (2023-03-11).
936:Models released by OpenAI
2890:. Curran Associates, Inc.
2452:. Curran Associates, Inc.
1491:Chinchilla language model
912:vision transformers (ViT)
24:
1518:
666:is a learned parameter.
92:openai.com/research/clip
2758:10.1145/3404835.3463257
2483:A ConvNet for the 2020s
2178:, embedding dimension:
1923:"Vocab size:"
273:over similarity scores:
3067:rom1504/clip-retrieval
3025:, LAION AI, 2024-09-06
2839:. pp. 2818โ2829.
2742:. pp. 2443โ2449.
2719:Cite journal requires
2637:Cite journal requires
2512:10.5281/zenodo.5143773
2142:, #vision parameters:
1252:
1209:Its implementation of
907:
872:
809:
660:
632:
592:
591:{\displaystyle T>0}
566:
526:
260:
170:
149:
3177:OpenAI's CLIP webpage
2440:Sohn, Kihyuk (2016).
1361:bicubic interpolation
1250:
901:
873:
810:
661:
633:
593:
567:
527:
261:
171:
147:
126:on 26 February 2021.
2820:10.18653/v1/P18-1238
2535:, OpenAI, 2024-09-06
2160:, #text parameters:
2061:"CLIP.png"
1454:Image classification
956:Embedding dimension
950:Parameters (vision)
878:is the negative log
819:
673:
642:
606:
576:
536:
277:
180:
160:
1480:multimodal learning
1468:kite (bird of prey)
1322:, then extended by
1222:rect-2 blur pooling
937:
114:Publication history
21:
3188:amaarora.github.io
3163:. PMLR: 1059โ1071.
2429:. PMLR: 4904โ4916.
2310:. January 5, 2021.
1464:prompt engineering
1344:Data preprocessing
1328:mutual information
1264:byte pair encoding
1253:
953:Parameters (text)
935:
908:
868:
805:
719:
656:
628:
588:
562:
522:
480:
423:
359:
302:
271:cross-entropy loss
256:
166:
150:
2864:979-8-3503-0129-8
2767:978-1-4503-8037-9
1320:English Knowledge
1200:
1199:
692:
690:
520:
471:
414:
412:
399:
350:
293:
291:
169:{\displaystyle N}
97:
96:
3233:
3211:Machine learning
3197:
3195:
3194:
3165:
3164:
3152:
3146:
3145:
3128:
3122:
3121:
3119:
3118:
3104:
3098:
3097:
3095:
3083:
3077:
3076:
3075:
3074:
3061:
3055:
3054:
3053:
3052:
3039:
3033:
3032:
3031:
3030:
3017:
3011:
3010:
3008:
3007:
2992:
2986:
2985:
2983:
2970:
2964:
2963:
2961:
2949:
2943:
2942:
2940:
2938:
2923:
2917:
2916:
2914:
2913:
2898:
2892:
2891:
2875:
2869:
2868:
2848:
2831:
2825:
2824:
2822:
2802:
2796:
2795:
2793:
2792:
2778:
2772:
2771:
2751:
2735:
2729:
2728:
2722:
2717:
2715:
2707:
2695:
2689:
2688:
2686:
2674:
2668:
2667:
2665:
2653:
2647:
2646:
2640:
2635:
2633:
2625:
2617:
2611:
2610:
2608:
2596:
2590:
2589:
2587:
2571:
2565:
2564:
2562:
2549:
2543:
2542:
2541:
2540:
2527:
2521:
2520:
2519:
2518:
2497:
2488:
2487:
2477:
2471:
2470:
2460:
2454:
2453:
2437:
2431:
2430:
2418:
2405:
2404:
2402:
2401:
2387:
2381:
2380:
2378:
2365:
2352:
2351:
2349:
2348:
2329:
2323:
2318:
2312:
2311:
2300:
2294:
2293:
2282:
2253:
2251:
2247:
2242:
2236:
2233:
2227:
2224:
2221:
2218:
2215:
2212:
2209:
2206:
2203:
2200:
2197:
2194:
2191:
2188:
2185:
2182:
2179:
2176:
2173:
2170:
2167:
2164:
2161:
2158:
2155:
2152:
2149:
2146:
2143:
2140:
2137:
2134:
2131:
2128:
2125:
2122:
2119:
2116:
2113:
2110:
2107:
2104:
2101:
2098:
2095:
2092:
2089:
2086:
2083:
2080:
2077:
2074:
2071:
2068:
2065:
2062:
2059:
2056:
2053:
2050:
2047:
2044:
2041:
2038:
2035:
2032:
2029:
2026:
2023:
2020:
2017:
2014:
2011:
2008:
2005:
2002:
1999:
1996:
1993:
1990:
1987:
1984:
1981:
1978:
1975:
1972:
1969:
1966:
1963:
1960:
1957:
1954:
1951:
1948:
1945:
1942:
1939:
1936:
1933:
1930:
1927:
1924:
1921:
1918:
1915:
1912:
1909:
1906:
1903:
1900:
1897:
1894:
1893:input_resolution
1891:
1888:
1885:
1882:
1879:
1876:
1873:
1870:
1867:
1864:
1861:
1858:
1855:
1852:
1849:
1846:
1843:
1840:
1837:
1834:
1831:
1828:
1825:
1822:
1819:
1816:
1813:
1810:
1807:
1804:
1801:
1800:input_resolution
1798:
1795:
1792:
1789:
1786:
1783:
1782:input_resolution
1780:
1777:
1774:
1771:
1768:
1765:
1762:
1759:
1756:
1753:
1750:
1747:
1744:
1741:
1738:
1735:
1734:available_models
1732:
1729:
1726:
1723:
1720:
1717:
1714:
1711:
1708:
1705:
1702:
1699:
1696:
1693:
1690:
1689:"cuda"
1687:
1684:
1681:
1678:
1675:
1672:
1669:
1666:
1663:
1660:
1657:
1654:
1651:
1648:
1645:
1642:
1639:
1636:
1633:
1630:
1627:
1624:
1621:
1618:
1615:
1612:
1609:
1606:
1603:
1600:
1597:
1594:
1591:
1588:
1585:
1582:
1579:
1576:
1573:
1570:
1567:
1564:
1561:
1558:
1555:
1552:
1549:
1546:
1543:
1540:
1537:
1534:
1531:
1528:
1525:
1522:
1516:
1510:
1507:
1437:Stable Diffusion
1289:
1286:
1283:
938:
877:
875:
874:
869:
864:
863:
814:
812:
811:
806:
792:
791:
779:
778:
769:
768:
747:
746:
718:
691:
683:
665:
663:
662:
657:
655:
637:
635:
634:
629:
627:
626:
597:
595:
594:
589:
571:
569:
568:
563:
561:
560:
548:
547:
531:
529:
528:
523:
521:
519:
518:
517:
513:
508:
507:
495:
494:
479:
469:
468:
464:
459:
458:
446:
445:
431:
422:
413:
405:
400:
398:
397:
396:
392:
387:
386:
374:
373:
358:
348:
347:
343:
338:
337:
325:
324:
310:
301:
292:
284:
265:
263:
262:
257:
255:
254:
230:
229:
217:
216:
192:
191:
175:
173:
172:
167:
29:
22:
3241:
3240:
3236:
3235:
3234:
3232:
3231:
3230:
3221:Computer vision
3201:
3200:
3192:
3190:
3181:
3173:
3168:
3154:
3153:
3149:
3130:
3129:
3125:
3116:
3114:
3106:
3105:
3101:
3085:
3084:
3080:
3072:
3070:
3063:
3062:
3058:
3050:
3048:
3041:
3040:
3036:
3028:
3026:
3019:
3018:
3014:
3005:
3003:
2994:
2993:
2989:
2972:
2971:
2967:
2951:
2950:
2946:
2936:
2934:
2925:
2924:
2920:
2911:
2909:
2900:
2899:
2895:
2877:
2876:
2872:
2865:
2833:
2832:
2828:
2804:
2803:
2799:
2790:
2788:
2780:
2779:
2775:
2768:
2737:
2736:
2732:
2718:
2708:
2697:
2696:
2692:
2676:
2675:
2671:
2655:
2654:
2650:
2636:
2626:
2619:
2618:
2614:
2598:
2597:
2593:
2573:
2572:
2568:
2551:
2550:
2546:
2538:
2536:
2529:
2528:
2524:
2516:
2514:
2499:
2498:
2491:
2479:
2478:
2474:
2462:
2461:
2457:
2439:
2438:
2434:
2420:
2419:
2408:
2399:
2397:
2389:
2388:
2384:
2367:
2366:
2355:
2346:
2344:
2331:
2330:
2326:
2319:
2315:
2302:
2301:
2297:
2284:
2283:
2266:
2262:
2257:
2256:
2249:
2245:
2243:
2239:
2234:
2230:
2226:
2225:
2222:
2219:
2216:
2213:
2210:
2207:
2204:
2201:
2198:
2195:
2192:
2189:
2186:
2183:
2180:
2177:
2174:
2171:
2168:
2165:
2162:
2159:
2156:
2153:
2150:
2148:n_params_vision
2147:
2144:
2141:
2138:
2135:
2132:
2129:
2126:
2123:
2120:
2117:
2114:
2111:
2108:
2105:
2102:
2099:
2096:
2093:
2090:
2087:
2084:
2081:
2078:
2075:
2072:
2069:
2066:
2063:
2060:
2057:
2054:
2051:
2048:
2045:
2042:
2039:
2036:
2033:
2030:
2027:
2024:
2021:
2018:
2015:
2012:
2009:
2006:
2003:
2000:
1997:
1994:
1991:
1988:
1985:
1982:
1979:
1976:
1973:
1970:
1967:
1964:
1961:
1958:
1955:
1952:
1949:
1946:
1943:
1940:
1937:
1935:n_params_vision
1934:
1931:
1928:
1925:
1922:
1919:
1916:
1913:
1910:
1907:
1904:
1901:
1898:
1895:
1892:
1889:
1886:
1883:
1880:
1877:
1874:
1871:
1868:
1865:
1862:
1859:
1856:
1853:
1850:
1847:
1844:
1841:
1838:
1835:
1832:
1829:
1826:
1823:
1820:
1817:
1814:
1811:
1808:
1805:
1802:
1799:
1796:
1793:
1790:
1787:
1784:
1781:
1778:
1775:
1772:
1769:
1766:
1763:
1760:
1757:
1754:
1751:
1748:
1745:
1742:
1739:
1736:
1733:
1730:
1727:
1724:
1721:
1718:
1716:"cpu"
1715:
1712:
1709:
1706:
1703:
1700:
1697:
1694:
1691:
1688:
1685:
1682:
1679:
1676:
1673:
1670:
1667:
1664:
1661:
1658:
1655:
1652:
1649:
1646:
1643:
1640:
1637:
1634:
1631:
1628:
1625:
1622:
1619:
1616:
1613:
1610:
1607:
1604:
1601:
1598:
1595:
1592:
1589:
1586:
1583:
1580:
1577:
1574:
1571:
1568:
1565:
1562:
1559:
1556:
1553:
1550:
1547:
1544:
1541:
1538:
1535:
1532:
1529:
1526:
1523:
1520:
1517:
1513:
1508:
1504:
1499:
1487:Google DeepMind
1476:
1456:
1422:
1392:
1373:
1346:
1308:
1303:
1288:
1285:
1282:
1245:
1175:ViT-L/14@336px
905:
896:
888:
852:
817:
816:
783:
770:
760:
732:
671:
670:
640:
639:
615:
604:
603:
574:
573:
552:
539:
534:
533:
499:
486:
481:
470:
450:
437:
432:
378:
365:
360:
349:
329:
316:
311:
275:
274:
246:
221:
208:
183:
178:
177:
158:
157:
142:
116:
49:January 5, 2021
46:Initial release
17:
12:
11:
5:
3239:
3237:
3229:
3228:
3223:
3218:
3213:
3203:
3202:
3199:
3198:
3179:
3172:
3171:External links
3169:
3167:
3166:
3147:
3144:: 23716โ23736.
3123:
3099:
3078:
3056:
3034:
3012:
2987:
2965:
2944:
2918:
2906:huggingface.co
2893:
2870:
2863:
2826:
2797:
2773:
2766:
2730:
2721:|journal=
2690:
2669:
2648:
2639:|journal=
2612:
2591:
2566:
2544:
2522:
2489:
2472:
2455:
2432:
2406:
2382:
2353:
2324:
2313:
2295:
2263:
2261:
2258:
2255:
2254:
2250:ViT-L/14@336px
2246:ViT-L/14-336px
2237:
2228:
2223:image_features
2184:image_features
2097:image_features
1911:context_length
1815:context_length
1803:context_length
1519:
1511:
1501:
1500:
1498:
1495:
1475:
1472:
1455:
1452:
1451:
1450:
1447:
1444:
1441:
1433:
1421:
1418:
1391:
1388:
1372:
1369:
1345:
1342:
1307:
1304:
1302:
1299:
1244:
1241:
1233:
1232:
1225:
1218:
1198:
1197:
1194:
1191:
1188:
1185:
1182:
1179:
1176:
1172:
1171:
1168:
1165:
1162:
1159:
1156:
1153:
1150:
1146:
1145:
1142:
1139:
1136:
1133:
1130:
1127:
1124:
1120:
1119:
1116:
1113:
1110:
1107:
1104:
1101:
1098:
1094:
1093:
1090:
1087:
1084:
1081:
1078:
1075:
1072:
1068:
1067:
1064:
1061:
1058:
1055:
1052:
1049:
1046:
1042:
1041:
1038:
1035:
1032:
1029:
1026:
1023:
1020:
1016:
1015:
1012:
1009:
1006:
1003:
1000:
997:
994:
990:
989:
986:
983:
980:
977:
974:
971:
968:
964:
963:
960:
957:
954:
951:
948:
945:
942:
903:
895:
892:
887:
884:
867:
862:
859:
855:
851:
848:
845:
842:
839:
836:
833:
830:
827:
824:
804:
801:
798:
795:
790:
786:
782:
777:
773:
767:
763:
759:
756:
753:
750:
745:
742:
739:
735:
731:
728:
725:
722:
717:
714:
711:
708:
705:
702:
699:
695:
689:
686:
681:
678:
654:
650:
647:
625:
622:
618:
614:
611:
587:
584:
581:
559:
555:
551:
546:
542:
516:
512:
506:
502:
498:
493:
489:
484:
478:
474:
467:
463:
457:
453:
449:
444:
440:
435:
429:
426:
421:
417:
411:
408:
403:
395:
391:
385:
381:
377:
372:
368:
363:
357:
353:
346:
342:
336:
332:
328:
323:
319:
314:
308:
305:
300:
296:
290:
287:
282:
253:
249:
245:
242:
239:
236:
233:
228:
224:
220:
215:
211:
207:
204:
201:
198:
195:
190:
186:
165:
141:
138:
115:
112:
104:neural network
95:
94:
89:
85:
84:
79:
73:
72:
67:
63:
62:
57:
51:
50:
47:
43:
42:
37:
31:
30:
15:
13:
10:
9:
6:
4:
3:
2:
3238:
3227:
3224:
3222:
3219:
3217:
3214:
3212:
3209:
3208:
3206:
3189:
3185:
3180:
3178:
3175:
3174:
3170:
3162:
3158:
3151:
3148:
3143:
3139:
3135:
3127:
3124:
3113:
3109:
3103:
3100:
3094:
3089:
3082:
3079:
3069:
3068:
3060:
3057:
3047:
3046:
3038:
3035:
3024:
3023:
3016:
3013:
3002:
2998:
2991:
2988:
2982:
2977:
2969:
2966:
2960:
2955:
2948:
2945:
2932:
2928:
2922:
2919:
2907:
2903:
2897:
2894:
2889:
2885:
2881:
2874:
2871:
2866:
2860:
2856:
2852:
2847:
2842:
2838:
2830:
2827:
2821:
2816:
2812:
2808:
2801:
2798:
2787:
2783:
2777:
2774:
2769:
2763:
2759:
2755:
2750:
2745:
2741:
2734:
2731:
2726:
2713:
2705:
2701:
2694:
2691:
2685:
2680:
2673:
2670:
2664:
2659:
2652:
2649:
2644:
2631:
2623:
2616:
2613:
2607:
2602:
2595:
2592:
2586:
2581:
2577:
2570:
2567:
2561:
2556:
2548:
2545:
2534:
2533:
2526:
2523:
2513:
2509:
2505:
2504:
2496:
2494:
2490:
2485:
2484:
2476:
2473:
2468:
2467:
2459:
2456:
2451:
2447:
2443:
2436:
2433:
2428:
2424:
2417:
2415:
2413:
2411:
2407:
2396:
2392:
2386:
2383:
2377:
2372:
2364:
2362:
2360:
2358:
2354:
2343:on 9 Feb 2021
2342:
2338:
2334:
2328:
2325:
2322:
2317:
2314:
2309:
2305:
2299:
2296:
2291:
2290:
2281:
2279:
2277:
2275:
2273:
2271:
2269:
2265:
2259:
2241:
2238:
2232:
2229:
2166:n_params_text
2130:"Model:
1986:n_params_text
1515:
1512:
1506:
1503:
1496:
1494:
1492:
1488:
1483:
1481:
1474:Multimodality
1473:
1471:
1469:
1465:
1460:
1453:
1448:
1445:
1442:
1438:
1434:
1431:
1427:
1426:
1425:
1419:
1417:
1415:
1410:
1407:
1405:
1401:
1397:
1389:
1387:
1385:
1380:
1378:
1370:
1368:
1366:
1362:
1357:
1355:
1350:
1343:
1341:
1338:
1336:
1333:
1329:
1325:
1321:
1316:
1314:
1305:
1300:
1298:
1295:
1293:
1280:
1275:
1273:
1269:
1265:
1260:
1258:
1249:
1242:
1240:
1238:
1230:
1226:
1223:
1219:
1216:
1215:
1214:
1212:
1207:
1205:
1195:
1192:
1189:
1186:
1183:
1180:
1177:
1174:
1173:
1169:
1166:
1163:
1160:
1157:
1154:
1151:
1148:
1147:
1143:
1140:
1137:
1134:
1131:
1128:
1125:
1122:
1121:
1117:
1114:
1111:
1108:
1105:
1102:
1099:
1096:
1095:
1091:
1088:
1085:
1082:
1079:
1076:
1073:
1070:
1069:
1065:
1062:
1059:
1056:
1053:
1050:
1047:
1044:
1043:
1039:
1036:
1033:
1030:
1027:
1024:
1021:
1018:
1017:
1013:
1010:
1007:
1004:
1001:
998:
995:
992:
991:
987:
984:
981:
978:
975:
972:
969:
966:
965:
962:Release date
961:
958:
955:
952:
949:
946:
943:
940:
939:
933:
930:
926:
924:
920:
915:
913:
900:
893:
891:
885:
883:
881:
860:
857:
853:
849:
846:
840:
837:
834:
828:
822:
796:
793:
788:
784:
780:
775:
771:
765:
761:
751:
748:
743:
740:
737:
733:
729:
720:
715:
712:
709:
706:
703:
700:
697:
693:
687:
684:
679:
676:
667:
648:
645:
623:
620:
616:
612:
609:
601:
585:
582:
579:
557:
553:
549:
544:
540:
514:
510:
504:
500:
496:
491:
487:
482:
476:
472:
465:
461:
455:
451:
447:
442:
438:
433:
427:
424:
419:
415:
409:
406:
401:
393:
389:
383:
379:
375:
370:
366:
361:
355:
351:
344:
340:
334:
330:
326:
321:
317:
312:
306:
303:
298:
294:
288:
285:
280:
272:
267:
251:
247:
243:
240:
237:
234:
231:
226:
222:
218:
213:
209:
205:
202:
199:
196:
193:
188:
184:
163:
154:
146:
139:
137:
134:
132:
127:
125:
121:
113:
111:
109:
105:
101:
93:
90:
86:
83:
80:
78:
74:
71:
68:
64:
61:
58:
56:
52:
48:
44:
41:
38:
36:
32:
28:
23:
3191:. Retrieved
3187:
3160:
3150:
3141:
3137:
3126:
3115:. Retrieved
3111:
3102:
3081:
3071:, retrieved
3066:
3059:
3049:, retrieved
3044:
3037:
3027:, retrieved
3021:
3015:
3004:. Retrieved
3000:
2990:
2968:
2947:
2937:17 September
2935:. Retrieved
2921:
2910:. Retrieved
2908:. 2023-09-10
2905:
2896:
2887:
2883:
2873:
2836:
2829:
2810:
2800:
2789:. Retrieved
2785:
2776:
2739:
2733:
2712:cite journal
2693:
2672:
2651:
2630:cite journal
2615:
2594:
2575:
2569:
2547:
2537:, retrieved
2531:
2525:
2515:, retrieved
2502:
2482:
2475:
2465:
2458:
2449:
2445:
2435:
2426:
2398:. Retrieved
2394:
2385:
2345:. Retrieved
2341:the original
2336:
2327:
2316:
2307:
2298:
2288:
2240:
2231:
2109:encode_image
1707:is_available
1514:
1505:
1484:
1477:
1461:
1457:
1423:
1420:Applications
1411:
1408:
1393:
1381:
1374:
1358:
1351:
1347:
1339:
1317:
1309:
1306:WebImageText
1296:
1276:
1261:
1257:Transformers
1254:
1237:EfficientNet
1234:
1221:
1208:
1201:
931:
927:
916:
909:
889:
668:
268:
155:
151:
135:
128:
117:
99:
98:
35:Developer(s)
2532:openai/CLIP
2025:transformer
1235:ALIGN used
944:Resolution
941:Model name
904:<CLS>
894:Image model
886:CLIP models
600:temperature
110:objective.
108:contrastive
82:MIT License
3205:Categories
3193:2024-09-11
3117:2024-09-19
3093:2111.09734
3073:2024-09-08
3051:2024-09-06
3029:2024-09-08
3006:2024-09-08
2981:2112.10741
2959:2204.06125
2912:2024-09-06
2846:2212.07143
2791:2024-09-19
2749:2103.01913
2684:1905.11946
2663:1904.11486
2606:1812.01187
2585:1512.03385
2560:2010.11929
2539:2024-09-06
2517:2024-09-06
2400:2024-09-06
2376:2103.00020
2347:2024-09-06
2260:References
2211:preprocess
2043:preprocess
2031:parameters
1980:parameters
1929:vocab_size
1830:vocab_size
1818:vocab_size
1746:preprocess
1430:featurizer
1326:with high
1243:Text model
959:Size (MB)
921:, such as
66:Written in
55:Repository
2704:160025533
2070:unsqueeze
1292:LayerNorm
1204:LayerNorm
1149:ViT-L/14
1123:ViT-B/16
1097:ViT-B/32
858:−
841:
781:⋅
766:τ
749:−
734:δ
707:∈
694:∑
649:∈
646:τ
624:τ
621:−
550:⋅
497:⋅
473:∑
448:⋅
428:
416:∑
402:−
376:⋅
352:∑
327:⋅
307:
295:∑
281:−
140:Algorithm
2931:Archived
2503:OpenCLIP
1390:Training
1377:alt-tags
1290:, apply
1196:2022-04
1170:2022-01
1144:2021-07
1118:2021-01
1092:2022-01
1071:RN50x64
1066:2021-07
1045:RN50x16
1040:2021-03
1014:2021-03
988:2021-01
3001:W&B
2395:icml.cc
1527:install
1365:cropped
1354:whitens
1335:synsets
1332:WordNet
1324:bigrams
1301:Dataset
1019:RN50x4
880:sigmoid
598:is the
88:Website
77:License
3112:GitHub
2861:
2786:GitHub
2764:
2702:
2337:GitHub
2308:OpenAI
2196:"
2091:device
1974:visual
1875:"
1848:"
1794:visual
1776:device
1770:device
1683:device
1671:import
1665:import
1653:import
1647:import
1599:openai
1587:github
1557:openai
1545:github
1398:and 3
1396:ResNet
1371:Others
1211:ResNet
1187:123.0
1184:304.3
1161:123.0
1158:304.0
1083:201.8
1080:420.4
1057:123.0
1054:167.3
993:RN101
923:ResNet
882:loss.
815:where
638:where
70:Python
40:OpenAI
3088:arXiv
2976:arXiv
2954:arXiv
2841:arXiv
2744:arXiv
2700:S2CID
2679:arXiv
2658:arXiv
2601:arXiv
2580:arXiv
2555:arXiv
2371:arXiv
2217:image
2205:model
2190:shape
2121:print
2115:image
2103:model
2049:Image
2037:image
2019:model
2004:numel
1968:model
1953:numel
1917:print
1899:print
1881:print
1833:print
1824:model
1809:model
1788:model
1740:model
1695:torch
1674:numpy
1668:Image
1650:torch
1578:https
1536:https
1497:Notes
1384:LAION
1367:out.
1313:GPT-2
1277:Like
1272:GPT-2
1135:63.1
1132:86.2
1109:63.1
1106:87.8
1089:1260
1086:1024
1031:90.7
1028:87.1
1005:63.1
1002:56.3
982:1024
979:63.1
976:38.3
967:RN50
124:arXiv
2939:2022
2859:ISBN
2762:ISBN
2725:help
2643:help
2248:and
2055:open
1758:load
1752:clip
1728:clip
1713:else
1701:cuda
1659:from
1656:clip
1638:CLIP
1623:CLIP
1617:main
1605:CLIP
1575:wget
1563:CLIP
1440:art.
1414:A100
1404:V100
1284:and
1279:BERT
1193:891
1190:768
1181:428
1178:336
1167:890
1164:768
1155:428
1152:224
1141:335
1138:512
1129:150
1126:224
1115:338
1112:512
1103:151
1100:224
1077:623
1074:448
1063:630
1060:768
1051:291
1048:384
1037:402
1034:640
1025:178
1022:288
1011:278
1008:512
999:120
996:224
985:244
973:102
970:224
583:>
20:CLIP
2851:doi
2815:doi
2754:doi
2508:doi
2202:del
2034:())
2010:for
1992:sum
1983:())
1959:for
1941:sum
1860:sum
1737:():
1719:for
1662:PIL
1644:png
1629:png
1611:raw
1593:com
1569:git
1551:com
1530:git
1524:pip
1470:".
1400:ViT
1268:GPT
120:CDN
3207::
3186:.
3159:.
3142:35
3140:.
3136:.
3110:.
2999:.
2904:.
2888:32
2886:.
2882:.
2857:.
2849:.
2809:.
2784:.
2760:.
2752:.
2716::
2714:}}
2710:{{
2634::
2632:}}
2628:{{
2578:.
2506:,
2492:^
2450:29
2448:.
2444:.
2425:.
2409:^
2393:.
2356:^
2335:.
2306:.
2267:^
2085:to
2064:))
2016:in
2007:()
1965:in
1956:()
1863:()
1854:np
1725:in
1710:()
1692:if
1680:np
1677:as
1584://
1542://
1482:.
1337:.
1274:.
1259:.
1206:.
838:ln
425:ln
304:ln
3196:.
3120:.
3096:.
3090::
3009:.
2984:.
2978::
2962:.
2956::
2941:.
2915:.
2867:.
2853::
2843::
2823:.
2817::
2794:.
2770:.
2756::
2746::
2727:)
2723:(
2706:.
2687:.
2681::
2666:.
2660::
2645:)
2641:(
2624:.
2609:.
2603::
2588:.
2582::
2563:.
2557::
2510::
2403:.
2379:.
2373::
2350:.
2220:,
2214:,
2208:,
2199:)
2193:}
2187:.
2181:{
2175:}
2172:,
2169::
2163:{
2157:}
2154:,
2151::
2145:{
2139:}
2136:m
2133:{
2127:f
2124:(
2118:)
2112:(
2106:.
2100:=
2094:)
2088:(
2082:.
2079:)
2076:0
2073:(
2067:.
2058:(
2052:.
2046:(
2040:=
2028:.
2022:.
2013:p
2001:.
1998:p
1995:(
1989:=
1977:.
1971:.
1962:p
1950:.
1947:p
1944:(
1938:=
1932:)
1926:,
1920:(
1914:)
1908:,
1902:(
1896:)
1890:,
1884:(
1878:)
1872:}
1869:,
1866::
1857:.
1851:{
1845:f
1842:,
1836:(
1827:.
1821:=
1812:.
1806:=
1797:.
1791:.
1785:=
1779:)
1773:=
1767:,
1764:m
1761:(
1755:.
1749:=
1743:,
1731:.
1722:m
1704:.
1698:.
1686:=
1641:.
1635:O
1632:-
1626:.
1620:/
1614:/
1608:/
1602:/
1596:/
1590:.
1581::
1572:!
1566:.
1560:/
1554:/
1548:.
1539::
1533:+
1521:!
1231:.
866:)
861:x
854:e
850:+
847:1
844:(
835:=
832:)
829:x
826:(
823:f
803:)
800:)
797:b
794:+
789:j
785:v
776:i
772:w
762:e
758:(
755:)
752:1
744:j
741:,
738:i
730:2
727:(
724:(
721:f
716:N
713::
710:1
704:j
701:,
698:i
688:N
685:1
680:=
677:L
653:R
617:e
613:=
610:T
586:0
580:T
558:i
554:w
545:i
541:v
515:T
511:/
505:j
501:w
492:i
488:v
483:e
477:i
466:T
462:/
456:j
452:w
443:j
439:v
434:e
420:j
410:N
407:1
394:T
390:/
384:j
380:w
371:i
367:v
362:e
356:j
345:T
341:/
335:i
331:w
322:i
318:v
313:e
299:i
289:N
286:1
252:N
248:w
244:,
241:.
238:.
235:.
232:,
227:1
223:w
219:,
214:N
210:v
206:,
203:.
200:.
197:.
194:,
189:1
185:v
164:N
Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.