Contrastive Language-Image Pre-training

145: 27: 1248: 914:. The naming convention for these models often reflects the specific ViT architecture used. For instance, "ViT-L/14" means a "vision transformer large" (compared to other models in the same series) with a patch size of 14, meaning that the image is divided into 14-by-14 pixel patches before being processed by the transformer. The size indicator ranges from B, L, H, G (base, large, huge, giant), in that order. 899: 530: 152:

The CLIP method trains a pair of models contrastively. One model takes in a piece of text as input and outputs a single vector representing its semantic content. The other model takes in an image and similarly outputs a single vector representing its visual content. The models are trained so that

1458:

CLIP can perform zero-shot image classification tasks, i.e. without any explicit training on those specific classes. This is achieved by prompting the text encoder with class names and selecting the class whose embedding is closest to the image embedding. For example, to classify an image, they

1489:'s Flamingo (2022), the authors trained a CLIP pair, with BERT as the text encoder and NormalizerFree ResNet F6 as the image encoder. The image encoder of the CLIP pair was taken with parameters frozen and the text encoder was discarded. The frozen image encoder was then combined with a frozen 2552:

Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob (2021-06-03). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale".

1310:

The CLIP models released by OpenAI were trained on a dataset called "WebImageText" (WIT) containing 400 million pairs of images and their corresponding captions scraped from the internet. The total number of words in this dataset is similar in scale to the WebText dataset used for training

276: 2368:

Radford, Alec; Kim, Jong Wook; Hallacy, Chris; Ramesh, Aditya; Goh, Gabriel; Agarwal, Sandhini; Sastry, Girish; Askell, Amanda; Mishkin, Pamela; Clark, Jack; Krueger, Gretchen; Sutskever, Ilya (2021). "Learning Transferable Visual Models From Natural Language Supervision".

1294:, then a final linear map. This is the text encoding of the input sequence. The final linear map has output dimension equal to the embedding dimension of whatever image encoder it is paired with. These models all had context length 77 and vocabulary size 49408. 1439:

uses the text encoder of CLIP ViT-L/14 to transform text prompts to an embedding space, as these embeddings provide detailed semantic information. CLIP can also be used as a gradient signal for directly guiding diffusion ("CLIP guidance") or other generative

3131:

Alayrac, Jean-Baptiste; Donahue, Jeff; Luc, Pauline; Miech, Antoine; Barr, Iain; Hasson, Yana; Lenc, Karel; Mensch, Arthur; Millican, Katherine; Reynolds, Malcolm; Ring, Roman; Rutherford, Eliza; Cabi, Serkan; Han, Tengda; Gong, Zhitao (2022-12-06).

813: 2834:

Cherti, Mehdi; Beaumont, Romain; Wightman, Ross; Wortsman, Mitchell; Ilharco, Gabriel; Gordon, Cade; Schuhmann, Christoph; Schmidt, Ludwig; Jitsev, Jenia (June 2023). "Reproducible Scaling Laws for Contrastive Language-Image Learning".

928:

Since the output vectors of the image model and the text model must have exactly the same length, both the image model and the text model have fixed-length vector outputs, which in the original report is called "embedding dimension".

2973:

Nichol, Alex; Dhariwal, Prafulla; Ramesh, Aditya; Shyam, Pranav; Mishkin, Pamela; McGrew, Bob; Sutskever, Ilya; Chen, Mark (2022-03-08). "GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models".

1348:

For the CLIP image models, the input images are preprocessed by first dividing each of the R, G, B values of an image by the maximum possible value, so that these values fall between 0 and 1, then subtracting by , and dividing by .

2285:

525:{\displaystyle -{\frac {1}{N}}\sum _{i}\ln {\frac {e^{v_{i}\cdot w_{i}/T}}{\sum _{j}e^{v_{i}\cdot w_{j}/T}}}-{\frac {1}{N}}\sum _{j}\ln {\frac {e^{v_{j}\cdot w_{j}/T}}{\sum _{i}e^{v_{i}\cdot w_{j}/T}}}} 1379:

from online crawling. The method was described as similar to how the Conceptual Captions dataset was constructed, but instead of complex filtering, they only applied a frequency-based filtering.

264: 1409:

All ViT models were trained on 224x224 image resolution. The ViT-L/14 was then boosted to 336x336 resolution by FixRes, resulting in a model. They found this was the best-performing model.

1359:

If the input image does not have the same resolution as the native resolution (224x224 for all except ViT-L/14@336px, which has 336x336 resolution), then the input image is scaled down by

2738:

Srinivasan, Krishna; Raman, Karthik; Chen, Jiecao; Bendersky, Michael; Najork, Marc (2021-07-11). "WIT: Knowledge-based Image Text Dataset for Multimodal Multilingual Machine Learning".

672: 1318:

The dataset contains 500,000 text-queries, with up to 20,000 (image, text) pairs per query. The text-queries were generated by starting with all words occurring at least 100 times in

664: 876: 570: 153:

the vectors corresponding to semantically similar text-image pairs are close together in the shared vector space, while those corresponding to dissimilar pairs are far apart.

636: 2320: 2500:

Ilharco, Gabriel; Wortsman, Mitchell; Wightman, Ross; Gordon, Cade; Carlini, Nicholas; Taori, Rohan; Dave, Achal; Shankar, Vaishaal; Namkoong, Hongseok (July 2021),

596: 2599:

He, Tong; Zhang, Zhi; Zhang, Hang; Zhang, Zhongyue; Xie, Junyuan; Li, Mu (2018-12-05). "Bag of Tricks for Image Classification with Convolutional Neural Networks".

1202:

Its implementation of ViT was the same as the original one, with one modification: after position embeddings are added to the initial patch embeddings, there is a

1459:

compared the embedding of the image with the embedding of the text "A photo of a {class}.", and the {class} that results in the highest dot product is outputted.

174: 156:

To train a pair of CLIP models, one would start by preparing a large dataset of image-caption pairs. During training, the models are presented with batches of

1251:

One decoder layer. The Transformer used in the CLIP text encoder was made by removing the cross-attention module, then stacking the resulting module 12 times.

2464: 130: 2332: 2952:

Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-12). "Hierarchical Text-Conditional Image Generation with CLIP Latents".

3133: 1256: 932:

For example, in the original OpenAI model, the ResNet models have embedding dimensions ranging from 512 to 1024. The ViTs, ranging from 512 to 768.

1449:

Given an image, CLIP can generate captions. This is done by finding the text input that maximizes the similarity score with the image embedding.

1352:

The rationale was that these are the mean and standard deviations of the images in the WebImageText dataset, so this preprocessing step roughly

1446:

CLIP can retrieve for images based on textual descriptions. This is possible even if the images were not explicitly tagged with those keywords.

2862: 2765: 2421:

Jia, Chao; Yang, Yinfei; Xia, Ye; Chen, Yi-Ting; Parekh, Zarana; Pham, Hieu; Le, Quoc; Sung, Yun-Hsuan; Li, Zhen; Duerig, Tom (2021-07-01).

3225: 136:

Concurrent to CLIP was ALIGN, published at the same conference. It was done by researchers at Google, with essentially the same algorithm.

2698:

Radford, Alec; Wu, Jeff; Child, R.; Luan, D.; Amodei, Dario; Sutskever, I. (2019). "Language Models are Unsupervised Multitask Learners".

1262:

In the original OpenAI report, they reported using a Transformer (63M-parameter, 12-layer, 512-wide, 8 attention heads) with lower-cased

3215: 1443:

A finetuned model based on CLIP can be used to rank images for their aesthetic quality, which can be used for improving dataset quality.

2481: 2321:

https://web.archive.org/web/20210105204011/https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language.pdf

1267: 1462:

The authors found that the trained model for ImageNet classification would fail in case of polysemy, requiring a moderate amount of

1217:

In the start of the CNN (the "stem"), they used three stacked 3x3 convolutions instead of a single 7x7 convolution, as suggested by.

129:

The report (with some details removed, and its appendix cut out to a "Supplementary PDF") was published in Proceedings of the 38th

2996: 2879: 2441: 3107: 69: 2930: 918: 54: 1490: 1402:(ViT-B/32, ViT-B/16, ViT-L/14). Each was trained for 32 epochs. The largest ResNet model took 18 days to train on 592 890:

While the original model was developed by OpenAI, subsequent models have been trained by other organizations as well.

2677:

Tan, Mingxing; Le, Quoc V. (2020-09-11). "EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks".

179: 1466:. For example, in ImageNet, the class "kite" refers to a kind of bird, not a toy, so the authors changed "kite" to " 118:

It was first announced on OpenAI's official blog on January 5, 2021, with a report served directly through OpenAI's

1413: 1287:("start of sequence" and "end of sequence"). Take the activations of the highest layer of the transformer on the 3210: 2805:

Sharma, Piyush; Ding, Nan; Goodman, Sebastian; Soricut, Radu (July 2018). Gurevych, Iryna; Miyao, Yusuke (eds.).

2340: 1403: 103: 1356:

the image tensor. These numbers slightly differ from the standard preprocessing for ImageNet, which uses and .

1224:

according to the terminology of ). This has the effect of blurring images before downsampling, for antialiasing.

3220: 2811:

Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

2740:

Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

1353: 119: 1220:

There is an average pooling of stride 2 at the start of each downsampling convolutional layer (they called it

3183: 1395: 1210: 922: 144: 26: 1340:

The dataset is private and has not been released to the public, and there is no further information on it.

641: 1363:, so that its shorter side is the same as the native resolution, then the central square of the image is 2711: 2629: 2480:

Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022).

1360: 1278: 818: 535: 107: 808:{\displaystyle L={\frac {1}{N}}\sum _{i,j\in 1:N}f((2\delta _{i,j}-1)(e^{\tau }w_{i}\cdot v_{j}+b))} 2807:"Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning" 1479: 532:

In essence, this loss function encourages the dot product between matching image and text vectors (

1247: 1236: 669:

Other loss functions are possible. For example, Sigmoid CLIP proposes the following loss function:

605: 3087: 2975: 2953: 2840: 2743: 2699: 2678: 2657: 2600: 2579: 2554: 2370: 2292:. Proceedings of the 38th International Conference on Machine Learning. PMLR. pp. 8748–8763. 1463: 1399: 1327: 1270:, it was decoder-only, with only causally-masked self-attention. Its architecture is the same as 1263: 911: 270: 3086:

Mokady, Ron; Hertz, Amir; Bermano, Amit H. (2021). "ClipCap: CLIP Prefix for Image Captioning".

2901: 2858: 2761: 2486:. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11976–11986. 1319: 572:) to be high, while discouraging high dot products between non-matching pairs. The parameter 3043: 2850: 2814: 2753: 2507: 1436: 1429: 1364: 879: 575: 76: 2423:"Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision" 2724: 2642: 1486: 2782:"std and mean for image normalization different from ImageNet · Issue #20 · openai/CLIP" 2929:. CompVis - Machine Vision and Learning Research Group, LMU Munich. 17 September 2022. 1266:(BPE) with 49152 vocabulary size. Context length was capped at 76 for efficiency. Like 159: 1375:

ALIGN used over one billion image-text pairs, obtained by extracting images and their

3204: 2703: 1376: 1228: 269:

The loss incurred on this batch is the multi-class N-pair loss, which is a symmetric

2656:

Zhang, Richard (2019-06-08). "Making Convolutional Networks Shift-Invariant Again".

2469:. IEEE/CVF International Conference on Computer Vision (ICCV). pp. 11975–11986. 176:

image-caption pairs. Let the outputs from the text and image models be respectively

3156: 3020: 898: 16:

Technique in neural networks for learning joint representations of text and images

2854: 2422: 2390: 1493:, by finetuning with some further parameters that connect the two frozen models. 1382:

Later models trained by other organizations had published datasets. For example,

2287: 2235:

It is not the same as the Knowledge-based Image Text dataset, also called "WIT".

1467: 1386:

trained OpenCLIP with published datasets LAION-400M, LAION-2B, and DataComp-1B.

599: 81: 2926: 2813:. Melbourne, Australia: Association for Computational Linguistics: 2556–2565. 1416:

GPUs on the LAION-2B dataset, for 160 epochs for a total of 32B samples seen.

34: 2621: 2501: 1509:

Similar to the "embedding dimension" of text embedding in Transformer models.

3108:"CLIP/notebooks/Prompt_Engineering_for_ImageNet.ipynb at main · openai/CLIP" 3065: 2757: 1291: 1203: 106:

models, one for image understanding and one for text understanding, using a

2819: 2781: 2511: 2463:

Zhai, Xiaohua; Mustafa, Basil; Kolesnikov, Alexander; Beyer, Lucas (2023).

2837:

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

1428:

A trained image encoder of a CLIP pair can be used as a pre-trained image

3155:

Brock, Andy; De, Soham; Smith, Samuel L.; Simonyan, Karen (2021-07-01).

1331: 925:(in the original report), and ConvNeXt (in the OpenCLIP model series). 3157:"High-Performance Large-Scale Image Recognition Without Normalization" 2878:

Touvron, Hugo; Vedaldi, Andrea; Douze, Matthijs; Jegou, Herve (2019).

2442:"Improved Deep Metric Learning with Multi-class N-pair Loss Objective" 266:. Two vectors are considered "similar" if their dot product is large. 2806: 2574:

He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (10 Dec 2015).

2289:

Learning Transferable Visual Models From Natural Language Supervision

1334: 1330:, names of all Knowledge articles above a certain search volume, and 1323: 39: 3161:

Proceedings of the 38th International Conference on Machine Learning

2427:

Proceedings of the 38th International Conference on Machine Learning

3176: 3092: 2980: 2958: 2845: 2748: 2683: 2662: 2605: 2584: 2559: 2375: 91: 2530: 1383: 1312: 1271: 1246: 897: 123: 2303: 59: 1412:

In the OpenCLIP series, the ViT-L/14 model was trained on 384

1394:

In the original OpenAI CLIP report, they reported training 5

1406:

GPUs. The largest ViT model took 12 days on 256 V100 GPUs.

1239:

of various sizes, a kind of convolutional neural network.

133:, PMLR, which had a submission deadline of February 2021. 3134:"Flamingo: a Visual Language Model for Few-Shot Learning" 1213:

was the same as the original one, with 3 modifications:

1281:, the text sequence is bracketed by two special tokens 602:, which is parameterized in the original CLIP model as 2902:"laion/CLIP-ViT-L-14-laion2B-s32B-b82K · Hugging Face" 122:, and a GitHub repository. The paper was delivered on 2622:"Making Convolutional Networks Shift-Invariant Again" 1424:

CLIP has found wide applications in various domains.

910:

The image encoding models used in CLIP are typically

906:

output vector is used as the image encoding for CLIP.

821: 675: 644: 608: 578: 538: 279: 182: 162: 1255:

The text encoding models used in CLIP are typically

87: 75: 65: 53: 45: 33: 1315:, which contains about 40 gigabytes of text data. 870: 807: 658: 630: 590: 564: 524: 258: 168: 3138:Advances in Neural Information Processing Systems 2884:Advances in Neural Information Processing Systems 2446:Advances in Neural Information Processing Systems 1227:The final convolutional layer is followed by a 917:Other than ViT, the image model is typically a 259:{\displaystyle v_{1},...,v_{N},w_{1},...,w_{N}} 2880:"Fixing the train-test resolution discrepancy" 100:Contrastive Language-Image Pre-training (CLIP) 8: 2576:Deep Residual Learning for Image Recognition 2466:Sigmoid Loss for Language Image Pre-Training 1432:. This can then be fed into other AI models. 131:International Conference on Machine Learning 19: 2495: 2493: 25: 18: 3091: 2979: 2957: 2844: 2818: 2747: 2682: 2661: 2604: 2583: 2558: 2374: 856: 820: 787: 774: 764: 736: 696: 682: 674: 652: 651: 643: 619: 607: 577: 556: 543: 537: 509: 503: 490: 485: 475: 460: 454: 441: 436: 430: 418: 404: 388: 382: 369: 364: 354: 339: 333: 320: 315: 309: 297: 283: 278: 250: 225: 212: 187: 181: 161: 934: 902:Vision Transformer architecture. The Rep 143: 2927:"Stable Diffusion Repository on GitHub" 2264: 2252:, inconsistently throughout the report. 1502: 3045:haltakov/natural-language-image-search 2720: 2709: 2638: 2627: 2333:"initial commit · openai/CLIP@b1c4b6b" 102:is a technique for training a pair of 2933:from the original on January 18, 2023 1478:CLIP has been used as a component in 659:{\displaystyle \tau \in \mathbb {R} } 7: 2416: 2414: 2412: 2410: 2363: 2361: 2359: 2357: 2280: 2278: 2276: 2274: 2272: 2270: 2268: 1485:For example, during the training of 2997:"Fun With Neural Cellular Automata" 2304:"Clip: Connecting text and images" 1297:ALIGN used BERT of various sizes. 871:{\displaystyle f(x)=\ln(1+e^{-x})} 14: 3042:Haltakov, Vladimir (2024-09-03), 2995:Whitaker, Jonathan (2022-05-22). 2339:. 5 January 2021. Archived from 947:Parameters (total, in millions) 565:{\displaystyle v_{i}\cdot w_{i}} 3064:Beaumont, Romain (2024-09-07), 2244:They referred to this as both 1435:For text-to-image generation, 865: 843: 831: 825: 802: 799: 757: 754: 726: 723: 148:Architecture overview of CLIP. 60:https://github.com/OpenAI/CLIP 1: 3184:"The Annotated CLIP (Part-2)" 2620:Zhang, Richard (2018-09-27). 1887:"Input resolution:" 1839:"Model parameters:" 1229:multiheaded attention pooling 3022:LAION-AI/aesthetic-predictor 2855:10.1109/CVPR52729.2023.00276 919:convolutional neural network 631:{\displaystyle T=e^{-\tau }} 3226:Natural language processing 2391:"ICML 2021 Call for Papers" 1905:"Context length:" 3242: 3216:Artificial neural networks 3182:Arora, Aman (2023-03-11). 936:Models released by OpenAI 2890:. Curran Associates, Inc. 2452:. Curran Associates, Inc. 1491:Chinchilla language model 912:vision transformers (ViT) 24: 1518: 666:is a learned parameter. 92:openai.com/research/clip 2758:10.1145/3404835.3463257 2483:A ConvNet for the 2020s 2178:, embedding dimension: 1923:"Vocab size:" 273:over similarity scores: 3067:rom1504/clip-retrieval 3025:, LAION AI, 2024-09-06 2839:. pp. 2818–2829. 2742:. pp. 2443–2449. 2719:Cite journal requires 2637:Cite journal requires 2512:10.5281/zenodo.5143773 2142:, #vision parameters: 1252: 1209:Its implementation of 907: 872: 809: 660: 632: 592: 591:{\displaystyle T>0} 566: 526: 260: 170: 149: 3177:OpenAI's CLIP webpage 2440:Sohn, Kihyuk (2016). 1361:bicubic interpolation 1250: 901: 873: 810: 661: 633: 593: 567: 527: 261: 171: 147: 126:on 26 February 2021. 2820:10.18653/v1/P18-1238 2535:, OpenAI, 2024-09-06 2160:, #text parameters: 2061:"CLIP.png" 1454:Image classification 956:Embedding dimension 950:Parameters (vision) 878:is the negative log 819: 673: 642: 606: 576: 536: 277: 180: 160: 1480:multimodal learning 1468:kite (bird of prey) 1322:, then extended by 1222:rect-2 blur pooling 937: 114:Publication history 21: 3188:amaarora.github.io 3163:. PMLR: 1059–1071. 2429:. PMLR: 4904–4916. 2310:. January 5, 2021. 1464:prompt engineering 1344:Data preprocessing 1328:mutual information 1264:byte pair encoding 1253: 953:Parameters (text) 935: 908: 868: 805: 719: 656: 628: 588: 562: 522: 480: 423: 359: 302: 271:cross-entropy loss 256: 166: 150: 2864:979-8-3503-0129-8 2767:978-1-4503-8037-9 1320:English Knowledge 1200: 1199: 692: 690: 520: 471: 414: 412: 399: 350: 293: 291: 169:{\displaystyle N} 97: 96: 3233: 3211:Machine learning 3197: 3195: 3194: 3165: 3164: 3152: 3146: 3145: 3128: 3122: 3121: 3119: 3118: 3104: 3098: 3097: 3095: 3083: 3077: 3076: 3075: 3074: 3061: 3055: 3054: 3053: 3052: 3039: 3033: 3032: 3031: 3030: 3017: 3011: 3010: 3008: 3007: 2992: 2986: 2985: 2983: 2970: 2964: 2963: 2961: 2949: 2943: 2942: 2940: 2938: 2923: 2917: 2916: 2914: 2913: 2898: 2892: 2891: 2875: 2869: 2868: 2848: 2831: 2825: 2824: 2822: 2802: 2796: 2795: 2793: 2792: 2778: 2772: 2771: 2751: 2735: 2729: 2728: 2722: 2717: 2715: 2707: 2695: 2689: 2688: 2686: 2674: 2668: 2667: 2665: 2653: 2647: 2646: 2640: 2635: 2633: 2625: 2617: 2611: 2610: 2608: 2596: 2590: 2589: 2587: 2571: 2565: 2564: 2562: 2549: 2543: 2542: 2541: 2540: 2527: 2521: 2520: 2519: 2518: 2497: 2488: 2487: 2477: 2471: 2470: 2460: 2454: 2453: 2437: 2431: 2430: 2418: 2405: 2404: 2402: 2401: 2387: 2381: 2380: 2378: 2365: 2352: 2351: 2349: 2348: 2329: 2323: 2318: 2312: 2311: 2300: 2294: 2293: 2282: 2253: 2251: 2247: 2242: 2236: 2233: 2227: 2224: 2221: 2218: 2215: 2212: 2209: 2206: 2203: 2200: 2197: 2194: 2191: 2188: 2185: 2182: 2179: 2176: 2173: 2170: 2167: 2164: 2161: 2158: 2155: 2152: 2149: 2146: 2143: 2140: 2137: 2134: 2131: 2128: 2125: 2122: 2119: 2116: 2113: 2110: 2107: 2104: 2101: 2098: 2095: 2092: 2089: 2086: 2083: 2080: 2077: 2074: 2071: 2068: 2065: 2062: 2059: 2056: 2053: 2050: 2047: 2044: 2041: 2038: 2035: 2032: 2029: 2026: 2023: 2020: 2017: 2014: 2011: 2008: 2005: 2002: 1999: 1996: 1993: 1990: 1987: 1984: 1981: 1978: 1975: 1972: 1969: 1966: 1963: 1960: 1957: 1954: 1951: 1948: 1945: 1942: 1939: 1936: 1933: 1930: 1927: 1924: 1921: 1918: 1915: 1912: 1909: 1906: 1903: 1900: 1897: 1894: 1893:input_resolution 1891: 1888: 1885: 1882: 1879: 1876: 1873: 1870: 1867: 1864: 1861: 1858: 1855: 1852: 1849: 1846: 1843: 1840: 1837: 1834: 1831: 1828: 1825: 1822: 1819: 1816: 1813: 1810: 1807: 1804: 1801: 1800:input_resolution 1798: 1795: 1792: 1789: 1786: 1783: 1782:input_resolution 1780: 1777: 1774: 1771: 1768: 1765: 1762: 1759: 1756: 1753: 1750: 1747: 1744: 1741: 1738: 1735: 1734:available_models 1732: 1729: 1726: 1723: 1720: 1717: 1714: 1711: 1708: 1705: 1702: 1699: 1696: 1693: 1690: 1689:"cuda" 1687: 1684: 1681: 1678: 1675: 1672: 1669: 1666: 1663: 1660: 1657: 1654: 1651: 1648: 1645: 1642: 1639: 1636: 1633: 1630: 1627: 1624: 1621: 1618: 1615: 1612: 1609: 1606: 1603: 1600: 1597: 1594: 1591: 1588: 1585: 1582: 1579: 1576: 1573: 1570: 1567: 1564: 1561: 1558: 1555: 1552: 1549: 1546: 1543: 1540: 1537: 1534: 1531: 1528: 1525: 1522: 1516: 1510: 1507: 1437:Stable Diffusion 1289: 1286: 1283: 938: 877: 875: 874: 869: 864: 863: 814: 812: 811: 806: 792: 791: 779: 778: 769: 768: 747: 746: 718: 691: 683: 665: 663: 662: 657: 655: 637: 635: 634: 629: 627: 626: 597: 595: 594: 589: 571: 569: 568: 563: 561: 560: 548: 547: 531: 529: 528: 523: 521: 519: 518: 517: 513: 508: 507: 495: 494: 479: 469: 468: 464: 459: 458: 446: 445: 431: 422: 413: 405: 400: 398: 397: 396: 392: 387: 386: 374: 373: 358: 348: 347: 343: 338: 337: 325: 324: 310: 301: 292: 284: 265: 263: 262: 257: 255: 254: 230: 229: 217: 216: 192: 191: 175: 173: 172: 167: 29: 22: 3241: 3240: 3236: 3235: 3234: 3232: 3231: 3230: 3221:Computer vision 3201: 3200: 3192: 3190: 3181: 3173: 3168: 3154: 3153: 3149: 3130: 3129: 3125: 3116: 3114: 3106: 3105: 3101: 3085: 3084: 3080: 3072: 3070: 3063: 3062: 3058: 3050: 3048: 3041: 3040: 3036: 3028: 3026: 3019: 3018: 3014: 3005: 3003: 2994: 2993: 2989: 2972: 2971: 2967: 2951: 2950: 2946: 2936: 2934: 2925: 2924: 2920: 2911: 2909: 2900: 2899: 2895: 2877: 2876: 2872: 2865: 2833: 2832: 2828: 2804: 2803: 2799: 2790: 2788: 2780: 2779: 2775: 2768: 2737: 2736: 2732: 2718: 2708: 2697: 2696: 2692: 2676: 2675: 2671: 2655: 2654: 2650: 2636: 2626: 2619: 2618: 2614: 2598: 2597: 2593: 2573: 2572: 2568: 2551: 2550: 2546: 2538: 2536: 2529: 2528: 2524: 2516: 2514: 2499: 2498: 2491: 2479: 2478: 2474: 2462: 2461: 2457: 2439: 2438: 2434: 2420: 2419: 2408: 2399: 2397: 2389: 2388: 2384: 2367: 2366: 2355: 2346: 2344: 2331: 2330: 2326: 2319: 2315: 2302: 2301: 2297: 2284: 2283: 2266: 2262: 2257: 2256: 2249: 2245: 2243: 2239: 2234: 2230: 2226: 2225: 2222: 2219: 2216: 2213: 2210: 2207: 2204: 2201: 2198: 2195: 2192: 2189: 2186: 2183: 2180: 2177: 2174: 2171: 2168: 2165: 2162: 2159: 2156: 2153: 2150: 2148:n_params_vision 2147: 2144: 2141: 2138: 2135: 2132: 2129: 2126: 2123: 2120: 2117: 2114: 2111: 2108: 2105: 2102: 2099: 2096: 2093: 2090: 2087: 2084: 2081: 2078: 2075: 2072: 2069: 2066: 2063: 2060: 2057: 2054: 2051: 2048: 2045: 2042: 2039: 2036: 2033: 2030: 2027: 2024: 2021: 2018: 2015: 2012: 2009: 2006: 2003: 2000: 1997: 1994: 1991: 1988: 1985: 1982: 1979: 1976: 1973: 1970: 1967: 1964: 1961: 1958: 1955: 1952: 1949: 1946: 1943: 1940: 1937: 1935:n_params_vision 1934: 1931: 1928: 1925: 1922: 1919: 1916: 1913: 1910: 1907: 1904: 1901: 1898: 1895: 1892: 1889: 1886: 1883: 1880: 1877: 1874: 1871: 1868: 1865: 1862: 1859: 1856: 1853: 1850: 1847: 1844: 1841: 1838: 1835: 1832: 1829: 1826: 1823: 1820: 1817: 1814: 1811: 1808: 1805: 1802: 1799: 1796: 1793: 1790: 1787: 1784: 1781: 1778: 1775: 1772: 1769: 1766: 1763: 1760: 1757: 1754: 1751: 1748: 1745: 1742: 1739: 1736: 1733: 1730: 1727: 1724: 1721: 1718: 1716:"cpu" 1715: 1712: 1709: 1706: 1703: 1700: 1697: 1694: 1691: 1688: 1685: 1682: 1679: 1676: 1673: 1670: 1667: 1664: 1661: 1658: 1655: 1652: 1649: 1646: 1643: 1640: 1637: 1634: 1631: 1628: 1625: 1622: 1619: 1616: 1613: 1610: 1607: 1604: 1601: 1598: 1595: 1592: 1589: 1586: 1583: 1580: 1577: 1574: 1571: 1568: 1565: 1562: 1559: 1556: 1553: 1550: 1547: 1544: 1541: 1538: 1535: 1532: 1529: 1526: 1523: 1520: 1517: 1513: 1508: 1504: 1499: 1487:Google DeepMind 1476: 1456: 1422: 1392: 1373: 1346: 1308: 1303: 1288: 1285: 1282: 1245: 1175:ViT-L/14@336px 905: 896: 888: 852: 817: 816: 783: 770: 760: 732: 671: 670: 640: 639: 615: 604: 603: 574: 573: 552: 539: 534: 533: 499: 486: 481: 470: 450: 437: 432: 378: 365: 360: 349: 329: 316: 311: 275: 274: 246: 221: 208: 183: 178: 177: 158: 157: 142: 116: 49:January 5, 2021 46:Initial release 17: 12: 11: 5: 3239: 3237: 3229: 3228: 3223: 3218: 3213: 3203: 3202: 3199: 3198: 3179: 3172: 3171:External links 3169: 3167: 3166: 3147: 3144:: 23716–23736. 3123: 3099: 3078: 3056: 3034: 3012: 2987: 2965: 2944: 2918: 2906:huggingface.co 2893: 2870: 2863: 2826: 2797: 2773: 2766: 2730: 2721:|journal= 2690: 2669: 2648: 2639:|journal= 2612: 2591: 2566: 2544: 2522: 2489: 2472: 2455: 2432: 2406: 2382: 2353: 2324: 2313: 2295: 2263: 2261: 2258: 2255: 2254: 2250:ViT-L/14@336px 2246:ViT-L/14-336px 2237: 2228: 2223:image_features 2184:image_features 2097:image_features 1911:context_length 1815:context_length 1803:context_length 1519: 1511: 1501: 1500: 1498: 1495: 1475: 1472: 1455: 1452: 1451: 1450: 1447: 1444: 1441: 1433: 1421: 1418: 1391: 1388: 1372: 1369: 1345: 1342: 1307: 1304: 1302: 1299: 1244: 1241: 1233: 1232: 1225: 1218: 1198: 1197: 1194: 1191: 1188: 1185: 1182: 1179: 1176: 1172: 1171: 1168: 1165: 1162: 1159: 1156: 1153: 1150: 1146: 1145: 1142: 1139: 1136: 1133: 1130: 1127: 1124: 1120: 1119: 1116: 1113: 1110: 1107: 1104: 1101: 1098: 1094: 1093: 1090: 1087: 1084: 1081: 1078: 1075: 1072: 1068: 1067: 1064: 1061: 1058: 1055: 1052: 1049: 1046: 1042: 1041: 1038: 1035: 1032: 1029: 1026: 1023: 1020: 1016: 1015: 1012: 1009: 1006: 1003: 1000: 997: 994: 990: 989: 986: 983: 980: 977: 974: 971: 968: 964: 963: 960: 957: 954: 951: 948: 945: 942: 903: 895: 892: 887: 884: 867: 862: 859: 855: 851: 848: 845: 842: 839: 836: 833: 830: 827: 824: 804: 801: 798: 795: 790: 786: 782: 777: 773: 767: 763: 759: 756: 753: 750: 745: 742: 739: 735: 731: 728: 725: 722: 717: 714: 711: 708: 705: 702: 699: 695: 689: 686: 681: 678: 654: 650: 647: 625: 622: 618: 614: 611: 587: 584: 581: 559: 555: 551: 546: 542: 516: 512: 506: 502: 498: 493: 489: 484: 478: 474: 467: 463: 457: 453: 449: 444: 440: 435: 429: 426: 421: 417: 411: 408: 403: 395: 391: 385: 381: 377: 372: 368: 363: 357: 353: 346: 342: 336: 332: 328: 323: 319: 314: 308: 305: 300: 296: 290: 287: 282: 253: 249: 245: 242: 239: 236: 233: 228: 224: 220: 215: 211: 207: 204: 201: 198: 195: 190: 186: 165: 141: 138: 115: 112: 104:neural network 95: 94: 89: 85: 84: 79: 73: 72: 67: 63: 62: 57: 51: 50: 47: 43: 42: 37: 31: 30: 15: 13: 10: 9: 6: 4: 3: 2: 3238: 3227: 3224: 3222: 3219: 3217: 3214: 3212: 3209: 3208: 3206: 3189: 3185: 3180: 3178: 3175: 3174: 3170: 3162: 3158: 3151: 3148: 3143: 3139: 3135: 3127: 3124: 3113: 3109: 3103: 3100: 3094: 3089: 3082: 3079: 3069: 3068: 3060: 3057: 3047: 3046: 3038: 3035: 3024: 3023: 3016: 3013: 3002: 2998: 2991: 2988: 2982: 2977: 2969: 2966: 2960: 2955: 2948: 2945: 2932: 2928: 2922: 2919: 2907: 2903: 2897: 2894: 2889: 2885: 2881: 2874: 2871: 2866: 2860: 2856: 2852: 2847: 2842: 2838: 2830: 2827: 2821: 2816: 2812: 2808: 2801: 2798: 2787: 2783: 2777: 2774: 2769: 2763: 2759: 2755: 2750: 2745: 2741: 2734: 2731: 2726: 2713: 2705: 2701: 2694: 2691: 2685: 2680: 2673: 2670: 2664: 2659: 2652: 2649: 2644: 2631: 2623: 2616: 2613: 2607: 2602: 2595: 2592: 2586: 2581: 2577: 2570: 2567: 2561: 2556: 2548: 2545: 2534: 2533: 2526: 2523: 2513: 2509: 2505: 2504: 2496: 2494: 2490: 2485: 2484: 2476: 2473: 2468: 2467: 2459: 2456: 2451: 2447: 2443: 2436: 2433: 2428: 2424: 2417: 2415: 2413: 2411: 2407: 2396: 2392: 2386: 2383: 2377: 2372: 2364: 2362: 2360: 2358: 2354: 2343:on 9 Feb 2021 2342: 2338: 2334: 2328: 2325: 2322: 2317: 2314: 2309: 2305: 2299: 2296: 2291: 2290: 2281: 2279: 2277: 2275: 2273: 2271: 2269: 2265: 2259: 2241: 2238: 2232: 2229: 2166:n_params_text 2130:"Model: 1986:n_params_text 1515: 1512: 1506: 1503: 1496: 1494: 1492: 1488: 1483: 1481: 1474:Multimodality 1473: 1471: 1469: 1465: 1460: 1453: 1448: 1445: 1442: 1438: 1434: 1431: 1427: 1426: 1425: 1419: 1417: 1415: 1410: 1407: 1405: 1401: 1397: 1389: 1387: 1385: 1380: 1378: 1370: 1368: 1366: 1362: 1357: 1355: 1350: 1343: 1341: 1338: 1336: 1333: 1329: 1325: 1321: 1316: 1314: 1305: 1300: 1298: 1295: 1293: 1280: 1275: 1273: 1269: 1265: 1260: 1258: 1249: 1242: 1240: 1238: 1230: 1226: 1223: 1219: 1216: 1215: 1214: 1212: 1207: 1205: 1195: 1192: 1189: 1186: 1183: 1180: 1177: 1174: 1173: 1169: 1166: 1163: 1160: 1157: 1154: 1151: 1148: 1147: 1143: 1140: 1137: 1134: 1131: 1128: 1125: 1122: 1121: 1117: 1114: 1111: 1108: 1105: 1102: 1099: 1096: 1095: 1091: 1088: 1085: 1082: 1079: 1076: 1073: 1070: 1069: 1065: 1062: 1059: 1056: 1053: 1050: 1047: 1044: 1043: 1039: 1036: 1033: 1030: 1027: 1024: 1021: 1018: 1017: 1013: 1010: 1007: 1004: 1001: 998: 995: 992: 991: 987: 984: 981: 978: 975: 972: 969: 966: 965: 962:Release date 961: 958: 955: 952: 949: 946: 943: 940: 939: 933: 930: 926: 924: 920: 915: 913: 900: 893: 891: 885: 883: 881: 860: 857: 853: 849: 846: 840: 837: 834: 828: 822: 796: 793: 788: 784: 780: 775: 771: 765: 761: 751: 748: 743: 740: 737: 733: 729: 720: 715: 712: 709: 706: 703: 700: 697: 693: 687: 684: 679: 676: 667: 648: 645: 623: 620: 616: 612: 609: 601: 585: 582: 579: 557: 553: 549: 544: 540: 514: 510: 504: 500: 496: 491: 487: 482: 476: 472: 465: 461: 455: 451: 447: 442: 438: 433: 427: 424: 419: 415: 409: 406: 401: 393: 389: 383: 379: 375: 370: 366: 361: 355: 351: 344: 340: 334: 330: 326: 321: 317: 312: 306: 303: 298: 294: 288: 285: 280: 272: 267: 251: 247: 243: 240: 237: 234: 231: 226: 222: 218: 213: 209: 205: 202: 199: 196: 193: 188: 184: 163: 154: 146: 139: 137: 134: 132: 127: 125: 121: 113: 111: 109: 105: 101: 93: 90: 86: 83: 80: 78: 74: 71: 68: 64: 61: 58: 56: 52: 48: 44: 41: 38: 36: 32: 28: 23: 3191:. Retrieved 3187: 3160: 3150: 3141: 3137: 3126: 3115:. Retrieved 3111: 3102: 3081: 3071:, retrieved 3066: 3059: 3049:, retrieved 3044: 3037: 3027:, retrieved 3021: 3015: 3004:. Retrieved 3000: 2990: 2968: 2947: 2937:17 September 2935:. Retrieved 2921: 2910:. Retrieved 2908:. 2023-09-10 2905: 2896: 2887: 2883: 2873: 2836: 2829: 2810: 2800: 2789:. Retrieved 2785: 2776: 2739: 2733: 2712:cite journal 2693: 2672: 2651: 2630:cite journal 2615: 2594: 2575: 2569: 2547: 2537:, retrieved 2531: 2525: 2515:, retrieved 2502: 2482: 2475: 2465: 2458: 2449: 2445: 2435: 2426: 2398:. Retrieved 2394: 2385: 2345:. Retrieved 2341:the original 2336: 2327: 2316: 2307: 2298: 2288: 2240: 2231: 2109:encode_image 1707:is_available 1514: 1505: 1484: 1477: 1461: 1457: 1423: 1420:Applications 1411: 1408: 1393: 1381: 1374: 1358: 1351: 1347: 1339: 1317: 1309: 1306:WebImageText 1296: 1276: 1261: 1257:Transformers 1254: 1237:EfficientNet 1234: 1221: 1208: 1201: 931: 927: 916: 909: 889: 668: 268: 155: 151: 135: 128: 117: 99: 98: 35:Developer(s) 2532:openai/CLIP 2025:transformer 1235:ALIGN used 944:Resolution 941:Model name 904:<CLS> 894:Image model 886:CLIP models 600:temperature 110:objective. 108:contrastive 82:MIT License 3205:Categories 3193:2024-09-11 3117:2024-09-19 3093:2111.09734 3073:2024-09-08 3051:2024-09-06 3029:2024-09-08 3006:2024-09-08 2981:2112.10741 2959:2204.06125 2912:2024-09-06 2846:2212.07143 2791:2024-09-19 2749:2103.01913 2684:1905.11946 2663:1904.11486 2606:1812.01187 2585:1512.03385 2560:2010.11929 2539:2024-09-06 2517:2024-09-06 2400:2024-09-06 2376:2103.00020 2347:2024-09-06 2260:References 2211:preprocess 2043:preprocess 2031:parameters 1980:parameters 1929:vocab_size 1830:vocab_size 1818:vocab_size 1746:preprocess 1430:featurizer 1326:with high 1243:Text model 959:Size (MB) 921:, such as 66:Written in 55:Repository 2704:160025533 2070:unsqueeze 1292:LayerNorm 1204:LayerNorm 1149:ViT-L/14 1123:ViT-B/16 1097:ViT-B/32 858:− 841:⁡ 781:⋅ 766:τ 749:− 734:δ 707:∈ 694:∑ 649:∈ 646:τ 624:τ 621:− 550:⋅ 497:⋅ 473:∑ 448:⋅ 428:⁡ 416:∑ 402:− 376:⋅ 352:∑ 327:⋅ 307:⁡ 295:∑ 281:− 140:Algorithm 2931:Archived 2503:OpenCLIP 1390:Training 1377:alt-tags 1290:, apply 1196:2022-04 1170:2022-01 1144:2021-07 1118:2021-01 1092:2022-01 1071:RN50x64 1066:2021-07 1045:RN50x16 1040:2021-03 1014:2021-03 988:2021-01 3001:W&B 2395:icml.cc 1527:install 1365:cropped 1354:whitens 1335:synsets 1332:WordNet 1324:bigrams 1301:Dataset 1019:RN50x4 880:sigmoid 598:is the 88:Website 77:License 3112:GitHub 2861: 2786:GitHub 2764: 2702: 2337:GitHub 2308:OpenAI 2196:" 2091:device 1974:visual 1875:" 1848:" 1794:visual 1776:device 1770:device 1683:device 1671:import 1665:import 1653:import 1647:import 1599:openai 1587:github 1557:openai 1545:github 1398:and 3 1396:ResNet 1371:Others 1211:ResNet 1187:123.0 1184:304.3 1161:123.0 1158:304.0 1083:201.8 1080:420.4 1057:123.0 1054:167.3 993:RN101 923:ResNet 882:loss. 815:where 638:where 70:Python 40:OpenAI 3088:arXiv 2976:arXiv 2954:arXiv 2841:arXiv 2744:arXiv 2700:S2CID 2679:arXiv 2658:arXiv 2601:arXiv 2580:arXiv 2555:arXiv 2371:arXiv 2217:image 2205:model 2190:shape 2121:print 2115:image 2103:model 2049:Image 2037:image 2019:model 2004:numel 1968:model 1953:numel 1917:print 1899:print 1881:print 1833:print 1824:model 1809:model 1788:model 1740:model 1695:torch 1674:numpy 1668:Image 1650:torch 1578:https 1536:https 1497:Notes 1384:LAION 1367:out. 1313:GPT-2 1277:Like 1272:GPT-2 1135:63.1 1132:86.2 1109:63.1 1106:87.8 1089:1260 1086:1024 1031:90.7 1028:87.1 1005:63.1 1002:56.3 982:1024 979:63.1 976:38.3 967:RN50 124:arXiv 2939:2022 2859:ISBN 2762:ISBN 2725:help 2643:help 2248:and 2055:open 1758:load 1752:clip 1728:clip 1713:else 1701:cuda 1659:from 1656:clip 1638:CLIP 1623:CLIP 1617:main 1605:CLIP 1575:wget 1563:CLIP 1440:art. 1414:A100 1404:V100 1284:and 1279:BERT 1193:891 1190:768 1181:428 1178:336 1167:890 1164:768 1155:428 1152:224 1141:335 1138:512 1129:150 1126:224 1115:338 1112:512 1103:151 1100:224 1077:623 1074:448 1063:630 1060:768 1051:291 1048:384 1037:402 1034:640 1025:178 1022:288 1011:278 1008:512 999:120 996:224 985:244 973:102 970:224 583:> 20:CLIP 2851:doi 2815:doi 2754:doi 2508:doi 2202:del 2034:()) 2010:for 1992:sum 1983:()) 1959:for 1941:sum 1860:sum 1737:(): 1719:for 1662:PIL 1644:png 1629:png 1611:raw 1593:com 1569:git 1551:com 1530:git 1524:pip 1470:". 1400:ViT 1268:GPT 120:CDN 3207:: 3186:. 3159:. 3142:35 3140:. 3136:. 3110:. 2999:. 2904:. 2888:32 2886:. 2882:. 2857:. 2849:. 2809:. 2784:. 2760:. 2752:. 2716:: 2714:}} 2710:{{ 2634:: 2632:}} 2628:{{ 2578:. 2506:, 2492:^ 2450:29 2448:. 2444:. 2425:. 2409:^ 2393:. 2356:^ 2335:. 2306:. 2267:^ 2085:to 2064:)) 2016:in 2007:() 1965:in 1956:() 1863:() 1854:np 1725:in 1710:() 1692:if 1680:np 1677:as 1584:// 1542:// 1482:. 1337:. 1274:. 1259:. 1206:. 838:ln 425:ln 304:ln 3196:. 3120:. 3096:. 3090:: 3009:. 2984:. 2978:: 2962:. 2956:: 2941:. 2915:. 2867:. 2853:: 2843:: 2823:. 2817:: 2794:. 2770:. 2756:: 2746:: 2727:) 2723:( 2706:. 2687:. 2681:: 2666:. 2660:: 2645:) 2641:( 2624:. 2609:. 2603:: 2588:. 2582:: 2563:. 2557:: 2510:: 2403:. 2379:. 2373:: 2350:. 2220:, 2214:, 2208:, 2199:) 2193:} 2187:. 2181:{ 2175:} 2172:, 2169:: 2163:{ 2157:} 2154:, 2151:: 2145:{ 2139:} 2136:m 2133:{ 2127:f 2124:( 2118:) 2112:( 2106:. 2100:= 2094:) 2088:( 2082:. 2079:) 2076:0 2073:( 2067:. 2058:( 2052:. 2046:( 2040:= 2028:. 2022:. 2013:p 2001:. 1998:p 1995:( 1989:= 1977:. 1971:. 1962:p 1950:. 1947:p 1944:( 1938:= 1932:) 1926:, 1920:( 1914:) 1908:, 1902:( 1896:) 1890:, 1884:( 1878:) 1872:} 1869:, 1866:: 1857:. 1851:{ 1845:f 1842:, 1836:( 1827:. 1821:= 1812:. 1806:= 1797:. 1791:. 1785:= 1779:) 1773:= 1767:, 1764:m 1761:( 1755:. 1749:= 1743:, 1731:. 1722:m 1704:. 1698:. 1686:= 1641:. 1635:O 1632:- 1626:. 1620:/ 1614:/ 1608:/ 1602:/ 1596:/ 1590:. 1581:: 1572:! 1566:. 1560:/ 1554:/ 1548:. 1539:: 1533:+ 1521:! 1231:. 866:) 861:x 854:e 850:+ 847:1 844:( 835:= 832:) 829:x 826:( 823:f 803:) 800:) 797:b 794:+ 789:j 785:v 776:i 772:w 762:e 758:( 755:) 752:1 744:j 741:, 738:i 730:2 727:( 724:( 721:f 716:N 713:: 710:1 704:j 701:, 698:i 688:N 685:1 680:= 677:L 653:R 617:e 613:= 610:T 586:0 580:T 558:i 554:w 545:i 541:v 515:T 511:/ 505:j 501:w 492:i 488:v 483:e 477:i 466:T 462:/ 456:j 452:w 443:j 439:v 434:e 420:j 410:N 407:1 394:T 390:/ 384:j 380:w 371:i 367:v 362:e 356:j 345:T 341:/ 335:i 331:w 322:i 318:v 313:e 299:i 289:N 286:1 252:N 248:w 244:, 241:. 238:. 235:. 232:, 227:1 223:w 219:, 214:N 210:v 206:, 203:. 200:. 197:. 194:, 189:1 185:v 164:N

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.

Index