• dalle:zero-shot text-to-image generation


    DALL·E—从文本到图像,超现实主义的图像生成器 - 知乎欢迎关注Smarter,构建CV世界观超现实主义强调梦幻与现实的统一才是绝对的真实,而如今OpenAI创造的DALL·E图像生成器,能够直接通过文本描述生成类似超现实主义的图像,让机器也能拥有顶级画家、设计师的创造力。…https://zhuanlan.zhihu.com/p/394467135如何评价DALL-E模型的实现? - 知乎DALL-E的具体实现,openAI没有公布,github上发布的代码只有一个dVAE的模型,相当于只有一半。但Hugging …https://www.zhihu.com/question/447757686/answer/2326092032

    漫谈VAE和VQVAE,从连续分布到离散分布 - 知乎欢迎关注Smarter,构建CV世界观最近DALLE和VQGAN展现出了非常强大的图片生成能力,DALLE可以通过输入文字生成匪夷所思的超现实主义图片,VQGAN可以生成百万像素的高清图片,而这两个生成模型都跟VAE和VQVAE的思想…https://zhuanlan.zhihu.com/p/388299884dalle是个分阶段的算法,dalle要训练三个模型,dvae,dalle和clip,dvae中encoder用来对图像提特征,dalle是个组合了图像特征和文本特征的自回归的语言模型,这块一定要注意,看代码还以为是类似clip的代理任务,其实不是的,text和image的特征做了拼接,是按照自回归transformer的思路做的,说白了就是一个gpt,最终输入text产生了图像特征再用dvae进行decoder解码,生成了的图像再采用clip进行排序输出。这三个部分都是分别训练的。但是一般clip是不训,找个预训练的就能用,或者直接像gan一样生成一个batch的图也可以。

    训练阶段:
    1)Stage One先单独做dVAE的训练(得到encoder、visual codebook、decoder);
    2)Stage Two做Transformer,text和image分别做编码,concat在一起之后做类似GPT-3的left-to-right自回归LM语言模型,这里的小细节是,输入是text在左,image在右,这样后面在推理时根据text生成image就非常自然了~

    推理阶段:
    输入分2种情况:1)只输入text;2)输入text + image
    1)只输入text时,先对text编码之后进入transformer进行自回归解码出image tokens,之后将生成的image tokens通过dVAE的codebook得到latent code,再送入dVAE的decoder做解码出图片;
    2)输入text + image,可以理解是在上面生成image tokens的时候引入一些prefix信息(看代码是默认用前面14*32个),我理解这样会更可控一些,其他都是一样的。

    最后,用CLIP对生成的图片进行排序,为什么会有多个呢?是因为在解码image tokens的时候,是根据概率分布做的采样,而不是直接argmax取greedy decode的那个,这样假设要生成n张图片,就跑n次解码(可以放在batch里面并行),而每次采样的不同,就可以生成n个不同的image token序列。

    1.Introduction

            用GAN不用VAE,可以提高图像保真度,其实在生成领域,包括超分等场景,最后使用gan去做decoder是很普遍的,就是因为gan生成的图的保真度好,但是gan也有问题,样本可能遭受严重的伪影,例如对象失真,不合逻辑的对象放置或前景和背景元素的不自然混合,之前看超分领域,cnn解码出来的图会有明显的平滑属性,没有sharp的棱角,但是gan的方法又会生成一些和原图无关的东西。

    2.method

    stage 1:训练一个dVAE将输入图的256x256压缩成32x32图片token,每个位置有8192种可能的值,也就是说dVAE的encoder输出是维度为32x32x8192的logits,然后通过logits索引codebook的特征进行组合,codebook的embedding是可学习的.

    stage 2:使用BPE encoder对文本进行编码,得到最多256个文本token,不够的pad,将256文本token和1024图像token进行concat,得到1280维度的特征,将拼接的特征输入transformer进行自回归训练。

    dVAE是VQVAE,VQVAE和VAE不同,VAE是学均值方差刻画高斯分布,通过引入后验分布,通过KL散度约束先验和后验,重参数从均值方差刻画的高斯中参数,进行decode。VQVAE通过encode学习中间编码,然后通过最近邻搜索将中间编码映射为codebook中k个向量之一,然后通过decode对latent code进行重建。最近邻搜索采用argmax来找codebook中索引位置,不可导,dalle使用Gumbel softmax trick来解决这个问题,argmax不可导,softmax近似max,而arg softmax是可导的。 

     第一部分是生成模型decode的,在KL中的第一部分是encode的,第二部分是先验分布。

    2.1 stage 1:learning the visual codebook

    kl weight=6.6,K=8196

    2.2 stage 2:learning the prior

    这一部分是dalle模型,就是一个先验的学习阶段,使用一个自回归的transformer做的,在dalle2中已经变成了扩散模型,这个自回归的transformer输入是BPE encoder之后的文本和dVAE encoder之后的图像,这块整个loss设计其实和clip是一致的。

    2.3 推理

    在推理时用的是dVAE的decode部分,产生的结果再用clip选择一个最合适的进行输出。

    2.4 data collection

    120亿的参数量,3.3m对text-image对。

    3.代码

    VAE:

    1. vae = DiscreteVAE(
    2. image_size=256,
    3. num_layers=3, # number of downsamples - ex. 256 / (2 ** 3) = (32 x 32 feature map)
    4. num_tokens=8192, # number of visual tokens. in the paper, they used 8192, but could be smaller for downsized projects
    5. codebook_dim=512, # codebook dimension
    6. hidden_dim=64, # hidden dimension
    7. num_resnet_blocks=1, # number of resnet blocks
    8. temperature=0.9, # gumbel softmax temperature, the lower this is, the harder the discretization
    9. straight_through=False, # straight-through for gumbel softmax. unclear if it is better one way or the other
    10. )

    img:4,3,256,256->norm->logits=encoder(img):4,8196,32,32->soft_one_hot=F.gumbel_softmax(logits):4,8196,32,32->sampled=einsum('b n h w,nd->b d h w',soft_one_hot,self.codebook_weight:8192,512):4,512,32,32->out=decoder(sampled):4,3,256,256

    1. DiscreteVAE(
    2. (codebook): Embedding(8192, 512)
    3. (encoder): Sequential(
    4. (0): Sequential(
    5. (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    6. (1): ReLU()
    7. )
    8. (1): Sequential(
    9. (0): Conv2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    10. (1): ReLU()
    11. )
    12. (2): Sequential(
    13. (0): Conv2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    14. (1): ReLU()
    15. )
    16. (3): ResBlock(
    17. (net): Sequential(
    18. (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    19. (1): ReLU()
    20. (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    21. (3): ReLU()
    22. (4): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
    23. )
    24. )
    25. (4): Conv2d(64, 8192, kernel_size=(1, 1), stride=(1, 1))
    26. )
    27. (decoder): Sequential(
    28. (0): Conv2d(512, 64, kernel_size=(1, 1), stride=(1, 1))
    29. (1): ResBlock(
    30. (net): Sequential(
    31. (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    32. (1): ReLU()
    33. (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    34. (3): ReLU()
    35. (4): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
    36. )
    37. )
    38. (2): Sequential(
    39. (0): ConvTranspose2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    40. (1): ReLU()
    41. )
    42. (3): Sequential(
    43. (0): ConvTranspose2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    44. (1): ReLU()
    45. )
    46. (4): Sequential(
    47. (0): ConvTranspose2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    48. (1): ReLU()
    49. )
    50. (5): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1))
    51. )
    52. )

    dalle:

    1. dalle = DALLE(
    2. dim=1024,
    3. vae=vae, # automatically infer (1) image sequence length and (2) number of image tokens
    4. num_text_tokens=10000, # vocab size for text
    5. text_seq_len=256, # text sequence length
    6. depth=12, # should aim to be 64
    7. heads=16, # attention heads
    8. dim_head=64, # attention head dimension
    9. attn_dropout=0.1, # attention dropout
    10. ff_dropout=0.1 # feedforward dropout
    11. )

    image:4,3,256,256/text:4,256->text_range:256,text_seq_len:1280,num_image_tokens:8192,num_text_tokens:10256->text:4,256->text=F.pad:4,257->tokens=text_emb(text):4,257,1024->image=vae.get_codebook_indices(image)->logits=self(image):4,8196,32,32->codebook_indices=logits.argmax:4,1024->image_emb=image_emb(image):4,1024,1024->tokens:4,1281,1024->out=self.transformers(tokens:4,1280,1024):4,1280,1024->logits:4,1280,18448->offsetted_image:4,1028,text:4,257,labels:4,1280->logits:4,18448,1280

    1. DALLE(
    2. (vae): DiscreteVAE(
    3. (codebook): Embedding(8192, 1024)
    4. (encoder): Sequential(
    5. (0): Sequential(
    6. (0): Conv2d(3, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    7. (1): ReLU()
    8. )
    9. (1): Sequential(
    10. (0): Conv2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    11. (1): ReLU()
    12. )
    13. (2): Sequential(
    14. (0): Conv2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    15. (1): ReLU()
    16. )
    17. (3): ResBlock(
    18. (net): Sequential(
    19. (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    20. (1): ReLU()
    21. (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    22. (3): ReLU()
    23. (4): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
    24. )
    25. )
    26. (4): Conv2d(64, 8192, kernel_size=(1, 1), stride=(1, 1))
    27. )
    28. (decoder): Sequential(
    29. (0): Conv2d(1024, 64, kernel_size=(1, 1), stride=(1, 1))
    30. (1): ResBlock(
    31. (net): Sequential(
    32. (0): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    33. (1): ReLU()
    34. (2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    35. (3): ReLU()
    36. (4): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1))
    37. )
    38. )
    39. (2): Sequential(
    40. (0): ConvTranspose2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    41. (1): ReLU()
    42. )
    43. (3): Sequential(
    44. (0): ConvTranspose2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    45. (1): ReLU()
    46. )
    47. (4): Sequential(
    48. (0): ConvTranspose2d(64, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))
    49. (1): ReLU()
    50. )
    51. (5): Conv2d(64, 3, kernel_size=(1, 1), stride=(1, 1))
    52. )
    53. )
    54. (transformer): Transformer(
    55. (layers): SequentialSequence(
    56. (layers): ModuleList(
    57. (0): ModuleList(
    58. (0): LayerScale(
    59. (fn): PreNorm(
    60. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    61. (norm_out): Identity()
    62. (fn): CachedAs(
    63. (fn): PreShiftToken(
    64. (fn): CachedAs(
    65. (fn): Attention(
    66. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    67. (to_out): Sequential(
    68. (0): Linear(in_features=1024, out_features=1024, bias=True)
    69. (1): Dropout(p=0.1, inplace=False)
    70. )
    71. )
    72. )
    73. )
    74. )
    75. )
    76. )
    77. (1): LayerScale(
    78. (fn): PreNorm(
    79. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    80. (norm_out): Identity()
    81. (fn): CachedAs(
    82. (fn): PreShiftToken(
    83. (fn): FeedForward(
    84. (net): Sequential(
    85. (0): Linear(in_features=1024, out_features=8192, bias=True)
    86. (1): GEGLU()
    87. (2): Dropout(p=0.1, inplace=False)
    88. (3): Linear(in_features=4096, out_features=1024, bias=True)
    89. )
    90. )
    91. )
    92. )
    93. )
    94. )
    95. )
    96. (1): ModuleList(
    97. (0): LayerScale(
    98. (fn): PreNorm(
    99. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    100. (norm_out): Identity()
    101. (fn): CachedAs(
    102. (fn): PreShiftToken(
    103. (fn): CachedAs(
    104. (fn): Attention(
    105. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    106. (to_out): Sequential(
    107. (0): Linear(in_features=1024, out_features=1024, bias=True)
    108. (1): Dropout(p=0.1, inplace=False)
    109. )
    110. )
    111. )
    112. )
    113. )
    114. )
    115. )
    116. (1): LayerScale(
    117. (fn): PreNorm(
    118. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    119. (norm_out): Identity()
    120. (fn): CachedAs(
    121. (fn): PreShiftToken(
    122. (fn): FeedForward(
    123. (net): Sequential(
    124. (0): Linear(in_features=1024, out_features=8192, bias=True)
    125. (1): GEGLU()
    126. (2): Dropout(p=0.1, inplace=False)
    127. (3): Linear(in_features=4096, out_features=1024, bias=True)
    128. )
    129. )
    130. )
    131. )
    132. )
    133. )
    134. )
    135. (2): ModuleList(
    136. (0): LayerScale(
    137. (fn): PreNorm(
    138. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    139. (norm_out): Identity()
    140. (fn): CachedAs(
    141. (fn): PreShiftToken(
    142. (fn): CachedAs(
    143. (fn): Attention(
    144. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    145. (to_out): Sequential(
    146. (0): Linear(in_features=1024, out_features=1024, bias=True)
    147. (1): Dropout(p=0.1, inplace=False)
    148. )
    149. )
    150. )
    151. )
    152. )
    153. )
    154. )
    155. (1): LayerScale(
    156. (fn): PreNorm(
    157. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    158. (norm_out): Identity()
    159. (fn): CachedAs(
    160. (fn): PreShiftToken(
    161. (fn): FeedForward(
    162. (net): Sequential(
    163. (0): Linear(in_features=1024, out_features=8192, bias=True)
    164. (1): GEGLU()
    165. (2): Dropout(p=0.1, inplace=False)
    166. (3): Linear(in_features=4096, out_features=1024, bias=True)
    167. )
    168. )
    169. )
    170. )
    171. )
    172. )
    173. )
    174. (3): ModuleList(
    175. (0): LayerScale(
    176. (fn): PreNorm(
    177. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    178. (norm_out): Identity()
    179. (fn): CachedAs(
    180. (fn): PreShiftToken(
    181. (fn): CachedAs(
    182. (fn): Attention(
    183. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    184. (to_out): Sequential(
    185. (0): Linear(in_features=1024, out_features=1024, bias=True)
    186. (1): Dropout(p=0.1, inplace=False)
    187. )
    188. )
    189. )
    190. )
    191. )
    192. )
    193. )
    194. (1): LayerScale(
    195. (fn): PreNorm(
    196. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    197. (norm_out): Identity()
    198. (fn): CachedAs(
    199. (fn): PreShiftToken(
    200. (fn): FeedForward(
    201. (net): Sequential(
    202. (0): Linear(in_features=1024, out_features=8192, bias=True)
    203. (1): GEGLU()
    204. (2): Dropout(p=0.1, inplace=False)
    205. (3): Linear(in_features=4096, out_features=1024, bias=True)
    206. )
    207. )
    208. )
    209. )
    210. )
    211. )
    212. )
    213. (4): ModuleList(
    214. (0): LayerScale(
    215. (fn): PreNorm(
    216. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    217. (norm_out): Identity()
    218. (fn): CachedAs(
    219. (fn): PreShiftToken(
    220. (fn): CachedAs(
    221. (fn): Attention(
    222. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    223. (to_out): Sequential(
    224. (0): Linear(in_features=1024, out_features=1024, bias=True)
    225. (1): Dropout(p=0.1, inplace=False)
    226. )
    227. )
    228. )
    229. )
    230. )
    231. )
    232. )
    233. (1): LayerScale(
    234. (fn): PreNorm(
    235. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    236. (norm_out): Identity()
    237. (fn): CachedAs(
    238. (fn): PreShiftToken(
    239. (fn): FeedForward(
    240. (net): Sequential(
    241. (0): Linear(in_features=1024, out_features=8192, bias=True)
    242. (1): GEGLU()
    243. (2): Dropout(p=0.1, inplace=False)
    244. (3): Linear(in_features=4096, out_features=1024, bias=True)
    245. )
    246. )
    247. )
    248. )
    249. )
    250. )
    251. )
    252. (5): ModuleList(
    253. (0): LayerScale(
    254. (fn): PreNorm(
    255. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    256. (norm_out): Identity()
    257. (fn): CachedAs(
    258. (fn): PreShiftToken(
    259. (fn): CachedAs(
    260. (fn): Attention(
    261. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    262. (to_out): Sequential(
    263. (0): Linear(in_features=1024, out_features=1024, bias=True)
    264. (1): Dropout(p=0.1, inplace=False)
    265. )
    266. )
    267. )
    268. )
    269. )
    270. )
    271. )
    272. (1): LayerScale(
    273. (fn): PreNorm(
    274. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    275. (norm_out): Identity()
    276. (fn): CachedAs(
    277. (fn): PreShiftToken(
    278. (fn): FeedForward(
    279. (net): Sequential(
    280. (0): Linear(in_features=1024, out_features=8192, bias=True)
    281. (1): GEGLU()
    282. (2): Dropout(p=0.1, inplace=False)
    283. (3): Linear(in_features=4096, out_features=1024, bias=True)
    284. )
    285. )
    286. )
    287. )
    288. )
    289. )
    290. )
    291. (6): ModuleList(
    292. (0): LayerScale(
    293. (fn): PreNorm(
    294. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    295. (norm_out): Identity()
    296. (fn): CachedAs(
    297. (fn): PreShiftToken(
    298. (fn): CachedAs(
    299. (fn): Attention(
    300. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    301. (to_out): Sequential(
    302. (0): Linear(in_features=1024, out_features=1024, bias=True)
    303. (1): Dropout(p=0.1, inplace=False)
    304. )
    305. )
    306. )
    307. )
    308. )
    309. )
    310. )
    311. (1): LayerScale(
    312. (fn): PreNorm(
    313. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    314. (norm_out): Identity()
    315. (fn): CachedAs(
    316. (fn): PreShiftToken(
    317. (fn): FeedForward(
    318. (net): Sequential(
    319. (0): Linear(in_features=1024, out_features=8192, bias=True)
    320. (1): GEGLU()
    321. (2): Dropout(p=0.1, inplace=False)
    322. (3): Linear(in_features=4096, out_features=1024, bias=True)
    323. )
    324. )
    325. )
    326. )
    327. )
    328. )
    329. )
    330. (7): ModuleList(
    331. (0): LayerScale(
    332. (fn): PreNorm(
    333. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    334. (norm_out): Identity()
    335. (fn): CachedAs(
    336. (fn): PreShiftToken(
    337. (fn): CachedAs(
    338. (fn): Attention(
    339. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    340. (to_out): Sequential(
    341. (0): Linear(in_features=1024, out_features=1024, bias=True)
    342. (1): Dropout(p=0.1, inplace=False)
    343. )
    344. )
    345. )
    346. )
    347. )
    348. )
    349. )
    350. (1): LayerScale(
    351. (fn): PreNorm(
    352. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    353. (norm_out): Identity()
    354. (fn): CachedAs(
    355. (fn): PreShiftToken(
    356. (fn): FeedForward(
    357. (net): Sequential(
    358. (0): Linear(in_features=1024, out_features=8192, bias=True)
    359. (1): GEGLU()
    360. (2): Dropout(p=0.1, inplace=False)
    361. (3): Linear(in_features=4096, out_features=1024, bias=True)
    362. )
    363. )
    364. )
    365. )
    366. )
    367. )
    368. )
    369. (8): ModuleList(
    370. (0): LayerScale(
    371. (fn): PreNorm(
    372. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    373. (norm_out): Identity()
    374. (fn): CachedAs(
    375. (fn): PreShiftToken(
    376. (fn): CachedAs(
    377. (fn): Attention(
    378. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    379. (to_out): Sequential(
    380. (0): Linear(in_features=1024, out_features=1024, bias=True)
    381. (1): Dropout(p=0.1, inplace=False)
    382. )
    383. )
    384. )
    385. )
    386. )
    387. )
    388. )
    389. (1): LayerScale(
    390. (fn): PreNorm(
    391. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    392. (norm_out): Identity()
    393. (fn): CachedAs(
    394. (fn): PreShiftToken(
    395. (fn): FeedForward(
    396. (net): Sequential(
    397. (0): Linear(in_features=1024, out_features=8192, bias=True)
    398. (1): GEGLU()
    399. (2): Dropout(p=0.1, inplace=False)
    400. (3): Linear(in_features=4096, out_features=1024, bias=True)
    401. )
    402. )
    403. )
    404. )
    405. )
    406. )
    407. )
    408. (9): ModuleList(
    409. (0): LayerScale(
    410. (fn): PreNorm(
    411. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    412. (norm_out): Identity()
    413. (fn): CachedAs(
    414. (fn): PreShiftToken(
    415. (fn): CachedAs(
    416. (fn): Attention(
    417. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    418. (to_out): Sequential(
    419. (0): Linear(in_features=1024, out_features=1024, bias=True)
    420. (1): Dropout(p=0.1, inplace=False)
    421. )
    422. )
    423. )
    424. )
    425. )
    426. )
    427. )
    428. (1): LayerScale(
    429. (fn): PreNorm(
    430. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    431. (norm_out): Identity()
    432. (fn): CachedAs(
    433. (fn): PreShiftToken(
    434. (fn): FeedForward(
    435. (net): Sequential(
    436. (0): Linear(in_features=1024, out_features=8192, bias=True)
    437. (1): GEGLU()
    438. (2): Dropout(p=0.1, inplace=False)
    439. (3): Linear(in_features=4096, out_features=1024, bias=True)
    440. )
    441. )
    442. )
    443. )
    444. )
    445. )
    446. )
    447. (10): ModuleList(
    448. (0): LayerScale(
    449. (fn): PreNorm(
    450. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    451. (norm_out): Identity()
    452. (fn): CachedAs(
    453. (fn): PreShiftToken(
    454. (fn): CachedAs(
    455. (fn): Attention(
    456. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    457. (to_out): Sequential(
    458. (0): Linear(in_features=1024, out_features=1024, bias=True)
    459. (1): Dropout(p=0.1, inplace=False)
    460. )
    461. )
    462. )
    463. )
    464. )
    465. )
    466. )
    467. (1): LayerScale(
    468. (fn): PreNorm(
    469. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    470. (norm_out): Identity()
    471. (fn): CachedAs(
    472. (fn): PreShiftToken(
    473. (fn): FeedForward(
    474. (net): Sequential(
    475. (0): Linear(in_features=1024, out_features=8192, bias=True)
    476. (1): GEGLU()
    477. (2): Dropout(p=0.1, inplace=False)
    478. (3): Linear(in_features=4096, out_features=1024, bias=True)
    479. )
    480. )
    481. )
    482. )
    483. )
    484. )
    485. )
    486. (11): ModuleList(
    487. (0): LayerScale(
    488. (fn): PreNorm(
    489. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    490. (norm_out): Identity()
    491. (fn): CachedAs(
    492. (fn): PreShiftToken(
    493. (fn): CachedAs(
    494. (fn): Attention(
    495. (to_qkv): Linear(in_features=1024, out_features=3072, bias=False)
    496. (to_out): Sequential(
    497. (0): Linear(in_features=1024, out_features=1024, bias=True)
    498. (1): Dropout(p=0.1, inplace=False)
    499. )
    500. )
    501. )
    502. )
    503. )
    504. )
    505. )
    506. (1): LayerScale(
    507. (fn): PreNorm(
    508. (norm): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    509. (norm_out): Identity()
    510. (fn): CachedAs(
    511. (fn): PreShiftToken(
    512. (fn): FeedForward(
    513. (net): Sequential(
    514. (0): Linear(in_features=1024, out_features=8192, bias=True)
    515. (1): GEGLU()
    516. (2): Dropout(p=0.1, inplace=False)
    517. (3): Linear(in_features=4096, out_features=1024, bias=True)
    518. )
    519. )
    520. )
    521. )
    522. )
    523. )
    524. )
    525. )
    526. )
    527. )
    528. (to_logits): Sequential(
    529. (0): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
    530. (1): Linear(in_features=1024, out_features=18448, bias=True)
    531. )
    532. (text_emb): Embedding(10256, 1024)
    533. (image_emb): Embedding(8192, 1024)
    534. )

  • 相关阅读:
    【洛谷】P1082 同余方程
    [蓝桥杯 2022 省 A] 推导部分和
    【日常记录】——对BigDecimal除法运算时遇到的Bug
    Spark - 第12章 弹性分布式数据集
    uniapp踩坑之项目:uniapp数字键盘组件—APP端
    【C++提高编程】第一章 模板:函数模板|类模板|
    人工智能驱动的个性化学习:技术如何彻底改变教育
    网络摄像头(IPC)介绍:类型、供电、镜头、夜视等
    《分析模式》漫谈11-Catalog不是手册
    上手 dpdk trace 功能
  • 原文地址:https://blog.csdn.net/u012193416/article/details/126108145