research ideas

Last updated on April 17, 2024 am

Weak, few-shot, zero-shot, unsupervised learning

CLIP Contrastive Language image pretraining(Learning Transferable Visual Models From Natural Language Supervision)
自己的一些想法

这篇21年的paper重要的不是刷点和在zero-shot上的能力,重要的是提示了一种深度学习和机器学习未来的方向,朝着agi方向发展,更接近人的方向发展,以知识进行驱动而不是由任务进行驱动,带来刷点的提高应该被视作是模型的副产物。在当下各个细分领域任务驱动模型逐渐完善,要继续向agi方向发展,我觉得知识驱动应该是一个必然的方向。

至于说用什么样的学习方式,比如说是以自然语言为驱动的还有待研究。在我的理解中知识应该是抽象的,自然语言只是知识的一种载体,或者说是一种接口,但是作为人与模型交互的方式是必不可少的。但是话又说回来,模型学习东西的方式就像是老师和学生,老师和学生之间的交互也是依赖于语言,所以目前看来用自然语言监督确实是最佳的方案。

<2023/7/14>

paper reading1

原文“We do not use the non-linear projection between the representation and the contrastive embedding space, a change which was introduced by Bachman et al. (2019) and popularized by Chen et al. (2020b). We instead use only a linear projection to map from each encoder’s representation to the multi-modal embedding space. We did not notice a difference in training efficiency between the two versions and speculate that non-linear projections may be co-adapted with details of current image only in self-supervised representation learning methods.”

自己也做过一些实验,比如前几天自己脑子一热把LSTM中的两个共享的权重矩阵改成了两个MLP,hidden_size = 1000,本来是一个1000 ->1000的linear 改成了 1000->500 ReLU 500->1000的结构,在参数量上与原模型保持一致,在计算复杂度上由于是ReLU对时间基本没有影响。但是在video captioning这样的任务中得掉10个点,而且模型更加难训练loss增大,也在想为什么,突然看到这么一段别人干过的。但是原文中也没有给出比较直观的解释。

但是比较疑惑的是在”attention is all you need”中作者采用的FFN其实就是一个类似于这样的结构为什么那样可以work用单层的linear就不work就很奇怪,得再去看看那篇paper看看。

<<2023/7/14>>

paper reading2

一些训练的trick,到时候应该会单独写一个blog

“We use the Adam optimizer (Kingma & Ba, 2014) with decoupled weight decay regularization (Loshchilov & Hutter, 2017) applied to all weights that are not gains or biases, and decay the learning rate using a cosine schedule (Loshchilov & Hutter, 2016). Initial hyperparameters were set using a combination of grid searches, random search, and manual tuning on the baseline ResNet50 model when trained for 1 epoch. Hyper-parameters were then adapted heuristically for larger models due to computational constraints. The learnable temperature parameter τ was initialized to the equivalent of 0.07 from (Wu et al., 2018) and clipped to prevent scaling the logits by more than 100 which we found necessary to prevent training instability. We use a very large minibatch size of 32,768. Mixed-precision (Micikevicius et al., 2017) was used to accelerate training and save memory. To save additional memory, gradient checkpointing (Griewank & Walther, 2000; Chen et al., 2016), half-precision Adam statistics (Dhariwal et al., 2020), and half-precision stochastically rounded text encoder weights were used.”

some ideas
  1. use external knowledge to enhance presentation ability.
  2. retrevial enhancement based.
  3. open domain

Multi-Modal learning

Video understanding
  • video和image的区别主要在于对于时许关系的处理上,本质上两者完全可以共用一个视觉模块处理,关键问题是怎么让模型“记忆”上下文联系。所谓的时序关系到底是帧与帧之间的双向交互联系,还是类似于通过记忆的方式按时间顺序进行记忆。如果是前者可以通过transformer encoder的方式去做,如果是后者,能否通过transformer decoder去实现?
  • 对于时序关系的预训练方法,例如video-mae显然是不合理的,只是笼统的视频的所有视觉特征进行学习,其实完全可以设计一种建立在已有image特征获取的基础上感知时间的方法,但是如何去构建这种loss和数据集是一个问题。
  • 将单个frame编码成一个token,使用language model多个image token进行理解是合理的嘛?或者说文本的上下文关系与视频的上下文关系是相同的嘛?很显然我觉得不是的,视频帧之间很显然缺少文本上下文的结构化特征。我认为直接通过语言模型对视频进行理解的时候会对两者上下文结构的理解产生偏差。