Share this page:

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, and Lijuan Wang, in Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.

Download the full text


Abstract


Bib Entry

@inproceedings{dou2022fiber,
  title = {Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone},
  author = {Dou, Zi-Yi and Kamath, Aishwarya and Gan, Zhe and Zhang, Pengchuan and Wang, Jianfeng and Li, Linjie and Liu, Zicheng and Liu, Ce and LeCun, Yann and Peng, Nanyun and Gao, Jianfeng and Wang, Lijuan},
  booktitle = {Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS)},
  year = {2022}
}

Related Publications

  • Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

    Zi-Yi Dou, Aishwarya Kamath, Zhe Gan, Pengchuan Zhang, Jianfeng Wang, Linjie Li, Zicheng Liu, Ce Liu, Yann LeCun, Nanyun Peng, Jianfeng Gao, and Lijuan Wang, in Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS), 2022.
    Full Text BibTeX Details
    @inproceedings{dou2022fiber,
      title = {Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone},
      author = {Dou, Zi-Yi and Kamath, Aishwarya and Gan, Zhe and Zhang, Pengchuan and Wang, Jianfeng and Li, Linjie and Liu, Zicheng and Liu, Ce and LeCun, Yann and Peng, Nanyun and Gao, Jianfeng and Wang, Lijuan},
      booktitle = {Proceedings of the Thirty-Sixth Conference on Neural Information Processing Systems (NeurIPS)},
      year = {2022}
    }
    
    Details
  • An Empirical Study of Training End-to-End Vision-and-Language Transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, Zicheng Liu, and Michael Zeng, in The Conference on Computer Vision and Pattern Recognition (CVPR-22), 2022.
    Full Text Code Abstract BibTeX Details
    Vision-and-language (VL) pre-training has proven to be highly effective on various VL downstream tasks. While recent work has shown that fully transformer-based VL models can be more efficient than previous region-feature-based methods, their performance on downstream tasks often degrades significantly. In this paper, we present METER, a Multimodal End-to-end TransformER framework, through which we investigate how to design and pre-train a fully transformer-based VL model in an end-to-end manner. Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion module (e.g., merged attention vs. co-attention), architectural design (e.g., encoder-only vs. encoder-decoder), and pre-training objectives (e.g., masked image modeling). We conduct comprehensive experiments and provide insights on how to train a performant VL transformer while maintaining fast inference speed. Notably, our best model achieves an accuracy of 77.64% on the VQAv2 test-std set using only 4M images for pre-training, surpassing the state-of-the-art region-feature-based model by 1.04%, and outperforming the previous best fully transformer-based model by 1.6%.
    @inproceedings{dou2022meter,
      title = {An Empirical Study of Training End-to-End Vision-and-Language Transformers},
      author = {Dou, Zi-Yi and Xu, Yichong and Gan, Zhe and Wang, Jianfeng and Wang, Shuohang and Wang, Lijuan and Zhu, Chenguang and Zhang, Pengchuan and Yuan, Lu and Peng, Nanyun and Liu, Zicheng and Zeng, Michael},
      booktitle = {The Conference on Computer Vision and Pattern Recognition (CVPR-22)},
      year = {2022}
    }
    
    Details