‘video analysis’ tag

Annotations sorted by machine learning into inferred 'tags'. This provides an alternative way to browse: instead of by date order, one can browse in topic order. The 'sorted' list has been automatically clustered into multiple sections & auto-labeled for easier browsing.

Beginning with the newest annotation, it uses the embedding of each annotation to attempt to create a list of nearest-neighbor annotations, creating a progression of topics. For more details, see the link.

Miscellaneous

Bibliography

https://research.google/blog/taking-medical-imaging-embeddings-3d/: “CT Foundation: Taking Medical Imaging Embeddings 3D”, Atilla Kiraly, Madeleine Traverse

link-bibliography
https://arxiv.org/abs/2307.05014: “Test-Time Training on Video Streams”, Renhao Wang, Yu Sun, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang

link-bibliography
https://arxiv.org/abs/2305.05665#facebook: “ImageBind: One Embedding Space To Bind Them All”, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Arm Holdings, Joulin, Ishan Misra

link-bibliography
https://arxiv.org/abs/2302.05442#google: “Scaling Vision Transformers to 22 Billion Parameters”, Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd van Steenkiste, Gamaleldin F. Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Patrick Collier, Alexey Gritsenko, Vighnesh Birodkar, Cristina Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetić, Dustin Tran, Thomas Kipf, Mario Lučić, Xiaohua Zhai, Daniel Keysers, Jeremiah Harmsen, Neil Houlsby

link-bibliography
https://arxiv.org/abs/2212.04979#google: “VideoCoCa: Video-Text Modeling With Zero-Shot Transfer from Contrastive Captioners”, Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu

link-bibliography
https://arxiv.org/abs/2212.05051: “VindLU: A Recipe for Effective Video-And-Language Pretraining”, Feng Cheng, Xizi Wang, Jie Lei, David Crandall, Mohit Bansal, Gedas Bertasius

link-bibliography
https://arxiv.org/abs/2209.14156: “TVLT: Textless Vision-Language Transformer”, Zineng Tang, Jaemin Cho, Yixin Nie, Mohit Bansal

link-bibliography
https://arxiv.org/abs/2208.03550: “EVL: Frozen CLIP Models Are Efficient Video Learners”, Ziyi Lin, Shijie Geng, Renrui Zhang, Peng Gao, Gerard de Melo, Xiaogang Wang, Jifeng Dai, Yu Qiao, Hongsheng Li

link-bibliography
https://arxiv.org/abs/2207.07285#alibaba: “X-CLIP: End-To-End Multi-Grained Contrastive Learning for Video-Text Retrieval”, Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, Rongrong Ji

link-bibliography
https://arxiv.org/abs/2206.11795#openai: “Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos”, Bowen Baker, Ilge Akkaya, Peter Zhokhov, Joost Huizinga, Jie Tang, Adrien Ecoffet, Brandon Houghton, Raul Sampedro, Jeff Clune

link-bibliography
https://arxiv.org/abs/2206.08356#facebook: “OmniMAE: Single Model Masked Pretraining on Images and Videos”, Rohit Girdhar, Alaaeldin El-Nouby, Mannat Singh, Kalyan Vasudev Alwala, Arm Holdings, Joulin, Ishan Misra

link-bibliography
https://arxiv.org/abs/2206.07160#microsoft: “LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling”, Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, Lijuan Wang

link-bibliography
https://arxiv.org/abs/2205.10747: “VidIL: Language Models With Image Descriptors Are Strong Few-Shot Video-Language Learners”, Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang, Mohit Bansal, Heng Ji

link-bibliography
https://arxiv.org/abs/2205.09113#facebook: “Masked Autoencoders As Spatiotemporal Learners”, Christoph Feichtenhofer, Haoqi Fan, Yanghao Li, Kaiming He

link-bibliography
https://arxiv.org/abs/2204.00598#google: “Socratic Models: Composing Zero-Shot Multimodal Reasoning With Language”, Andy Zeng, Adrian Wong, Stefan Welker, Krzysztof Choromanski, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence

link-bibliography
https://arxiv.org/abs/2203.11096: “CLIP Meets GamePhysics: Towards Bug Identification in Gameplay Videos Using Zero-Shot Transfer Learning”, Mohammad Reza Taesiri, Finlay Macklon, Cor-Paul Bezemer

link-bibliography
https://arxiv.org/abs/2201.12086#salesforce: “BLIP: Bootstrapping Language-Image Pre-Training for Unified Vision-Language Understanding and Generation”, Junnan Li, Dongxu Li, Caiming Xiong, Steven Hoi

link-bibliography
https://arxiv.org/abs/2111.11432#microsoft: “Florence: A New Foundation Model for Computer Vision”, Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, Pengchuan Zhang

link-bibliography
https://arxiv.org/abs/2107.14795#deepmind: “Perceiver IO: A General Architecture for Structured Inputs & Outputs”, Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Koppula, Daniel Zoran, Andrew Brock, Evan Shelhamer, Olivier Hénaff, Matthew M. Botvinick, Andrew Zisserman, Oriol Vinyals, João Carreira

link-bibliography
https://arxiv.org/abs/2107.00650: “CLIP-It! Language-Guided Video Summarization”, Medhini Narasimhan, Anna Rohrbach, Trevor Darrell

link-bibliography
https://arxiv.org/abs/2106.11097: “CLIP2Video: Mastering Video-Text Retrieval via Image CLIP”, Han Fang, Pengfei Xiong, Luhui Xu, Yu Chen

link-bibliography
https://arxiv.org/abs/2103.07579#google: “Revisiting ResNets: Improved Training and Scaling Strategies”, Irwan Bello, William Fedus, Xianzhi Du, Ekin D. Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph

link-bibliography
https://ai.facebook.com/blog/learning-from-videos-to-understand-the-world/: “Learning from Videos to Understand the World”, Geoffrey Zweig, Polina Kuznetsova, Michael Auli, Francois Fagan

link-bibliography
https://arxiv.org/abs/2103.03206#deepmind: “Perceiver: General Perception With Iterative Attention”, Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

link-bibliography
https://cdn.openai.com/papers/Learning_Transferable_Visual_Models_From_Natural_Language_Supervision.pdf: “CLIP: Learning Transferable Visual Models From Natural Language Supervision”, Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever

link-bibliography
https://arxiv.org/abs/2012.08508#deepmind: “Object-Based Attention for Spatio-Temporal Reasoning: Outperforming Neuro-Symbolic Models With Flexible Distributed Architectures”, David Ding, Felix Hill, Adam Santoro, Matt Botvinick

link-bibliography
https://arxiv.org/abs/2008.09037: “Accuracy and Performance Comparison of Video Action Recognition Approaches”, Matthew Hutchinson, Siddharth Samsi, William Arcand, David Bestor, Bill Bergeron, Chansup Byun, Michael Houle, Matthew Hubbell, Michael Jones, Jeremy Kepner, Andrew Kirby, Peter Michaleas, Lauren Milechin, Julie Mullen, Andrew Prout, Antonio Rosa, Albert Reuther, Charles Yee, Vijay Gadepally

link-bibliography
https://arxiv.org/abs/1905.00546#facebook: “Billion-Scale Semi-Supervised Learning for Image Classification”, I. Zeki Yalniz, Hervé Jégou, Kan Chen, Manohar Paluri, Dhruv Mahajan

link-bibliography
https://arxiv.org/abs/1811.11721: “CCNet: Criss-Cross Attention for Semantic Segmentation”, Zilong Huang, Xinggang Wang, Yunchao Wei, Lichao Huang, Humphrey Shi, Wenyu Liu, Thomas S. Huang

link-bibliography
https://arxiv.org/abs/1808.01340#deepmind: “A Short Note about Kinetics-600”, Joao Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, Andrew Zisserman

link-bibliography
https://arxiv.org/abs/1705.07750#deepmind: “Quo Vadis, Action Recognition? A New Model I3D and the Kinetics Dataset”, Joao Carreira, Andrew Zisserman

link-bibliography
https://arxiv.org/abs/1608.03609: “Clockwork Convnets for Video Semantic Segmentation”, Evan Shelhamer, Kate Rakelly, Judy Hoffman, Trevor Darrell

link-bibliography