MemexQA: Visual Memex Question Answering
Lu Jiang, Junwei Liang, Liangliang Cao, Yannis Kalantidis, Sachin Farfade, Alexander Hauptmann
Carnegie Mellon University, Customer Service AI, Yahoo

Demo Video
Introduction
This paper proposes a new task, MemexQA: given a collection of photos or videos from a user, the goal is to automatically answer questions that help users recover their memory about events captured in the collection. Towards solving the task, we 1) present the MemexQA dataset, a large, realistic multimodal dataset consisting of real personal photos and crowd-sourced questions/answers, 2) propose MemexNet, a unified, end-to-end trainable network architecture for image, text and video question answering. Experimental results on the MemexQA dataset demonstrate that MemexNet outperforms strong baselines and yields the state-of-the-art on this novel and challenging task. The promising results on TextQA and VideoQA suggest MemexNet's efficacy and scalability across various QA tasks.
Figure: MemexQA examples. Top: sampled personal photos of a Flickr user, each photo collage corresponds to an album. Bottom: representative questions and answer pairs. Explore more...
Dataset
Table: Comparison with the representative VQA and Visual7W dataset.
Coming soon. You can first explore the dataset.
Experiments
Table: Performance comparison on the MemexQA dataset.
Human Experiments We first examine the human performance on MemexQA. We are interested in measuring 1) how well human can perform in the MemexQA task, 2) what is the contribution of each modality in helping users answer questions, and 3) how long does it take for humans to answer a MemexQA question.
Model Experiments We compare MemexNet with both classical and state-of-the-art VQA models. The above table reports the overall performance. As we see, MemexNet outperforms all baseline methods, with statistically significant differences. The result indicates MemexNet is a promising network for vision and language reasoning on this new task. The significant gap between the human and model accuracy, however, indicates that MemexQA is still a very challenging AI task. To analyze the contribution of the MMLookupNet, we replace it with the average embedding of concepts and metadata, and report the accuracy. As we see, the notable performance drop, especially on “what” and “when” questions, suggests MMLookupNet is beneficial in training MemexNet.
Release Log