MemexQA: Visual Memex Question Answering
Lu Jiang, Junwei Liang, Liangliang Cao, Yannis Kalantidis, Sachin Farfade, Alexander Hauptmann
Carnegie Mellon University, Customer Service AI, Yahoo

Demo Video
This paper proposes a new task, MemexQA: given a collection of photos or videos from a user, the goal is to automatically answer questions that help users recover their memory about events captured in the collection. Towards solving the task, we 1) present the MemexQA dataset, a large, realistic multimodal dataset consisting of real personal photos and crowd-sourced questions/answers, 2) propose MemexNet, a unified, end-to-end trainable network architecture for image, text and video question answering. Experimental results on the MemexQA dataset demonstrate that MemexNet outperforms strong baselines and yields the state-of-the-art on this novel and challenging task. The promising results on TextQA and VideoQA suggest MemexNet's efficacy and scalability across various QA tasks.
Figure: MemexQA examples. Top: sampled personal photos of a Flickr user, each photo collage corresponds to an album. Bottom: representative questions and answer pairs. Explore more...
Table: Comparison with the representative VQA and Visual7W dataset.
Here we provide download links for the question-answer pairs with candidate answers for each question. The candidate answers are mainly automatically generated. We also provide raw images of the albums as well as image features.
Current version: 1.1
qas.json (Pretty print version) The question-answer pairs in JSON format. Total 20563 QAs.
album_info.json Album metadata in JSON format.
test_question.ids Question ID list for test set.
Full photo collection:
     all_photos.tgz (15GB) All photos used in the data collection.
5k photo collection:
      shown_photos.tgz (7.5GB) The 5k photo collection used in the QA verification. Please refer to the paper for further information
      photos_inception_resnet_v2_l2norm.npz Image features for this photo collection.
Additional data:
      gcloud_concepts.p Visual concepts in the photos by Google Vision API.
      gcloud_ocrs.p OCR transcriptions from the photos by Google Vision API.

Terms of use: by downloading the image data you agree to the following terms:
  1. You will NOT distribute the above images.
  2. Carnegie Mellon University makes no representations or warranties regarding the data, including but not limited to warranties of non-infringement or fitness for a particular purpose.
  3. You accept full responsibility for your use of the data and shall defend and indemnify Carnegie Mellon University, including its employees, officers and agents, against any and all claims arising from your use of the data, including but not limited to your use of any copies of copyrighted images that you may create from the data.
Table: Performance comparison on the MemexQA dataset.
Human Experiments We first examine the human performance on MemexQA. We are interested in measuring 1) how well human can perform in the MemexQA task, 2) what is the contribution of each modality in helping users answer questions, and 3) how long does it take for humans to answer a MemexQA question.
Model Experiments We compare MemexNet with both classical and state-of-the-art VQA models. The above table reports the overall performance. As we see, MemexNet outperforms all baseline methods, with statistically significant differences. The result indicates MemexNet is a promising network for vision and language reasoning on this new task. The significant gap between the human and model accuracy, however, indicates that MemexQA is still a very challenging AI task. To analyze the contribution of the MMLookupNet, we replace it with the average embedding of concepts and metadata, and report the accuracy. As we see, the notable performance drop, especially on “what” and “when” questions, suggests MMLookupNet is beneficial in training MemexNet.
Release Log