Defining Image Memorability using
the Visual Memory Schema

Memorability of an image is a characteristic determined by human observers ability to remember images they have seen. We propose a new concept called the Visual Memory Schema (VMS) referring to an organization of image components human observers share when encoding and recognizing images. The concept of VMS is operationalised by asking human observers to define memorable regions of images they were asked to remember during an episodic memory test. We then statistically assess the consistency of VMSs across observers for either correctly or incorrectly recognised images. The associations of the VMSs with eye fixations and saliency are analysed separately as well. We then adapt two different deep learning architectures for predicting the image memorability selections made by human subjects and analyse the results when using transfer learning at the outputs of various network layers.

author = {Erdem Akagündüz and Adrian G. Bors and Karla K. Evans},
title = {Defining Image Memorability using the Visual Memory Schema},
journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
year = {2018},
volume = {??},
pages = {????--????},
number = {?},
month = {????}}


Dr. Erdem Akagündüz

Assistant Professor
Dept. of Electrical and Electronics Engineering
Çankaya University, Turkey


Image Categories

The memorability experiment we conducted used 800 images selected from the Fine-Grained Image Memorability (FIGRIM) set [Bylinskii et al. 2015]. The FIGRIM image set is composed of 1754 target images (i.e. images with memorability scores obtained from human observers) from 21 different scene categories with at least 300 images of 700x700 pixels (or of higher) resolution, selected from among the SUN image set. A subset of target images from the FIGRIM dataset additionally includes corresponding mappings of the observers eye-movement locations recorded during the memory test. For the FIGRIM memory experiments, 120 images representing a mix of target and filler images were presented to human observers for 1s each. Both inter-category and across-category experiments were conducted, thus two separate memorability scores exist for each image [Bylinskii et al. 2015].

The 800 images used in our experiment, to which we refer as the VISCHEMA image set, are selected from 12 image categories of the FIGRIM/SUN image sets. These 12 categories belong to image categories, characterized by either very low or very high memorability scores, such as the Mountain or Playground categories, respectively. We redefine the image categories by producing a hierarchical structure in which images are categorized firstly as Indoor and Outdoor scenes, and then each of these categories are labelled into either Private or Public for the Indoor scenes, and as Man-made or Natural for the Outdoor scenes, as shown in Figure \ref{CategoryHRFigure. The categorization continues with further dividing into subordinate FIGRIM/SUN categories, such as: Kitchen (100), Living room (100), Air terminal (100), Conference room (100), Amusement park (44), Playground (56), House (66), Skyscraper (34), Golf course (58), Pasture (42), Badlands (47), Mountain (53), where the numbers of images in each subcategory is indicated in the parentheses. As seen in the figure below, there are 100 images for each of the 8 leaf-categories, namely kitchen, living-room, big, small, public-entertainment, work-home, populated and isolated. Each leaf-category include images from one or more categories of the FIGRIM/SUN image sets.

The Memory Experiment

In order to collect the VMS for 800 images, 90 subjects recruited from the population of students and staff of University of York, UK and with ages between 19 to 30, were presented with a memory experiment that consisted of two stages. During the first stage or study phase, all participants were shown 400 images from 8 leaf categories, in a randomized order. Each image was shown for 3 seconds with the study phase of the experiment lasting a total of 20 minutes. The participants were asked to do their best to memorize the images they saw on a computer screen, in a quiet and darkened room.

The first stage was immediately followed by the second stage (or test phase) in which the participants were shown another group of 400 images, 200 of which were repetitions from the first stage, in a randomized order. Similar to the first stage, the category distribution was uniform, such that 50 out of 100 images from 8 leaf categories were shown. During the test phase 1) the participants {were} asked to rate how well they {remembered} the image using a continuous rating bar from "not seen" to "definitely seen", and also 2) if they thought they remembered the image well enough (i.e. by placing the rating bar above the predefined threshold of 30%) they were asked to select at least 1 and at most 3 rectangular regions, of size determined by the observer, that made them remember that image.

Each participant saw 600 different images in a single experiment including 200 repeat images, 200 non-repeat (first-stage-fillers) and 200 new images representing second-stage-filler images. Each image was shown to the participants in the test phase for region selection, for approximately 45 (90 subjects X 400 second phase images / 800 total images) times across participants, ensuring an equal probability of observation for each image by the participants.

Sample VMS maps

VMSs are single channel maps having the same resolution as the image. When constructing a VMS for an image, the human annotations are added on top of each other and are normalized by the number of participants that annotated the image. Thus, VMS is a 2D probability distribution function (PDF) of the spatial distribution of the pixels, corresponding to specific scene information as visualized in the image. VMS indicate the probability for specific image regions of being selected by an observer as memorable. In other words, the brighter VMS pixels become, the more likely they are to be remembered by a human observer. It is important to note that the VMS is a map constructed by using human observer responses, defining most memorable regions of images, unlike the memorability maps in Khosla et al. (2012) that were based on automatic machine computations. Furthermore VMS represents both true and false memorability of a region, which provides a different and improved concept of region memorability, when compared to previous studies on the subject.

The least memorable, moderately memorable and the most memorable image for each leaf category are shown together with true and false VMSs, where HR and FAR scores are indicated as well.


Set of original images, extracted from the SUN image database used in the experiment and their memorisation selections made during the experiments

Download Original Image Dataset

Download Memorised Selections from Images

Download Matlab Code used for these images