A Synchronization Ground Truth for the Jiku Mobile Video Dataset

M. Guggenberger, M. Lux, and L. Böszörmenyi. A Synchronization Ground Truth for the Jiku Mobile Video Dataset. In MultiMedia Modeling, Lecture Notes in Computer Science, 8936. Springer International Publishing, 2015.

Paper Git repository ZIP archive

Synchronization Overview

The following pictures provide for each event a graphical overview of the synchronized tracks ordered on a timeline. The colors of the tracks mark groups of overlapping recording clusters. Tracks in white with only a horizontal line in the middle are muted tracks that have been excluded from the ground truth because they are non-continuous recordings. Each cluster contains one track with a red background, these are the recordings whose original timestamps have been used to order the clusters correctly in time.

GT_090912

NAF_160312

NAF_230312

RAF_100812

SAF_290512

Geodata Map

The geodata map does not belong to the actual ground truth described in the paper but is supplied as an additional resource.

Due to the difference of the speed of sound and speed of light, all recordings have an offset between their audio and video tracks, depending on the distance from which a scene was recorded. A distance of 10 meters equals an offset of about 30 milliseconds. While this itself is usually not noticeable, especially when the audio lags behind the video which the human brain is used to, it can get noticeable when synchronizing multiple recordings. Since the ground truth is based on the synchronization of the audio tracks of the recordings, this automatically implies that the video tracks suffer from certain offsets among each other since they are recorded from varying distances from each respective target scene.

Since the Jiku dataset also contains additional metadata including the geographical recording positions and the compass direction the devices were pointed at, the idea was to calculate the distances from the recorded scenes and compensate the video offsets. We hypothesized that the location of the target scene (e.g. a stage) could be located in a certain area where the line-of-sight vectors show their maximum overlap or concentration. The map below is the graphical representation of this idea and served as a first visual validation step.

Unfortunately, the map demonstrates that our hypothesis does probably not hold. Additionally, the supplied metadata is very incomplete. The location data from GPS is missing for about half of the recordings, and the location accuracy values do not make much sense. There is no unit specified, the values are spread between 0 and 700000 (which device gets a GPS fix with an inaccuracy of 700000 meters?), and there is no raw location data like there is for the compass data to help in the reconstruction and understanding of the values.