NOTE: A Newer version of this tutorial may exist. Click here to go to the never version

<--Back to Projects

Automatic Foreground Segmentation in Surveillance Videos

Chris Tralie

ELE 488 Fall 2010 Final Project

Development Environment: MS Visual Studio 2008 under Windows 7


The purpose of this project is to implement and compare two techniques for automatic foreground detection in videos, primarily in surveillance videos where the camera position doesn't vary much. Most automatic foreground segmentation techniques will try to learn some sort of a background model and classify foreground pixels as those that are "different enough" from that model as objects move through the scene. The challenges come with learning the background model from scratch in the face of data that is both part of the foreground and background, adapting the background model to deal with illumination or other changes, and determining what triggers a pixel to be classified as a foreground pixel.

Some of the earlier methods tried to come up with a stochastic background model in advance and then proceeded to use these static models to classify objects throughout [4]. However, these methods break for even the slightest change in the background that validates the stationarity assumption. Clearly some sort of adaptive technique is needed to model the background. Kalman filter approaches have been used as an adaptive technique to solve this problem [4]. However, just like with any application of the Kalman filter, its application suffers from some of the limitations of a unimodal distribution. For instance, if the room changes illumination (different times of day), the Kalman Filter can only represent one of the illumination states at any given time. Hence, it will take a long time to converge to the new mean, etc.

The first technique I implemented is an extension of the Kalman Filter methods to tracking a multimodal distribution at each pixel, using a Mixture of Gaussians[1][2]. A (semi) real-time approximation to expectation maximization is used to track the means and standard deviations of the multiple hypotheses over time. The idea is to learn models for K different states over time, some of which can represent different foreground models that pass through the scene, and others for representing various background techniques. Criticism of this technique has been its slow convergence, which I will explore in this project.

The other technique I implemented is a completely different approach to the problem entitled "A transform domain approach to real-time foreground segmentation in video sequences"3 (which was actually co-written by the professor of this course back in 2005). I refer to this technique more compactly as the DCT-Based Segmentation in this writeup, because it relies solely upon the DCT coefficients of each 8x8 block to do the segmentation. The motivation for this will become clear later when I get into the guts of this method.

Test Videos

I tried to get a variety of test cases that span different types of illumination changes, different types of scene changes, and different distributions of color content to rigorously test my implementations of the algorithms. All videos can be found in this directory. The descriptive list is below, where I ordered the videos in ascending order of projected difficulty:


Before I can use my programs, my video files need to be split up into a sequency of JPEG images using ffmpeg. I relied heavily on two libraries to load these JPEG images into memory various ways. The first is the CImage library[5], which I used to load images from JPEG files, to create a user interface that shows real-time segmentation, and to save segmentation results to a sequence of bitmap files.

The second library I used was the Independent JPEG Group (IJG)'s library libjpeg[6]. This library allowed me to conveniently extract the luminance component of the 64 DCT coefficients in each block. This was a little tricky and required some Googling, but once I got it to work it allowed me to take advantage of the fact that the DCT coefficients are already computed in JPEG images, so that I didn't have to do re-do the forward DCT myself in the DCT-Based Tracker (hence saving valuable computation time).

As the segmentation takes place, I output a sequence of bitmap images that can later be consolidated back into a single video file using ffmpeg.

Mixture of Gaussians Segmentation

Stauffer and Grimson of MIT's AI Labs introduced the first multiple hypothesis tracker for a background models using a Mixture of Gaussians[1]. I also drew on a paper a few years later that took a more careful theoretical look at the algorithm to help me choose values for the parameters mentioned in the Stauffer/Grimson paper[2]. Let me now briefly review the algorithm in pseudocode:

Algorithm Pseudocode
Algorithm Results
This algorithm runs at about 0.5 frames per second on my laptop (so it's not feasible as a real-time algorithm

NOTE: Since I intialize the modes to be pure red, green, blue, white, and black, it's unlikely that any of them will completely represent what's actually going on in the scene. So it usually takes about 10-20 frames for the model to latch onto the background; a new mode is generated at each pixel that actually represents the actual background color at the first frame and its weight slowly increase proportionally to Alpha. In the mean time, almost everything gets classified as foreground (the first 10-20 frames are very green, and then the method suddenly snaps into place). Here are the results for each video:

I was overall impressed with the results of the MOG method; they exceeded my expectations. I was warned that MOG takes an extremely long time to converge but with my choice of Alpha = 0.01, it only ended up taking between 10-20 frames for the new background to be modeled properly. Of course, this can be too long and very noticeable, especially in with changes of illumination, so hopefully the next method will deal better with illumination changes.

DCT-Based Segmentation

The DCT-based segmentation technique presented in [3] approaches the problem from a completely different angle. Instead of trying to keep multiple hypotheses to track different states of illumination, this tries to deal with illumination offsets directly by using information in the frequency domain. The idea is that illumination should primarily only affect the zeroth DCT coefficient (the DC offset), while the AC coefficients should remain the same. Since the algorithm only relies on DCT coefficients, which can be precomputed and stored in JPEG images, and since it deals with 8x8 blocks at a time instead of pixel by pixel (a factor of 64 decrease in the amount of objects that need to be tracked in the same video), it can run real-time. Here is pseudocode for the algorithm in more detail:

Algorithm Pseudocode
Algorithm Results
NOTE: In all of these examples, a red X is drawn over a block with both fAC and fDC triggered, while a green X is drawn over a block with the fDC threshold triggered only.

Overall, this technique does much better than the MOG technique at resisting false positives due to changes in illumination (not suprising since that what it was designed to do), and it's over an order of magnitude faster (real time on my laptop). However, it doesn't seem to have much of an advantage with scene changes, nor in static scenes with very little illumination change. Also, unfortunately, my implementation also has an unresolved problem with other noisy false positives, but that could probably be resolved with some more implementation tweaking.

Program Usage / Code Setup

I set up two MS Visual Studio 2008 projects for the two different techniques implemented. The first project, bgmog, contains the following files:

bgdct.cppMain program, responsible for training on the first 20 frames, adapting the parameters, and classifying based on the parameters
SegImage.cpp, SegImage.hA class that stores the fac and fdc coefficient arrays and the classification results, along with functions to draw the classification results (red and green Xs).

bgmog.cppMain program; all it does is load the filesand set up a single MOGTracker
MOGTracker.cpp, MOGTracker.hTracks the multiple hypotheses for each pixel. Accepts an image per frame, updates the image, and draws the segmentation results as green blobs once it has finished updating the models based on this frame

A typical execution run for either bgmog or bgdct is as follows:
bgmog/bgdct [image directory in] [image directory out] [n frames]
The programs expect that "image directory in" contains a set of .jpg frame images numbered starting at 1, and they expect that the image directory out has alreay been created


[1] Stauffer, C.; Grimson, W.E.L.; , "Adaptive background mixture models for real-time tracking," Computer Vision and Pattern Recognition, 1999. IEEE Computer Society Conference on. , vol.2, no., pp.2 vol. (xxiii+637+663), 1999
doi: 10.1109/CVPR.1999.784637

[2] P.W. Power, J.A. Schoonees, "Understanding background mixture models for foreground segmentation," in: Proceedings of the Image and Vision Computing New Zealand, 2002, pp. 267-271.
[3] Zhu J, Schwartz SC, Liu B (2005) "A transform domain approach to real-time foreground segmentation in video sequences." In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Philadelphia, USA, pp. 685-688.
[4] Lecture notes on tracking in images by Szymon Rusinkiewicz, COS 429 Computer Vision Fall 2009
[5] The open-source "CImage Library."
[6] "LibJPEG". The Independent JPEG Group.
[7] "19 ffmpeg commands for all needs."