Image Credit : Google
In this blog, we will learn about object
tracking using the TLD. TLD stands for Tracking , Learning
and Detection .
What is an Object tracking?
Locating an object in successive frames of a video is called object
tracking. More about object tracking click here . There is
another term Object Detection.
Object detection is the task of
localization of objects in an input image. The definition of an “object”
vary. It can be a single instance or a whole class of objects.
Object detection methods are typically
based on the local image features or a sliding window. More about
Object Detection Click here
How can we do object tracking?
There are many algorithms for object tracking.
However, every algorithm has their own pros and cons. In this article, we
will discuss about TLD algorithm only.
Pros of TLD
1. It works better in occlusion.
2. TLD is good at learning the
appearance of the object
Cons of TLD
1. Does not work better when object
rotates about 90 degree or more.
2. Object disappear in frame.
Acquisition: an asset or object bought or obtained.
TLD works
as, we need to mark first frame using a rectangle to indicate the location of
the object we want to track. The object is then tracked in subsequent frames
using the tracking algorithm.
First, we
define our goal.
Objective: Given a bounding box defining,
the object of interest in a single frame, our goal is to automatically
determine the object is bounding box or indicate that the object is not visible
in every frame that follows.
The video
stream is to be processed at frame-rate and the process should run indefinitely
long. We refer to this task as long-term tracking.
Frame - rate: Frame rate is the frequency at which consecutive images
called frames appear on a display.
1.
Detection of the object when it reappears in the camera’s
field of view.
2.
Handle scale and illumination changes
3.
Handle background clutter
4.
Handle partial occlusions
5.
Operate in real-time
The
long-term tracking can be approached either from tracking or from detection perspectives.
More about TLD
1. The tracker follows the object from frame to frame.
2. The
learning estimates detector’s errors and updates it to avoid these errors in
the future.
3. The
detector localizes all appearances that have been observed so far and corrects
the tracker if necessary.
The Block
Diagram of the TLD framework is below.
The starting point of our research is the acceptance of the
fact that neither tracking nor detection can solve the long-term tracking
task independently.
Why Tracking and Detection altogether?
However, if they operate simultaneously, there is potential to benefit
one from another.
1. A tracker can provide weakly labeled training data for a detector and
thus improve it during run-time.
2. A detector can re-initialize a tracker and thus minimize the tracking
failures.
3. Each sub-task is addressed by a single component and the
components operate simultaneously.
4. The tracker follows the object from frame to frame. The detector
localizes all appearances that have been observed so far and corrects the
tracker if necessary.
5. The learning estimates detector’s errors and updates it to avoid
these errors in the future.
More about TLD Framework
1. TLD is a framework designed for long-term tracking of an unknown
object in a video stream.
2. Tracker estimates the object’s motion between consecutive frames
under the assumption that the frame-to-frame motion is limited and the object
is visible.
3. The tracker is likely to fail and never recover if the object
moves out of the camera view.
4. Detector treats every frame as independent and performs full scanning
of the image to localize all appearances that have been observed and learned in
the past.
Framework means a basic structure underlying a
system, concept, or text.
5. Learning observes performance of both,
tracker and detector, estimates detector’s errors and generates training
examples to avoid these errors in the future. The learning component
assumes that both the tracker and the detector can fail.
6 .By the virtue of the learning, the detector generalizes to more object
appearances and discriminates against background.
We have
talked about Tracking and Detection. Now we will talk about learning
portion
In TLD,
we use PN - Learning. More precisely, you can say, P -Expert and N-Expert Learning.
1. P-N
learning estimates the errors by a pair of “experts”:
2.
P-expert estimates missed detections, and
3.
N-expert estimates false alarms
The
learning process is modeled as a discrete dynamical system and the conditions
under which the learning guarantees improvement are found.
P-expert
1.
P-expert exploits the temporal structure in the video and assumes that the
object moves along a trajectory.
2. The
P-expert remembers the location of the object in the previous frame and
estimates the object location in current frame using a frame-to-frame tracker.
3. If the
detector labeled the current location as negative (i.e. made false negative
error), the P-expert generates a positive example.
4. The
goal of P-expert is to discover new appearances of the object and thus increase
generalization of the object detector.
5.
P-expert can exploit the fact that the object moves on a trajectory and add
positive examples extracted from such a trajectory
6. In
every frame, the P-expert outputs a decision about the reliability of the
current location (P-expert is an online process). If the current location is
reliable, the P-expert generates a set of positive examples that update the
object model and the ensemble classifier.
N- Expert
1. N-expert:
exploits the spatial structure in the video and assumes that the object can
appear at a single location only.
2. The
N-expert analyzes all responses of the detector in the current frame and the
response produced by the tracker and selects the one that is the most
confident.
3. Patches that are not over lapping with the maximally confident patch are
labeled as negative. The maximally confident patch re-initializes the location
of the tracker
4. N- expert generates negative training examples. Its goal is to discover
clutter in the background against which the detector should discriminate.
5. The key assumption of the N-expert is that the object can occupy at most one
location in the image. Therefore, if the object location is known, the
surrounding of the location is labeled as negative.
6. N-expert is applied at the same time as P-expert, i.e., if the trajectory is
reliable.
For the
update of the object detector and the ensemble classifier, we consider only
those patches that were not rejected by either the variance filter or the
ensemble classifier.