Monday, July 8, 2019

TLD (Tracking , Learning and Detection) : Complete Overview with Python Code


Image Credit : Google 

In this blog, we will learn about object tracking using the TLD. TLD stands for Tracking , Learning and Detection .  

What is an Object tracking?

Locating an object in successive frames of a video is called object tracking. More about object tracking click here . There is another term Object Detection.


Object detection is the task of localization of objects in an input image. The definition of an “object” vary. It can be a single instance or a whole class of objects.

Object detection methods are typically based on the local image features or a sliding window. More about Object Detection Click here 

How can we do object tracking?

There are many algorithms for object tracking. However, every algorithm has their own pros and cons. In this article, we will discuss about TLD algorithm only. 

Pros of TLD

1. It works better in occlusion.
2. TLD is good at learning the appearance of the object

Cons of TLD

1. Does not work better when object rotates about 90 degree or more.
2. Object disappear in frame.

Acquisition: an asset or object bought or obtained. 

TLD works as, we need to mark first frame using a rectangle to indicate the location of the object we want to track. The object is then tracked in subsequent frames using the tracking algorithm. 
First, we define our goal.
Objective: Given a bounding box defining, the object of interest in a single frame, our goal is to automatically determine the object is bounding box or indicate that the object is not visible in every frame that follows.
The video stream is to be processed at frame-rate and the process should run indefinitely long. We refer to this task as long-term tracking.

Frame - rate:
Frame rate is the frequency at which consecutive images called frames appear on a display.
1.            Detection of the object when it reappears in the camera’s field of view.
2.            Handle scale and illumination changes
3.            Handle background clutter
4.            Handle partial occlusions 
5.            Operate in real-time 

The long-term tracking can be approached either from tracking or from detection perspectives.

More about TLD

1. The tracker follows the object from frame to frame. 


2. The learning estimates detector’s errors and updates it to avoid these errors in the future. 

3. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary.

The Block Diagram of the TLD framework is below.

The starting point of our research is the acceptance of the fact that neither tracking nor detection can solve the long-term tracking task independently. 

Why Tracking and Detection altogether?

However, if they operate simultaneously, there is potential to benefit one from another.

1. A tracker can provide weakly labeled training data for a detector and thus improve it during run-time.

2. A detector can re-initialize a tracker and thus minimize the tracking failures.

3.  Each sub-task is addressed by a single component and the components operate simultaneously. 

4. The tracker follows the object from frame to frame. The detector localizes all appearances that have been observed so far and corrects the tracker if necessary. 

5. The learning estimates detector’s errors and updates it to avoid these errors in the future.

More about TLD Framework 

1. TLD is a framework designed for long-term tracking of an unknown object in a video stream.

2. Tracker estimates the object’s motion between consecutive frames under the assumption that the frame-to-frame motion is limited and the object is visible.

3. The tracker is likely to fail and never recover if the object moves out of the camera view.

4. Detector treats every frame as independent and performs full scanning of the image to localize all appearances that have been observed and learned in the past.

Framework means a basic structure underlying a system, concept, or text.

5. Learning observes performance of both, tracker and detector, estimates detector’s errors and generates training examples to avoid these errors in the future. The learning component assumes that both the tracker and the detector can fail.

6 .By the virtue of the learning, the detector generalizes to more object appearances and discriminates against background.

We have talked about Tracking and Detection. Now we will talk about learning portion  

In TLD, we use PN - Learning. More precisely, you can say, P -Expert and N-Expert Learning.

1. P-N learning estimates the errors by a pair of “experts”: 

2. P-expert estimates missed detections, and 

3. N-expert estimates false alarms 

The learning process is modeled as a discrete dynamical system and the conditions under which the learning guarantees improvement are found.


P-expert 

1. P-expert exploits the temporal structure in the video and assumes that the object moves along a trajectory.

2. The P-expert remembers the location of the object in the previous frame and estimates the object location in current frame using a frame-to-frame tracker.

3. If the detector labeled the current location as negative (i.e. made false negative error), the P-expert generates a positive example.

4. The goal of P-expert is to discover new appearances of the object and thus increase generalization of the object detector.

5. P-expert can exploit the fact that the object moves on a trajectory and add positive examples extracted from such a trajectory

6. In every frame, the P-expert outputs a decision about the reliability of the current location (P-expert is an online process). If the current location is reliable, the P-expert generates a set of positive examples that update the object model and the ensemble classifier.

N- Expert 

1. N-expert: exploits the spatial structure in the video and assumes that the object can appear at a single location only.

2. The N-expert analyzes all responses of the detector in the current frame and the response produced by the tracker and selects the one that is the most confident. 

3. Patches that are not over lapping with the maximally confident patch are labeled as negative. The maximally confident patch re-initializes the location of the tracker

4. N- expert generates negative training examples. Its goal is to discover clutter in the background against which the detector should discriminate. 

5. The key assumption of the N-expert is that the object can occupy at most one location in the image. Therefore, if the object location is known, the surrounding of the location is labeled as negative.

6. N-expert is applied at the same time as P-expert, i.e., if the trajectory is reliable.
For the update of the object detector and the ensemble classifier, we consider only those patches that were not rejected by either the variance filter or the ensemble classifier.







No comments:

Post a Comment

Behavior Recognition System Based on Convolutional Neural Network

Our this article is on this  research paper .  Credit : Bo YU What we will do ? We build a set of human behavior recognition syste...