logo

hriram

Real Time Speed Estimation Using Computer Vision

Speed estimation is one of the popular emerging use cases of computer vision. Recently I learned some really interesting concepts while trying to design a pedestrian speed estimator. I am writing this blog to share some of those learnings.

To estimate speed of any object, we need to detect, track and then calculate the speed using the data tracked across frames. Hence we need to perform the below steps in sequence.

  • Object recognition
  • Object tracking
  • World plane rectification
  • Speed estimation
  • Object recognition

    The first step involves detecting the objects in the frame. There are a lot of object detection algorithms out there that does the job with reasonable accuracy out of the box with pre-trained weights, but the one most suited for this application is yolo (or) you only look once algorithm, since it is faster compared to the rest. There are many open source implementations of the algorithm with newer versions coming in every year with minor improvements. I have chosen to go with yolov5 for my network. Also, we can selectively choose the objects that we want the algorithm to detect such as car, people etc. Our detection algorithm will take image from sampled video frame as input and return bounding boxes (bbox) for the chosen object classes as output.

    Typical output from an yolo network. Credits- (MTheiler), CC BY-SA 4.0
    Typical output from an yolo network. Credits- (MTheiler), CC BY-SA 4.0

    Object tracking

    For our next step, we need to track the detected objects across different frames for speed calculation. The baseline algorithm for multi object tracking (MOT) is sort algorithm which is based on Kalman filter. Kalman filter is an optimal state estimator used for predicting state in a stochastic system. It takes the bbox data as input and generates unique ids for the detected objects. However one disadvantage with this algorithm is that, it cannot re-identify a person who was previously identified thereby causing duplication of ids. When people walk in the streets they can be frequently hidden from the view of the camera by a vehicle or a group of people.

    There is an extended version of the sort algorithm called DeepSORT that doesn’t suffer from the above issue. It uses pre-trained deep association metric to re-identify objects, however at the expense of being slightly slower than the basic sort. Choosing the right tracker is highly problem specific. I went with deepsort for this particular problem since the network will then give an estimate of the total number of footfalls per day in addition to real time speed estimation. While the steps discussed till here applies for wide variety of computer vision problems, the next few steps will be specific to the speed estimation problem at hand.

    World plane rectification

    We need to perform a rectification (2D planar transformation) step before using the data output from previous step to calculate speeds, since our bbox coordinates are not in the real world plane but rather in an imaginary plane based on the resolution of the video feed. If we know the exact latitude of the location, we can perform rectification using the latitude data. If the location is unknown, we can use reasonable assumptions regarding dimensions of the location for the rectification step.

    We need the coordinates of four sets of points (preferably the corners) in the video (imaginary plane) and its corresponding points in the real world plane. Accuracy of our calculations will improve by increasing the area covered by these four points and by using precise location data. The points I chose for my problem are as follows

    image

    In the pixel plane we already know where each of the above points lie. For the real world plane, we can use width and height of the road to get the corresponding points.

    image

    The formula for calculating the mapping matrix also called as homography (H) matrix is as follows

    $$PH = \begin{bmatrix} -x_1 \quad -y_1 \quad -1 \quad 0 \quad 0 \quad 0 \quad x_1x_1' \quad y_1x_1' \quad x_1' \\ 0 \quad 0 \quad 0 \quad -x_1 \quad -y_1 \quad -1 \quad x_1y_1' \quad y_1y_1' \quad y_1' \\ -x_2 \quad -y_2 \quad -1 \quad 0 \quad 0 \quad 0 \quad x_2x_2' \quad y_2x_2' \quad x_2' \\ 0 \quad 0 \quad 0 \quad -x_2 \quad -y_2 \quad -1 \quad x_2y_2' \quad y_2y_2' \quad y_2' \\ -x_3 \quad -y_3 \quad -1 \quad 0 \quad 0 \quad 0 \quad x_3x_3' \quad y_3x_3' \quad x_3' \\ 0 \quad 0 \quad 0 \quad -x_3 \quad -y_3 \quad -1 \quad x_3y_3' \quad y_3y_3' \quad y_3' \\ -x_4 \quad -y_4 \quad -1 \quad 0 \quad 0 \quad 0 \quad x_4x_4' \quad y_4x_4' \quad x_4' \\ 0 \quad 0 \quad 0 \quad -x_4 \quad -y_4 \quad -1 \quad x_4y_4' \quad y_4y_4' \quad y_4' \\ \end{bmatrix} \begin{bmatrix}h1 \\ h2 \\ h3 \\ h4 \\ h5 \\ h6 \\ h7 \\ h8 \\h9 \end{bmatrix} = 0$$

    The above equation for the matrix H can be solved either using SVD or \(X=A^{-1}B\) as discussed here. Once we have the H matrix we can use it to map any point from pixel plane to world plane using the following formula

    $$\left[\begin{array}{c}{x^{\prime} * \lambda} \\ {y^{\prime} * \lambda} \\ {\lambda}\end{array}\right]=\left[\begin{array}{lll}{h_{11}} & {h_{12}} & {h_{13}} \\ {h_{21}} & {h_{22}} & {h_{23}} \\ {h_{31}} & {h_{32}} & {h_{33}}\end{array}\right] \cdot\left[\begin{array}{l}{x} \\ {y} \\ {1}\end{array}\right]$$

    Speed estimation

    For the final step, we have to choose a reference point in the bbox (popular choices are top or bottom midpoint of the bbox). By calculating the change in coordinates of bbox reference points for a person (dx, dy) in real world unit (say metres), using the output of rectification step and dividing it by the time elapsed (seconds), we can finally get the speed!

    Overview

    Now that we have understood the individual steps needed to implement the solution, let’s take a look at the overall block diagram of the network.

    image

    Due to processing constraints, usually frame sampling is implemented so as to run the complete network only every N_JUMP frames. In my case, I had to use N_JUMP of 6 frames or above to run the entire network real time on a given 25fps video using Nvidia GPU with a compute capability of 7.5.

    Source code

    You can take a look at my implementation of the complete network using Python with the following link shriram1998/PedestrianSpeedEstimator: Pedestrian Speed Estimation using yolov5, deepsort and homography matrix (github.com).

    Thus we have discussed one of the possible solutions for the complex problem of speed estimation in computer vision. Thank you for reading 🙂

    Further reading

  • Yolo paper
  • Deepsort paper