Stereographic Depth Estimation with OpenCV¶
For some time in my career, I have been working with depth estimation utilizing stereo cameras. My first exposure to this was at my previous company were I took part in the development of an underwater stereo camera system that was used to estimate the weight of fish for aquaculture. This proved very challenging at the time as I was very green to this concept. Having very little background information on the subject it was even difficult to ask the right questions to find the answers I was looking for.
5 years later, I've developed a comprehensive understanding of this subject and would like to share a "no bullshit" guide to this utilizing the OpenCV library. If you want to calibrate a stereographic camera system from scratch, for practical applications, this is the guide for you.
Brief Overview to Stereographic Depth Estimation¶
It's actually really simple. You take two cameras, place them next to eachother, take a picture at the same time, and now you have two points of reference to judge how far away something is. It's literally the same as how your eyes work (trying closing one eye and walk around, you'll notice you feel a little funny as you don't nessesarily know how far away things are).
The math is a bit more complex but only really requires some understanding of trigonometry and linear algebra. Further, the OpenCV library has a lot of the functions you need to do this already built in. But conceptually, you'll want to understand the following:
Two points of reference create the baseline of a triangle that you will use to calculate the distance to an object: camera 1, camera 2, and the thing you're looking at. We call this the "world point".
The "world point" is a point in 3D space. But, on an image, there is only two dimensions. So, we convert the "world point" to the "image point" through projection.
The process of projection from the camera to "image plane" and "world point" needs a set of features. We call these the "intrinsic" and "extrinsic" parameters. The "intrinsic" parameters are the properties of the camera itself (focal length, principal point, etc.) and the "extrinsic" parameters are the properties of the camera in relation to the world (rotation and translation matrixes).
The stereo camera system produces two images that have an assosiated distortion. This distortion is unique to each camera (meaning all stereo camera setups and even each camera within a single setup) and is a consequence of the physical properties of the glass of the lens. This distortion needs to be corrected before we can calculate the depth.
To rectify the images to produce a single image containing the view and the depth associated with it, and correct the distortion, we need to calibrate the cameras. This process discovers the "intrinsic" and "extrinsic" parameters utilizing known features from the images. A typical feature is a chessboard pattern as it has known geometry.
Once rectification and calibration is complete, we use a process called the "block matching" algorithm to calculate the disparity between the two images. The disparity is the difference in the position of the same point in the two images. This disparity is used to calculate the depth of the point using the focal length and a fixed parameter unique to each camera setup called the "baseline" (the distance between the two cameras).
import matplotlib.pyplot as plt
import cv2
frame = cv2.imread('images/stereo-camera-model.jpg')
plt.imshow(frame)
plt.axis('off')
plt.show()
In the diagram above, you can see a basic setup of a stereo camera system. A few important things to note is "f", the focal length, which is the distance from the image sensor to the image plane. The "b" is the "baseline" or simply the distance between the two cameras. A lot of times, the manufacturer of the camera will provide this information. If not, you can estimate it yourself using a ruler.
For this tutorial, I utilized a stereo camera system called the "Zed" camera. This is quite an expensive beast but a very well calibrated one. It even comes with a host of open-sourced tools provided by the manufacturer. With an embeded NVIDIA GPU, it can do real-time depth estimation and point-cloud generation. But, for this tutorial, I will be using OpenCV in Python (in this notebook) to do the same thing.
Now, to make matters even more interesting, this camera was mounted inside a custom underwater housing developed by Sexton who makes custom high quality camera housings for underwater and industrial applications. Due to the fact that seawater has a different refraction index then air, and the lens distortion of the custom housing, the calibration process is a bit more complex then a standard stereo camera setup. It's hard to just throw a camera underwater, expect it to survive, while also taking clear pictures of a checkerboard. I've had some interesting experiences with this process in this past and it can be quite challenging.
But, with the help of BlueRobotics and their underwater ROV system, and one of their expert engineers who has also been a colleague of mine for many years, we were able to mount the camera on the ROV and take pictures of a checkerboard pattern underwater.
frame = cv2.cvtColor(cv2.imread('images/rov_zed_setup.jpg'), cv2.COLOR_BGR2RGB)
plt.figure(figsize=(10, 10))
plt.imshow(frame)
plt.axis('off')
plt.show()
Certainly an impressive setup, but the calibration process is the same as any other stereo camera system. So, let's get started.
Step 1: Capture Checkerboard Images¶
We need to capture checkerboard images. From this, we can extract the geometry of the checkerboard and use it to calibrate the camera system. But, any known geometry could be used in this senario. If we dropped a cube underwater, we could deploy an object detection model to find the cube, detect it's corners, and use that instead. But, we happen to have a printed checkerboard pattern that we suspended underwater for this purpose.
frame = cv2.cvtColor(cv2.imread('images/checkerboard.jpg'), cv2.COLOR_BGR2RGB)
frame = cv2.rotate(frame, cv2.ROTATE_90_COUNTERCLOCKWISE)
plt.imshow(frame)
plt.axis('off')
plt.show()
For the geometry to be detected and be used in the calibration process, we need the checkboard to be visible in both images. To do this, we have to be able to see where we are driving the rover. Since the ZED camera has it's own processing unit in it (based on a Linux OS and ARM processor), I wrote a little web application that allows me to record and stream the video feed from the camera to any laptop. And this isn't utilizing some HLS streaming as that would be far to slow for real-time viewing. Instead, I utilize an FFMPEG process that records the video in high quality and pipes a second stream utilizing the MPEG1 video codec to a UDP port. This is picked up by another Socket.IO server that streams the packets to the web application. Finally, a fantastic library called JSMPEG is used to decode the video packets and display the video feed in the browser. It's a beautiful solution for real-time video streaming without the need for a dedicated RTSP server or a complex WebRTC setup.
import graphviz as gv
dot = gv.Digraph()
dot.node('A', 'Zed Camera')
dot.node('B', 'FFMPEG')
dot.node('C', 'Pipe to MPEG1 and UDP stream')
dot.node('D', 'Pipe to MJPEG and disk')
dot.node('E', 'Socket.IO Server')
dot.node('F', 'Web Browser')
dot.edges(['AB', 'BC', 'BD', 'CE', 'EF'])
dot
frame = cv2.cvtColor(cv2.imread('images/distortion.jpeg'), cv2.COLOR_BGR2RGB)
plt.imshow(frame)
plt.title('GUI for Monitoring While Driving ROV (shows distortion as well)')
plt.axis('off')
plt.show()
An FFMPEG pipeline like this was used to record the video:
ffmpeg \
-an \
-f v4l2 \
-i /dev/video0 \
-video_size 2560x720 \
-framerate 30 \
-threads 4 \
-f tee \
-map 0:v \
-f segment \
-vcodec mjpeg \
-qscale:v 3 \
-segment_time 10 \
-reset_timestamps 1 \
-strftime 1 \
recording_%s.mp4 \
-map 0:v \
-f mpegts \
-vcodec mpeg1video \
-qscale:v 18 \
-vf \"crop=(iw/2):ih:0:0\" \
udp://0.0.0.0:3131
Couple of notes on this pipeline:
- The video feed was output as MJPEG, as it's lossless, and the VBR was set to 3 to keep the file size down while also maintaining a high quality image.
- The video feed was split into 10 second segments to keep the file size down and make it easier to work with.
- The second pipe output MPEG1 video with a much highly VBR (meaning lower quality) as it is being streamed to the web application. I also cropped the image in half to only show the left camera feed. However, the pilot did mention that the quality may have still been too high and having visibility to both views would have been helpful.
Once the recording was done, we can utilize FFMPEG again to compiled the video segments into a single video file:
# output the file names to a text file sorted chronologically
ls | sort | sed 's/^/file /' > video.txt
# concatenate the files into a single video
ffmpeg -f concat -safe 0 -i video.txt -c copy compiled.mp4
With all of this out of the way, it's time to drive the ROV around and capture some images of the checkerboard. For best calibration results, you want to drive the ROV around the checkerboard, at different distances away from the board and at different angles (all angles of pitch, yaw, and roll). Capturing the geometry in all corners of the image plane is also important. And, if not for the sake of completeness, the scene should be free of high contrasting lighting conditions, reflections, or particulates in the water. These can make identifying the checkerboard corners difficult and inaccurate. This will become a lot more clear in the code outline below.
import numpy as np
import pandas as pd
from tqdm import tqdm
file_path = '/home/jack/Mounts/DiskOne/stereo/zed_calibration/zed_compiled.mp4'
capture = cv2.VideoCapture(file_path)
writer = cv2.VideoWriter('checkerboard.mp4', cv2.VideoWriter_fourcc(*'mp4v'), 120, (1280, 360))
frame_count = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))
cv2.namedWindow('calibrate')
# this is the size of the checkerboard pattern. We have 8 rows and 11 columns
checkerboard_size = (8, 11)
points = []
last_found = 0
frame_index = 0
# nice little progress bar. Always useful.
loader = tqdm(total=frame_count, position=0, leave=True, desc='Finding chessboard corners')
while True:
# fast forward in needed to increase speed of finding corners
# If you have a lengthy video, this can be slow on a CPU
if frame_index - last_found > 10:
frame_index += 10
capture.set(cv2.CAP_PROP_POS_FRAMES, frame_index)
loader.update(10)
elif frame_index - last_found > 100:
frame_index += 50
capture.set(cv2.CAP_PROP_POS_FRAMES, frame_index)
loader.update(50)
else:
frame_index += 1
capture.set(cv2.CAP_PROP_POS_FRAMES, frame_index)
loader.update(1)
ret, frame = capture.read()
if not ret:
break
left_frame = frame[:, :int(frame.shape[1] // 2)]
right_frame = frame[:, int(frame.shape[1] // 2):]
left_found, left_corners = cv2.findChessboardCorners(left_frame, checkerboard_size)
if left_found:
right_found, right_corners = cv2.findChessboardCorners(right_frame, checkerboard_size)
if right_found:
last_found = frame_index
left_gray = cv2.cvtColor(left_frame, cv2.COLOR_BGR2GRAY)
right_gray = cv2.cvtColor(right_frame, cv2.COLOR_BGR2GRAY)
# increase accuracy of the corner detection by subpixel refinement
left_corners = cv2.cornerSubPix(left_gray, left_corners, (11, 11), (-1, -1), (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS, 30, 0.01))
right_corners = cv2.cornerSubPix(right_gray, right_corners, (11, 11), (-1, -1), (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS, 30, 0.01))
# draw the checkerboard corners on the frame for display
left_frame = cv2.drawChessboardCorners(left_frame, checkerboard_size, left_corners, True)
right_frame = cv2.drawChessboardCorners(right_frame, checkerboard_size, right_corners, True)
points.append({'frame_index': frame_index, 'left': left_corners, 'right': right_corners})
loader.set_postfix({'points': len(points)})
# stack the frames horizontally
display = np.hstack((left_frame, right_frame))
# resize in half for display purposes
display = cv2.resize(display, fx=0.5, fy=0.5, dsize=(0, 0))
cv2.imshow('calibrate', display)
cv2.setWindowTitle('calibrate', f'calibrate: {frame_index}')
if left_found and right_found:
writer.write(display)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
loader.close()
cv2.destroyAllWindows()
capture.release()
writer.release()
qt.qpa.plugin: Could not find the Qt platform plugin "wayland" in "/home/jack/.local/share/virtualenvs/datadev-5ox7fytP/lib/python3.10/site-packages/cv2/qt/plugins" Finding chessboard corners: 19780it [22:52, 14.41it/s, points=6327]
from IPython.display import Video
Video('https://jackmead515.github.io/videos/checkerboard.mp4', html_attributes='loop autoplay muted playsinline width="100%"')
The video should provide a reference to how the checkerboard patterns should be captured. All these points are saved from the video and we should save them now so we don't have to reprocess the video again.
points = pd.DataFrame(points)
points['left'] = points['left'].apply(lambda x: x.flatten())
points['right'] = points['right'].apply(lambda x: x.flatten())
points.to_parquet('points.parquet')
Step 2: Rectification and Calibration¶
Now we can begin to anaylze the set of checkerboard points we have obtained and use these to calibrate the camera system. Refering back to step 1, we should anaylze the distribution of our data points to see how well we have captured the geometry acrossed the image plane.
import seaborn as sns
sns.set_theme(style="whitegrid")
points = pd.read_parquet('points.parquet')
points['left'] = points['left'].apply(lambda x: x.reshape(-1, 2))
points['right'] = points['right'].apply(lambda x: x.reshape(-1, 2))
pl = np.stack(points.left.values, axis=0)
pl.shape = (pl.shape[0] * pl.shape[1], 2)
pr = np.stack(points.right.values, axis=0)
pr.shape = (pr.shape[0] * pr.shape[1], 2)
print('Total number of points:', pl.shape[0])
Total number of points: 556776
sns.jointplot(x=pl[:, 0], y=pl[:, 1], kind='hist', color='g', fill=True)
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.suptitle('Distribution of Left Checkerboard Points')
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)
plt.show()
sns.jointplot(x=pr[:, 0], y=pr[:, 1], kind='hist', color='r', fill=True)
plt.xlabel('X Coordinate')
plt.ylabel('Y Coordinate')
plt.suptitle('Distribution of Right Checkerboard Points')
plt.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)
plt.show()
Viewing these jointplots, we can see that the data points have a wide distribution acrossed the image plane. Note that we should have points in the range of 0-1280 and 0-720 for the x and y axis respectively. But we seem to actually have missed some points from 0-200 on the x axis for the left image plane. Hopefully, the estimation will still have enough reference points to calibrate the system regardless.
Now, what is the most appropriate way to select a sample of points to use for calibration? We could use all the points, but this is extremely inefficient and could actually lead to a worse calibration. Instead, what we should do, is select a random set of samples that are evenly distributed acrossed the image plane. To evenly select a sample, we can write a small algorithm for that. We will arbitrarily select 200 sets of points to be used as that should be enough to calibrate the camera system without being too computationally expensive. (just known from experience)
A note that the left
and right
columns are the checkboard points as a numpy matrix. To find the "top left" corner in the matrix, we have to reshape the matrix back to it's original form
# 8x11 grid, meaning 8 rows and 11 columns
checkerboard_size = (8, 11)
points.left = points.left.apply(lambda x: x.reshape(checkerboard_size[1], checkerboard_size[0], 2))
points.right = points.right.apply(lambda x: x.reshape(checkerboard_size[1], checkerboard_size[0], 2))
# indexing is by column first and then row
top_left = lambda x: x[0, checkerboard_size[0]-1]
bottom_right = lambda x: x[checkerboard_size[1]-1, 0]
m = points.left.iloc[0]
tl = top_left(m)
br = bottom_right(m)
plt.scatter(m[:, :, 0], m[:, :, 1], color='r')
plt.scatter(tl[0], tl[1], color='g', s=100)
plt.scatter(br[0], br[1], color='b', s=100)
# we need to flip the y-axis to match the image
# since image pixels start from the top left
plt.gca().invert_yaxis()
plt.show()
# taking 5% of the image size as the region size so we have 20x20 regions.
# although, we won't get a point set for every region
region_w = int(1280*0.05)
region_h = int(720*0.05)
def within_region(point_set, x, y):
tl = top_left(point_set)
return tl[0] >= x and tl[0] <= x + region_w and tl[1] >= y and tl[1] <= y + region_h
x_range = np.arange(0, 1280, region_w)
y_range = np.arange(0, 720, region_h)
plt.figure(figsize=(6, 6))
total_samples = 0
for x in x_range:
for y in y_range:
points['in_region'] = points.left.apply(lambda ps: within_region(ps, x, y))
sample = points[points.in_region]
if len(sample) > 0:
sample = sample.sample(1)
pl = sample.left.values[0].copy()
pr = sample.right.values[0].copy()
pl.shape = (pl.shape[0] * pl.shape[1], 2)
pr.shape = (pr.shape[0] * pr.shape[1], 2)
plt.scatter(pl[:, 0], pl[:, 1], color='g', s=1)
plt.scatter(pr[:, 0], pr[:, 1], color='r', s=1)
total_samples += 1
print('Total samples:', total_samples)
for x in x_range:
plt.axvline(x, c='b', alpha=0.5)
for y in y_range:
plt.axhline(y, c='b', alpha=0.5)
plt.grid(False)
plt.show()
Total samples: 125
Note that this also indirectly shows you the difference in perspective between the two cameras: the left camera captured points more towards the right side of the image, and inversely for the right camera. This is literally what "disparity" is. And the next steps will be to estimate that difference.
Now we can actually calibrate the camera system using this sample set of data. We will use this sample method to actually get several sets and calibrate them individually. Using basic statistics, this will show us what the average reprojection error is so we can select the appropriate calibration set. This will be more clear further on.
sample_batches = []
for b in range(0, 10):
for x in x_range:
for y in y_range:
points['in_region'] = points.left.apply(lambda ps: within_region(ps, x, y))
sample = points[points.in_region]
if len(sample) > 0:
sample = sample.sample(1)
pl = sample.left.values[0].copy()
pr = sample.right.values[0].copy()
sample_batches.append({'left': pl.flatten(), 'right': pr.flatten(), 'batch': b })
sample_batches = pd.DataFrame(sample_batches)
sample_batches.to_parquet('sample_batches.parquet')
sample_batches = pd.read_parquet('sample_batches.parquet')
sample_batches['left'] = sample_batches['left'].apply(lambda x: x.reshape(-1, 2))
sample_batches['right'] = sample_batches['right'].apply(lambda x: x.reshape(-1, 2))
This next bit of code is a function that performs the entire calibration process that OpenCV provides. I can't go too in depth into the math here, as that would fill many pages. But the OpenCV documentation actually provides a very good explanation of each process. I highly recommend reading it and seeking external guides if you want to understand the math behind this process. Or, even better, deep dive into the OpenCV source code.
def calibrate_stereo(sample_points, square_size=3, checkerboard_size=(8, 11), image_size=(1280, 720)):
"""
Calibrates and rectifies a stereo camera setup using a set of checkerboard points. Returns the intrinsic and extrinsic parameters
:param sample_points: A DataFrame containing the left and right checkerboard points
:param square_size: The size of the square in the checkerboard pattern in centimeters
:param checkerboard_size: The size of the checkerboard pattern, (rows, columns)
:param image_size: The size of one half of the stereo image in pixels (width, height)
"""
# termination criteria for the calibration. EPS is the accuracy of the parameters
criteria = (cv2.TERM_CRITERIA_MAX_ITER + cv2.TERM_CRITERIA_EPS, 100, 1e-5)
# flags for the stereo calibration. The OpenCV documentation has more details on these
# https://docs.opencv.org/4.x/d9/d0c/group__calib3d.html#ga687a1ab946686f0d85ae0363b5af1d7b
flags = (cv2.CALIB_FIX_ASPECT_RATIO + cv2.CALIB_ZERO_TANGENT_DIST + cv2.CALIB_SAME_FOCAL_LENGTH)
# create a matrix that contains the coordinates of the checkerboard corners
# and using the square size, calculate the 3D coordinates of the corners
corner_coordinates = np.zeros((np.prod(checkerboard_size), 3), np.float32)
corner_coordinates[:, :2] = np.indices(checkerboard_size).T.reshape(-1, 2)
corner_coordinates *= square_size
object_points = np.array([corner_coordinates] * len(sample_points), np.float32)
# stack the left and right image points
left_image_points = np.stack(sample_points.left.values, axis=0)
right_image_points = np.stack(sample_points.right.values, axis=0)
# calibrate the stereo camera setup, producing the intrinsic and extrinsic parameters and the reprojection error
reprojection_error, left_camera_matrix, left_dist_coeffs, right_camera_matrix, right_dist_coeffs, R, T, E, F = cv2.stereoCalibrate(
object_points,
left_image_points,
right_image_points,
None,
None,
None,
None,
image_size,
None,
None,
None,
None,
criteria=criteria,
flags=flags
)
# rectify the stereo camera setup, producing the rectification matrices and projection matrices
rectification_left, rectification_right, proj_mats_left, proj_mats_right, disp_to_depth_mat, valid_boxes_left, valid_boxes_right = cv2.stereoRectify(
left_camera_matrix,
left_dist_coeffs,
right_camera_matrix,
right_dist_coeffs,
image_size,
R,
T,
flags=0
)
# create the undistortion and rectification maps for the left and right cameras
# this will be used to remove distortion from the images
undistortion_map_left, rectification_map_left = cv2.initUndistortRectifyMap(
left_camera_matrix,
left_dist_coeffs,
rectification_left,
proj_mats_left,
image_size,
cv2.CV_32FC1
)
undistortion_map_right, rectification_map_right = cv2.initUndistortRectifyMap(
right_camera_matrix,
right_dist_coeffs,
rectification_right,
proj_mats_right,
image_size,
cv2.CV_32FC1
)
return {
'reprojection_error': reprojection_error,
'left_camera_matrix': left_camera_matrix,
'left_dist_coeffs': left_dist_coeffs,
'right_camera_matrix': right_camera_matrix,
'right_dist_coeffs': right_dist_coeffs,
'R': R,
'T': T,
'E': E,
'F': F,
'rectification_left': rectification_left,
'rectification_right': rectification_right,
'proj_mats_left': proj_mats_left,
'proj_mats_right': proj_mats_right,
'disp_to_depth_mat': disp_to_depth_mat,
'valid_boxes_left': valid_boxes_left,
'valid_boxes_right': valid_boxes_right,
'undistortion_map_left': undistortion_map_left,
'rectification_map_left': rectification_map_left,
'undistortion_map_right': undistortion_map_right,
'rectification_map_right': rectification_map_right,
'focal_length': proj_mats_left[0, 0],
}
Since this is a very computationally expensive process, I'm multi-threading each batch.
from concurrent.futures import ThreadPoolExecutor, as_completed
pool = ThreadPoolExecutor(max_workers=8)
calibrations = []
futures = []
for _, sample in sample_batches.groupby('batch'):
futures.append(pool.submit(calibrate_stereo, sample))
loader = tqdm(total=len(futures), position=0, leave=True, desc='Calibrating Stereo Cameras')
for future in as_completed(futures):
calibrations.append(future.result())
loader.update(1)
pool.shutdown()
loader.close()
calibrations = pd.DataFrame(calibrations)
calibrations[['reprojection_error', 'focal_length']]
Calibrating Stereo Cameras: 0%| | 0/10 [00:23<?, ?it/s] Calibrating Stereo Cameras: 100%|██████████| 10/10 [01:16<00:00, 7.66s/it]
reprojection_error | focal_length | |
---|---|---|
0 | 1.593398 | 719.472766 |
1 | 1.515538 | 726.201919 |
2 | 1.535844 | 720.017902 |
3 | 1.465996 | 723.502134 |
4 | 1.461528 | 727.751580 |
5 | 1.556814 | 729.101834 |
6 | 1.500000 | 729.223363 |
7 | 1.469965 | 729.106474 |
8 | 1.543690 | 730.283654 |
9 | 1.430410 | 723.477894 |
Amazing! We see that the average reprojection error hovers around 1.5. Typically values below 2 are good calibrations. But it is achieveable to get sub 0.1 pixel error with a very good set of calibration points. But it depends on the setup and the desired accuracy. We can then select the calibration with the error closest to the average. It's worth also taking note of the estimated focal length. This length varies from 719.4 to 730.2! Since this number varies a bit, it's best looking at the average error to see if the calibration is good. I have seen sometimes a calibration that has a very low error, but the focal length is way off. This usually produces distorted images even after the undistortion process. So, it's important to look at both the error and the focal length.
mean_reprojection_error = calibrations['reprojection_error'].mean()
# find the calibration closest to the mean reprojection error
calibration = calibrations.iloc[(calibrations['reprojection_error'] - mean_reprojection_error).abs().argsort()[:1]]
calibration.to_pickle('calibration.pkl')
calibration
reprojection_error | left_camera_matrix | left_dist_coeffs | right_camera_matrix | right_dist_coeffs | R | T | E | F | rectification_left | ... | proj_mats_left | proj_mats_right | disp_to_depth_mat | valid_boxes_left | valid_boxes_right | undistortion_map_left | rectification_map_left | undistortion_map_right | rectification_map_right | focal_length | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
6 | 1.5 | [[729.2233630145528, 0.0, 644.681831986751], [... | [[0.16833925952241344, 0.858350543438188, 0.0,... | [[729.2233630145528, 0.0, 636.7249355556893], ... | [[0.14579477804156685, 0.9499134750111304, 0.0... | [[0.9999930589292869, -0.001999907374329684, 0... | [[-12.030477889512992], [-0.04024151553958756]... | [[8.476479628961037e-05, -0.020777986528176015... | [[1.3492669791807235e-09, -3.3073932037239e-07... | [[0.9999980987491265, 0.0013401141230613106, 0... | ... | [[729.2233630145528, 0.0, 641.3035507202148, 0... | [[729.2233630145528, 0.0, 639.5215301513672, -... | [[1.0, 0.0, 0.0, -641.3035507202148], [0.0, 1.... | (134, 74, 1012, 568) | (132, 78, 1016, 569) | [[-172.49211, -171.83736, -171.16911, -170.487... | [[-95.96282, -96.30531, -96.64135, -96.97099, ... | [[-145.11618, -144.74399, -144.35585, -143.951... | [[-88.009125, -88.48905, -88.96143, -89.426346... | 729.223363 |
1 rows × 21 columns
Step 3: Tuning the Block Matching Algorithm¶
Now that we have a set of calibration parameters, we can then use this to rectify the images and calculate disparity. In case it wasn't clear, disparity is the pixel level difference between two points in the two images. Since we know the focal length and baseline, we can use this disparity to translate "pixels" into metric units. This is the depth estimation process.
But the algorithm to calculate the disparity is not so straight forward. The OpenCV library provides interactions with the "block matching" algorithm. Essentially, this algorithms using template matching to find the best match between two images utilizing a small "block" of pixels (we refer to that as "block size"). But there are a lot more parameters to the algorithm that need to be tuned manually to get the best results for the scene you intend to capture. For our underwater scene, we need to tune the algorithm to be able to see the reef systems, fish, and other objects as you will see in the video below.
Fun fact: there is a live stream known as the TheMegaLab camera that shows a view of the underwater ecosystem off the shore of Kailua-Kona, Hawaii. Part of the video we utilized to tune the block matching algorithm shows this camera. You can view it here on Youtube
So let's get to the code and see how we can setup OpenCV to provide a GUI to tune the block matching algorithm.
import json
calibration = pd.read_pickle('calibration.pkl')
undistortion_map_left = calibration['undistortion_map_left'].values[0]
rectification_map_left = calibration['rectification_map_left'].values[0]
undistortion_map_right = calibration['undistortion_map_right'].values[0]
rectification_map_right = calibration['rectification_map_right'].values[0]
disparity_to_depth_map = calibration['disp_to_depth_mat'].values[0]
# initialize the stereo block matching algorithm here. We will tune the parameters later
stereo = cv2.StereoBM_create(numDisparities=16, blockSize=21)
def no_op(*args):
pass
file_path = '/home/jack/Mounts/DiskOne/stereo/themegalab/compiled.mp4'
capture = cv2.VideoCapture(file_path)
cv2.namedWindow('disparity')
# create trackbars for the stereo block matching parameters
cv2.createTrackbar('numDisparities', 'disparity', 1, 20, no_op)
cv2.createTrackbar('blockSize', 'disparity', 5, 50, no_op)
cv2.createTrackbar('preFilterType', 'disparity', 1, 1, no_op)
cv2.createTrackbar('preFilterSize', 'disparity', 2, 25, no_op)
cv2.createTrackbar('preFilterCap', 'disparity', 5, 63, no_op)
cv2.createTrackbar('textureThreshold', 'disparity', 10, 500, no_op)
cv2.createTrackbar('uniquenessRatio', 'disparity', 1, 100, no_op)
cv2.createTrackbar('speckleRange', 'disparity', 0, 100, no_op)
cv2.createTrackbar('speckleWindowSize', 'disparity', 3, 25, no_op)
cv2.createTrackbar('maxSpeckleSize', 'disparity', 100, 5000, no_op)
cv2.createTrackbar('maxSpeckleDiff', 'disparity', 32, 256, no_op)
cv2.createTrackbar('disp12MaxDiff', 'disparity', 5, 25, no_op)
cv2.createTrackbar('minDisparity', 'disparity', 5, 50, no_op)
while True:
ret, frame = capture.read()
if not ret:
break
left_frame = frame[:, :int(frame.shape[1] // 2)]
right_frame = frame[:, int(frame.shape[1] // 2):]
# remap the frames using the undistortion and rectification maps.
# This removes distortion from the images
remapped_left = cv2.remap(left_frame, undistortion_map_left, rectification_map_left, cv2.INTER_LINEAR)
remapped_right = cv2.remap(right_frame, undistortion_map_right, rectification_map_right, cv2.INTER_LINEAR)
remapped_left_gray = cv2.cvtColor(remapped_left, cv2.COLOR_BGR2GRAY)
remapped_right_gray = cv2.cvtColor(remapped_right, cv2.COLOR_BGR2GRAY)
# get the parameters from the trackbars. Mind the scaling as these
# numbers need to be in specific ranges. The documentation has more details
numDisparities = cv2.getTrackbarPos('numDisparities','disparity')*16
blockSize = cv2.getTrackbarPos('blockSize','disparity')*2 + 5
preFilterType = cv2.getTrackbarPos('preFilterType','disparity')
preFilterSize = cv2.getTrackbarPos('preFilterSize','disparity')*2 + 5
preFilterCap = cv2.getTrackbarPos('preFilterCap','disparity')
textureThreshold = cv2.getTrackbarPos('textureThreshold','disparity')
uniquenessRatio = cv2.getTrackbarPos('uniquenessRatio','disparity')
speckleRange = cv2.getTrackbarPos('speckleRange','disparity')
speckleWindowSize = cv2.getTrackbarPos('speckleWindowSize','disparity')*2
maxSpeckleSize = cv2.getTrackbarPos('maxSpeckleSize','disparity')
maxSpeckleDiff = cv2.getTrackbarPos('maxSpeckleDiff','disparity')
disp12MaxDiff = cv2.getTrackbarPos('disp12MaxDiff','disparity')
minDisparity = cv2.getTrackbarPos('minDisparity','disparity')
# Update the stereo block matching parameters
stereo.setNumDisparities(numDisparities if numDisparities > 0 else 16)
stereo.setBlockSize(blockSize if blockSize > 5 else 5)
stereo.setPreFilterType(preFilterType if preFilterType > 0 else 1)
stereo.setPreFilterSize(preFilterSize if preFilterSize > 5 else 5)
stereo.setPreFilterCap(preFilterCap if preFilterCap > 0 else 5)
stereo.setTextureThreshold(textureThreshold if textureThreshold > 0 else 10)
stereo.setUniquenessRatio(uniquenessRatio if uniquenessRatio > 0 else 1)
stereo.setSpeckleRange(speckleRange if speckleRange > 0 else 0)
stereo.setSpeckleWindowSize(speckleWindowSize if speckleWindowSize > 0 else 3)
stereo.setDisp12MaxDiff(disp12MaxDiff if disp12MaxDiff > 0 else 5)
stereo.setMinDisparity(minDisparity if minDisparity > 0 else 5)
# compute the disparity map!!
disparity = stereo.compute(remapped_left_gray, remapped_right_gray)
# filter the disparity map to remove noise. This is also known as speckle filtering
# and we can tune this parameter as well.
disparity, _ = cv2.filterSpeckles(disparity, 0, maxSpeckleSize, maxSpeckleDiff)
# Normalize the disparity so we can visualize it.
disparity = disparity.astype(np.float32) / 16.0
disparity = cv2.normalize(disparity, disparity, 0, 255, cv2.NORM_MINMAX)
disparity = disparity.astype(np.uint8)
disparity = cv2.applyColorMap(disparity, cv2.COLORMAP_JET)
display = cv2.hconcat([remapped_left, disparity])
display = cv2.resize(display, fx=0.5, fy=0.5, dsize=(0, 0))
cv2.imshow('disparity', display)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
cv2.destroyAllWindows()
capture.release()
stereo_parameters = {
'numDisparities': numDisparities,
'blockSize': blockSize,
'preFilterType': preFilterType,
'preFilterSize': preFilterSize,
'preFilterCap': preFilterCap,
'textureThreshold': textureThreshold,
'uniquenessRatio': uniquenessRatio,
'speckleRange': speckleRange,
'speckleWindowSize': speckleWindowSize,
'maxSpeckleSize': maxSpeckleSize,
'maxSpeckleDiff': maxSpeckleDiff,
'disp12MaxDiff': disp12MaxDiff,
'minDisparity': minDisparity
}
with open('stereo_parameters.json', 'w') as f:
json.dump(stereo_parameters, f)
from IPython.display import Video
Video('https://jackmead515.github.io/videos/tuning-process.mp4', html_attributes='loop autoplay muted playsinline width="100%"')
The code rather speaks for itself. But, the important parameters to tune are:
numDisparities
: the number of disparities to search for. This is the range of depth that the algorithm will search for. The higher the number, the more depth you can see. But, the more computationally expensive the process is. This number should be divisible by 16.blockSize
: the size of the block to search for. This is the size of the template that the algorithm will use to search for the best match. For larger block sizes, you will have less noise but also less detail. The converse is true for smaller block sizes.uniquenessRatio
: this is a parameter that defines the uniqueness of the match. The higher the number, the more unique the match has to be. This is useful for reducing noise in the disparity map.speckleWindowSize
andspeckleRange
: these parameters are used to remove noise in the form of "speckles" from the disparity map. ThespeckleWindowSize
is the maximum size of the speckle in pixels and thespeckleRange
is the maximum disparity variation within the speckle. ThemaxSpeckleSize
andmaxSpeckleDiff
are complimentary parameters that define how to filter out speckles that are unfortunately produced.
But once we have the parameters tuned, we have a fully functional stereo camera system that can estimate depth in real-time! (at least on my PC). If you wanted to deploy this on a raspberry pi, you would have to only calculate disparity only on frames that contain a useful feature. For instance, you could deploy a small object detection model that detects fish or other objects and only calculates disparity on those frames. The ZED has a built-in NVIDIA GPU. So we could make use of the C++ API for OpenCV and CUDA optimizations to make this process real-time on all frames.
Step 4: Depth Estimation¶
Now we are on the final step of this notebook which is to estimate depth in metric units. With a simple two camera stereo camera system, depth is given to us as the formula:
depth = (focal_length * baseline) / disparity
Where focal_length
is the focal length of the camera (we use the calculated focal_length from the calibration), baseline
is the distance between the two cameras, and disparity
is the difference in the position of the same point in the two images. This is the value we tuned in the previous step.
Given our baseline
is in units of millimeters, we should get millimeters as our output of depth. For this camera we have specifically, the manufacturer (in the link above) gives us this value as 120mm.
The code below shows this process and calculates the depth in meters for an arbitrary box in the center of the image.
with open('stereo_parameters.json', 'r') as f:
stereo_parameters = json.load(f)
print(stereo_parameters)
stereo = cv2.StereoBM_create()
stereo.setNumDisparities(stereo_parameters['numDisparities'])
stereo.setBlockSize(stereo_parameters['blockSize'])
stereo.setPreFilterType(stereo_parameters['preFilterType'])
stereo.setPreFilterSize(stereo_parameters['preFilterSize'])
stereo.setPreFilterCap(stereo_parameters['preFilterCap'])
stereo.setTextureThreshold(stereo_parameters['textureThreshold'])
stereo.setUniquenessRatio(stereo_parameters['uniquenessRatio'])
stereo.setSpeckleRange(stereo_parameters['speckleRange'])
stereo.setSpeckleWindowSize(stereo_parameters['speckleWindowSize'])
stereo.setDisp12MaxDiff(stereo_parameters['disp12MaxDiff'])
stereo.setMinDisparity(stereo_parameters['minDisparity'])
file_path = '/home/jack/Mounts/DiskOne/stereo/themegalab/compiled.mp4'
capture = cv2.VideoCapture(file_path)
writer = cv2.VideoWriter('depth.mp4', cv2.VideoWriter_fourcc(*'mp4v'), 30, (1280, 360))
undistortion_map_left = calibration['undistortion_map_left'].values[0]
rectification_map_left = calibration['rectification_map_left'].values[0]
undistortion_map_right = calibration['undistortion_map_right'].values[0]
rectification_map_right = calibration['rectification_map_right'].values[0]
disparity_to_depth_map = calibration['disp_to_depth_mat'].values[0]
focal_length = calibration['focal_length'].values[0]
# this could be saved in the stereoparameters.json file as well.
baseline_mm = 120.0
capture.set(cv2.CAP_PROP_POS_FRAMES, 10500)
cv2.namedWindow('depth')
index = 0
while True:
ret, frame = capture.read()
index += 1
if index > 500:
break
left_frame = frame[:, :int(frame.shape[1] // 2)]
right_frame = frame[:, int(frame.shape[1] // 2):]
# remap for undistortion
remapped_left = cv2.remap(left_frame, undistortion_map_left, rectification_map_left, cv2.INTER_LINEAR)
remapped_right = cv2.remap(right_frame, undistortion_map_right, rectification_map_right, cv2.INTER_LINEAR)
remapped_left_gray = cv2.cvtColor(remapped_left, cv2.COLOR_BGR2GRAY)
remapped_right_gray = cv2.cvtColor(remapped_right, cv2.COLOR_BGR2GRAY)
# compute the disparity map
disparity = stereo.compute(remapped_left_gray, remapped_right_gray)
disparity, _ = cv2.filterSpeckles(disparity, 0, stereo_parameters['maxSpeckleSize'], stereo_parameters['maxSpeckleDiff'])
disparity = disparity.astype(np.float32) / 16.0
# calculate the depth map using the disparity map.
depth_mm = np.zeros(disparity.shape)
depth_mm[:] = np.NaN
depth_mm[disparity > 0] = baseline_mm * focal_length / disparity[disparity > 0]
# we pick an arbitary target for this example. This will target the view's center.
target = (
int(remapped_left.shape[1] / 2 - 50),
int(remapped_left.shape[0] / 2 - 50),
int(remapped_left.shape[1] / 2 + 50),
int(remapped_left.shape[0] / 2 + 50)
)
x1, y1, x2, y2 = target
# calculate the median depth of the target region. We convert the meters
# to millimeters for display purposes.
direct_depth_m = np.nanmedian(depth_mm[y1:y2, x1:x2]) * 0.001
# draw a rectangle around the target and display the depth
cv2.rectangle(remapped_left, (x1, y1), (x2, y2), (0, 255, 0), 2)
cv2.putText(remapped_left, f'{direct_depth_m:.2f}m', (x1, y1-10), cv2.FONT_HERSHEY_SIMPLEX, 0.8, (0, 0, 255), 2)
disparity = cv2.normalize(disparity, disparity, 0, 255, cv2.NORM_MINMAX)
disparity = disparity.astype(np.uint8)
disparity = cv2.applyColorMap(disparity, cv2.COLORMAP_JET)
display = cv2.hconcat([remapped_left, disparity])
display = cv2.resize(display, fx=0.5, fy=0.5, dsize=(0, 0))
writer.write(display)
cv2.imshow('depth', display)
if cv2.waitKey(1) & 0xFF == ord('q'):
break
capture.release()
writer.release()
cv2.destroyAllWindows()
{'numDisparities': 48, 'blockSize': 45, 'preFilterType': 1, 'preFilterSize': 17, 'preFilterCap': 16, 'textureThreshold': 385, 'uniquenessRatio': 3, 'speckleRange': 23, 'speckleWindowSize': 22, 'maxSpeckleSize': 4191, 'maxSpeckleDiff': 55, 'disp12MaxDiff': 5, 'minDisparity': 5}
Video('https://jackmead515.github.io/videos/depth.mp4', html_attributes='loop autoplay muted playsinline width="100%"')
Conclusion¶
Of course, this isn't even close to the end of what else we could do. This is actually just the beginning steps to point cloud generation for 3D reconstruction: like photogrammetry or robotic environment mapping. In fact, using this footage obtained from the ROV, my engineer colleague actually generated a 3D model of the pipeline you see in the video above. You can imagine how useful that would be for inspection purposes: generating a full 3D model that is accurate to the millimeter.
Perhaps my next notebook will cover that process... :)
A special thanks goes to Sexton and BlueRobotics for providing the equipment to make this possible.
And an extra special thanks to my colleague who is the mastermind behind the ROV and hardware setup. He has a pretty rad Youtube channel. I highly recommend checking it out if you're interested in robotics and underwater exploration.