Toyota is sponsoring the development of new algorithms for surface reconstruction and real-time localization in PCL, as part of the first PCL code sprint.
For a complete list of all the present and past PCL code sprints please visit http://www.pointclouds.org/blog.
Thomas Whelan and John McDonald, two of our collaborators from the National University of Ireland in Maynooth, have been feverishly working away on extensions and improvements to PCL’s implementation of KinectFusion - KinFu.
Two of the major limitations of KinectFusion have been overcome by the algorithm, which Tom calls Kintinuous:
Tom has created some amazing mesh reconstructions of apartments, corridors, lecture theaters and even outdoors (at night) - which can be produced real-time. Below is a video overview showing some of these reconstructions.
We are hoping to integrate Kintinuous into a full SLAM system (iSAM) with loop-closures to create fully consistent 3D meshes of entire buildings. More details including the output PCD files and a technical paper are available here.
As a final blog post for this Toyota Code Sprint, I am attaching the final report I have written for the sponsors.
The presentation slides associated with the report:
And the midterm report:
Just a quick note that the visual odometry library we have been using within our particle filter - called FoVis (Fast Visual Odometry) - has been released. It supports both stereo cameras (including Pt Grey Bumblebee) and depth cameras (MS Kinect, Asus Pro Live) and achieves 30Hz using a single CPU core. I haven’t used another VO system but its highly recommended.
FoVis was developed by Nick Roy’s group here in MIT. More details can be found in this paper:
Visual Odometry and Mapping for Autonomous Flight Using an RGB-D Camera. Albert S. Huang, Abraham Bachrach, Peter Henry, Michael Krainin, Daniel Maturana, Dieter Fox, and Nicholas Roy Int. Symposium on Robotics Research (ISRR), Flagstaff, Arizona, USA, Aug. 2011
So I’ve been doing some analysis to determine the overall benefit of the improvements mentioned below within our application. It will give a feel for the impact of various improvements in future.
High Level Timing
First lets look at the high level computing usage. We did several tests and looked at elapsed time for different numbers of particles and also looked at (1) doing the likelihood evaluation on the GPU (using the shader) or (2) on the CPU (as previously).
The data in this case was at 10Hz so real-time performance amounts to the sum of each set of bars being less than 0.1 seconds. For low number of particles, the visual odometry library, FoVis, represents a significant amount of computation - more than 50 percent for 100 particles. However as the number of particles increases the balance of computation shifts to the likelihood function.
The visual odometry and likelihood function can be shifted to different threads and overlapped to avoid latency. We haven’t explored it yet.
Other components such as particle propogation and resamping are very small fractions of the overall computation and are practically negiligible.
In particular you can see that the likelihood increases much more quickly for the CPU evaluation.
Low Level Timing of Likelihood
The major elements in the likelihood are the rendering of the model view and the scoring of the simulated depths - again using the results of these same runs.
Basically the cost of rendering is fixed - a direct function of the (building) model complexity and the number of particles. We insert the ENTIRE model into OpenGL for each iteration - so there will be a good saving to be had by determining the possibly visible set of polygons to optimize this. This requires clustering the particles and extra caching at launch time but could result in an order of magnitude improvement.
It’s very interesting to see that the cost of doing the scoring is so significant here. Hordur put a lot of work into adding OpenGL Shader Language (OpenGLSL) support. The effect of it is that for large numbers of particles e.g. 900 scoring is only 5% of available time (5% of 0.1 seconds). Doing this on the CPU would be 30%.
For the simplified model, this would be very important as the scoring will remain as is, but the render time will fall a lot.
NOTE: I think that some of these runs are slightly unreliable. For portions of operation there was a charasteristic chirp in computation during operation. I think this was the GPU being clocked down or perhaps my OpenGL-based viewer soaking up the GPU. I’ll need to re-run to see.
I finally managed to put Misha’s Poisson reconstruction implementation in PCL. It now works properly and passes the unit tests, with the same behavior as the original application. Futhermore, we are adapting it to the PCL coding style.
A new pcl::surface base class has been introduced, due to some confusion communicated via the pcl-users mailing lists. Now, there is the CloudSurfaceProcessing class, which represents the base class for algorithms that take a point cloud as an input and produce a new output cloud that has been modified towards a better surface representation. These types of algorithms include surface smoothing, hole filling, cloud upsampling etc.
In this category, we have the MovingLeastSquares algorithm and its additional features we mentioned in previous blog posts and the all-new BilateralUpsampling algorithm, based on the bilateral filter (for more details about the workings of the algorithm, please see the previous post). Suat was kind enough to help me by modifying the OpenNIGrabber such that it will output 1280x1024 PointXYZRGB clouds when the Kinect is set to high-resolution RGB -> this means that each second row and column contains nan depth values. And here is where the BilateralUpsampling comes in, using the RGB info to fill in the missing depth data, as visualized in the following example image:
In this post I’m going to talk about some GPU optimizations which have increased the speed of the likelihood function evaluation. Hordur Johannsson was the one who did most of this OpenGL magic.
Evaluating Individual Measurement to Model Associations
This is the primary approach of this method. Essentially by using the Z-buffer and an assumption of a generative ray-based cost function, we can evaluate the likelihood along the ray rather than the distance in Euclidean space. This is as was previously discussed below, I just mention it here for context.
Computing Per-pixel Likelihood Functions on the GPU using a Shader
The likelihood function we are currently evaluating is a normalized Gaussian with an added noise floor:
Previously this was computed on the CPU by transferring the depth buffer back to the CPU from the GPU. We have instead implemented a GPU shader to compute this on the GPU.
Currently we look-up this function from a pre-computed look-up table. Next we want to evaluate this functionally, to explore the optimization of function’s shape as well as things like image decimation. (Currently we decimate the depth image to 20x15 pixels).
Summing the Per-pixel Likelihood on the GPU
Having evaluted the log likelihood per pixel, the next step is to combine them into a single log likelihood per particle by log summation:
This can be optimized by parallel summation of the pixel images: e.g. from 32x32 to 16x16 and so on to 1x1 (the final particle likelihood). In addition there may be some accuracy improvement by summing simularly sized values and avoiding floating point rounding errors: e.g. (a+b) + (c+d) instead of ((a+b)+c) +d
We’re currently working on this but the speed up is unlikely to be as substantial as the improvement in the previous section.
Single Transfer of Model to GPU
An addition to the above, previously we used the mixed polygon model. We have now transferred to using only a model made up of only triangles. This allows us to buffer the model on the GPU and instead we transmit the indices of the model triangles which should be rendered in a particular iteration. (Which is currently all the triangles).
Future work will look at determining the set of potentially visible polygons - perhaps using a voxel-based indice grid. This is more complex as it requires either implicit or explicit clustering of the particle set.
In addition to the above, we are now using an off screen buffer which allows us to renderer virtual images without being limited by the resolution of the machine’s specific resolution.
Putting it all together
We’ve benchmarked the improvements using a single threaded application which carries out particle propogration (including Visual Odometry), renderering, likelihood scoring and resampling. The test log file was 94 seconds long at about 10Hz (about 950 frames in total). We are going to focus on the increased number of particles for real-time operation with all modules being fully computed.
The biggest improvement we found was using the triangle model. This resulted in about a 4 times increase in processing speed. Yikes!
Using the triangle model, we then tested with 100, 400 and 900 particles the effect of computing the likelihood on the GPU using a shader:
This results in a 20-40% improvement, although we would like to carry out more sample points to verify this. The result of this is that we can now achieve real-time performance with 1000 particles. For the log file we didn’t observe significant improvement in accuracy beyond about 200 particles - so the next step is to start working with more aggressive motion and data with motion blur. There, being able to support extra particles will become important.
In my next post I’ll talk about how that computation is distributed across the various elements of the application.
Time for a new blog post. Lately, I have been working on image-based approaches for solving our smoothing and surface reconstructions problems. A straight-forward, but very effective method I wanted to implement for a long time is the one in:
The idea behind is to use the RGB image in order to enhance the depth image, in a joint bilateral filtering, based on the following formula:
where, in our case, is the depth image and is the RGB image.
The nice thing about it is the fact that we can use the 15Hz mode of the Kinect in order to produce high resolution (1280x1024 px) RGB images and normal (640x480 px) depth images. By using this method, we can obtain 1280x960 depth images. I faced some problems with the registration of depth to high-res RGB images, so the results below show just the case of 640x480 depth and color images.
As you can see, there are a lot of new points in the filtered point cloud (168152 vs 141511 in the input), no noisy points, and their positions are coherent with their neighbors.
Just as a teaser, an example of the high-resolution image-based upsampling (will come back with more results once we solve the registration problem mentioned above):
Also, in the meantime, I have spent a frustratingly large amount of time fixing the Poisson implementation we had in PCL. It seems that there was a memory error in the version a Google Summer of Code participant introduced in PCL, so in the next days we shall try to bring the newer version in.
In the last period, I have concentrated on coming up with new and better upsampling methods for the Moving Least Squares algorithm. Also, a lot of issue on the ones presented last time were solved.
Everything was committed to trunk (along with a complete interface and documentation) and should be included in the next PCL release. The upsampling methods are the following:
A quick timing analysis shows us that the running times are well within the 2 minutes as mentioned in the project requirements. Important to note is the fact that the bulk of the time (~35s) is spent on computing the MLS surface, and about 1-3s is spent on the actual upsampling. Thus, we can conclude that the quality improvements of the upsampling are well worth the additional ~5% increase in execution time.
|Upsampling method||Time(s)||Resulting # points|
A more conclusive test would be to take a Kinect cloud of a wall at a distance where the noise is accentuated (~3m) and try to fit a plane in each of the resulting upsampled clouds. In order to make the experiment more realistic, we took the picture of the wall at an angle, such that the quantization effects would increase along the wall. The numeric results are the following:
|Upsampling method||Cloud # points||% inliers|
Unfortunately these numerical values do not represent the actual quality of the fit, because of the varying point density across the cloud in the different upsampling methods (i.e., the parts of the wall closer to the sensor had a larger density and precision in the original cloud, and as points get farther from the sensor, the sparsity and noise increase; BUT in VOXEL_GRID_DILATION and RANDOM_UNIFORM_DENSITY, the density is constant across the cloud, meaning that the noisy part of the wall has the same amount of points as the more precise part).
As such, in order to analyze the quality of the fit, we do a visual analysis of the inliers/outliers ratio, as shown in the following pictures:
Original cloud and its plane inliers
NONE cloud and its plane inliers
SAMPLE_LOCAL_PLANE cloud and its plane inliers
RANDOM_UNIFORM_DENSITY cloud and its plane inliers
VOXEL_GRID_DILATION and its plane inliers
The conclusion is that the VOXEL_GRID_DILATION method behaves the best, as it has the least holes out of all the options.
So this is a wrap-up for the MLS smoothing. Next, I shall be looking into image-based hole-filling and how this can be applied to our problem. This will involve some experiments using MatLab and adding some code into the PCL I/O framework.
An overview of the system we are developing is illustrated at the bottom of this post. Sensor output in illustrated in lemon color, although only RGB-D data is considered in this project. Modules we have been actively developing are colored blue. The module in orange is our Incremental Motion Estimation algorithm. This module current utilizes a feature-based visual odometry algorithm. One alternative to this approach would be Steinbrucker et al. (ICCV 2001) which was developed specifically for RGB-D.
We have also illustrated in pink three modules which would naturally augment the point cloud-based localization system by providing redundancy and recovery methods. We have been developing these modules independently of PCL. They are illustrated here to demonstrate the flexibility of an SMC-based approach to localization.
Some typical VO performance for motion along a corridor. The boxes are 5m in size. We used a Kinect and the Freenect driver for this.
I’m currently finalizing a particle filter which used Eigen Isometry3d and boost. My old version used GSL for random number generation - which I want to move away from.
With some very good advice from Zoltan and a lot of hacking, we now have 3 upsampling methods for the MLS algorithm.
No additional points are created here. The input pointcloud is projected to its own MLS surface. This is exactly what we previously tested and presented in a recent blog post.
For each point, sample its local plane by creating points inside a circle with fixed radius and fixed step size. Then, using the polynomial that was fitted, compute the normal at that position and add the displacement along the normal. To reject noisy points, we increased the threshold for the number of points we need in order to estimate the local polynomial fit. This guarantees that points with a ‘weak neighborhood’ (i.e., noise) do not appear in the output.
And a few results:
The first picture is the reconstruction of coke bottles on a table. Please notice the correction for the quantization effects. The table surface is now planar and the objects look ‘grippable’. The second picture is a similar scenario, now using tupperware. The last picture shows how well the door handle is reconstructed. We conclude that visually, the results are very good.
An immediate problem is that this method adds the same amount of new samples to all points, not taking into account the local point density. An improvement we can make on this approach is to filter it with a voxel grid in order to have a uniform point density. Of course, this operation is superfluous, and is useful just for memory saving (see picture below for comparison between upsampled and upsampled + voxel grid).
Take as input a desired point density within a neighborhood with a fixed radius. For each point, based on the density of its vicinity, add more points on the local plane using a random number generator with uniform distribution. We then apply the same procedure as for 2 to project the point to the MLS surface.
The results are satisfying. As compared to 2, we do not need to apply the expensive voxel grid filter anymore. An issue might be the fact that, because we generate the points using a random number generator, the output point cloud looks a bit messy (as compared to 2, where the points are generated on a grid determined by the step size), but the surface is still well preserved. Also, the time performance is poor because of the rng.
This method makes sense theoretically, but in practice we are having serious issues optimizing it to fit into main memory. The idea behind it is to take each point pair within a fixed radius neighborhood and to sample the line connecting these two points. Ideally, this would fill up any small holes inside the cloud. The downside is that it also creates a lot of additional points in already dense areas. Handling this elegantly is something we need to think about.
In this blog post, we shall inspect the mesh construction methods available in PCL.
The algorithm was first presented 25 years ago in:
- William E. Lorensen, Harvey E. Cline: Marching Cubes: A high resolution 3D surface construction algorithm. In: Computer Graphics, Vol. 21, Nr. 4, July 1987
In PCL, these are implemented in the MarchingCubes class with the variants MarchingCubesGreedy and MarchingCubesGreedyDot. The ‘greedy’ comes from the way the voxelization is done. Starting from a point cloud, we create a voxel grid in which we mark voxels as occupied if a point is close enough to the center. Obviously, this allows us to create meshes with a variable number of vertices (i.e., subsample or upsample the input cloud). We are interested in the performance of this algorithm with the noisy Kinect data. Time-wise, the algorithm ran in about 2-3 seconds for a 640x480 cloud.
The following figure shows the results for various leaf sizes (from left to right, bottom to top: leaf size of 0.5 cm, 1 cm, 3 cm, and 6 cm, respectively):
And a close-up on the highest resolution cloud:
We conclude that the results are not satisfactory, as the upsampling is ‘artificial’ and does not inherit the properties of the underlying surface. Furthermore, there is no noise-removal mechanism and the blocking artifacts are disturbing.
Naive Algorithm for Organized Point Cloud Triangulation
This algorithm is implemented in the OrganizedFastMesh class in PCL. The idea behind is very simple: it takes each point in the inherent 2D grid of the Kinect clouds and triangulates it with its immediate neighbors in the grid. One can quickly understand that NaN points (points that were not captured by the sensor) will result in holes in the mesh. This is a mesh construction method and will output a mesh with exactly the same vertices as the input cloud. It does not take care of noise or NaN values in any way.
A screenshot of the output can be seen in the following figure. Visually, the result is decent, considering that the processing time is extremely small - just a single pass through all the points of the clouds.
We have managed to collect all the datasets required by Toyota. For a complete description, please visit the following link (also accessible from my main page).
Programming-wise, we have spent time fixing bugs and beautifying the pcl_surface module. After I will finish my exams next week, I shall start looking into implementing some new algorithms.
So as to quantify the benefit of using our depth image simulation mode for localization, we need to do some benchmarking. There are a number of parameters we need to test:
And some metrics:
Where error is measured using LIDAR-based ground truth.
In addition to this we also developed a second type of likelihood function. This method is essentially equivalent to ICP - without the iterative. Each pixel’s location is compared to the nearest point in euclidean space.
In this figure we try to illustrate some examples of where the scores would be substantially different. So we have an RGB-D sensor sensing two depth samples (purple) something like the end of a corridor (green, from above):
For the depth image simulation method (blue, Ray-to-Plane) the likelihood would be formed by comparing with the difference in depth (blue star). For the ICP-type method (red, point-to-plane) the distance can be very different. For the lower case the distances are approximately the same. For the upper case, the association is very different. Also as the ICP-type method requires searching over the entire set of planes for the explict correspondence and it is also quite expensive.
We wanted to test what effect this choice has on accuracy. Below are some figures showing the results across an extensive experimentation. The figures were produced with my 2 year old 4-core 2.53GHz Pentium Core2 with an Nvidia Quadro 1700M with 32-cores. Each result is the average of 20 independent Monte Carlo runs. Total testing is equivalent to 16 hours of runtime.
First of all, error broadly converges to about the same accuracy as the number of particles increases:
The 50cm figure is largely because we aim to maintain multi-modal estimate - so as to be more robust when tracking. To get the optimal performance for each frame you could use ICP using the most likely particle as the starting condition. We haven’t done that here.
This figure shows the timing performance as the number of particles increases:
OUR implementation of the nearest plane lookup was pretty slow. However our target framerate of 10Hz was achieved with 100 particles for the depth image likelihood function (ray-to-plane). As you saw above, for 100 particles the error has typically converged, so thats been sufficient to be realtime with the type of motion you see in the figure below.
For all of these figures we decimated the original RGB-D image by a factor of 32. Next we would like to look at what effect that has - especially now that my new laptop has 256 GPU cores.
Additionally we want to look at sub-dividing the full model, so that only the planes near to the particle poses [within 20m] are passed to OpenGL for testing. This is likely to give us a substantial performance improvement. I believe that 1000 particles at 10Hz should be achieveable.
Interestingly the approach developed by Ryohei Ueda during his internship at PCL/Willow Garage is very close to what we do here: the tracking paradigm is essentially inverted. Later in this project it would be interesting to apply depth image simulation method to his tracking algorithm and see if it can be speeded up. Given the object models he was using are much smaller than a building model, it should do.
For the VTK smoothing tests, we took the raw clouds, triangulated them using the OrganizedFastMesh triangulation with the TRIANGLE_ADAPTIVE_CUT option, and then fed this to the 3 smoothing algorithms from the VTK library.
The first one to be tested is MeshSmoothingLaplacianVTK with the default parameters recommended by VTK, but with an increase on the number of iterations from 20 to 100.
Here, the results are satisfactory, in both cases, the quantization artifacts are reduced (they are still visible).
Also, if we look at the corresponding mesh, the reconstruction after smoothing looks more natural, with a better surface curvature.
Bottles and Tupperware Datasets
In this case, the Laplacian smoothing does not work well anymore. The quantization and the high noise level is still present in the case of both the bottles and tupperware datasets. The main reason for this is the fact that the objects of interest were quite far away from the sensor and the quantization artifacts are quite accentuated (i.e., there are large gaps between the points belonging to the same object).
The mesh subdivision schemes we have been provided by the VTK library are not of great use for our scenarios, as they just split up triangles in the mesh, inheriting from their artifacts. Furthermore, these schemes are highly dependent on the quality of the initial triangulation - which in our case is the simple OrganizedFastMesh - does not yield excellent results. They basically just resample point on the triangles present in the input mesh, without taking into consideration any more complex information about the vertex neighborhood.
Another thing we tried was to combine the simple subdivision with the previous laplacian smoothing, and the results are visually decent, as shown in the next figure. Again, we inherit the problems of the subdivision scheme (the holes caused by the incorrect triangulation).
In the meantime, I have worked on solving some issues with Zoltan Marton’s Greedy Projection Triangulation. Two trac issues regarding this were solved, but its current state does not allow us to reconstruct Kinect scans - once we solve this, I will do benchmarking on the gp3 algorithm too. Other time-consuming fixes were done for OrganizedFastMesh.
A direction we definitely need to look into is to have some algorithms that also add points during reconstruction. The original MLS and GP3 papers do mention this possibility, but they have not been implemented in PCL yet. It is clear so far that we still do not have the Holy Grail of smoothing yet.