PCL Developers blog

Andreas Mützel

This is my personal page

email:amuetzel (at) uni-koblenz.de
project:KdTrees on GPU using CUDA
mentor:Marius Muja (Stéphane Magnenat [Radu B. Rusu, Michael Dixon])

About me

I’m a master’s student at the University of Koblenz-Landau in the Computational Visualistics program. There, my main interests are image processing and robotics, and this is my first serious GPU programming project.

Apart from computer stuff, such as programming and games, my hobbies include playing the guitar, listening to heavy metal music and (recently) rock climbing.

Roadmap

Here is a brief outline of my GSoC project milestones:

  • Get familiar with FLANN - http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
    • is currently used in PCL for k-nn search and for radius-search
    • contains several index types, of special interest is the low dimensionality kd-tree index (KDTreeSingleIndex)
    • have a look at the FLANN public interface (flann::Index class), we would like the GPU implementation to use the same interface. If you think we need to change this interface in some way to allow better performance, we can discuss about it and come up with the best solution
  • KDTree on GPU
    • Get familiar with an existing kd-tree implementation on GPU: http://www.cs.unc.edu/~shawndb/
    • Benchmark against the CPU implementation and FLANN’s KDTreeSingleIndex
  • 3D/4D kdtree GPU implementation in FLANN
    • Create GPU kd-tree implementation (may use lessons learned from Shanw’s implementation and from the CPU implementation in FLANN)
    • Analise trade-offs and changes that a GPU implementation requires vs a CPU implementation
    • Profile the GPU implementation and the CPU implementation to understand the bottlenecks in each case
    • Document and write a tutorial on how to use the GPU implementation
  • High dimensional GPU implementation of existing FLANN algorithms
    • Is it feasible to implement any of the higher dimensional approximate nearest neighbor search algorithms from FLANN on GPU?
    • If yes, implement it and benchmark it against the CPU implementation.
    • Document and write a tutorial on how to use the GPU implementation

Click here for a more detailed roadmap

Recent status updates

GPU KD Tree Integrated Into FLANN
Tuesday, August 16, 2011
../../_images/gsoc2.png

A few days ago, I sent a patch containing my GSoC work to Marius, and it can be found in FLANN (git head) now. If you want to try it, you simply have to install CUDA and then build FLANN from git. Instructions on how to use the new index type can be found in the manual. In short, I finished all three mandatory points of my roadmap, but sadly there wasn’t enough time for the implementation of high-dimensional index structures in FLANN.

This was probably my final GSoC update, since I’m leaving on a trip tomorrow and returning on sunday. I really enjoyed working with this project, so I guess you can expect some more improvements on the CUDA FLANN kd tree after GSoC :-)

Nearly Done
Sunday, August 07, 2011
../../_images/gsoc2.png

Sorry for the long time without updates, I was basically working on the GPU tree build and didn’t have any interesting results until now. Basically the result is: Building the kd tree on the GPU can be about 2-3 times faster than the FLANN CPU build method, though I didn’t do any complete benchmarks until now.

A nice side effect is that the GPU tree build leads to a slightly different memory layout that makes tree traversal a bit (~5%) faster than before.

I’m currently starting to polish the code, document it and write tests to verify that it works correctly. If everything goes the way I planned, it should be ready for a merge with FLANN HEAD by the end of the week.

Radius Search and GPU Tree Build
Tuesday, July 26, 2011
../../_images/gsoc2.png

My current status: Radius search: works, but could need some improvement concerning the speed... GPU tree build: Naive implementation basically works, but would only be about 25-50% faster than the CPU version. Zhou et al achieved a speedup of about 8x, so I’m starting over with an approach that is more similar to theirs. (I’m not trying to implement their approach exactly as there are some implementation details missing, so I don’t know for sure how they implemented it...)

Status Update
Wednesday, July 20, 2011
../../_images/gsoc2.png

Last week I was waiting for a FLANN API change that would be necessary for radius search on the GPU. As the next steps (implementation of radius search, integration into FLANN and documentation) depend on this, I started with the next independent tasK: KD tree construction on the GPU. This isn’t as easy as the GPU search was, but I think it’s about half done. Most of the single functions should work, but not all of them are done yet (and of course, the pieces will still need to be put together).

I just noticed that the API change is done in GIT, so I’ll put the GPU build on hold and finish radius search first.

Performance In Comparison To Other Implementations
Monday, July 11, 2011
../../_images/gsoc2.png

Basically my kd tree is up and running now. The problem is that the performance is always better than FLANN, but the speed advantage in some tests is sometimes as low as as 8x, while it is more around 10x-15x usually, and in some rare peak cases almost 25x. I tried finding some other GPU NN search implementations to compare my results to, but those are quite rare. One thing I found was http://nghiaho.com/?p=437, but the search was slightly slower than my implementation on synthetic data sets and much slower on real-world data.

In “GPU-accelerated Nearest Neighbor Search for 3D Registration” by Deyuan Qiu, Stefan May, and Andreas Nüchter, a gpu kd tree is used for ICP and, via an approximate search, achieves a speedup of up to 88x, though this is for the complete ICP procedure with a slower CPU and slightly faster GPU than my system. Also, no information about the NN search time is given, so to compare the speeds, I would have to test the complete ICP setup. If I interpret the results and fig. 4 correctly, their search algorithm is faster than my implementation when only a coarse approximation of the nearest neighbor is needed, but when an exact nearest neighbor is needed, I suspect that my code should be a lot faster.