PCL Developers blog

All blog posts for Andreas Mützel

GPU KD Tree Integrated Into FLANN
Tuesday, August 16, 2011
../../_images/gsoc2.png

A few days ago, I sent a patch containing my GSoC work to Marius, and it can be found in FLANN (git head) now. If you want to try it, you simply have to install CUDA and then build FLANN from git. Instructions on how to use the new index type can be found in the manual. In short, I finished all three mandatory points of my roadmap, but sadly there wasn’t enough time for the implementation of high-dimensional index structures in FLANN.

This was probably my final GSoC update, since I’m leaving on a trip tomorrow and returning on sunday. I really enjoyed working with this project, so I guess you can expect some more improvements on the CUDA FLANN kd tree after GSoC :-)

Nearly Done
Sunday, August 07, 2011
../../_images/gsoc2.png

Sorry for the long time without updates, I was basically working on the GPU tree build and didn’t have any interesting results until now. Basically the result is: Building the kd tree on the GPU can be about 2-3 times faster than the FLANN CPU build method, though I didn’t do any complete benchmarks until now.

A nice side effect is that the GPU tree build leads to a slightly different memory layout that makes tree traversal a bit (~5%) faster than before.

I’m currently starting to polish the code, document it and write tests to verify that it works correctly. If everything goes the way I planned, it should be ready for a merge with FLANN HEAD by the end of the week.

Radius Search and GPU Tree Build
Tuesday, July 26, 2011
../../_images/gsoc2.png

My current status: Radius search: works, but could need some improvement concerning the speed... GPU tree build: Naive implementation basically works, but would only be about 25-50% faster than the CPU version. Zhou et al achieved a speedup of about 8x, so I’m starting over with an approach that is more similar to theirs. (I’m not trying to implement their approach exactly as there are some implementation details missing, so I don’t know for sure how they implemented it...)

Status Update
Wednesday, July 20, 2011
../../_images/gsoc2.png

Last week I was waiting for a FLANN API change that would be necessary for radius search on the GPU. As the next steps (implementation of radius search, integration into FLANN and documentation) depend on this, I started with the next independent tasK: KD tree construction on the GPU. This isn’t as easy as the GPU search was, but I think it’s about half done. Most of the single functions should work, but not all of them are done yet (and of course, the pieces will still need to be put together).

I just noticed that the API change is done in GIT, so I’ll put the GPU build on hold and finish radius search first.

Performance In Comparison To Other Implementations
Monday, July 11, 2011
../../_images/gsoc2.png

Basically my kd tree is up and running now. The problem is that the performance is always better than FLANN, but the speed advantage in some tests is sometimes as low as as 8x, while it is more around 10x-15x usually, and in some rare peak cases almost 25x. I tried finding some other GPU NN search implementations to compare my results to, but those are quite rare. One thing I found was http://nghiaho.com/?p=437, but the search was slightly slower than my implementation on synthetic data sets and much slower on real-world data.

In “GPU-accelerated Nearest Neighbor Search for 3D Registration” by Deyuan Qiu, Stefan May, and Andreas Nüchter, a gpu kd tree is used for ICP and, via an approximate search, achieves a speedup of up to 88x, though this is for the complete ICP procedure with a slower CPU and slightly faster GPU than my system. Also, no information about the NN search time is given, so to compare the speeds, I would have to test the complete ICP setup. If I interpret the results and fig. 4 correctly, their search algorithm is faster than my implementation when only a coarse approximation of the nearest neighbor is needed, but when an exact nearest neighbor is needed, I suspect that my code should be a lot faster.

Optimizations
Sunday, July 10, 2011
../../_images/gsoc2.png

Since the speedup in comparison to the KDTree CPU implementation didn’t seem that high to me, I tried several things to optimize the search algorithm. These are:

  • Structure of Arrays (SoA): split the node structure into several (32 bit and 128 bit) structs and store each part in a single array: normally, this is supposed to result in performance gains on GPUs, but here, it always decreased performance.
  • Recursive Search: In the beginning, I re-wrote the tree traversal to work without a stack. But for this, I needed to store the parent index of each node, which increased the struct size by 128 bit due to alignment issues. I tried to instead keep a stack in shared memory and do the search in a recursive way. But this was also about 10% slower in the benchmarks presented in the last post.
  • Rotating split axes: Another way to save 128 bit on the node struct size was to rotate the split axis and remove the information about it from the struct. Result: again, no speed improvement.

So it seems like I came up with the best solution on the first try. I still have one or two ideas that might increase speed, but my hopes aren’t that high.

I think that the kD tree traversal just isn’t that well suited for GPU computation because of the large number of random memory accesses and generally the high number of memory accesses and low number of computations.

Real Data Benchmark
Saturday, July 09, 2011
../../_images/gsoc2.png

So far, I only benchmarked the algorithms on synthetic data, where the search and query points were randomly scattered inside a 3d unit cube. There, the minimal kd tree was faster that FLANN by a factor of about 20 in the 1NN case, as shown in my last benchmark post. In some (unpublished) benchmarks, the FLANN kd tree ported to the GPU was always about 8 times faster than FLANN with the same random data set.

In this post, I’m going to show some results with more realistic data sets: I modified Nick’s benchmarking code to test the kNN search of FLANN, my FLANN GPU port and my minimal kd tree on the GPU. Here, two pcd files are loaded; the first one is used to build the NN index while the points from the second one are used as query points. All the tests used the data from the office dataset available in the PCL data repository and were performed on a Phenom II 955 CPU and a GTX260 gpu.

For the first test, ‘office2.pcd’ was used both as search and query points. The result is this:

../../_images/bench1_office22.png

As you can see, with real-world data, the FLANN GPU kd tree is always faster than both FLANN on the CPU and the minimal tree. Build time is about 3% slower for the GPU FLANN algorithm than for the normal FLANN implementation, and both are about twice as fast as the minimal tree. (The GPU build times includes the upload to the GPU, and the GPU searches include the up- and download to and from the gpu.) The pattern of the FLANN options’ build time being about twice as fast as the other build time repeated in all the tests, so it is not shown in the other benchmarks.

The search time of the FLANN GPU algorithm is shortest; in then case of k=1, it is about 6x faster than FLANN, and 30% faster than the minimal kd tree. With k=64, the speed of both is already equal. This is because the GPU is bad at handling the complexity of maintaining the result heap used in the kNN search, which takes more and more time with increasing k.

In the second test, ‘office1.pcd’ was used as the search points, while the query points came from ‘office2.pcd’:

../../_images/bench1_office12.png

Here, the minimal kd tree was not tested as it timed out already on k=32. With k=1, the FLANN GPU search takes only 0.025s, while the FLANN seach takes 0.86s, which is 34x the GPU search time! In this test, the influence of a decreasing speed advantage with increasing k can be seen again. For example, with k=128, FLANN takes 19s, and the CUDA FLANN 8s. To show the differences with low values of k, the following image shows only the tests until k=16:

../../_images/bench1_office12_k16.png

The last test used ‘office3.pcd’ to build the tree and searched for the points in ‘office4.pcd’. Here, only the FLANN and GPU FLANN results are shown, as the minimal tree started to time out on the GPU very soon.

../../_images/bench1_office34.png

Here, the GPU search time was about 10x faster for the case of k=1, but the speed advantage decreased to a factor of 7 at k=8 and 2 at k=128.

To repeat the benchmark with your own pcd files, please download, build and install my patched FLANN from svn+ssh://svn@svn.pointclouds.org/flann/amuetzel/trunk/flann. Then, build the benchmarking code from svn+ssh://svn@svn.pointclouds.org/flann/amuetzel/trunk/benchmark. Finally, you can run the benchmark as “nnlib_benchmark <searchdata.pcd> <querydata.pcd>” and you’ll get a nice output of the search times, plus an output text file that can be plotted using gnuplot. If you want to use the code in your own application,

The result did not really surprise me. I guess the main reason for the fast results is the spatial organization of the search and query points; as both are (organized) point clouds from a kinect, query points that are next to each other in the cloud array are likely also spatially close. This leads to more coherent memory accesses, which is a good thing since the speed of the search is computationally cheap, but severely limited by memory access performance. The slow memory accesses also lead to the huge slowdowns with large values of k. There is not enough (fast) on-chip memory to store the neighbor heaps, so they have to be stored in (slow) GPU RAM. Thus, inserting new neighbors becomes expensive, as the data often has to be moved around in the heap structure. So as a conclusion, I think I can say that if you need few nearest neighbors, use your GPU, but if you need many nearest neighbors or have some spare CPU cores to parallelize the search, use the CPU.

If you test the code, I would really appreciate it if you tell me about the results, along with your system specs (GPU+CPU)! Of course, if you have any trouble building it, you can also contact me ;-) I would be especially interested in any benchmarks on Fermi-series cards, to see if the cache found on these cards improves performance.

Some more benchmarks will follow, for example about the approximate NN search and with some more other data sets.

More FLANN progress
Sunday, July 03, 2011
../../_images/gsoc2.png

Last week, I didn’t have that much time for the GSoC work. But at least I finished the 1NN and kNN searches, integrated them into FLANN and uploaded the code to svn.pointclouds.org/flann/amuetzel/trunk/flann. To use it, first build the modified flann library you can find there.

The only difference to using the normal FLANN KdTreeSingleIndex is to create the index like this:

flann::Index<flann::L2<float> > flannindex( dataset, flann::KDTreeCudaLowdimIndexParams() );

If you do this, you will likely get an exception about an unknown index type as soon as you run the code. To solve this, #define FLANN_USE_CUDA before including flann.hpp and link libflann_cuda to the executable. The CUDA index types are enabled this way to avoid having to link the lib when no CUDA index is going to be used.

If you want to try the code, there is one other thing to keep in mind: Right now, it only works with the flann::L2<float> distance on 3D vectors.

In the next days, I will start working on four things, likely in this order: Restructuring the GPU kernel code to make it easier to adapt it to other distance types, porting the minimal kD-Tree to the FLANN API, making the interface work with 2D and 4D vectors and finally creating a benchmark with real-world data sets. I hope to be able to reuse some parts of Nick’s benchmark code for that, as he is doing the comparison of of the NN search libraries at the moment.

Next Experiments
Saturday, June 25, 2011
../../_images/gsoc2.png

The last days I have started working on translating the FLANN KDTreeSingleIndex to a GPU version. FLANN’s implementation of the kD-tree has two main advantages in comparison to my implementation: First, the worst-case performance when the query points lie far away from the closest search points is way better. Second, it is possible to store multiple points in a single leaf node, which is not possible with the minimal tree representation.

I hope that both points will be beneficial in the GPU implementation. In the last days, I worked on the tricky problem of translating the search algorithm to a non-recursive version, along with the necessary changes in the data structure. Right now, I am porting this to CUDA and will report on the performance as soon as it is done.

First Code Uploaded
Thursday, June 16, 2011
../../_images/gsoc2.png

I finally cleaned up the code a lot and restructured it so that it can be used in regular cpp files that are not processed by nvcc. (It’s not integrated into PCL, but will go straight into FLANN.)

If you want to try it, even though I wouldn’t consider it production-ready: the code is located at http://dl.dropbox.com/u/32615544/kdtree.zip until it is integrated into FLANN. The interface is slightly similar to FLANN, but it can’t be used as a drop-in replacement in the current form. You will likely need to change one include path in the CMakeLists.txt, as it points to a directory in my own home directory, but apart from that, it should work out of the box.

Any benchmark results would be welcome, together with information about your GPU and CPU! (I usually execute the tests with 1M points.)

GPU vs CPU Benchmarking Results
Monday, June 13, 2011
../../_images/gsoc2.png

In the last days, I finally figured out why Shawn Brown’s CUDA code (http://www.cs.unc.edu/~shawndb/) crashed on my PC (actually my fault), so here are some benchmarking results! 4 implementations in total were tested: FLANN (KDTREE_SINGLE_INDEX), the existing CUDA code, my CUDA code and my CPU fallback. The tests were done on a Phenom II 955 CPU with 12GB of RAM and a GTX260 GPU.

The main difference between the original and my CUDA code is that I don’t use a stack for tree traversal; instead, I evaluate the decision about which branch to follow again when moving up the tree. So in the table, CUDA means Shawn Brown’s code, and Stackless CUDA is my code. Both use minimal trees with rotating split axes, while FLANN tries to select the optimal split axes and has to store extra information about the nodes.

I used two artificial test cases. All of them consisted of randomly distributed 3D points, but their distribution is different for each test.

Test 1: In the first test, all the points have x,y,z coordinates in the range of [0,1[. In all tests, the number of query and search points is the same, but the points are different. (All times are in seconds.)

Algorithm Build Time, 10k points 1-NN 16-NN Build Time, 100k points 1-NN 16-NN
FLANN 0.003662 0.008056 0.045166 0.048583 0.141357 0.667724
CUDA 0.011676 0.001118 0.014650 0.143819 0.014667 0.115384
Stackless CUDA 0.057128 0.002929 0.013916 0.102539 0.010742 0.152343
CPU Fallback 0.057128 0.011230 n/a 0.102539 0.138671 n/a
Algorithm Build Time, 1M points 1-NN 16-NN Build Time, 4M points 1-NN 16-NN
FLANN 0.681396 2.213623 8.890380 3.179687 10.11523 37.945800
CUDA 1.651866 0.133243 1.100975 7.380264 0.547088 OOM
Stackless CUDA 0.654541 0.107910 1.694580 3.307128 0.468261 OOM
CPU Fallback 0.654541 1.908203 n/a 3.307128 9.09375 n/a

All GPU search times include transfer times, and the build time of the last two algorithms is always the same as they work on the same tree. (Also, the transfer time to the GPU is included here.)

The 16-NN test could not be performed with my CPU fallback because it is not implemented yet. The CUDA implementations failed at the largest data set because of a simple reason: My GTX260 has 768MB of RAM, but the buffers for storing the tree and returning the results would use more than 500 MB. The result: std::bad_alloc on allocation of the thrust::device_vector.

Test 2: Here, the x coordinate of the search points was limited to [0,.1[, everything else remained the same. About 90% of the query points were outside the range of the search points now.

Algorithm Build Time, 10k points 1-NN 16-NN Build Time, 100k points 1-NN 16-NN
FLANN 0.003906 0.019042 0.077636 0.050292 0.348632 1.320312
CUDA 0.011676 0.013418 0.053049 0.148135 0.365875 1.450399
Stackless CUDA 0.068359 0.008300 0.024902 0.093261 0.213378 0.470214
CPU Fallback 0.068359 0.081054 n/a 0.093261 2.822753 n/a
Algorithm Build Time, 1M points 1-NN 16-NN
FLANN 0.705566 7.083007 25.00878
CUDA 1676.332 timeout OOM
Stackless CUDA 0.655761 11.65576 OOM
CPU Fallback 0.655761 147.4165 n/a

In this test, FLANN was noticeably slower than the in the first test, but the CUDA results of the 1M point test were even slower than FLANN and always timed out on the 4M point test. Since CUDA usually kills GPU kernels that take too much time after about 4 seconds, some of the tests could not be completed.

In conclusion, it is possible to say that for “uniform” data sets, the GPU implementation is faster than FLANN by a factor of about 20. But for skewed data sets like in the second test, the performance drops significantly. The overhead of storing explicit information about the splits might result in a speed gain on the GPU as well, even though this might also result in a worse performance due to more memory accesses. So I think it might be interesting to implement and benchmark that as well. But first, my next step would be to test a more “real-world”-like case. This could be done by taking two kinect frames and searching for the neighbors of frame 1 in frame 2, which would be an approximation of the searches performed in the ICP algorithm.

For this, I think I’ll have to do some restructuring of my code first. At the moment, it requires that the code that calls it is also compiled via nvcc, which doesn’t work for some PCL code, especially those files that include Eigen headers. Somehow nvcc doesn’t like template-heavy code... As soon as this is done, I’ll publish the code somewhere. This should be done the end of the week. I hope I can start integrating it into FLANN by then, too.

More GPU Programming
Friday, June 10, 2011
../../_images/gsoc2.png

The last days, I continued working on the kD tree. Spent quite a few hours on a problem where the 1NN GPU code would find a slightly different point from the CPU implementation and finally traced it down to floating point imprecesions...

Additionally, the kNN search works now. All of this is still a naive implementation, but can be faster than FLANN in some cases. Benchmarks will hopefully follow this week, next week otherwise.

Discussing Things and Learning CUDA
Tuesday, June 07, 2011
../../_images/gsoc2.png

It’s been a busy last week. So far, I’ve been discussing PCL and FLANN API changes for the GPU NN search on the mailing list and with Marius and I started getting familiar with CUDA and Thrust. Additionally, I got the existing CUDA NN search to compile after getting help from Shawn, but it results in runtime errors on my PC...

Additionally, I just ported my toy 3D OpenCL kD-Tree to CUDA. Nice to see that the “porting” was basically a copy&paste job. And it works about 30-50% faster than before on the same graphics card. Good to see that it’s faster than the OpenCL version, but also kind of sad since it won’t work on AMD graphics cards.

Start of the Second Week
Tuesday, May 31, 2011
../../_images/gsoc2.png

The good news: CUDA and PCL are working now on my ArchLinux PC. So no Ubuntu for me :-) The way it works is kind of hackish now. I compile the CPU code with GCC 4.6, but pass the option ‘-DCUDA_NVCC_FLAGS=”–compiler-bindir=/opt/gcc-4.4”’ to cmake, so that nvcc uses GCC 4.4 for the CPU stuff in the .cu files. At least it works...

The bad news: The GPU implementation on http://www.cs.unc.edu/~shawndb/ doesn’t build on my PC. Missing header files in the project, as well as nvcc errors such as “error: attribute “global” does not apply here”. Seems like I have to leave out the benchmarking and start directly with the implementation without learning anything from that implementation...

In the meantime, I’m figuring out the optimal NN search API to allow for GPU and CPU searches. See the mailing lists ;-)

CUDA Madness
Friday, May 27, 2011
../../_images/gsoc2.png

I seriously started wondering why Linux people even look at CUDA... Trying to install it right now, bu it seems like only horribly outdated distros with GCC < 4.5 are supported. (And in turn, most of those aren’t really supported any more.) I hope I’ll get gcc 4.4 running on Archlinux, otherwise I’ll have to switch to an unsupported Fedora. Or is OpenSUSE 11.2 still supported?

Update: GCC 4.4 is running now, but I still get tons of errors in the PCL CUDA code. Never thought that this would be such a dependency nightmare :-( Seems like I have to try Ubuntu 10.04 then...

Late Start
Thursday, May 26, 2011
../../_images/gsoc2.png

Today I finally returned to Germany and started with my GSoC work. I’m one of the GPU NN guys and, of course, developing on Linux. I already compiled PCL on my laptop, but currently I’m working on getting all the stuff (incl. CUDA) to run on my desktop PC, since the laptop doesn’t have a Nvidia graphics card.

In the meantime, I’m having a look at the FLANN KD-Tree implementation and the CUDA one mentioned in my roadmap and already encountered the first problem: the CUDA implementation was developed with Visual Studio, which means: no Linux Makefile, meh...

A first observation I made when having a first look at comparing my own KD-Tree implementation with FLANN: I use a left-balanced median split scheme which stores the points of the tree without any overhead and switches cyclically between the split axes. When using randomly generated tree and query points in the 3D unit cube, my implementation is slightly faster than FLANN and the OpenCL version that I did a few months ago is about 8 times faster than FLANN on a GTX 260 vs. a Phenom II 955. But when using configurations such as ones where the x axis of all query points is restricted to [0,0.5] while the y axis of all tree points can only take values in [0,0.25], the performance degrades seriously. The CPU implementation takes about 30 times longer, as well as the OpenCL one, but FLANNs search time does not change. So I guess the flexibility of choosing optimal split planes is worth the overhead of storing additional info, which is what FLANN does if I saw it correctly.

In the next days, I will focus on the first milestone, the definition of a suitable interface for GPU KD-Tree search, should the current FLANN interface not prove sufficient.