PCL Developers blog

Nick Vanbaelen

Welcome to my personal, official PCL Google Summer of Code page. I am excited to get working for PCL and you will find all project progress and specifics on this page!

email:nickon (at) acm.org
project:KdTree on GPU using CUDA
mentor:Marius Muja (St├ęphane Magnenat [Radu B. Rusu, Michael Dixon])

About me

I’m a masters student at Catholic University of Leuven (KULeuven) and my field of study is Civil Engineering: Computer science with specialization Artificial Intelligence. My interests in computer science ranges from theoretical aspects to software design. In particular I’m interested in distributed and parallel computing and the optimisation of software for the underlying hardware platform.

Besides computer science, my hobby is carp fishing. I’ve been doing this since I was a kid and it is grown to be a passion of mine. It’s a great way to relax and don’t think about anything else than trying to outsmart the one giant carp which is wandering around in the lake.


The main goal of my project is to firstly benchmark different NN libraries such as nnn, libnabo and FLANN (currently being used in PCL) and integrating the best parts of each into FLANN. When this integration is done I will be exploring possibilities to optimize FLANN for multi-core architectures, exploiting the maximum amount of parallelization possible for each underlying hardware platform. I will be working closely together with Marius Muja, which is the mentor for this project and also the developer of the FLANN library.

In the meantime I will also be working together with my GSoC colleague Andreas, which will be starting the development of a NN search on the GPU. The aim is that I will be involved in the decision making of the design and integration into FLANN of this NN GPU implementation, such that towards the end of the GSoC period (or if the timespan won’t allow it, outside of the GSoC period) I can aid in developing this GPU implementation.

You can find the more detailed roadmap of my GSoC project by clicking here.

Recent status updates

Multithreaded search available in FLANN
Friday, August 26, 2011

Today Marius has applied the patch for multithreaded search to the git head of FLANN. So, multithreaded search is available now! Give it a try! If you encounter any problems, contact the mailing list and we will try to resolve them as quickly as possible :)

This will also be my last blogpost as this is the end of GSoC. I would like to thank all PCL developers (especially Marius, my mentor) for their help during the summer and for the opportunity they gave me by accepting me for GSoC 2011! It was a blast and you’ll probably see me around!

Wrapping up
Thursday, August 18, 2011

Ah, long time no seen! I’ve been busy with the redo of my exam (business skills, a broad view on economics. Wasn’t really my cup of tea but I think I passed it this time thumbs up). Back to GSoC then! Currently I’m wrapping up for the final submission to the Google repos and for the merge with the git head of FLANN: doxygen comments, unit tests using GTest and writing a paragraph in the manual of FLANN on how to use the multithreading search.

I’ll send out another blog post when this is finished and the multithreaded search is is merged succesfully with the git head of FLANN!

Multithreaded FLANN
Friday, August 12, 2011

It’s been a while since I’ve posted a blog update, but I haven’t been sitting around lately. Since my last blogpost I completed the multithreading of FLANN by making use of the Intel Threading Building Blocks library. This embodies that now all knn- and radiusSearches make use of multiple cores when specified by the user. The parallelism can be exploited when multiple query points are passed trough the knn- or radiusSearch interface and the amount of cores is specified in the SearchParam struct.

Because using Intel TBB provides some additional overhead (using function objects, spawning threads(!), ...) over the normal singlecore FLANN implementation, it is expected that it will only pay off when enough work is being done by the spawned worker threads. How is work increased? Ofcourse as the k and radius parameter increases and/or the more query points are passed through the interface, the more work is introduced and we start benefiting from the multithreaded version at a particular point. Cfr. the GPU implementation of FLANN (Andreas).

Multithreading results

Opposed to the GPU implementation, multithreading starts to pay off at much smaller workloads. Depending on the k/radius parameter, multithreaded search can already start paying of at as few as 100 query points. You can have a look at the benchmarkings I did on my i5 450M processor (quadcore; 2 pyshical cores, each being a logical 2 core processor): FLANN TBB payoff comparison. This comparison is actually for knnSearch, but I also did the benchmarkings for radiusSearch and they are exactly the same. This ofcourse makes sense as the radius also just acts as a “work increasing” parameter.

We can see that a fairly consistent speedup of 2x can be achieved when we transcend the payoff barrier. I’ve also made a benchmark for comparing traditional FLANN with FLANN using 2 cores and FLANN using 4 cores: FLANN numof cores comparison. Interesting to see is that introducing twice as many cores doesn’t always double the search time performance as we all intuitively might think. You can see that going from 2 to 4 cores introduces an extra gain in searchtime speedup, but not as significant a speedup as going from 1 to 2 cores.

Future steps

Currently I’m wrapping up for the GSoC period, meaning that I’m cleaning code and integrating it nicely into FLANN, writing documentation and unit tests for it and so on. Once this is done the deadline of GSoC will probably be reached.

Things that I want to perfect / examine later on is having a look at how the caching mechanisms in Intel TBB could potentially increase the performance of multithreaded FLANN even further. For example a first thing that I can have a look at is, when building the KDtree, try and see what the effect is of allocating each node or points in that node on single cache lines instead of using the standard malloc procedure, which doesn’t offer any guarantee on this. This way we might already have an increase in searchtime because of the better fitting into and the more effective use of, the cache.

Another more challenging but logical extention of this first optimization is to take into account which query points were assigned to which CPU, such that the warmed up on-processor cache can be used for doing the search through the tree. Potential major speedup can be obtained if large portions of the kdtree can be fit into the on-processor cache. This is detailed optimization work, but truly interesting stuff to look at and speed up FLANN even further.

Conclusions for integration after some more detailed benchmarking
Wednesday, July 27, 2011

Since I mentioned in my earlier blogpost that I would investigate the smaller build times of SC2 and Libnabo than those of FLANN, here is the conclusion on integration (and some other interesting stuff I picked up along the way):

The larger build times of FLANN

SplitCloud2 and Libnabo seemed the have smaller build times than FLANN, but this was due to the use of the PCL wrapped FLANN version instead of using FLANN directly. The larger build time of the PCL wrapped FLANN is due to a copy of the pointcloud that is being made in the PCL kdtree wrapper that really should be avoided. You can see the difference in build-time between wrapped and direct FLANN in these benchmarking figures (SplitCloud2 is also considered):


You can see now that it is in no way beneficial (in realistic scenario’s) to choose SplitCloud2 over FLANN.

Equivalent kdtrees in libnabo and FLANN

What I would like to mention is that I did some refactoring on the benchmark framework in my trunk. For libnabo and FLANN benchmarking I added a way to easily build kdtrees with the same leafsize (number of points in a leaf node) such that benchmarkings can be created where the two libraries always fight with equal weapons. For the default leafsize that is currently used in PCL (=15), we can see that FLANN is performing better in both measurements (build and search time):


For different meaningful leafsizes, FLANN keeps outperforming Libnabo, meaning that integrating Libnabo (or certain aspects of it) would not be beneficial in any way.

Optimal leafsize for kdtree

It also turns out that when changing the leafsize that is used for building the FLANN kdtree, some more performance can be gained. Changing the leafsize from 15 to 50 [1] improves the kdtree build time with approx 20% while radius search times only incur an increase of approx 5%, and knn search times incur an increase of approx 5-10%. This is a valuable insight that can be used in choosing a leafsize that balances the build and search times for it’s particular use in PCL. The results of these benchmarkings are linked below for the ones that are interested:




Multithreading of FlANN

Sorry for the late update on the final conclusion of integration for libnabo and SC2, this is because I have ben context switching between multithreading development and interpreting benchmarking results. Currently I’ve set up a test project that locates Intel TBB using pkg-config and a FindTBB.cmake file I wrote. This is a clean and basic project that I can easily expand while developing a multithreaded variant of the SingleKDTreeIndex in FLANN. I’m currently working on this as we speak. The goal will be to highly optimise this index and when CMake detects that Intel TBB is installed on the system, it will maximally exploit the capabilities of the hardware platform.

[1] I noted from the benchmarkings that a leafsize of 50 provides a good balance between kdtree build time improvement and search time degrading, if the current leafsize value should be changed is currently being discussed on the developer mailing list. In effect some more fine grained benchmarkings could be generated to reach a well founded consensus.

All benchmarking results in one post!
Thursday, July 21, 2011

UPDATE a debug build sneaked up on me and poluted the results... Hence this final benchmarking update. Now it’s a fair battle between the libraries and FLANN :)

The automated benchmarking framework is finished and you can all find it here: http://svn.pointclouds.org/flann/nickon/trunk/automated%20benchmark/. Please read the README file that is included and everything should be clear after that :) The included python scripts are able to draw graphs of the benchmarking results and bundle all of them into a single html page. Have a look at the scripts I used for benchmarking libnabo and nnn and you should be ready to easily create your own benchmark automation.


The results for libnabo can be found by clicking on this link: Libnabo vs. FLANN. Brute force search times for the libnabo library are intentionally not plotted in the graphs. This would cause a different y axis scale, which would disable us to decently compare the kdtree implementations of libnabo to the one in FLANN. The actual search times of the brute force search can still be found in the results table above the respective graph.

The results indicate that FLANN still is the faster library regarding knn search, with respect to libnabo. For almost every pointcloud FLANN ‘outperforms’ the libnabo library in terms of search times. Outperforming might be an overstatement here, because the search times are really close, but FLANN still performs a tad bit better then libnabo. When we look at the build times libnabo seems to gain some ground. Ofcourse the build time of the bruteforce search is just a fraction of a second (the NearestNeighbourSearch object just being initialized), but the kdtree methods of libnabo are consistently built faster then the kdtree index built by FLANN.

The better search time performance of FLANN could be due to a different and hence better constructed kdtree. Otherwise the faster search times can be attributed to other optimizations in FLANN and we could benefit from building the tree like it is done in the libnabo library.


The results for NNN can be found by clicking on this link: NNN vs. FLANN (realistic range). The same comment applies here. To not overload the graph, only the search times of the optimized versions of the NNN, SplitCloud (SC) and SplitCloud2 (SC2) are plotted. The radiuses ranges considered in this ‘realistic’ scenario contain up to 15% of the entire point cloud.

We can see that for all datasets, FLANN performs better on smaller radiuses. This is because the kdtree exploits the sparsity of the point cloud. Noticable is that the SplitCloud2 method comes pretty near the search times of FLANN! We can also see that the build time of the SplitCloud2 is always smaller than the build time of the FLANN kdtree. Depending on the point cloud distribution it can even be half of FLANN’s build time.

To get a view on the behaviour of the naive methods on larger radiuses I’ve also included a ‘zoomed out’ view of the graphs. You can find them over here: NNN vs. FLANN (full range). We notice that when we move to larger radiuses, the benefit of the more naive methods (especially NNN, which doesn’t use any datastructure at all), shows off [1]. However these radiuses are not very realistic scenarios so actually this is just to show the trend of these methods.

So to conclude the benchmarking of NNN, for the realistic radius ranges we have the kdtree approach of FLANN that performs better than any naive method. We must however note that the SplitCloud2 method is a good tradeoff since it’s build time is noticeably smaller than that of FLANN. For few searches this might prove valuable.


Now, since phase 2 is all about integration of the previously mentioned libraries but they don’t introduce any real performance gains w.r.t. FLANN, (except for the faster build times of libnabo, which I will investigate) I can immediatly move forward to phase 3 which is the multithreading of FLANN. I already initiated this phase (sorry for the late blogpost on the libnabo results, my bad) and am currently looking into and evaluating Intel TBB for managing optimum thread spawning, cache coherency,... for the FLANN library.

[1] I already mentioned that the larger the radiuses become the less benefit the SplitCloud methods will have, because they will construct allmost entirely overlapping point clouds.