Honda Research Institute is sponsoring the development of 3D technologies for driver-assistance systems in PCL.
For a complete list of all the present and past PCL code sprints please visit http://www.pointclouds.org/blog.
Click on any of the links below to find out more about our team of PCL developers that are participating in the sprint:
The Honda Research Code Sprint for ground segmentation from stereo has been completed. PCL now includes tools for generating disparity images and point clouds from stereo data courtesy of Federico Tombari, as well as tools for segmenting a ground surface from such point clouds from myself. Attached is a report detailing the additions to PCL and the results, as well as a video overview of the project. There is a demo available in trunk apps, as pcl_stereo_ground_segmentation.
This week I committed the code to perform face detection in PCL trunk and wrote the final report summarizing the work done as well as how to use the module.
In the last months I have been working on a new meta-global descriptor called OUR-CVFH (http://rd.springer.com/chapter/10.1007/978-3-642-32717-9_12) that as you can imagine is an extension to CVFH which is in its turn an extension to VFH (i am not very original at dubbing things). I have also commited some tools and pipelines into pcl/apps/3d_rec_framework (still unstable and not very well documented).
Tomorrow we are having a TV demo in the lab where we are showing recent work on recognition/classification and grasping of unknown objects. So, as happens usually, I had to finish some things for it and I would like to show how with OUR-CVFH is it possible to do scale invariant recognition and 6DOF pose estimation + scale. The training objects are in this case downloaded from 3d-net.org (unit scale, whatever unit is) and they usually do not fit the test objects accurately.
Apart from this, I have also extended OUR-CVFH to use color information and integrate in the histogram. Basically, the reference frame obtained in OUR-CVFH is used to create color distributions depending on the spatial distribution of the points. To test the extension, I did some evaluations on the Willow Garage ICRA 11 Challenge dataset obtaining excellent results (about 99% precission and recall). The training dataset is composed of 35 objects and the test set with 40 sequences totalling 435 object instances. A 3D recognition pipeline method based on SIFT (keypoints projected to 3D) obtains about 70% in such a dataset (even though the objects present texture most of the time). Combining SIFT with SHOT and merging the hypotheses together, gets about 84% and the most recent paper on this dataset (Tang et al. from ICRA 2012) obtains about 90% recall at 99% precission. If you are not familiar with the dataset, here are some screenshots and the respective overlayed recognition and pose estimation.
The color extension to OUR-CVFH and the hypotheses verification stage are not yet in PCL but I hope to commit them as soon as possible, probably after ICRA deadline and before ECCV. You can find the Willow ICRA challenge test dataset in PCD format at http://svn.pointclouds.org/data/ICRA_willow_challenge.
I am back from “holidays”, conferences, etc. and today I started dealing with some of the concerns I pointed out in the last email, mainly regarding the memory footprint required to train. The easiest way to deal with that is to do bagging on each tree so that the training samples used at each tree are loaded before start and dismissed after training a specific tree. I implemented that by adding an abstract DataProvider class to the random forest implementation which is specialized depending on the problem. Then, when a tree is trained and a data provider is available, the tree requests training data to the provider, trains and discards the samples.
I also realized that most of the training data I have for faces, contains a lot of NaNs except of the parts containing the face itself and other parts of the body (which are usually localized in the center of the image). So, to reduce further the data in memory, the specialization of the data provider crops the kinect frames discarding regions with only NaN values.
With this two simple tricks, I am able to train each tree in the forest with 2000 random training samples (from each sample 10 positive and 10 negative patches are extracted) requiring only 3GB of RAM. In case that more training data is needed or the training samples become bigger, one might use a similar trick to design an out-of-core implementation where the data is not requested at tree level but at node level and only indices are kept into memory.
I also found some silly bugs and now I am retraining... let’s see what comes out.
I have continued working on the face detection method and added the pose estimation part, including the clustering step mentioned on my last post. See the video for some results from our implementation (at the beginning is a bit slow due to the video recording software, then it gets better).
I fixed several bugs lately and even though the results start looking pretty good I am not yet completely satisfied. First I was facing some problems during training regarding what to do with patches where the features are invalid (division by zero), I ended up using a tree with three branches and that worked better although I am not yet sure which classification measure should be used then (working on that). The other things are: use of normal features which can be computed very fast with newer PCL versions on organized data and a modification on the way the random forest is trained. Right now it requires all training data to be available in memory (one integral image for each training frame or even four of them if normals are used). This ends up taking a lot of RAM and restricts the amount of training data that can be used to train the forest.
Hi again!
Has been some time since my last post. Was on vacation for some days, then sick and afterwards getting all stuff done after the inactivity. Anyway, I have resumed work on head detection + pose estimation reimplementing the approach from Fanelli at ETH. I implemented the regression part of the approach so that the trees provide information about the head location and orientation and did some improvements on the previous code. I used the purity criteria in order to activate regression which seemed the most straightforward.
The red spheres show the predicted head location after filtering sliding windows that reach leaves with high variance and therefore, are not accurate. As you can see there are several red spheres at non head locations. Nevertheless, the approach relies on a final bottom-up clustering to isolate the different heads in the image. The size of the clusters allows to threshold head detections and eventually, remove outliers.
I hope to commit a working (and complete) version quite soon together with lots of other stuff regarding object recognition.
I recently recieved stereo data and disparity maps to work with for this project, so I wrote a tool to convert the disparity maps to PCD files. The provided disparity data has been smoothed somewhat, which I think might be problematic for our application. For this reason, I also produced disparities usign OpenCV’s semi-global block matching algorithm, which produces quite different results. You can see an example here:
Above is the left image of the input scene. Note the car in the foreground, the curb, and the more distant car on the left of the image.
Above is a top-down view of a point cloud generated by OpenCV’s semi-global block matching. The cars and curb are visible, though there is quite a bit of noise.
Above is an image using the provided disparities, which included some smoothing. The curb is no longer visible, and there is also an odd “ridge” in the groudnplane starting at the front of the car. I think this will be problematic for groundplane segmentation. Both approaches seem to have some advantages and disadvantages, so I’ll keep both sets of PCDs around for testing. Now that I have PCD files to work with, I’m looking forward to using these with my segmentation approach. Prior to using stereo data, I developed segmentation for use on Kinect. I think the main challenge in applying this approach to stereo data will be dealing with the reduced point density and greatly increased noise. I’ll post more on this next time.
Hi again!
Because of ICRA and the preparations for the conference I was quite inactive the last couple of weeks. This week I had to finish some school projects and yesterday I resumed work on face detection. Because the BoW approach using SHOT was “slow”, I decided to give it a try to Decision Forests and Features extracted directly from the depth image (similar to http://www.vision.ee.ethz.ch/~gfanelli/head_pose/head_forest.html). Although not yet finished, I am already able to generate face responses over the depth map and the results look like this:
Basically, the map accumulates at each pixel how many sliding windows with a probability of being a face higher than 0.9 include each specific pixel. This runs in real time thanks to the use of integral image for evaluating the features. For the machine learning part, I am using the decision forest implementation available in the ML module of PCL.
Here are some results regarding 3D face detection:
To speed things up I switched to normal estimation based on integral images (20fps). I found and fixed some bugs regarding the OMP version of SHOT (the SHOT description was being OMP’ed but the reference frame computation was not) and now I am getting almost 1fps for the whole recognition pipeline. Was not convinced about the results with RNN so I decided to give Kmeans a try (no conclusion yet regarding this).
Code is far from perfect (there are few hacks here and there) but the results start looking decent. Regarding further speed-ups, we should consider moving to GPU, however, porting SHOT might not be the lightest task. Next steps would be stabilizing the code, do some commits and then will see if we go for GPU or move forward to pose detection and hypotheses verification.
Yesterday and today, I have been working on face detection and went ahead implementing a Bag Of Words approach. I will summarize briefly the steps I followed:
Right now, I am not doing any post-processing and just visualizing the first 5 candidates (those sliding windows with higher similarity). You can see some results in the following images. Red points indicate computed features (that are not found in the visualized candidates) and green spheres those that vote for a face within the candidates list. Next step will be some optimizations to speed up detection. Now is taking between 3 and 6s depending on the amount of data to be processed (ignore points far away from camera and downsampling).
This week I have been working on a publication which took away most of my time. However, we found some time to chat with Federico about face detection. We decided to try next week a 3D features classification approach based on a bag of words model. Such an approach should be able to gracefully deal with intraclass variations and deliver regions of interest with high probability of containing faces on which we can focus the most expensive recognition and pose estimation stage.
Hi again, first I need to correct myself regarding my last post where I claimed that this week I would be working on face detection. The experiments and tests I was doing on CVFH ended up taking most of my time but the good news are that I am getting good results and the changes increased the descriptiveness and robustness of the descriptor.
Mainly, a unique and repeatable coordinate frame is built at each CVFH smooth cluster (I also slightly modified how the clusters are computed) of an object enabling a spatial description of the object in respect to this coordinate frame. The other good news are that this coordinate frame is also repeatable under roll rotations and thus can substitute the camera roll histogram which in some situations was not accurate/resolutive enough yielding several roll hypotheses that need to be further postprocessed and inevitably slow down the recognition.
This are some results using the first 10 nearest neighbours, pose refinement with ICP and hypotheses verification using the greedy approach. The recognition time varies between 500ms and 1s per object, where approx 70% of the time is spent on pose refinement. The training set contains 16 objects.
A couple of scenes avoiding pose refinement stage where it can be observed that the pose obtained aligning the reference frame is accurate enough for the hypotheses verification to select a good hypothesis. In this case, the recognition time varies between 100ms and 300ms per object.
I am pretty enthusiatic about the modifications and believe that with some GPU optimizations (mainly regarding nearest neighbour searches for ICP and hypotheses verification) a real time (at least, almost) could be implemented.
Regarding the local pipeline, I implemented a new training data source for registered views obtained with a depth device. In this case, the local pipeline can be used as usual without needing 3D meshes of the objects to recognize. The input is represented as pcd files (segmented views of the object) together with a transformation matrix that align a view to a common coordinate frame. This allows to easily train objects in our environment (Kinect + calibration pattern) and allow the use of RGB/texture cues (if available in the sensor) that were not available using 3D meshes. The next image shows an example of a fast experiment where four objects where scanned from different viewpoint using a Kinect and placed into a scene with some clutter in order to be recognized.
The red points represent the overlayed model after being recognized using SHOT, geometric correspondence grouping, SVD, ICP and Papazov’s verification. The downside of not having a 3D mesh is that the results do not look so pretty :) Notice that such an input could as well be used to train the global pipeline. Anyway, I will be doing a “massive” commit later next week with all these modifications. GPU optimizations will be postponed for a while but help is welcomed after the commits.
This last week I have continued working on the recognition framework, focusing on the global pipeline. The global pipelines require segmentation to hypothesize about objects in the scene, each object is then encoded using a global feature (right now available in PCL are VFH, CVFH, ESF, ...) and matched against a training set which objects (their partial views) have been encoded using the same feature. The candidates obtained from the matching stage are post-processed with the Camera Roll Histogram (CRH) to obtain a full 6DOF pose. Finally, the pose can be refined and the best candidate selected by means of an hypotheses verification stage. I will also integrate Alex’s work regarding real time segmentation and euclidean clustering to the global pipeline (see http://www.pointclouds.org/news/new-object-segmentation-algorithms.html).
In summary, I committed the following things to PCL:
These are some results using CVFH, CRH, ICP and the greedy hypotheses verification:
I have as well been playing a bit with CVFH to solve some mirror invariances and in general, increase the descriptive power of the descriptor. Main challenge so far has been finding a semi-global unique and repeatable reference frame. I hope to finish at the beginning of next week with this extension and be able to cleanup the global pipeline so I can commit it. Regarding the main topic of the sprint, we will try some fast face detectors based on depth to efficiently retrieve regions of interest with high probability of containing faces. Another interesting approach that we will definetely try can be found here: http://www.vision.ee.ethz.ch/~gfanelli/head_pose/head_forest.html
Hi again, I integrated the generation of training data for faces into the recognition framework and use the standard recognition pipeline based on SHOT features, Geometric Consistency grouping + RANSAC, SVD to estimate the 6DOF pose and the hypotheses verification from Papazov. The results are pretty cool and encouraging...
Thats me in the first image (should go to the hairdresser...) and the next image is Hannes, a colleague from our lab.
The CAD model used in this case was obtained from http://face.turbosquid.com/ that contains some free 3D meshes of faces. Observe that despite of the training geometry being slightly different than those from the recognized subjects, the model is aligned quite good to the actual face. Notice also the amount of noise in the first image.
I am having interesting conversations with Radu and Federico about how to proceed, so I will post a new entry soon with a concrete roadmap.