# Code Sprints

Q: What is a code sprint?

A code sprint is an accelerated research and development program, where we pair talented developers with senior researchers and engineers for 3-6 months of extended programming work. The model is inspired by the Google Summer Of Code initiative, and is meant to create open source solutions for interesting 2D/3D computer perception problems. Each code sprint is financially supported by a different institution (see below).

Q: What does this blog represent?

The Point Cloud Library (PCL) developer blog represents a great way to keep track of daily updates in PCL sponsored code sprints. Our developers are writing about their experiences and progress updates while working on exciting PCL projects.

## Active Sprints

The following shows the current list of ongoing code sprints. Click on any of the logos below to go to the respective sprint blog page.

## Completed Sprints

The list of code sprints which have been completed is:

We would like to thank our financial sponsors for their generous support. For details on how to start a new code sprint or contribute to PCL, please visit http://www.pointclouds.org/about and our organization’s web site at http://www.openperception.org/get-involved/.

Rigid and Non-Rigid Transformation ( Second phase )
Thursday, July 17, 2014
• Introduction

In the previous phase, it was presented how to obtain a statistical model from a set of face-meshes. The next step in our project is to “match” the mean face of the database, with the face obtain from a 3D Camera scan, like the one in the picture below:

The matching is done by applying alternatively the following methods.

• Rigid Transformation

This method is very similar to the Iterative Closest Point Cloud algorithm, because the goal is to estimate a rotation matrix and a translation vector that would move the average face to an optimal position, near the face of the kinect. Basically, it is required to minimize the error and this is done by calculating the Jacobian matrix of this system.

Of course this process is applied iteratively, and below are presented a few stages of positioning of the model over the scan:

The correspondences between the points is determined by comparing the orientation of the vertices from the model with the closest points from the scan. Here you can see an example with the correspondences. Note that at the edges of the scan, the corresponces tend to form pyramids, because a lot of points from the model have the same closest neighbour.

• Non-Rigid Transformation

Once the model is roughly aligned, we need to modify the shape of the model to match the face from the scan. For this we make use of the eigenvectors computed in the previous phase and we calculate a Jacobian matrix for the system: , where is the matrix of eigenvectors, is the current form of the model and is the vector of basis coefficients that need to be determined.

However, there is on more constraint to be applied and that is to minimize the sum , where is the eigenvalue of the corresponding eigenvector. Therefore, to the Jacobian matrix of this transformation, we need to add a diagonal matrix with on the diagonal.

• Results

As mentioned above, these functions are applied alternatively for a few number of times, and the following results were obtained:

The above picture was obtained after one iteration and the following one after 10:

Implementation of the shape generator
Tuesday, July 08, 2014

The work on the lccp implementation is mainly done and the pull request is awaiting its approvel. Thanks a lot for the help of the pcl community (especially Sergey and Victor). Thanks to your comments the lccp algorithm is now in a much better shape. Talking about shapes: The next milestone in this GSOC project is the implementation of a shape generator which can be used to create various labeled scenes which can be used to create unit tests or benchmarks for part and/or object segmentation algorithms. I have written a big part of the code already. Now the question is: To which module of pcl should this generator go? I think about putting it into the geometry module. Any comment on this is more than welcome! The next image shows an assembled animal-like object which has been generated from simple geometric shapes (mainly geons).

2. Implementation of local-based approach for stereo matching
Saturday, June 28, 2014

INTRODUCTION

In this post, I will briefly describe the current state of the stereo module and the new features added.

Currently, the stereo module encompass two matching local-based algorithms: 1. Block-based algorithm, which is programed using the Box-Filtering algorithm proposed in [McDonnell81]. 2. Adaptive Cost 2-pass Scanline Optimization, presented in [Wang06]. Both methods use the Sum of Absolute Differences (SAD) as the dissimilarity measure.

As mentioned in the previous blog, the first objective of the present project is to implement the local-based approach proposed in [Min1], for dense correspondence estimation in a pair of grayscale rectified images with an efficient cost aggregation step. Additionally, the cost aggregation step in based on the method presented in [Yoon06], where the weighting function uses a similarity measure based on the color and spatial distances.

DETAILS

In order to do so, a new class CompactRepresentationStereoMatching was created in the stereo module. This class inherits from class GrayStereoMatching, which in turns inherits from class StereoMatching, since some pre and post-processing methods are re-implemented. The new class has five member functions with public access: setRadius, set FilterRadius and setNumDispCandidates, setGammaS, setGammaC, which set three data members of type int (radius, filter_radius and num_disp_candidates) and two of type double (gamma_c and gamma_s) with private access, as well as implementing the virtual method compute_impl.

radius corresponds to the radius of the cost aggregation window, with default value equal to 5.

filter_radius corresponds to the radius of the box filter used for the computation of the likelihood function. The default value is 5.

num_disp_candidates is the number of the subset of the disparity hypotheses used for the cost aggregation. The default value is 60.

gamma_c is the spatial bandwidth used for cost aggregation based on adaptive weights. The default value is 15.

gamma_s is the color bandwidth used for cost aggregation based on adaptive weights. The default value is 25.

Similarly to the previous methods, the current class is based on the SAD matching function, and it estimates the per-pixel cost efficiently using the Box-Filtering algorithm.

To test the algorithm, the Middlebury stereo benchmark (http://vision.middlebury.edu/stereo/) dataset is going to be used.

 [McDonnell81] McDonnell, M. J. “Box-filtering techniques”. Computer Graphics and Image Processing 17.1, 65-70, 1981.
 [Wang06] Wang, Liang, et al. “High-quality real-time stereo using adaptive cost aggregation and dynamic programming.” 3D Data Processing, Visualization, and Transmission, Third International Symposium on. IEEE, 2006.
 [Yoon06] K.-J. Yoon and I.-S. Kweon. “Locally Adaptive Support-Weight Approach for Visual Correspondence Search”. In Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR), 924–931, 2005.
RSD Feature computation and analysis
Thursday, June 26, 2014

RSD Feature is a local feature histogram that describes the surface local to a query point. There is pcl implementation for this that is available in the features folder. With the help of my mentor I understood the algorithm by which this feature is obtained. To very if this is working perfectly we took a real object whose radius is known and generated the RSD computation on the entire point cloud of the object. This gives RSD Feature Histogram for all the points in the pointcloud. We can also get the min and max radius of the local surface patch around each point in the pointcloud. I generated various combination of parameters to know how the radius computed varies. Below is the object used which has a radius of 3.5cm which is 0.035m

Below are some of the params chosen and their corrresponding effect on the min and max radius in the local surface patch of each point. For Normal Radius search = 0.03 Max_radius = 0.7 (maximum radius after which everything is plane) RSD_radius search = 0.03

For Normal Radius search = 0.03

For Normal Radius search = 0.02

RSD_radius search = 0.03 - This is found to be good way for generating histograms

I tried to do MLS smoothing on the point cloud data and then compute the RSD feature which makes the normal computation better and resulting in consistency over all the points on the object surface.

For Normal Radius search = 0.03

For Normal Radius search = 0.03

For Normal Radius search = 0.02

RSD_radius search = 0.03 - This is found to be good way for generating histograms

Now I tested out how the actual feature looks like at a point on the sphere to check if it matches with the histogram in the paper. The same is compared between raw point cloud from the kinect and MLS smoothened point cloud. Below is the result of the same.

It was really hard to fix the previous image that it can show the histograms with values and good resolution. So below is the snapshot of the spherical and cylinderical surfaces.

Cylinderical Surface:

Spherical Surface:

Next post will have the details of how GRSD results are and how they differentiate the characteristics of two surfaces. GRSD code from the author will be integrated into the PCL code base. We also plan to categorize the pipeline into modules that fit into the PCL code base as features, surface and segmentation sections. These information will be posted in the next post.

Modifications of the gpu/people module and first steps for data collection
Wednesday, June 25, 2014

As the first weeks of GSOC are over I am going to summarize my progress so far. The goal of my project is to apply machine learning techniques on collected skeletal data for activity recognition. Most research in this area has been using Nite or Kinect SDK for skeleton tracking. As PCL already has a pose detector available, we want to try using it to collect skeletal information, which however, requires some modifications of the gpu/people module, which was the major focus of my work until now.

First steps

Before the actual start of coding it was important to research the existing methods for action recognition. We decided to implement a novel approach published by Ferda Ofli, Rizwan Chaudhry, Gregorij Kurillo, Reneé Vidal, Ruzena Bajcsy - “Sequence of the Most Informative Joints (SMIJ): A new representation for human skeletal action recognition”. In the presented paper, skeleton joints are ranked based on their variance and the sequence of the resulting rankings is used as the input feature vector for classification.

The next step was to make a draft of activities that we are planning to recognize. We decided to stick with the set of actions proposed by the IAS-Lab of University of Padua, which includes: check watch, cross arms, kick, get up, pick up, point, punch, scratch head, sit down, stand, throw from bottom up, throw over head, turn around, walk and wave.

Trying out the pcl/gpu/people module

Obtaining skeleton data with gpu/people module was not as easy as it seemed to be from the first sight. After the challenge of compiling the source with GPU enabled and making it run with Openni, the detector worked with following configuration:

• Ubuntu 14.04 LTS
• CUDA-toolkit 5.5.22
• Openni 1.5.4
• Avin2 Sensor Kinect driver v 5.1.2

The tree files necessary for the detection were not provided in trunk and should be downloaded here: https://github.com/PointCloudLibrary/data/tree/master/people/results .

The pose detector runs on the RGB and Depth data obtained from an RGB-D sensor (we use Microsoft Kinect) and produces colour labelling of each pixel, depicting the corresponding body part. After testing the detector following observations should be mentioned:

• The people detector worked very well in certain positions, the best case includes frontal orientation, near-range with no walls or ground visible.
• Despite the picture in the tutorial , the current implementation does not provide positions of the skeletal joints, which was also an issue for discussion . Consequently, obtaining the joint position is one of the challenges of this project.
• The program has problems with large surfaces, mostly the walls and the floor, which are often labelled as the human. This issue also occurs in the official demo video . As out project requires full body visibility, it is necessary to fix this problem (especially as it comes to the floor).
• In unusual positions, especially while turning, the body segmentation is usually correct, but the labelling often fails.

Trying out the pcl/gpu/people pose detector with one tree (left), three trees (middle), problems with large surfaces (right)

Extending the pose detector to estimate the positions of the joints

At the beginning it was necessary to browse through the code in order to understand how the program works. The detection happens by calling the process() function of the PeopleDetector class. In the end of the function, the variable sorted2 (with type BlobMatrix) contains the data of each body-part-label.

There are 26 labels (currently 24 used) all together: Lfoot, Lleg, Lknee, Lthigh,Rfoot, Rleg, Rknee, Rthigh, Rhips, Lhips, Neck, Rarm, Relbow, Rforearm, Rhand, Larm, Lelbow, Lforearm, Lhand, FaceLB, FaceRB, FaceLT, FaceRT, Rchest, Lchest, Rshoulder, Lshoulder.

The labels Rshoulder and Lshoulder exist but are currently not implemented in the detection.

The PeopleDetector class was extended with an array containing the coordinates of the joints and finding a way to calculate those coordinates was a major challange. The first idea was to simply use the mean values of the corresponding blobs. In spite of the simplicity of this approach, the results were satisfying. The second idea was to use the buildTree() function, which estimates the optimal blob-tree starting from the neck and then recursively browse through the child-blobs and use their mean values. The buildTree() function uses the “ideal” lengths of the limbs to estimate the optimal blobs (those optimal values are defined here). I also want to thank Koen Buys for giving me tips on the calculation.

As we are interested in the position of the limbs, using the mean values of the blobs is not always appropriate. For example, we are more interested in the upper border of the hip, which is connected to the torso, instead of the hip centroid. Besides this, we are also interested in the shoulder position, which was not implemented. The elbow label was also a special case as it usually has very small area and is often not detected. Consequently, I made some additional modifications to estimate those positions, which are described below.

Shoulders:

• Basic idea: Use the highest part of the chest blob as the shoulder position
• The point cloud of the left/right chest blob is extracted.
• The maximum Y-value of this point cloud is calculated
• All the points of the chest-blob that have Y-value close to to the maximum (chosen threshold: 10 cm) are taken, their 3D mean value is calculated and used as the shoulder position

Elbows:

• If an elbow-blob already exists, nothing is done: the mean value is used.
• Otherwise: The elbow is the point of the arm (upper arm) blob, which has the largest distance from the shoulder

Hips:

• The mean of the “lowest points” (in a certain threshold) of the hip-blob (not the mean of the whole blob). This modification was done due to the fact that the blob itself is covering the whole torso.

In general the quality of the joint position depends directly on the quality of labelling. As no tracking is implemented yet, the movement of the joints is not continuous.

Skeleton visualization: “good” example with “checking watch” action (left), labelling fails when the floor is visible (right)

Using Ground Plane People Detector for body segmentation

As mentioned before, the original People Pose Detector has some problems with large surfaces, especially the floor. We tried to solve this problem by combining the original People Pose Detector with Ground Plane People Detector (implemented by my mentor, Matteo Munaro), to segment the body cluster before the actual labelling.

In the resulting application, at first the three points of the ground plane are selected, after which the Ground Plane People Detector removes the ground plane and estimates the point cloud belonging to the human. The points of the cluster are then transformed to the depth image, setting all other depth pixels to very high values.

Some additional corrections were added to improve the segmentation results (depth filtering, extending the legs, as too many ground floor points are removed). Additionally, the RGB and Depth calibration (8 pixel shift) is done as proposed by Lingzhu Xiang .

Using the Ground Plane People Detector improves the performance significantly if the full body visibility is required as it completely solves the large-surfaces-problem.

It should also be considered, what should be done if the Ground Plane People Detector does not detect the human (meaning that none of the detected clusters had confidence over the defined threshold). In this case we use the segmentation from the last frame, in which the person was detected.

Pose detector with full body visibility without (left) and with (right) segmentation.

Examples of activities: cross arms, check watch, kick, point, turn, wave

Storing the data

I am currently working on completing the framework for data collection. Storing skeletal information (3-D position of the joints) in TXT files is already implemented. People Pose Detector already includes optional storage of depth and RGB-data as PNG images. However, we decided to store the RGB and Depth images with more efficient method using the lzf-format (thanks to Lingzhu Xiang for the tip). Another idea I am working right now is to run the people pose detector offline on the stored images to use the full speed.

Statistical Face Model ( First phase )
Sunday, June 22, 2014
• Introduction

The goal of this project is to implement a program that will modify the expressions of several scanned faces according to the facial expressions captured by a RGBD camera.

The first step is to create a statistical model based on a training database of faces. The training set used so far was the one provided by the FaceWarehouse project and it consisted of 3D meshes stored in .obj files. For further information, please consult the following link: http://gaps-zju.org/facewarehouse/

• Aproach

For each face in the training set, a column vector S was created and it contained the coordinates for every vertice of the mesh. Afterwards, the avearage vector and the covariance matrix were calculated. Normally, the covariance matrix would be calculated as , however one should note that this matrix is 34530 by 34530 and in order to compute the statistical model, the most significant eigenvectors are required. To speed up the calculations, a matrix was formed by joining the vectors and the eigenvectors for were calculated. It is important to note that the size of is determined by the number of faces and that the eigenvectors of the covariance matrix can be obtained by left multiplying to the eigenvectors of . Once the eigenvectors are calculated, the statistical model is obtained according to the formula: , where is the weight of an eigenvector, determined by multiplying a random number in the range [-2,2] with the corresponding eigenvalue. The final results of this phase are presented below. The average face is:

And the model is:

As you can see, the model obtained is a bit flattened compared to the mean face, that is because in the training set the majority of the faces are a bit rounded, however this project needs a model to take into consideration several types of faces, and this is why we need to consider the covariance of the samples in the database.

• Feature Steps

• For this model, only the vertices of the faces were used, however the texture coordinates also need to be taken into consideration. Unfortunately, the database does not provide any information about the colors as of yet. Once the data is available the model needs to be adapted for this feature
• Once the statistical model is fully configured, a 3D registration algorithm must be applied to project the facial expression of a testing sample to the model.
• References

T. Vetter and V. Blanz, A Morphable Model For The Synthesis Of 3D Faces, Max-Planck-Institut, Tubingen, Germany

1. Reduction of computational redundancy in cost aggregation in stereo matching.
Saturday, June 21, 2014

INTRODUCTION

A stereo image pair can be used to estimate the depth of a scene. To do so, it is necessary to perform pixel matching and find the correspondences in both images. Different methods for stereo correspondence have been proposed and they are classified in two classes:

• Correlation-based algorithms: Produce a dense set of correspondences.
• Feature-based algorithms: Produce a sparse set of correspondences.

Additionally, correlation-based algorithms are usually classified in two main groups, local (window-based) or global algorithms. However, some methods do not fit into any group, and are classified in between them.

The current work is based on correlation-based algorithms, more espefically local and window based-methods, intended for applications where a dense and fast output is required.

The input of the algorithm are two calibrated images, i.e. the camera geometry is known. The images are also rectified in order to limit the correspondence to a 1D search.

CORRESPONDENCE ESTIMATION

The general methodology for stereo vision local approaches can be summarized as follows. An energy cost is computed for every pixel p by using the reference and d-shifted right images:

(1)

Then, the aggregated cost is computed by an adaptive sum of the per-pixel cost:

(2)

Finally, a Winner-Takes-All method is used to find the best of all the disparity hypothesis:

(3)

This whole process is complex and time consuming since it is repeated for every hypothesis d. A representation of the conventional approaches can be observed in next figure [Min1].

Min et al. [Min1] introduced a new methodology to reduce the complexity, by finding a compact representation of the per-pixel likelihood, assuming that low values do not provide really informative support. In this case, only a pre-defined number of disparity candidates per pixel are selected to perform the cost aggregation step. The subset of disparity hypotheses correspond to the local maxima points in the profile of the likelihood function, previously pre-filtered to reduce the noise, as shown in the following example:

The disparity hypotheses estimation and cost aggregation processes proposed by Min et al. are depicted in the next figure, where Sc is the subset of disparity hypothesis with size Dc:

 [Min1] Min, D., Lu, J., & Do, M. N. “A revisit to cost aggregation in stereo matching: How far can we reduce its computational redundancy?.” In IEEE International Conference on Computer Vision (ICCV), 2011 (pp. 1567-1574).
2. Fitting superquadrics (a.k.a. The Horror of Box Constrained Non-Linear Optimization)
Friday, June 20, 2014

In this post we will see some initial results for fitting superquadrics to full pointclouds. Let’s quickly start with the math so you have a good idea of how the code works.

I will remind you again the superellipsoid equation (yes, I will keep bringing up the equation again and again so it stays forever on your brain):

(1)

We will call the expression to the left . A 3D point() will belong to the canonical superquadric defined by the parameters if . To have a general superquadric, we must consider the translation and rotation terms, hence the general equation (1) has the following form:

(2)

where is the translation of the superellipsoid center with respect to a given global frame and are the column vectors of the rotation matrix of the superellipsoid (again, with respect to some defined global axes). In fact, to be completely rigurous, we should express as a function of both the point being evaluated and the superellipsoid parameters being used: where

In order to find the superellipsoid that best fit a given full pointcloud (composed by 3D points with , we need to minimize the error between each point and equation (2). In its most basic form, we could try to minimize this equation:

Some wise people suggested a couple of modifications to the basic version above and came up with this:

the factor makes sure that the superellipsoid obtained is the smallest possible. The additional exponent improves the convergence time.

As of now, I will not go into more details on the math behind since I am running out of time to write this entry. However, there are a few things that you should remember of this post:

• Our goal is to solve a Non-Linear Square problem.
• We have 11 parameters to find
• Our parameters are bounded, which means that they have upper and lower limits. This constraints the type of algorithms we can use to optimize our solution.
• The more dense the pointcloud is, the more factors will be considered in the equation above.

As of now, we have implemented the method described in [Duncan13] to fit segmented, full pointclouds to superellipsoids. In this post we will present some initial results obtained for 4 test cases (sphere, box, cylinder and oval shape). We tested them in 3 scenarios:

• Canonical superellipsoids with no noise.
• General superellipsoids (rotation and translation added) no noise.
• General superellipsoids with noise (up to 5% of the length of the smallest principal axis).

Case 0: Parameters used
Parameter Sphere Ellipsoid Cylinder Box
a 0.25 0.04 0.05 0.025
b 0.25 0.08 0.05 0.08
c 0.25 0.06 0.1 0.18
e1 1 0.75 0.25 0.1
e2 1 0.75 1.0 0.1

The pointclouds are shown in the following figure and are also available in my repository .

The code used for the results shown in the rest of this post can be found here .

For the first test case we generated full sampled pointclouds by using the sampling code we presented in our previous post. To initialize the maximizer, we used the pointcloud’s bounding box information for the superellipsoid dimensions and global transform. For all cases we used an initial value of 0.5 for and 1.0 for (these values are in the middle of the allowed range).

Results for the fitting are shown in the following table. It can be seen that the fitting works pretty well, which is kind of expected since this is the most basic case.

Case 0: Results
Parameter Sphere Ellipsoid Cylinder Box
a 0.247 0.039 0.0499 0.025
b 0.247 0.079 0.049 0.079
c 0.247 0.059 0.099 0.179
e1 0.99 0.753 0.271 0.10
e2 0.99 0.753 0.97 0.13

For case 1, we modified the test pointclouds by applying a transformation to them. Details of the transformations for each case are shown below (the parameters remain constant so we omit to repeat them).

Case 1: Parameters
Parameter Sphere Ellipsoid Cylinder Box
x 0.5 -0.6 -0.4 -0.1
y 0.8 0.2 0.7 0.3
z 0.0 0.0 0.3 0.5
roll 0.0 0.2 0.6 0.0
pitch 0.0 0.5 0.9 0.0
yaw 0.3 0.3 0.8 0.8

The results are shown below. We observed that the parameters that remain constant keep approximately the same fitted values as in Case 0, so we won’t repeat them in the next table.

Case 1: Results
Parameter Sphere Ellipsoid Cylinder Box
x 0.5 -0.6 -0.4 -0.1
y 0.8 0.199 0.69 0.3
z 0.0 0.0 0.29 0.49
roll 0.0 0.199 -0.117 0.0
pitch 0.0 0.49 1.02 0.0
yaw 0.0 0.29 -0.055 -2.34

We can observe that the translation values are well fitted, while the same is not the case for the rotation values. In the next post we should discuss some ideas to fix that.

Finally, we added noise to the pointclouds. The values used are shown in the following table (meaning that a uniform disturbance between was randomly applied to each point in the pointcloud.

Case 2: Parameters
Parameter Sphere Ellipsoid Cylinder Box
0.01 0.002 0.0025 0.0015
Percentage 4% 5% 5% 6%

The pointclouds are shown in the following figure:

The final parameters are shown below:

Case 2: Results
Parameter Sphere Ellipsoid Cylinder Box
a 0.24 0.039 0.049 0.025
b 0.245 0.079 0.049 0.08
c 0.249 0.0601 0.102 0.19
e1 1 0.76 0.35 0.33
e2 0.91 0.71 0.92 0.1
x 0.49 -0.59 -0.39 -0.1
y 0.8 0.19 0.70 0.3
z 0.0 0.0 0.30 0.49
roll 0.0 0.19 0.95 0.0
pitch 0.0 0.50 0.47 0.0
yaw -2.8 0.30 1.34 -2.3

I should probably put a parallel table with all the original values so you can compare visually more easily. In any case, couple of observations:

• Rotation final values are the ones with the biggest errors.
• Noise levels that exceed 5% are not acceptable (the fitting values and vary significantly, altering the shape of the object. The other parameters keep being reasonably accurate.
• We are not using any loss function to alleviate the effect of the outliers. Here that did not prove particularly important, but when we do experiments with data that is not synthetically obtained (real pointclouds of objects) it will probably matter.
• A crucial requirement for a good fit is the initialization of the Z axis of the object (revolution axis).
 [Duncan13] Duncan, Kester, et al. “Multi-scale superquadric fitting for efficient shape and pose recovery of unknown objects.” Robotics and Automation (ICRA), 2013 IEEE International Conference on. IEEE, 2013.
Friday, June 20, 2014

In this post I will talk about superquadric uniform sampling. You might be wondering why you should care about sampling. Well, there are 2 reasons:

• Synthetic data: To test our superquadric fitting algorithm (next section) we will start using as a baseline canonical superquadrics with some added noise. Once a fitting algorithm works with these complete pointclouds, we will be able to proceed with more complex shapes (that are not necessarily superellipsoids but close enough to attempt to fith them).
• Debugging purposes: Visualization of our results.
• Beautiful math: The math of this part is simple, yet elegant. If you want to know the details, peek the code (linked below) or read the paper mentioned later in this entry.

From last time, you might remember that the Superellipsoids can be expressed with the explicit equations:

Now, let’s say that we want to generate pointclouds of different superellipsoids. Also by now let’s assume we only care about canonical superellipsoids (no translation and no rotation). A first idea would probably be to just sample the values of and to generate the 3D samples. Let’s see what happens if we use this simple approach:

You might notice that the results for the sphere and the oval shape are reasonably well distributed; however the results for the cylinder and the box are far from even. In general, the naive approach of uniformly sampling and would work well for high values of and (where high means closer to 1 than to 0). For lower values (such as for the box and cylinder cases where ) the performance of naively sampling the angles would degrade.

The following figure shows a more reasonable sampling:

The samples above were obtained by performing a simple technique proposed by Pilu and Fisher [PiluFisher95]. These authors noticed that in order to obtain an uniform sampling, the 3D distance between samples had to be constant; however, uniform sampling distance does not correlate with uniform angle steps.

An implementation of the algorithm proposed by Pilu and Fisher can be found in my PCL fork while the source example sampleSQ.cpp generates a pointcloud for a superellipsoid defined by user input parameters.

 [PiluFisher95] Pilu, Maurizio, and Robert B. Fisher. “Equal-distance sampling of superellipse models.” DAI RESEARCH PAPER (1995).
Dataset collection and curiosities in Kinect calibration
Friday, June 20, 2014

Difficult poses for the current PCL people detector

This project aims to improve people detection capability in PCL. To validate the improvement we need measurements and data as proof. Therefore the first step of the project is to collect and establish datasets of relevant human poses. PCL’s people detector assumes that human will be in upright poses instead of arbitrary poses, and the multiple stages of detection including segmentation clustering and classification in its implementation rely on such assumption, making it susceptible to varied human poses in real world. Thus I intend to collect various representative poses that deviate from upright poses. As is discussed with my mentor Matteo, those include:

• Crouching
• Hands up
• Sitting
• Dancing/running
• Lying on sofa or floor

These poses are not equal. Crouching, dancing, and running pose challenges to the image based human classifier which currently assume a sample window of a full-body upright pose. Sitting likely involves occlusion of some part of body and increases difficulty in segmentation. Hands up would introduce errors in segmentation or even false segmentation in the current head based subclustering as it determines each subcluster by its highest local maxima which could erroneously be the hands. While lying on sofa or floor would completely defeat the current approach of segmentation because the person is no longer an independent object to be segmented. So these are the types of data we will collect and evaluate upon.

Methods to collect data

There are different ways to collect point cloud data, but as is suggested in OpenNI and PCL Recording Codes, the most efficient capturing method in PCL is pcl_openni_image (or the only method that works because other ones in PCL either totally fail or explode in I/O) which grabs RGBD frames in compact pclzf format (samples). You use pcl_openni_image '#1' to test Kinect configuration and pcl_openni_image '#1' -visualize 0 to save batch frames. This produces compact data stream in 400KB per frame, or 12MB/second, with point clouds reconstructed afterwards when being used. You can visualize the data with pcl_image_grabber_viewer -pclzf -dir captured_frames/ -fps 30, or programmatically replace pcl::OpenNIGrabber with pcl::ImageGrabber like the image grabber viewer. The other way to collect data is to save raw frames and camera information in ROS bags and play back the image frames to reconstruct the point cloud on the fly and feed depth-registered point cloud topic to PCL via pcl_ros.

Problems with depth registration

pclzf format seems fine, until it is actually reconstructed to point clouds with a large bias in depth registration:

Depth registration is the first step of data collections. PCL’s people detector makes use of image based people classifier, which relies on correctly color registered point cloud. Somehow there are some peculiarities in how we perform the depth registration and calibration prior to that. The image above shows that color of the background is mapped onto the points that belong to the person.

My guess of this was that this might be caused by wrong calibration, or loss of calibration during the capturing and reconstructing process. So after redoing the intrinsic calibration of RGB camera and IR camera and extrinsic calibration as described in openni_launch tutorials and this Kinect calibration tutorial, I tried the other method with ROS openni_launch depth_registration:=true publish_tf:=true/false. The problem with depth registration persists, and OpenNI seems to ignore my calibration parameters or frame transforms no matter what and only use its built-in factory default parameters:

It turns out PCL does not perform any internal calibration at all and relies on OpenNI providing correct depth registration, and there is no way of updating OpenNI’s calibration parameters. Yes you can calibrate the Kinect all you want but there is no way to make use of the result in existing code base. This thread about what is happening behind OpenNI depth registration with pointers inside is a good read.

We can still do the whole depth registration process manually, as in pointcloud_utils.cpp, but unfortunately the data captured by pcl_openni_image is already depth registered somehow with wrong extrinsic calibration. To avoid spending too much time on perfecting data calibration, I decided to make a minimal fix to extrinsic error in the depth registration in pclzf data by shifting the RGB layer along X axis, in pseudo-code like this:

for (x = 0; x < width; x++)
for (y = 0; y < height; y++)
cloud(x, y).rgb = cloud(x + 8, y).rgb;

The result looks mostly acceptable with some residual “borders” around and can be improved later on:

Note that green box in the right of the person.

Datasets and first results

I collected datasets of various poses standing, crouching, sitting, and lying on sofa. One standing dataset looks like this (selected thumbnails):

There is also this result demonstrating the failure mode of the current head based subclustering method:

These extra erroneous segments (green boxes) are caused by the hands up pose with which the current clustering method would recognize the hands as heads. So this is where I will improve upon.

The following video are the first results on all current collected datasets. Green boxes are segmented clusters and red boxes are people detections. The minimum height is set to 0.1 so you will see lots of green boxes, which is for evaluating the current performance of the segmentation.

Some major observations for the results:

• Segmentation is very good and almost never misses the true person clusters.
• Hands up pose likely generates a bunch of false positive clusters, as explained above.
• Segmentation totally fails in the sofa dataset.
• The classifier can have some room of improvement.

So this is about everything for this time. Next time I will write about dataset annotation and performance improvement of some first techniques that I will implement.

Thursday, June 19, 2014

Superquadrics are a family of geometric shapes that can represent a wide array of diverse primitives using a small number of parameters. As an example, look at the figure below: The first row depicts 5 common household objects. The second row shows the superquadrics that more closely resemble them.

Superquadrics were initially introduced in the computer graphics community by Alan Barr [Barr81], but they were later adopted by the robotics community as an effective modelling tool to approximate objects shape. In general, superquadrics include superellipsoid and supertoroids, but for most practical uses, we care for only superellipsoids. These can be expressed with the following formula:

(1)

As it can be seen, superellipsoids in their canonical form can be expressed by 5 parameters:

• : Scaling factors along principal axes
• : Shape factor of the superellipsoid cross section in a plane orthogonal to XY containing the axis Z.
• : Shape factor of the superellipsoid cross section in a plane parallel to XY.

If a general transformation is considered, then the total number of parameters required to define a superellipsoid is 11 (the 6 additional being the rotation and translation degrees of freedom )

Expression (1) shows the implicit equation of the superellipsoids. Its parametric solution can be expressed as:

with and . In our next post we will learn how to generate pointclouds for superellipsoids (which is not as simple as just sampling and ! Stay tuned :).

 [Barr81] Barr, Alan H. “Superquadrics and angle-preserving transformations.” IEEE Computer graphics and Applications 1.1 (1981): 11-23.
First post - LCCP algorithm implementation
Monday, June 16, 2014

Hello everybody, this is my first blog post to this project. The last weeks I have been busy implementing the LCCP algorithm in PCL. The algorithm can be used to split a point cloud into regions which are isolated by concave boundaries to all other regions. It turns out that this is a highly valuable segmentation as it often retrieves (bottom-up!) nameable parts like handles, heads and so on. Especially robotic applications may find this useful. Together with Jeremie I will introduce it at this year’s CVPR 2014 conference (S. C. Stein, M. Schoeler, J. Papon, F. Woergoetter: Object Partitioning using Local Convexity). We will have a poster next Tuesday afternoon. So if you are around, you are more than welcome to visit us there. In the meantime I hope I get the pull request through, so that everybody interested can play around with the algorithm. It will be located in the segmentation module. There is also an example pcl_example_lccp_segmentation. To get you interested the following image shows an example of the segmentation. As you can see all parts can be easily named.

That’s it for the first post. Hope to see some of you at CVPR. Stay tuned for more to come.

Richtsfeld’s code and results (First post)
Monday, June 16, 2014
• Introduction

Goal of this project is to integrate cluttered scene segmentation methods into pcl::segmentation. This sprint involves implementation of modules from two publications below.

• Z.-C. Marton, F. Balint-Benczedi, O. M. Mozos, N. Blodow, A. Kanezaki, L. C. Goron, D. Pangercic, and M. Beetz: Part-based geometric categorization and object reconstruction in cluttered table-top scenes; Journal of Intelligent and Robotic Systems, January 2014
• A.-Richtsfeld, T. Mörwald, J. Prankl, M. Zillich and M. Vincze: Learning of Perceptual Grouping for Object Segmentation on RGB-D Data; Journal of Visual Communication and Image Representation (JVCI), Special Issue on Visual Understanding and Applications with RGB-D Cameras, July 2013

These papers have made their code available and has pcl implementation to some extent already. But we will aim to make the modules interoperable in our implementation.

• Richtsfeld’s code and results

As a first step, Richtsfeld’s code was analyzed. Below is the highlevel picture of the structure of his code base along with comments on their functionality.

His code was tested on some scenes that were grabbed by me using Kinect. Below are the snapshots of the same.

His code works on organized pointcloud of type PointXYZRGB.

Code is yet to be tested for quantitative results with our annotated dataset.

• Brainstorming

These are some of the discussions I had with my mentor Zoltan and the summary is listed below.

• Zoltan’s work classifies the relations based on features computed on group of segments as 1-8 elements in a group.

• Richtsfeld’s work classifies the relations based on segment pairs.

• Richtsfeld’s work is limited to organized pointcloud and cannot handle a cloud that is fused out of many pointclouds of a scene say through registration.

• Richtsfeld’s work computes features which are inspired from Gestalt’s principles. There are someother features that are worth testing. These are the features used for structure discovery in a pointcloud data. Features are as below and more details on them are available in the publication - Collet, Alvaro, Siddhartha S. Srinivasa, and Martial Hebert. “Structure discovery in multi-modal data: a region-based approach.” Robotics and Automation (ICRA), 2011 IEEE International Conference on. IEEE, 2011.

• Additional features to test are:

• Shape Model
• Self Continuity
• Contour compactness
• Pair continuity
• Verticality
• Concavity
• Projection
• Alignment
• Surface compatibility
Results on the Castro dataset.
Monday, March 31, 2014

We’ve implemented almost everything, what was planned, and now we want to present our results.

First of all please watch this video with results on the whole Castro dataset. Road is marked with red, left and right road borders are indicated by green and blue lines respectively.

The main problem is the presence of holes in the road surface. This is caused by the holes in the input disparity maps. We decided not to inpaint them, because we have no information about scene in those points. But we will compute the quality of the labeling only in points with known disparity. It allows to estimate the results of our method independently of the quality of the method for generating disparity maps.

Our method has a significant advantage in situations with one or both curbs are clearly visible (with corresponding sidewalks). You can compare the result of the previous sprint’s method (left) with the result of our method (right) on the same frame (which has two clearly visible sidewalks).

Next, I’m going to show you the numerical results. Precision is a ratio of right detected road’s points to all detected pixels, recall is a percent of detected road’s points. Only points with known disparity are taken into account.

Final Report
Friday, January 24, 2014

It is with pleasure to share the successful completion of this Toyota Code Sprint in this final blog post. In this project, homography estimation based on multi-modal, multi-descriptor correspondence sets has been explored, and inspired the introduction of the multi-descriptor voting approach (MDv). The proposed MDv approach achieved a consistent accuracy in the 0.0X range, a level of consistency that is better than those based on single-type state of the art descriptors including SIFT. In the process, a framework for analyzing and evaluating single and multi-descriptor performance has been developed, and employed to validate the robustness of MDv, as compared with homography estimations based on a single descriptor type, as well as those based on RANSAC registration of best-K multi-descriptor correspondence sets. The code and dataset for this project are hosted on https://github.com/mult-desc/md, with dependencies on both PCL 1.7 and OpenCV 2.4.6.

Follows is an in-depth report detailing the project’s accomplishments, as well as design and validation considerations: