2024 Conference on Robot Learning (CoRL 2024)
TL;DR: We propose an SE(3)-equivariant grasp pose generative model by constructing a framework to learn SE(3)-invariant conditional distributions with Continuous Normalizing Flows.
EquiGraspFlow significantly improves grasp success rates when objects are rotated, outperforming current methods. Moreover, our model consistently delivers strong performance across various object orientations.
Traditional methods for synthesizing 6-DoF grasp poses from 3D observations often rely on geometric heuristics, resulting in poor generalizability, limited grasp options, and higher failure rates. Recently, data-driven methods have been proposed that use generative models to learn the distribution of grasp poses and generate diverse candidate poses. The main drawback of these methods is that they fail to achieve SE(3)-equivariance, meaning that the generated grasp poses do not transform correctly with object rotations and translations. In this paper, we propose EquiGraspFlow, a flow-based SE(3)-equivariant 6-DoF grasp pose generative model that can learn complex conditional distributions on the SE(3) manifold while guaranteeing SE(3)-equivariance. Our model achieves the equivariance without relying on data augmentation, by using network architectures that guarantee it by construction. Extensive experiments show that EquiGraspFlow accurately learns grasp pose distribution, achieves the SE(3)-equivariance, and significantly outperforms existing grasp pose generative models.
Recently, generative model-based approaches for 6-DoF grasping, such as 6-DOF GraspNet [1] and SE(3)-DiffusionFields [2], have been introduced. However, the primary flaw with these existing grasp pose generative models is that they do not produce consistent grasp poses for rotated objects, leading to significant failure in some cases. An ideal model should generate grasp poses that transform identically for rotated and translated objects. Such models are considered SE(3)-equivariant.
In this paper, we propose an SE(3)-equivariant 6-DoF grasp pose generative model that produces consistent grasp poses for rotated and translated objects. Denoting an object's point cloud by \( \mathcal{P} \) and a grasp pose \( T \in \mathrm{SE}(3) \), a grasp pose generative model is represented by \( p(T | \mathcal{P}) \). With \( T' \mathcal{P} \) denoting 3D transformation of the points in the point cloud \( \mathcal{P} \) by a transformation \( T' \), the SE(3)-equivariance of the grasp pose generative model is formulated as follows: For a transformed point cloud, equivalently transformed grasp poses should have same likelihood.
Therefore, the required condition for SE(3)-equivariant grasp pose generation is that the generative model should learn SE(3)-invariant conditional distributions described as follows:
The conditional Continuous Normalizing Flows (CNFs) models a target conditional distribution \( q(T | \mathcal{P}) \) by transforming a prior conditional distribution \( p_0(T | \mathcal{P}) \) using the time-dependent conditional angular and linear velocity fields \( \omega_\theta(t, \mathcal{P}, T) \) and \( v_\phi(t, \mathcal{P}, T) \) where \( \theta \) and \( v \) are trainable parameters. Denoting the transformed distribution at time \( t \) by \( p_t(T | \mathcal{P}) \), we train \( \omega_\theta \) and \( v_\phi \) so that \( p_1(T | \mathcal{P}) \) closely approximates \( q(T | \mathcal{P}) \).
Generation process of \( T_\tau \sim p_\tau(T | \mathcal{P}) \) is as follows:
where \( [a] \) is an operation that maps a 3D vector \( a \) to a skew-symmetric matrix defined as \( [a]_{12} = -a_3, [a]_{13} = a_2, [a]_{23} = -a_1 \). The following figure depicts a flow constructed from the velocity fields and ODEs.
We demonstrate that starting from an SE(3)-invariant prior conditional distribution, SE(3)-equivariant conditional velocity fields preserve the invariance of transformed conditional distributions over time.
We utilize a prior conditional distribution \( p_0(T | \mathcal{P}) = p_0(R) p_0(x | \mathcal{P}) \) where \( p_0(R) \) is uniform over SO(3) and \( p_0(x | \mathcal{P}) \) is Gaussian in \( \mathbb{R}^3 \) with its mean located at the center of the point cloud \( \mathcal{P} \). It is trivial to show that this prior conditional distribution is SE(3)-invariant. The SE(3)-equivariance of the conditional velocity fields is decomposed into the equivariances on \( \mathbb{R}^3 \) and SO(3). The \( \mathbb{R}^3 \)-equivariance is achieved by subtracting the point mean from \( \mathcal{P} \) and \( x \). The SO(3)-equivariance is achieved by adopting the Vector Neuron (VN) architectures [3], which are designed to be SO(3)-equivariant. The structure of the velocity fields is depicted in the following figure.
However, directly using the VN architectures is not straightforward since they require lists of 3D vectors as input, while time \( t \) is a scalar. Thus, we propose an equivariant lifting layer that converts any scalar variables into 3D equivariant vectors, so that the lifted time can be input into VNs while preserving equivariance.
The structure of the equivariant lifting layer is depicted in the above figure. The procedure of the lifting layer is as follows:
We utilize a dataset obtained from the Laptop, Mug, Bowl, and Pencil categories of the ACRONYM dataset [4]. For the data augmentation of the training dataset, we construct two strategies: None denotes no augmentation, and SO(3)-aug denotes augmenting by random arbitrary rotation in SO(3). The evaluation metrics we utilize are Earth Mover's Distance (EMD) and grasp success rate. The EMD measures the distance between the distributions of the generated and ground-truth grasp poses, defined by the minimum geodesic distance on the SE(3) manifold required to align the samples. The grasp success rate is assessed by determining whether the Franka Panda gripper successfully holds the object following the grasping action. Both metrics are first averaged across the rotations for each object, and then averaged across all objects.
Zero standard deviation with respect to object rotations
Identical metric values
@inproceedings{lim2024equigraspflow,
title={EquiGraspFlow: SE(3)-Equivariant 6-DoF Grasp Pose Generative Flows},
author={Lim, Byeongdo and Kim, Jongmin and Kim, Jihwan and Lee, Yonghyeon and Park, Frank C},
booktitle={8th Annual Conference on Robot Learning},
year={2024}
}