Next Article in Journal
A Review of Stall Detection in Subsonic Axial Compressors
Previous Article in Journal
Advancements in Smart Nondestructive Evaluation of Industrial Machines: A Comprehensive Review of Computer Vision and AI Techniques for Infrastructure Maintenance
Previous Article in Special Issue
Pick and Place Control of a 3-DOF Robot Manipulator Based on Image and Pattern Recognition
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Physics-Based Self-Supervised Grasp Pose Detection

1
Department of Autonomous and Intelligent Systems, Tekniker—Basque Research and Technology Alliance (BRTA), Iñaki Goenaga 5, 20600 Eibar, Gipuzkoa, Spain
2
Robotics and Autonomous Systems Group (RSAIT), Department of Computer Science and Artificial Intelligence, University of the Basque Country (UPV/EHU), 20018 Donostia-San Sebastián, Gipuzkoa, Spain
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Submission received: 31 October 2024 / Revised: 24 December 2024 / Accepted: 25 December 2024 / Published: 28 December 2024

Abstract

:
Current industrial robotic manipulators have made their lack of flexibility evident. The systems must know beforehand the piece and its position. To address this issue, contemporary approaches typically employ learning-based techniques, which rely on extensive amounts of data. To obtain vast data, an often sought tool is an extensive grasp dataset. This work introduces our Physics-Based Self-Supervised Grasp Pose Detection (PBSS-GPD) pipeline for model-based grasping point detection, which is useful for generating grasp pose datasets. Given a gripper-object pair, it samples grasping pose candidates using a modified version of GPD (implementing inner-grasps, CAD support…) and quantifies their quality using the MuJoCo physics engine and a grasp quality metric that takes into account the pose of the object over time. The system is optimized to run on CPU in headless-parallelized mode, with the option of running in a graphical interface or headless and storing videos of the process. The system has been validated obtaining grasping poses for a subset of Egad! objects using the Franka Panda two-finger gripper, compared with state-of-the-art grasp generation pipelines and tested in a real scenario. While our system achieves similar accuracy compared to a contemporary approach, 84% on the real-world validation, it has proven to be effective at generating grasps with good centering 18 times faster than the compared system.

1. Introduction

Manipulation tasks often require the manual definition of grasping poses. This process is very time-consuming, as all the poses must be validated with a real robot. Hence, current lines of research on this matter are focusing on the automation of the grasping pose estimation. The correct recognition of these points is essential in any manipulation task, especially in industrial scenarios, where efficiency and grasp quality take an essential role, thus making the grasping point identification a key area of research within the robotics community.
The current research tendencies follow two main approaches [1,2], depending on whether the morphology of the piece is available beforehand (model-based approaches) or not (model-free approaches). Each approach has its lights and shadows; on the one hand, model-free approaches are attractive as they try to calculate grasping points regardless of the piece, sacrificing overall accuracy and gaining flexibility. On the other hand, model-based approaches can give grasping points specially tailored for each piece, gaining precision but sacrificing scalability and generalization and, of course, requiring the model of the object. The two approaches portray the dichotomy between accuracy and scalability.
Moreover, due to the emergence of deep neural models and artificial intelligence (AI) techniques, researchers in the field of robotics are using those learning-based approaches in the context of grasping point detection. However, this comes with inherent drawbacks, the most important being the amount of data needed to train grasping point identification models. Generating data in real systems is costly both in terms of time and economy. Thus, a popular go-to approach is to obtain the data from simulation. This can be corroborated by the rise in the popularity of simulators such as Google DeepMind’s MuJoCo [3], NVidia’s Isaac Sim and Isaac Lab [4], Unity, with tools such as Unity Robotics Hub [5], PyBullet [6] and so on. For instance, Zhang et al. [7] developed a convolutional neural network (CNN) specially tailored to ease the process of grasping cables in a cluttered environment, called CG-CNN (Cable Grasping Convolutional Neural Network). In this case, they use the PyBullet simulator to generate numerous scenes, applying domain randomization, and thus collecting a diverse dataset. Then, their GC-CNN is trained with the synthetic dataset, thus obtaining a model that can predict the grasp qualities. Then, in the article they explore two different grasping order strategies: random policy and estimated quality-based policy, where they report that the grasping quality approach outperforms the random grasping order policy. In [8] they propose an end-to-end system that predicts the poses, categories and scores directly, taking the pointcloud as input and without the need for sampling, as they deem the sampling process to be time costly. Adding to that, to generate the training data of cluttered scenes, they propose an approach based on the Ferrari–Canny metric, as well as PyBullet to create the training data for multi-object grasp planning. They report better results in clutter removal experiments than in contemporary studies. The main difference between data-driven approaches and our approach is that our approach is centered on generating data while the data-driven approaches use these data to generate an algorithm capable of solving the problem online.
Further studying the model-free approaches, the literature shows that the traditional strategy is to use visual information given by cameras [9]. These approaches are appealing due to the fact that the system is focused on generalizing to never-before-seen objects or variable environment conditions. Nonetheless, the cost of lacking prior model knowledge is that the system is limited to the information seen by the sensors at the moment of picking, which are susceptible to noise, thus sacrificing overall accuracy. In [10] the researchers develop a grasp detection module based on key points that achieves real-time estimation in both the CPU and GPU and report a grasping success of 96% with a real robot. To train and validate their system, they use the Cornell [11] and Jacquard [12] grasp datasets, both formed by grasps annotated using simulation. The work in [13] showcases the development of a Grasp Pose Domain Adaptation Network with the main task of detecting possible 6 degrees of freedom (DoF) grasping poses in cluttered workspaces. They carry out experiments using the publicly available GraspNet-1 Billion [14] and Acronym [15] grasping pose datasets. Adding to that, Acronym is based on simulation and offers the option of generating cluttered scenes.
With respect to the model-based approaches, ref. [16] proposes a self-supervised robot system for 6D object pose estimation. Regarding the grasping poses themselves, the authors take 100 samples from the grasps calculated in [17], and choose the grasps that are kinematically valid. In [18], a learning-based approach for rigid objects in cluttered scenarios is presented. The grasp and placement trials are carried out in a simulated environment, using V-Rep [19], transferring them to the real world. In their approach, points are sampled from the surface of the object and then, using the unsupervised learning partitioning around medoids (PAM) technique they reduce the number of points. Researchers in [20], use the Computer-Aided Design (CAD) of the object as an input of their system in order to sample a set of accessible grasps, e.g., grasps that do not collide with the environment, and to attach to each grasp a set of simulated visuotactile images, to form a grasp database. When the object is seen, the system samples grasping pose candidates and tries to match the candidates with the ones in the database, thus obtaining the fittest grasping pose. They measure the fitness of a grasp based on the simulated contact images, the probability of a grasp giving good visuotactile information and the length of the motion needed to grasp the object, as well as the number of regrasps needed.
The deep analysis of the state of the art allows us to argue that the tendencies are inclining towards model-free online grasping point calculation, as they are flexible alternatives that allow the system to adapt to never-seen pieces without the need for re-adjusting the system. A good amount of these works propose using learning-based techniques to solve this challenge, while not focusing in detail on the grasp pose estimation of the approaches of their work, as stated in [17]. As previously mentioned, learning-based approaches need a large amount of data. Our work proposes a self-supervised grasp pose estimation system that uses the MuJoCo simulator to generate grasping poses for a given CAD in Wavefront Obj format. These generated points can either be used to pick the object using the obtained grasping poses directly or to train learning-based approaches. There are some works that are similar to ours, in the sense that they also cover the automatic generation of grasping poses for a given set of objects using CADs as input, namely ACRONYM [15], and MultiGripperGrasp [21]. To generate their database, in Acronym they use the meshes available from ShapeNetSem [22], sample grasping points using an antipodal sampling scheme and label them using the Nvidia FleX [23] simulator. Their evaluation procedure consists of grasping the object in a gravity-less environment and then shaking it. Their dataset also enables the creation of cluttered scenes. MultiGripperGrasp also consists of a dataset containing labeled grasping poses with different grippers and objects; it also lets the user add new grippers and objects, assuming they are well modeled. The objects contained in their dataset consist of a set of 329 objects from the GoogleScannedObjects dataset [24] and 16 objects from the YCB dataset [25]. To estimate the grasps, they first use GraspIt [26] to sample candidates from the object and then use Isaac Sim [27] to evaluate each grasping candidate. This evaluation consists of monitoring the time of fall-off for each object and assigning the time as a score. It is noteworthy to mention that in both previous approaches, a standard value has been used for all the objects’ physical parameters. In Acronym, the friction is set to 1.0 and density to 150 kg/m3, while in MultiGripperGrasp, the frictions, both static and dynamic are set to 0.5 and the density to 100 kg/m3.
This manuscript directly compares our system with MultiGripperGrasp. Furthermore, the following assertions underline the principal novelties of our article:
  • We present a fully CPU-based simulation tool to estimate the grasping poses for a given object, thus avoiding the need to use expensive GPUs.
  • Our system generates grasps around 18 times faster than the compared system.
  • Our system achieves similar grasp accuracy results as the compared method, correctly executing 84% of the grasps.
  • Compared to manually defined grasping poses, our system achieves higher similarity to those grasps than the compared approach.
This document is organized as follows: Section 2 describes the system architecture, emphasizing the communication between the different elements. Section 3 presents the grasp validation pipeline, explaining all the steps of the logic of execution. Section 4 presents the experimentation followed to validate and quantify the fitness of the system, while Section 5 showcases the results obtained from said experimentation process. Finally, Section 6 collects the conclusions and lessons learned from this work.

2. System Architecture

The work presented in this paper proposes a two-stage system; an initial candidate sampler, using a modified version of GPD [28], and the realistic simulation environment using MuJoCo, both integrated within the ROS2 ecosystem. This section digs deep both into the communication between the parts as well as the technical aspects of the implementation of each part, portrayed in Figure 1.

2.1. GPD Integration

As stated before, GPD has been integrated into the architecture encapsulated as a ROS2 node, providing custom ROS2 services as the server, and has been dockerized. When the docker is started, it launches the node in charge of initiating the two main service servers:
  • Initial grasp sampler service: This service receives the object model (CAD), as well as the information about the gripper and the fingers, that is, the height, width and depth of the fingers, the distance between them and the grasp mode it allows, either inner grasps, outer grasps or both. With all this information, it produces the grasping point candidates, each including the 6D pose of the object with respect to the gripper. Originally, the GPD returned the pose of the gripper in the object’s coordinate system, but we inverted the pose to obtain the pose of the object in gripper coordinates in order to ease the validation process in the simulation. Finally, more information about the initial candidate estimation is presented later in this very section.
  • File generation service: The service receives as input a list of 6D grasping poses after being validated in the simulator, as well as the list of poses “discovered” in the validation process and outputs a confirmation message, finally generating an XML file containing the validated grasps. The term “discovered” grasping poses is further explored in Section 3.
Moreover, the original GPD implementation has been enhanced to allow having CADs as input, to later convert them to a pointcloud format, as no camera is involved in the self-supervised grasp pose estimation process. Adding to that, GPD has also been strengthened to allow the estimation of inner grasps.
Regarding the initial candidate estimation, this enhanced GPD part iterates over an initially defined number of grasp candidates, also taking into account a defined number of orientations, and checks whether the grasp is geometrically viable, that is, if the gripper does not collide with the fingers, as well as the amount of points from the pointcloud that are between the grasps, since a threshold is defined to determine the minimum amount of points that have to exist between the fingers in order to be considered a good candidate. Figure 2 shows an example of the initial candidate estimation and sampling using GPD.

2.2. Simulation Environment

As mentioned previously, the simulation environment is based on the popular MuJoCo (Version 3.1.5 of MuJoCo) physics engine. Most precisely, its native Python bindings have been used. MuJoCo has been chosen as our go-to physics engine as it is open source, very regularly updated, with an active community, does not need a dedicated GPU nor a high-end machine to work and most importantly it is focused on contact-rich scenarios, which are the basis of robotic manipulation.
The simulation environment has been merged into the system architecture also as a ROS2 node, acting as the ROS2 service client to the aforementioned initial grasp sampler and file generation services.
Digging deeper into the implementation, it must be noted that MuJoCo works with the convex hulls of the objects in order to calculate the collisions. This is also a popular approach on video game engines, due to computational reasons. Thus, as we seek the maximum amount of physical realism on contacts, which the convex hull does not provide, we have used convex decomposition software to subdivide the original geometry into smaller and convex ones (see Figure 3). This vastly increases the realism of the contacts by sacrificing computational power as, in essence, by subdividing the geometry into smaller ones we are creating more geometries which translates in practice to more contacts. This is why it is also vital to find an equilibrium between computational power and realism in contacts.
Initially, we tested the objects using V-HACD [29], but the results were not what we were searching for, as V-HACD removed the concavities in the objects, losing realism as a consequence. Hence, we opted to use CoACD [30].
This software is tailored to preserve the concavities of the objects and offers a good amount of parameters that enable us to fully customize the convex subdivision operative. It offers the ability to limit the amount of convex geometries created by the process or to sacrifice time in order to enhance the convex geometries’ quality, which directly translates into giving more realism to the subdivided geometries. It must be noted that, depending on the parameters, the subdivision process can take from a few seconds to a couple of minutes. Notably, CoACD also offers native Python bindings, which greatly reduce the time needed to integrate the software into our architecture.
It is also worth noting that, as we previously pointed out, the convex subdivision process is translated into more geometries in the scene, and therefore, more contacts. A great amount of concurrent contacts can lead to instability in the simulation, sacrificing the realism we are seeking. We experimentally found the balance between simulation stability and physical contact realism, limiting the maximum amount of geometries CoACD generates for the objects at 150.
Scaling the simulation to validate new objects is straightforward. The simulation’s key configuration parameters, such as the object to use, are defined in a JSON file. Before starting the simulation, the system checks if the object has been previously subdivided, conducting it if necessary. When the object is successfully subdivided, the system creates an MJCF file; the XML file MuJoCo uses to model the scene, associated with said object and the newly subdivided meshes. Then, the system reads what gripper is going to be used to generate the grasps and includes said gripper into the object’s XML; MuJoCo works only with one XML, one that describes the whole scene. Of course, to include the gripper into the scene’s MJCF, the gripper must be correctly modeled and its MJCF generated. Regarding pre-generated grippers, we tried the system with a Robotiq 2f85 and a Franka Panda Hand (see Figure 4), both publicly available at the MuJoCo menagerie [31], the collection of robots and end-effectors curated by Google DeepMind. Although a couple of modifications have been made to the original designs:
  • All tools have been oriented and translated in order to make their closed fingertips touch the zero of the world coordinates, as seen in Figure 4.
  • Each gripper features a MuJoCo body called “g_ref” defined as a child of the base of the gripper. This is conducted in order to have a reference point for the gripper with a standardized name.
  • The grippers have had their finger actuator renamed to “finger_actuator” and the control range remapped to the 0 to 255 range in order to control it via script without the need to adapt the simulation to each actuator.
  • Each tool now includes an additional joint and actuator (with a standardized name) that lifts the gripper vertically 10 cm. The modeling of the joint and actuator is based on the “lift” modeled in the “Hello Robot Stretch 3”, available in the menagerie as well.
Again on the topic of automatic MuJoCo scene generation, the system also checks if the selected gripper–object pair has been previously selected in order to avoid creating unnecessary XML files. Note that, contrary to other works present in bibliography [21] we only “load” one gripper–object pair per simulation. There are no multiple concurrent workstations, mainly due to the intention of limiting the number of concurrent contacts and the design choice of loading only one XML representing the scene, as for the MuJoCo version used in this work, 3.1.5, it does not allow to load additional elements to the scene natively once it has been loaded into memory. Thus, we iteratively validate all the poses on a single workstation.
When the MuJoCo scene is created, the system loads it into the simulation and proceeds with the validation process using one of the three available modes. Even though the validation is deeply studied in Section 3, for the sake of clarity some information must be advanced; the validation process in the simulation is carried out by a culling phase that iterates over all the candidates and evaluates them. The simulation offers three modes to run the validation process:
  • Headless Single-Thread (HST) mode: The simulation runs fully on a single thread, going as fast as that thread allows thanks to being run in headless mode. This alternative also implements the option to record videos of the validation process, which is useful to debug the process. The video recording function renders and saves the frames for each iteration, thus consuming RAM and computational power, and overall reducing notably the speed of the validation process, but still achieving faster than real (FTR) times.
  • Headless Multi-Thread (HMT) mode: The simulation uses all the available threads in parallel. To achieve parallelization, the system divides the list of grasp candidates into as many groups as the threads the CPU has. Then, using Python’s multiprocessing library, each group is loaded into a thread. The information of each grasp candidate is stored in a queue and finally merged together into a single list. As this mode is used to make the validation process go as fast as possible, the parallelized culling phase does not support video generation. This mode is the go-to option for fast grasp validation, as it is the mode that achieves the fastest FTR time.
  • Real-Time Interactive mode: This is the only mode that runs in real-time and does not implement the video-saving tool. As it runs in real-time, it is not fit to validate a vast amount of grasps. For reference, a validation of 20,000 grasp candidates on this mode could take a maximum of approximately 16 h and 36 min. This mode should only be used for debugging purposes.

3. Validation Method

This section describes the grasp validation method, following the execution sequence, and digging deep into the metric we propose to estimate the quality of the grasps.

3.1. The Validation

The validation pipeline’s sequence is straightforward. Assuming that the node with GPD embedded is up and the ROS2 services are accessible, the simulation node is launched. This node sends the object CAD to the sampler, as well as the information of the gripper, and the sampler returns the grasping candidates. The candidates are then validated using the preferred validation method, i.e., headless multi-thread, headless single-thread or real-time interactive. When the simulation is loaded, it starts with the culling phase. The system generates the gripper always in the vertical pose, and the piece in the candidate pose, “freezing” the piece in the position. Then, the fingers are closed (in the case of opening grasping candidates, the order is inverted; first, the fingers are closed, then the piece is set into position and finally the order to open the fingers is sent). After a defined number of steps, the piece is “freed” and the poses start being saved. The culling process has four different outcomes:
  • Falling: The piece falls. In this case, the iteration is terminated the moment the piece moves a certain distance away from the gripper, and the score assigned to this candidate is 0.
  • Bad grasp: The candidate does not fall, but the movement inside the piece is high enough to score a bad grasp quality estimation. In this case, the grasp is culled from the candidates.
  • Good grasp: The candidate was good enough not only not to fall, but also to achieve a better score than the threshold.
  • Discovery grasp: During the development of the system, we observed that there were some grasps that started with a lot of movement but ended completely still and stable. For example, if we pick a rectangle from the side with a certain amount of inclination, the rectangle will rotate to be parallel to the gripper. The initial grasp was incorrect, but if we consider this pose as a grasp, it is actually good. To search for these cases, we also calculate the score of the object taking into account only the poses starting from the half of the operative. We also define a more strict score threshold for this score. If the grasp is good enough, we record the pose of the object with respect to the gripper in the last step and add it to the list of grasps that passed the culling phase.
An example of a correct grasp is depicted in Figure 5, where both attempts are correct, while in Figure 6 an incorrect attempt is presented, due to the rotation of the piece, underlining the importance of a movement-based metric, as the one presented in this work.

3.2. The Metric

The bibliography has shown the lack of a standardized metric to measure the quality of a grasp. Our proposed method offers a continuous value to portray the quality of a grasp by analyzing the movement of the piece once it has been grasped, using the pose of the object during the grasping process. Thus, this approach is applicable to all state-of-the-art simulators and offers fine-grained information on the quality of the grasp, the translation of the piece as well as the rotation.
The main objective of the metric is to give a continuous value that measures the rotation and translation of the piece during the grasping operative, not limited to initial and final pose comparison, as this does not take into account all the movement the piece makes during the grasp. With the later approach, if the piece moves during the grasp, but ends in a pose similar to the initial pose, the result would be a good score, far from reality.
The metric first estimates the mean translation ( t m ) using the Euclidean distance of all the consecutive poses. Then, the translation score ( S c o r e t ) is calculated from the mean translation by obtaining the inverse of the value, normalized by applying min-max normalization between certain thresholds. These threshold parameters ( p o s _ t h for the translation score and r o t _ t h for the rotation score) have been obtained experimentally. The conditions that define the translational score can be observed in Equation (1).
S c o r e t = 0 if t m > p o s _ t h 1 t m p o s _ t h otherwise
Similarly, the metric obtains the rotation score ( S c o r e r ) by measuring the mean distance of all the consecutive quaternions ( r m ). Then, the value is min-max normalized obtaining the inverse of the min-max normalization as well (see Equation (2)).
S c o r e r = 0 if r m > r o t _ t h 1 r m r o t _ t h otherwise
Finally, we estimate the score of the grasp by calculating the weighted sum between the translation score, S c o r e t , and the rotation score, S c o r e r . The weights given to the rotation ( W r ) and translation scores ( W t ), can be modified in order to give more importance to the rotation or translation, depending on the context. The reason why we take the inverse of the min-max normalizations is that the lower the movement, the higher the score should be. The score does not measure how much the objects move, but how much the object does not move.
Thus, to obtain each pose, we access the pose information MuJoCo provides, calculate the pose with respect to the gripper, and store it every 50 simulation steps. This metric can be implemented in other simulation environments given that the information on the pieces’ pose is accessible. It must be noted that the thresholds that measure the upper bounds of the maximum rotation and translation, that is, p o s _ t h and r o t _ t h , could be modified in order to allow more or less motion.

4. Experimentation

The main scope of this work has been the development and evaluation of a self-supervised model-based grasp pose estimator based on the MuJoCo physics engine. To assess the adequateness of the approach, it must be validated both in terms of time employed to obtain the grasps, as well as the quality of the grasps themselves. To this end, we compared and studied the grasps and their distribution over the objects, analyzed the time-grasp relation and finally validated the grasps in a real robotic system. Most importantly, we compared our approach to the most similar state-of-the-art approach, that is, MultiGripperGrasp [21].
This section presents both the resources and the experiments employed to carry out the fitness assessment process. Thus, it has been divided into four parts: Section 4.1 gives a brief overview of MultiGripperGrasp, the system we are comparing with. Section 4.2 shows the objects used to estimate the grasping points. Section 4.3 describes the modeling of the experiments. Section 4.4 showcases the hardware used, both regarding the robotic system and the computers.

4.1. MultiGripperGrasp

MultiGripperGrasp, from now on MGG also features a two-stage system: a sampling algorithm and a validation process in simulation. The sampling step is carried out in GraspIt! [26] and the initial candidate evaluation requires IsaacSim [27]. Noteworthily, the user must manually bridge the output of GraspIt! as input to IsaacSim to be validated since these two components are not communicated. Note that objects need to be modeled twice, once for GraspIt! and a second time for the IsaacSim input (they provide a tool to prepare the objects for GraspIt!). To model the objects used in the experimentation, we followed the approach proposed in the paper [21], assuming a uniform density of 100 and friction of 0.5 for all objects. Most importantly, IsaacSim requires a high-end computer to be able to run. For instance, the minimum requirements to be able to run IsaacSim, presented in the Nvidia Omniverse Documentation web, are a quad-core Intel Core i7 (7th generation) processor, 32 GB of RAM, 50 GB SSD storage, a GeForce RTX 3070 with 8 GB (phrasing the documentation of Nvidia, “More RAM and VRAM is recommended for advanced usage of IsaacSim.”) of VRAM and Ubuntu 20.04/22.04 or Windows 10/11. Regarding GraspIt!, MGG provides a docker image containing GraspIt! on a Python2 environment and on Ubuntu 18.04.

4.2. The Objects

Egad! [32] is a dataset consisting of over 2000 objects procedurally generated using evolutionary algorithms. The objective of this dataset is to offer a publicly available, open-source benchmarking tool in order to train and evaluate robotic systems, both in grasping tasks and in perception tasks. Most notably, this diverse dataset categorizes the objects depending on the complexity of the shape of the object, and the difficulty to grasp each object. This categorization is portrayed in the names of the objects; the names are coded with one letter and a number, where the letter refers to the grasp difficulty and the number to the shape complexity. They compared their dataset to other very popular choices within the robotics community: the YCB [25] dataset and Dex-Net 2.0 [33] dataset, finding that their proposed dataset covers more different geometric shapes and has more objects. Most importantly, they make public their object models, thus being able to download and even print them to test them in a real robotic system, while YCB and Dex-Net rely on the availability of real objects, having references that are nearly impossible to find in non-US countries. Finally, they also offer a tool to scale the objects of the dataset in order to allow them to fit in the grippers.
For our experimentation, we found Egad! to be a great object dataset, as we could test the real robotic system with the objects, and we also had a guide that defined the graspability of the objects. Taking these graspability and complexity metrics, we took a subset of 10 objects trying to cover the complexity and graspability of the objects as best as possible. Our subset is formed by objects from their training and test sets. From the training set, we have chosen A00, L01, N07, G12, J22, R24 and V05, and from the test set we have taken A6, D3 and G6. Note that in order to print them, we had to create a hollow version of each one of those objects by opening holes of ⌀10 mm (see Figure 7). The width of the walls is left to 4 mm. The printing material used is Nylon 12, which offers a robust surface and enough strength to not be deformed by the grasps.

4.3. Modeling the Experiments

It is important to note that the grasp estimation has been conducted using a Franka Panda Hand. This is mainly because, as for the date of this work, MGG does not include the Robotiq 2f85 gripper in their system. For that reason, we proceeded to carry out the experiments using the Franka Panda Hand, available in Isaac Sim, GraspIt! and MuJoCo.
To carry out the experiments, we launched both our system and MGG, and obtained the grasping points for the same Egad! objects using the hollowed STLs and randomly sampled 10 grasps for each object; we modified the output in order to be used as input in our application to carry the picking operatives, and analyzed the grasps both in terms of grasp accuracy in the real robot and mathematically to determine qualities such as centering and grasp sparsity. To conduct this analysis, the following experiments are proposed:
  • Validation at the real system: The most important part is to evaluate the quality of the grasps based on the capability of the robot to pick the objects from that point. To validate the grasps we have used the Robotiq 2f85 two-finger gripper, due to the lack of availability of a Franka Panda Hand. To carry out the validation, our system did not require any kind of adjustment, as we estimated the grasps with respect to the grippers’ closed fingertips, but we took into account the difference between the dimensions of the Robotiq 2f85 and the Franka Panda Hand in order to validate MGG correctly, and modified the grasping points in order to apply this difference. This transformation is required as the validation application estimates the grasping point with respect to the tip of the fingers, while MGG returns the pose with respect to the gripper base. The grasping points have not been further modified. Then, we programmed in the UR10 a script that with a given object, finds that object in the scene and picks the object from the defined grasping point. As validating each proposed grasping point for each object is not viable, we randomly sampled 10 grasps for each object and carried out this validation for each one of those grasps. Note-worthily, regarding the sampling process on the MGG dataset, we only sampled 10 grasps from the grasps that were held more or equal to 3 s, that is, their theoretic best grasps. This adds up to 200 grasps in total, a hundred from each approach. As both systems estimated the grasping poses for the objects when they were isolated, if the robot could not find a collision-free trajectory to the object due to it colliding with the table, we helped the robot by elevating the object and thus isolating it from the environment. The experiment itself is straightforward, the robot goes to the defined grasping pose and lifts it, holding it in place for 3 s. Then, evaluate if the piece has been correctly picked or not, marking only the grasps that manage to stay in grasp during the 3 s period as successful.
  • Comparison with manually defined grasps: We also compare the grasps obtained from both methods with ground truth in the form of manually defined grasps to try to quantify how much distance there would be between the ground truth and the systems’ outputs, as we argue that the grasps defined manually by the operators tend to be near-optimal grasps for each object, as they are validated on the spot. Adding to that, the manually defined grasps cover the entire graspable area of the object. Thus, comparing the obtained grasps with the ground truth will also tell the sparsity of the grasps proposed by the system.
  • Grasp centering: We measure if the grasps are actually centered on the piece (see Section 5.3), as in some applications it is critical that the grasp is centered on the object, not to hit or move the objects besides the one that is going to be picked.
  • Time per grasp: We analyze the system in order to measure how much time is needed to obtain a valid grasp.

4.4. Hardware

The experiments have been performed on two different computers. Computer one has 16 GB of RAM, an Intel Core i7-10700 CPU 2.90 GHz × 16, no dedicated GPU and Ubuntu 22.04. This PC has been used to run our whole system and to run MGG’s sampling part (the GraspIt! docker).
MGG’s initial candidate validation using IsaacSim has been carried out on a computer with 64 GB of RAM, an Intel Core i7-12700KF × 20, and an Nvidia RTX 4070 12 GB with Ubuntu 22.04.
The validation in the real system has been executed using a UR10 arm with a Robotiq 2f85 parallel jaw gripper, combined with a stationary Photoneo XL camera set above the workspace to carry out the object pose estimation. The setup, with the aforementioned elements can be seen in Figure 8.

5. Results

This section is divided into three main parts. Section 5.1 gathers the results obtained during the validation in the real robot, while Section 5.2 contains the comparison with the ground truth grasps. Section 5.3 analyzes the centering of the grasps.

5.1. Real Robot Validation

During the validation process, we observed that due to the off-centering of certain grasps, some pieces moved notably when the pads were closing, thus affecting the grasping points themselves. Hence, we labeled those grasps as failed grasps, as the object was not picked where the system intended. Figure 9 shows a graph containing the grasps that were successful, in the sense that the piece was lifted from the table and did not fall during the grasp. It can be seen that both systems performed the same regarding accuracy and that both had quite consistent grasp accuracy for all objects, excluding G6 in our system, which performed especially poorly and lowered the overall mean. The validation process of that object showed that one of the best ways to pick the object was from the outer perimeter, a recurrent grasp in MGG, thus achieving a higher score. Furthermore, our system also performs worse in A6, although we achieve better results in J22 and L01, but the difference is not as notable as in G6.
Most notably, during the validation process, we also observed that MGG returned two grasps where the gripper collided with objects (A6 and V05) themselves. These two are the only instances present in the sampled population.
Finally, regarding the object positioning, due to the fact that we wanted to evaluate the accuracy of the systems estimating grasps and that both systems calculate the grasps when the object is in the air, to avoid potential collisions of the robot with the table, we used a small platform to put the pieces on. This accuracy does not portray the ability of the systems to generate grasps that are fit to take the object from the table, but it measures the quality of the grasps themselves instead. This is also why it may seem that there is no correlation between the theoretical complexity of each piece with the grasp accuracy of the systems. The analysis of the complexity of the pieces should be carried out when trying to grasp the object from a table without any help.

5.2. Similarity with Real World Grasps

We manually defined a set of grasps that cover the graspable area of the object for a random subset of different objects to estimate if the system return grasps are similar to the poses an expert would define, that is, high-quality grasps that are scattered through the object. Then, for each of these manually defined grasps, we searched for the successful candidates returned by the system, i.e., the candidates that resulted in a successful grasp in the real robot experiment; the most similar grasp and stored the similarity between that grasp and the manually defined grasp. We then calculated the mean similarity for each object, obtaining the results shown in Figure 10.
The results show overall slightly better results in our system. Most precisely, the amount of grasps correctly estimated by each system plays a crucial role in this statistic, as a higher number of correct grasps may improve the probabilities of having a grasp similar to one manually proposed. Taking that into account, our system achieves an overall mean similarity of 0.808807 while theirs achieves a mean of 0.738469. This means that our system returns grasps that are scattered through the object and that are similar to the ones manually defined.
The similarity is calculated using the following formula:
S = W t ( 1 t d m a x _ e u c _ d i s t ) + W r ( 1 r d )
where
  • S: The similarity between two grasps.
  • W t : The weight given to the displacement or translation similarity. In this case it is 0.7.
  • W r : The weight assigned to the rotational similarity. In this case is set to 0.3.
  • t d : The translation distance between two poses. Calculated using the euclidean distance.
  • m a x _ e u c _ d i s t : The maximum Euclidean distance two poses can have. In this case, it is 8 cm (the maximum size an object can have).
  • r d : The rotational distance between two poses. It compares two quaternions (the comparison between two quaternions is described https://math.stackexchange.com/questions/90081/quaternion-distance/90098#90098 here (accessed on 24 December 2024)), taking a value between 0 and 1, where 0 is the same quaternion and 1 the adverse.

5.3. Grasp Centering

The estimation of the centering of the pieces can be used to measure the quality of the grasp, as it is undesirable for a grasp to drag the piece. This dragging can not only cause damage to the piece but also modify the original grasping point. However, it must be taken into account that the centering of the grasp does not always translate into a successful grasp. Noteworthily, it is also harder to use off-centered grasps to plan trajectories, as the gripper may collide with other near objects when trying to perform the grasp. Thus, we argue that, in most cases, a centered grasp is preferred to an off-centered grasp, even more so if the off-center is due to miscalculations.
To measure the centering of a grasp we created a bounding box mimicking the trajectory the Panda’s pads would follow, and calculated the nearest point from the sides of the bounding box to the volume of the piece within this bounding box, as seen in Figure 11. Then, we calculate the absolute value of the difference in the distance from both sides of the bounding box to the object. This estimates the absolute value of the displacement of the piece, meaning that as the value is closer to 0, the piece is more centered between the pads. We carried out this process for each object, using all the grasps employed to validate the object with the real robot, observable in Figure 12. Taking the mean off-centering for each object, we calculated the mean and quantified how much off-centering both systems produce.
Our system produced a mean off-centering of 4.402 mm, while MGG produced a mean off-centering of 4.574 mm. Moreover, the grasp centering is very similar for each object, excluding some outliers like J22 where MMG performed worse, or G6 and L01 where our system performed badly. To shed some light on the off-centering, Figure 13 compares an off-centered grasp vs. a centered grasp.

Time per Grasp

Each system estimates the grasps with different design and implementation approaches; MGG starts the estimation process for all objects in parallel (GraspIt! conducts the estimation at the same time for all objects, while their IsaacSim algorithm validates the objects sequentially using multi-stations), while our approach goes sequentially. Furthermore, MGG requires to set a minimum amount of GraspIt! “plans” for the system to achieve (and it will evaluate plans until that amount is achieved) while our system only requires setting an initial amount of samples and orientations to estimate.
Taking that into account, we calculated the time required to estimate a valid grasp. In the case of MGG, we took the time spent from the start of the grasp sampling to the end of the validation in Isaac Sim and divided it by the number of grasps that achieved a 3 s grasp in their simulation. For PBSS-GPD, we divided the time from the start of the sampling process to the end of the MuJoCo simulation validation, by the amount of grasps that the simulation considered fit, i.e., the grasps that surpassed the quality threshold.
Table 1 gathers the number of grasps of each system as well as the time employed to obtain those grasps. Checking the time per grasp, we can see that our system obtains the grasps nearly 18 times faster (17.7350), most probably due to the use of GPD instead of GraspIt!.

6. Conclusions and Further Work

This work presents PBSS-GPD, a fully automatic system that, given a CAD model, generates grasping poses for the object. The system has not only proven its fitness in terms of transferability to the real world, but it has also been compared to a contemporary approach in terms of accuracy, grasp centering, similarity with real-world grasps and time per grasp, showing similarities to small improvements in the former and an important advancement in the latter. Moreover, the system runs completely on the CPU. A video of the whole system can be seen in the following link https://youtu.be/BxEZOLP8hso.
Hence, for the sake of clarity, we present the key compared elements between our systems:
Summing up, both systems excelled in terms of accuracy, achieving the same amount of correct real-world grasps. From the 100 samples obtained from each system, 84 were correct in both cases, thus obtaining an accuracy of 84%. Regarding the time per grasp, the time required for each system to obtain a valid grasp has been measured. While MGG obtains 2.637 s per grasp, our system obtains a valid grasp in 0.149 s, approximately 18 times faster. This directly translates into more data generated in a certain amount of time, making it a desirable feature in order to train learning-based algorithms, as the amount of data they need is vast. Moreover, both systems estimate fairly centered grasp poses, an aspect critical to grasping objects without moving them in the workspace. Our approach achieved a mean off-centering of 4.402 mm, while MGG generated grasps with a mean off-centering of 4.574 mm. Thus, our method does not show a significant improvement in this aspect compared to MGG. The grasps generated by both systems show good similarity with the ground truth for the random subset of objects. In terms of strict similarity values, our system achieves a score of 0.8088 while theirs achieves a score of 0.7385. Thus, we can infer that our system generates grasps more akin to the ones generated in the real world, and hence covering the object better. Most notably, our system is especially less restrictive in terms of minimum system requirements, mostly because it runs completely on the CPU. In terms of usability, MGG isolates the grasp candidate sampler and simulation-based grasp validator. This makes the whole validation pipeline sub-optimal from the point of view of usability [34], as the user must collect, modify and export the data from the sampler in the docker to the validator outside of the docker. Regarding our system, communication via ROS2 services has proven to be useful in order to communicate the two processes, even if our sampler is embedded in a docker. Finally, MGG provides more grippers off-the-box, as well as support for dexterous hand grasp pose generation. Adding to that, both systems offer the ability to scale the simulation by manually adding new grippers and objects. Regarding the grippers, it is true that for both systems the grippers must be manually modeled, in terms of objects our system scales better, as the only thing necessary is the OBJ file. Our approach generates the necessary subdivision, physical properties and MuJoCo files automatically, while MGG requires the user to model the objects in Isaac Sim. Addressing the applicability of our system, our system obtains the grasping poses for any given object, and thus, it can be applied to a broad spectrum of scenarios, going from industrial scenarios to human–robot collaboration scenarios. Non-expert users can easily use the system, as it only requires selecting the piece. This would directly output an XML file containing the grasping poses. Moreover, the usability can be further enhanced by developing a user interface to simplify the piece selection process.
Regarding the next steps, the work presented in this manuscript could be improved by exploring different approaches to enhance the initial GPD candidate quality using the information provided by the CAD model. Regarding the simulation environment, we plan to carry out a parameter optimization process using evolutive algorithms with the objective of reducing the “sim2real” gap, as the bibliography [35] has shown the viability of this approach. Adding to that, we will also explore the option of using parameter optimization techniques to improve the accuracy of the score calculation metric. In terms of grasp quality, the experimentation has shown penetration between the fingertips and the geometries in the simulation. Work still remains to solve this issue, as solving it could improve the quality of the simulation, resulting in better grasps. In addition, the grasps have shown variability in terms of centering. A deep analysis of the off-centering of the grasps, as well as what generates this off-centering still remains, as solving the off-centering of the grasps might increase the accuracy of the system, generating better grasps. Moreover, we plan to extend our system to a greater number of grippers and tools, covering the range of two-finger parallel grippers, three-finger parallel grippers, suction tools and magnetic tools, and extend the validation to a greater number of objects, focusing on industrial use cases, as well as to include and optimize the MuJoCo parameters for a greater amount of available object materials. Finally, as our system estimates the grasping points for any given object isolated, to solve the problem of grasping in cluttered environments, we plan to develop a data-driven reinforcement learning approach using grasping data obtained from our system. The usage of data-driven approaches to enhance the flexibility of systems has proven to be an effective way of doing so [36].

Author Contributions

Conceptualization, A.I., I.M. and A.A.; methodology, A.I., J.A.R., I.M. and A.A.; software, A.I. and J.A.R.; validation, A.I. and J.A.R.; formal analysis, J.A.R.; investigation, A.I. and J.A.R.; resources, A.I., I.M. and E.L.; data curation, A.I., J.A.R. and E.L.; writing—original draft preparation, J.A.R.; writing—review and editing, A.I. and E.L.; visualization, A.I., J.A.R. and E.L.; supervision, I.M., A.A., A.I. and E.L.; project administration, I.M.; funding acquisition, A.I., I.M. and A.A. All authors have read and agreed to the published version of the manuscript.

Funding

This work has been partially funded by HARTU project that has received funding from the European Union’s research and innovation programme Horizon Europe under the grant agreement No. 101092100, the project ADAPTA under programme Transmisiones 2023, funded by “State Research Agency (AEI) of Spain”, contract number PLEC2023-010218, and HELDU project funded by the “Basque Government—Department of Economic Development, Sustainability and Environment”—ELKARTEK 2023 Program (KK-2023/00055).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Grasping pose availability as well as any other data inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

GPDGrasp Pose Detection.
PBSS-GPDSelf-Supervised Grasp Pose Detection.
CADComputer-Aided Design.
AIArtificial Intelligence.
CNNConvolutional Neural Network
CG-CNNCable Grasping Convolutional Neural Network.
DoFDegrees of Freedom.
PAMPartitioning Around Medoids.
HSTHeadless Single-Thread.
FTRFaster Than Real.
HMTHeadless Multi-Thread.
VRAMVideo Random Access Memory.
RAMRandom Access Memory.
CPUCentral Processing Unit.
AEIAgencia Estatal de Investigación (State Research Agency).

References

  1. Kleeberger, K.; Bormann, R.; Kraus, W.; Huber, M.F. A survey on learning-based robotic grasping. Curr. Robot. Rep. 2020, 1, 239–249. [Google Scholar] [CrossRef]
  2. Xie, Z.; Liang, X.; Roberto, C. Learning-based robotic grasping: A review. Front. Robot. AI 2023, 10, 1038658. [Google Scholar] [CrossRef]
  3. Todorov, E.; Erez, T.; Tassa, Y. MuJoCo: A physics engine for model-based control. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vilamoura-Algarve, Portugal, 7–12 October 2012; pp. 5026–5033. [Google Scholar] [CrossRef]
  4. Mittal, M.; Yu, C.; Yu, Q.; Liu, J.; Rudin, N.; Hoeller, D.; Yuan, J.L.; Singh, R.; Guo, Y.; Mazhar, H.; et al. Orbit: A Unified Simulation Framework for Interactive Robot Learning Environments. IEEE Robot. Autom. Lett. 2023, 8, 3740–3747. [Google Scholar] [CrossRef]
  5. Robotics, U. Unity Robotics Hub. 2022. Available online: https://github.com/Unity-Technologies/Unity-Robotics-Hub (accessed on 24 December 2024).
  6. Coumans, E.; Bai, Y. PyBullet, a Python Module for Physics Simulation for Games, Robotics and Machine Learning. 2016–2021. Available online: http://pybullet.org (accessed on 24 December 2024).
  7. Zhang, L.; Bai, K.; Li, Q.; Chen, Z.; Zhang, J. A Collision-Aware Cable Grasping Method in Cluttered Environment. In Proceedings of the 2024 IEEE International Conference on Robotics and Automation (ICRA), Yokohama, Japan, 13–17 May 2024; pp. 2126–2132. [Google Scholar] [CrossRef]
  8. Ni, P.; Zhang, W.; Zhu, X.; Cao, Q. Pointnet++ grasping: Learning an end-to-end spatial grasp generation algorithm from sparse point clouds. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3619–3625. [Google Scholar]
  9. Tian, H.; Song, K.; Li, S.; Ma, S.; Xu, J.; Yan, Y. Data-driven robotic visual grasping detection for unknown objects: A problem-oriented review. Expert Syst. Appl. 2023, 211, 118624. [Google Scholar] [CrossRef]
  10. Zhai, D.H.; Yu, S.; Xia, Y. FANet: Fast and accurate robotic grasp detection based on keypoints. IEEE Trans. Autom. Sci. Eng. 2023, 21, 2974–2986. [Google Scholar] [CrossRef]
  11. Lenz, I.; Lee, H.; Saxena, A. Deep learning for detecting robotic grasps. Int. J. Robot. Res. 2015, 34, 705–724. [Google Scholar] [CrossRef]
  12. Depierre, A.; Dellandréa, E.; Chen, L. Jacquard: A large scale dataset for robotic grasp detection. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Madrid, Spain, 1–5 October 2018; pp. 3511–3516. [Google Scholar]
  13. Zheng, L.; Ma, W.; Cai, Y.; Lu, T.; Wang, S. GPDAN: Grasp pose domain adaptation network for sim-to-real 6-DoF object grasping. IEEE Robot. Autom. Lett. 2023, 8, 4585–4592. [Google Scholar] [CrossRef]
  14. Fang, H.S.; Wang, C.; Gou, M.; Lu, C. Graspnet-1billion: A large-scale benchmark for general object grasping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11444–11453. [Google Scholar]
  15. Eppner, C.; Mousavian, A.; Fox, D. ACRONYM: A Large-Scale Grasp Dataset Based on Simulation. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 6222–6227. [Google Scholar] [CrossRef]
  16. Deng, X.; Xiang, Y.; Mousavian, A.; Eppner, C.; Bretl, T.; Fox, D. Self-supervised 6d object pose estimation for robot manipulation. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 3665–3671. [Google Scholar]
  17. Eppner, C.; Mousavian, A.; Fox, D. A billion ways to grasp: An evaluation of grasp sampling schemes on a dense, physics-based grasp data set. In Proceedings of the The International Symposium of Robotics Research; Springer: Hanoi, Vietnam, 2019; pp. 890–905. [Google Scholar]
  18. Kleeberger, K.; Völk, M.; Moosmann, M.; Thiessenhusen, E.; Roth, F.; Bormann, R.; Huber, M.F. Transferring experience from simulation to the real world for precise pick-and-place tasks in highly cluttered scenes. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 25–29 October 2020; pp. 9681–9688. [Google Scholar]
  19. Rohmer, E.; Singh, S.P.; Freese, M. V-REP: A versatile and scalable robot simulation framework. In Proceedings of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, Tokyo, Japan, 3–7 November 2013; pp. 1321–1326. [Google Scholar]
  20. Bauza, M.; Bronars, A.; Hou, Y.; Taylor, I.; Chavan-Dafle, N.; Rodriguez, A. SimPLE, a visuotactile method learned in simulation to precisely pick, localize, regrasp, and place objects. Sci. Robot. 2024, 9, eadi8808. [Google Scholar] [CrossRef]
  21. Casas, L.F.; Khargonkar, N.; Prabhakaran, B.; Xiang, Y. MultiGripperGrasp: A Dataset for Robotic Grasping from Parallel Jaw Grippers to Dexterous Hands. In Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Abu Dhabi, United Arab Emirates, 14–18 October 2024. [Google Scholar]
  22. Savva, M.; Chang, A.X.; Hanrahan, P. Semantically-enriched 3D models for common-sense knowledge. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Boston, MA, USA, 7–12 June 2015; pp. 24–31. [Google Scholar]
  23. Macklin, M.; Müller, M.; Chentanez, N.; Kim, T.Y. Unified particle physics for real-time applications. ACM Trans. Graph. (TOG) 2014, 33, 1–12. [Google Scholar] [CrossRef]
  24. Downs, L.; Francis, A.; Koenig, N.; Kinman, B.; Hickman, R.; Reymann, K.; McHugh, T.B.; Vanhoucke, V. Google scanned objects: A high-quality dataset of 3d scanned household items. In Proceedings of the 2022 International Conference on Robotics and Automation (ICRA), Philadelphia, PA, USA, 23–27 May 2022; pp. 2553–2560. [Google Scholar]
  25. Calli, B.; Singh, A.; Bruce, J.; Walsman, A.; Konolige, K.; Srinivasa, S.; Abbeel, P.; Dollar, A.M. Yale-CMU-Berkeley dataset for robotic manipulation research. Int. J. Robot. Res. 2017, 36, 261–268. [Google Scholar] [CrossRef]
  26. Miller, A.T.; Allen, P.K. Graspit! a versatile simulator for robotic grasping. IEEE Robot. Autom. Mag. 2004, 11, 110–122. [Google Scholar] [CrossRef]
  27. NVidia Isaac Sim. Available online: https://developer.nvidia.com/isaac/sim (accessed on 24 December 2024).
  28. Ten Pas, A.; Gualtieri, M.; Saenko, K.; Platt, R. Grasp pose detection in point clouds. Int. J. Robot. Res. 2017, 36, 1455–1473. [Google Scholar] [CrossRef]
  29. V-HACD. Available online: https://github.com/kmammou/v-hacd (accessed on 24 December 2024).
  30. Wei, X.; Liu, M.; Ling, Z.; Su, H. Approximate convex decomposition for 3d meshes with collision-aware concavity and tree search. ACM Trans. Graph. (TOG) 2022, 41, 1–18. [Google Scholar] [CrossRef]
  31. MuJoCo Menagerie. Available online: https://github.com/google-deepmind/mujoco_menagerie (accessed on 24 December 2024).
  32. Morrison, D.; Corke, P.; Leitner, J. Egad! an evolved grasping analysis dataset for diversity and reproducibility in robotic manipulation. IEEE Robot. Autom. Lett. 2020, 5, 4368–4375. [Google Scholar] [CrossRef]
  33. Mahler, J.; Liang, J.; Niyaz, S.; Laskey, M.; Doan, R.; Liu, X.; Ojea, J.A.; Goldberg, K. Dex-net 2.0: Deep learning to plan robust grasps with synthetic point clouds and analytic grasp metrics. arXiv 2017, arXiv:1703.09312. [Google Scholar]
  34. Nielsen, J. Usability Engineering; Morgan Kaufmann: Burlington, MA, USA, 1994. [Google Scholar]
  35. Collins, J.; Brown, R.; Leitner, J.; Howard, D. Traversing the Reality Gap via Simulator Tuning. In Proceedings of the Australasian Conference on Robotics and Automation (ACRA 2021), Online, 6–8 December 2021; Australian Robotics and Automation Association (ARAA): Brisbane, Australia, 2021; pp. 1–10. [Google Scholar]
  36. Zhao, J.; Wang, Z.; Lv, Y.; Na, J.; Liu, C.; Zhao, Z. Data-Driven Learning for H Control of Adaptive Cruise Control Systems. IEEE Trans. Veh. Technol. 2024, 73, 18348–18362. [Google Scholar] [CrossRef]
Figure 1. Diagram showcasing the system architecture. The execution order is from left to right. The system takes as an entry point the CAD of the piece, as well as the gripper’s characteristics. Then, it obtains the initial grasp candidates and builds the MuJoCo scene, formed by the gripper and object. Finally, given the list of candidates and the scene, it iterates over each grasp pose, evaluating them. The output is the list of good grasps, in an XML.
Figure 1. Diagram showcasing the system architecture. The execution order is from left to right. The system takes as an entry point the CAD of the piece, as well as the gripper’s characteristics. Then, it obtains the initial grasp candidates and builds the MuJoCo scene, formed by the gripper and object. Finally, given the list of candidates and the scene, it iterates over each grasp pose, evaluating them. The output is the list of good grasps, in an XML.
Machines 13 00012 g001
Figure 2. Example of the output of GPD. Here, the pointcloud of the piece as well as the representation of the gripper for each candidate is presented. GPD evaluates the geometrical viability of the grasps from a variety of orientations. (a) Output of GPD from one perspective. (b) Output of GPD from other perspective.
Figure 2. Example of the output of GPD. Here, the pointcloud of the piece as well as the representation of the gripper for each candidate is presented. GPD evaluates the geometrical viability of the grasps from a variety of orientations. (a) Output of GPD from one perspective. (b) Output of GPD from other perspective.
Machines 13 00012 g002
Figure 3. Convex hull vs. convex decomposition using CoACD. Sub-figure (a) shows that using the convex hull is not a viable way to estimate realistic contacts. On the contrary, decomposing the geometry using CoACD as seen in Sub-figure (b) allows the system to perform more realistic contacts. (a) Convex hull of the original non-convex geometry. (b) Convex hull of the decomposed object using CoACD.
Figure 3. Convex hull vs. convex decomposition using CoACD. Sub-figure (a) shows that using the convex hull is not a viable way to estimate realistic contacts. On the contrary, decomposing the geometry using CoACD as seen in Sub-figure (b) allows the system to perform more realistic contacts. (a) Convex hull of the original non-convex geometry. (b) Convex hull of the decomposed object using CoACD.
Machines 13 00012 g003
Figure 4. The models obtained from the MuJoCo menagerie and modified to integrate them into our validation process. The orientation of the gripper is shown as the Z axis going from the fingertips to the base, and the X axis parallel to the horizontal plane of the gripper. (a) Robotiq 2f85 two finger gripper. (b) Franka Panda Hand parallel jaw gripper.
Figure 4. The models obtained from the MuJoCo menagerie and modified to integrate them into our validation process. The orientation of the gripper is shown as the Z axis going from the fingertips to the base, and the X axis parallel to the horizontal plane of the gripper. (a) Robotiq 2f85 two finger gripper. (b) Franka Panda Hand parallel jaw gripper.
Machines 13 00012 g004
Figure 5. Two examples of Egad! pieces being grasped. The picture shows the gripper, the piece and certain information; the candidate ID, the score decomposition and a color code to help debug if the grasp was correct. (a) Egad! object D3 being correctly picked from the grasping candidate number 13. (b) Egad! object A6 being picked from the grasping candidate number 2.
Figure 5. Two examples of Egad! pieces being grasped. The picture shows the gripper, the piece and certain information; the candidate ID, the score decomposition and a color code to help debug if the grasp was correct. (a) Egad! object D3 being correctly picked from the grasping candidate number 13. (b) Egad! object A6 being picked from the grasping candidate number 2.
Machines 13 00012 g005
Figure 6. This figure showcases three instances of the same grasp attempt, candidate 1. The leftmost image, sub-figure (a) shows the piece at the starting moments of the grasp. The image in the center, sub-figure (b), shows the same attempt some moments later, where it can be seen how the piece rotates. Sub-figure (c) shows the last moments of the grasp attempt, where the piece has further rotated. This is directly taken into account by the score, where the rotation score in this case is 0. (a) Egad! object D3 being grasped. (b) Egad! object D3 rotating in the grasp. (c) Egad! object D3 has rotated.
Figure 6. This figure showcases three instances of the same grasp attempt, candidate 1. The leftmost image, sub-figure (a) shows the piece at the starting moments of the grasp. The image in the center, sub-figure (b), shows the same attempt some moments later, where it can be seen how the piece rotates. Sub-figure (c) shows the last moments of the grasp attempt, where the piece has further rotated. This is directly taken into account by the score, where the rotation score in this case is 0. (a) Egad! object D3 being grasped. (b) Egad! object D3 rotating in the grasp. (c) Egad! object D3 has rotated.
Machines 13 00012 g006
Figure 7. Egad! objects chosen in our subset, presented in no particular order in sub-figure (a). The sub-set covers the graspability and geometric complexity spectrum. Sub-figure (b) shows the ⌀10 holes made in the objects in order to hollow them. The walls are 4 mm thick. (a) The chosen Egad! subset, in no particular order. (b) N07 with hole visible.
Figure 7. Egad! objects chosen in our subset, presented in no particular order in sub-figure (a). The sub-set covers the graspability and geometric complexity spectrum. Sub-figure (b) shows the ⌀10 holes made in the objects in order to hollow them. The walls are 4 mm thick. (a) The chosen Egad! subset, in no particular order. (b) N07 with hole visible.
Machines 13 00012 g007
Figure 8. Setup of the real robot system, featuring an UR10, Robotiq 2f85 gripper and a Photoneo XL to locate the pieces.
Figure 8. Setup of the real robot system, featuring an UR10, Robotiq 2f85 gripper and a Photoneo XL to locate the pieces.
Machines 13 00012 g008
Figure 9. Accuracy of the methods. This graph studies the percentage of successful grasps without taking into account the ones that moved largely during the grasp. Both approaches achieve a remarkable 84% accuracy.
Figure 9. Accuracy of the methods. This graph studies the percentage of successful grasps without taking into account the ones that moved largely during the grasp. Both approaches achieve a remarkable 84% accuracy.
Machines 13 00012 g009
Figure 10. Object-wise mean grasp similarity. It quantifies how similar the grasps are compared to a manually defined ground truth. It is useful to quantify the similarity to the ground truth for the grasps estimated by the system, assuming that the ones manually defined have the highest quality. Our system creates grasps more similar to the ones manually defined for all the objects of the subset compared to the ones MGG provides.
Figure 10. Object-wise mean grasp similarity. It quantifies how similar the grasps are compared to a manually defined ground truth. It is useful to quantify the similarity to the ground truth for the grasps estimated by the system, assuming that the ones manually defined have the highest quality. Our system creates grasps more similar to the ones manually defined for all the objects of the subset compared to the ones MGG provides.
Machines 13 00012 g010
Figure 11. Approach used to measure the centering of the grasp. The nearest point of the volume of the piece within the green bounding box to the red bubbles is calculated. There is one red bubble on each side of the bounding box (marked with the purple arrow). Thus, we can estimate if the grasp is off-centered, meaning that a side would collide with the pad earlier than the other.
Figure 11. Approach used to measure the centering of the grasp. The nearest point of the volume of the piece within the green bounding box to the red bubbles is calculated. There is one red bubble on each side of the bounding box (marked with the purple arrow). Thus, we can estimate if the grasp is off-centered, meaning that a side would collide with the pad earlier than the other.
Machines 13 00012 g011
Figure 12. Box plots showing the mean, median and deviation statistics about the off-centering of each piece for each system. The mean off-centering is calculated by taking the off-centering of each grasp for that piece and calculating the mean. The off-centering is the absolute value obtained from the difference between the distance from the left pad to the nearest point of the object, and the distance from the right pad to the nearest point of the object.
Figure 12. Box plots showing the mean, median and deviation statistics about the off-centering of each piece for each system. The mean off-centering is calculated by taking the off-centering of each grasp for that piece and calculating the mean. The off-centering is the absolute value obtained from the difference between the distance from the left pad to the nearest point of the object, and the distance from the right pad to the nearest point of the object.
Machines 13 00012 g012
Figure 13. A direct comparison between a centered grasp in (a) vs. an off-centered grasp in (b). The closer the object is to one of the pads, the higher the off-centering score will be. (a) A centered grasp for D3. (b) An off-centered grasp for D3.
Figure 13. A direct comparison between a centered grasp in (a) vs. an off-centered grasp in (b). The closer the object is to one of the pads, the higher the off-centering score will be. (a) A centered grasp for D3. (b) An off-centered grasp for D3.
Machines 13 00012 g013
Table 1. Table showing the fit grasp number as well as the time used to obtain those grasps and the time per grasp. Our system performs almost 18 times faster.
Table 1. Table showing the fit grasp number as well as the time used to obtain those grasps and the time per grasp. Our system performs almost 18 times faster.
Number of GraspsTime (s)Time per Grasp
MGG871222,9702.6372 s/grasp
Ours10,18015140.1487 s/grasp
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ruiz, J.A.; Iriondo, A.; Lazkano, E.; Ansuategi, A.; Maurtua, I. Physics-Based Self-Supervised Grasp Pose Detection. Machines 2025, 13, 12. https://doi.org/10.3390/machines13010012

AMA Style

Ruiz JA, Iriondo A, Lazkano E, Ansuategi A, Maurtua I. Physics-Based Self-Supervised Grasp Pose Detection. Machines. 2025; 13(1):12. https://doi.org/10.3390/machines13010012

Chicago/Turabian Style

Ruiz, Jon Ander, Ander Iriondo, Elena Lazkano, Ander Ansuategi, and Iñaki Maurtua. 2025. "Physics-Based Self-Supervised Grasp Pose Detection" Machines 13, no. 1: 12. https://doi.org/10.3390/machines13010012

APA Style

Ruiz, J. A., Iriondo, A., Lazkano, E., Ansuategi, A., & Maurtua, I. (2025). Physics-Based Self-Supervised Grasp Pose Detection. Machines, 13(1), 12. https://doi.org/10.3390/machines13010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop