A Modern Take on Visual Relationship Reasoning for Grasp Planning

data vis

Interacting with real-world cluttered scenes pose several challenges to robotic agents that need to understand complex spatial dependencies among the observed objects to determine optimal pick sequences or efficient object retrieval strategies. Existing solutions typically manage simplified scenarios and focus on predicting pairwise object relationships following an initial object detection phase, but often overlook the global context or struggle with handling redundant and missing object relations. In this work, we present a modern take on visual relational reasoning for grasp planning. We introduce D3GD ➀, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories. Additionally, we propose D3G ➁, a new end-to-end transformer-based dependency graph generation model that simultaneously detects objects and produces an adjacency matrix representing their spatial relationships. Recognizing the limitations of standard metrics, we employ the Average Precision of Relationships ➂ for the first time to evaluate model performance, conducting an extensive experimental benchmark. The obtained results establish our approach as the new state-of-the-art for this task, laying the foundation for future research in robotic manipulation.

D3GD Testbed

data vis

The D3B Testbed introduces a novel difficulty based ratings for the relationship reasoning task. With our testbed we push the task complexity to new levels by providing three different difficulty levels and one synth-to real track. We measure difficulty by number of objects, depth and breath of clutter. We base our testbed on the photorealistic MetaGraspNetV2 Dataset

D3G Model

We leverage the power of modern transformer based detector models and the graph reasoning power of graph-transformers to build a powerful end-to-end relationship detection model capable of outperforming previous models and generate great results even in high clutter complex environments. First we extract object queries using a Detr-like detector, then by pair-wise combination we generate the initial edge features. We refine the edge features by using our own dense-variant of graph transformer layers. Finally we classify each edge to predict if a relationship exists and if so in which direction.

We improve upon competitors thanks to the better feature extraction of transformer based detection models, two stage methods that use ROI-Pooling tend to fail when the generated bounding box contains intruding objects.

Our model performs very well even when dealing with a high number of objects and highly complex relationship graphs.

Still, several challenges are present. Occluded objects are sometimes difficult to detect and lead to missed relationships, and very ambiguous relationships generated from barely touching objects are difficult to detect.

BibTeX

@article{rabino2024relationshipreasoning,
      title={A Modern Take on Visual Relationship Reasoning for Grasp Planning},
      author={Paolo Rabino and Tatiana Tommasi},
      journal={arXiv preprint arXiv:2409.02035},
      year={2024} 
}