PAC Logo PAC Bench: Do Foundation Models Understand
Prerequisites for Executing Manipulation
Policies?



Anonymous Authors
Under review (double-blind)



PAC Bench Example Scenarios

Abstract

Vision-Language Models (VLMs) are increasingly pivotal for generalist robot manipulation, enabling tasks such as physical reasoning, policy generation, and failure detection. However, their proficiency in these high-level applications often assumes a deep understanding of low-level physical prerequisites, a capability that is largely unverified. To perform actions reliably, robots must comprehend intrinsic object properties (e.g., material, weight), action affordances (e.g., graspable, stackable), and physical constraints (e.g., stability, reachability, or an object's state like being closed). Despite their ubiquitous use in manipulation, we argue that off-the-shelf VLMs may lack this granular, physically-grounded understanding, as these specific prerequisites are often overlooked in their pre-training. Addressing this critical gap, we introduce PAC Bench, a comprehensive benchmark designed to systematically evaluate VLM comprehension of these core Properties, Affordances, and Constraints (PAC) from a task executability perspective. PAC Bench features a diverse dataset with over 30,000 annotations, comprising 673 real-world images (115 object classes, 15 property types, 1–3 affordances defined per class), 100 real-world humanoid-view scenarios and 120 unique simulated constraint scenarios across four tasks. Our evaluations reveal significant gaps in the ability of VLMs to grasp fundamental physical concepts, underscoring their current limitations for reliable robot manipulation and pointing to key areas that require targeted research. PAC Bench also serves as a standardized benchmark for rigorously evaluating VLM physical reasoning and guiding the development of more robust and physically grounded models for robotic manipulation.

Data Exploration

Explore representative samples from the four datasets comprising PAC Bench. Each dataset provides unique perspectives and scenarios for evaluating foundation models' understanding of properties, affordances, and constraints in robotic manipulation.

Constraint Images Dataset

Simulated scenarios designed to test understanding of physical constraints like impossible placement, stability, and reachability.

Impossible Placement Constraint

Impossible Placement

Stack Bottom Constraint

Support/Occlusion Issues

Stack Edge Constraint

Reachability Issues

Reachability Constraint

Stability Constraints

Humanoid Robot Dataset

Real-world captures from Unitree G1 humanoid robot perspective for authentic constraint scenarios.

Humanoid Robot View 1
Humanoid Robot View 2
Humanoid Robot View 3
Humanoid Robot View 4
← Scroll horizontally to view more samples →

Open Images Dataset

Diverse real-world images for property and affordance evaluation across 115 object classes.

Open Images Sample 1
Open Images Sample 2
Open Images Sample 3
Open Images Sample 4
← Scroll horizontally to view more samples →

RoboCasa Objects Dataset

Multi-angle views of household objects (24 perspectives per object) for comprehensive property evaluation. Scroll horizontally to see camera rotation through different angles.

Cheese Block

Cheese 0°

0° azimuth

Cheese 45°

45° azimuth

Cheese 90°

90° azimuth

Cheese 135°

135° azimuth

Cheese 180°

180° azimuth

Cheese 225°

225° azimuth

Cheese 270°

270° azimuth

Cheese 315°

315° azimuth

← Scroll horizontally to see camera rotation →

Donut

Donut 0°

0° azimuth

Donut 45°

45° azimuth

Donut 90°

90° azimuth

Donut 135°

135° azimuth

Donut 180°

180° azimuth

Donut 225°

225° azimuth

Donut 270°

270° azimuth

Donut 315°

315° azimuth

← Scroll horizontally to see camera rotation →

Baguette

Baguette 0°

0° azimuth

Baguette 45°

45° azimuth

Baguette 90°

90° azimuth

Baguette 135°

135° azimuth

Baguette 180°

180° azimuth

Baguette 225°

225° azimuth

Baguette 270°

270° azimuth

Baguette 315°

315° azimuth

← Scroll horizontally to see camera rotation →

Each object is captured from 24 different angles (3 elevations × 8 azimuth rotations) to provide comprehensive visual coverage for property understanding evaluation. The horizontal sliders above show the 8 azimuth angles at elevation 0°, demonstrating how the camera rotates around each object.

Dataset Overview

PAC Bench Framework Overview

PAC Bench evaluates foundation models' understanding of three fundamental components crucial for robotic manipulation: Properties (intrinsic object characteristics like material, weight), Affordances (action possibilities like graspable, stackable), and Constraints (physical limitations like stability, reachability). Our benchmark features over 30,000 annotations across 673 real-world images, 100 humanoid-view scenarios, and 120 simulated constraint scenarios. This comprehensive evaluation reveals significant gaps in VLMs' physical reasoning capabilities, highlighting critical areas for improvement in robotics applications.

Experiments and Results

PAC Bench Dataset Distribution

Distribution of annotations in PAC Bench across three dimensions: (Left) physical properties annotated in the dataset, showing the relative frequency of each property; (Center) affordance categories, with slices below 5% omitted for clarity; (Right) constraint domains, contrasting simulation (blue shades) and real-world (green shades) scenarios.



VLM Performance Comparison

Comparative PAC understanding profiles of selected VLMs across different model generations. The diverse performance signatures suggest varied developmental trajectories in acquiring physical common sense. Our evaluation reveals significant gaps in current VLMs's ability to understand fundamental physical concepts required for reliable robot manipulation.

Detailed Performance Results

Property Understanding Accuracy (%)

Model Open Images Humanoid Avg
P1 P2 P3 P4 P5 P6 P1 P2 P3 P4 P5 P6
Claude 3.5 Sonnet0.031.90.00.02.742.350.228.950.752.719.455.227.8
Claude 3.7 Sonnet20.223.532.636.766.437.047.830.348.355.713.255.738.9
Gemini 2.0 Flash 00119.735.340.858.056.143.955.239.840.346.838.254.744.1
GPT-4.113.829.04.425.991.027.851.255.743.358.243.864.242.4
Llama 4 Maverick36.234.937.647.090.014.643.877.159.257.740.354.249.4

Properties P1–P6: Color, Contents, Weight, Density, Sealing, Hardness. Bold values indicate best performance in each category.

Constraint Understanding Accuracy (%)

Model Simulation Real World Avg
Impossible Place Occlusion Stability Reachability Humanoid
Claude 3.5 Sonnet0.00.00.00.00.00.00.00.00.00.00.00.01.80.03.70.4
Gemini 2.5 Pro P10.020.010.090.030.060.00.040.00.030.00.020.011.318.89.425.8
GPT-4.10.00.00.050.070.050.00.00.00.00.00.00.011.313.29.413.6
GPT-4.1 Mini0.00.00.00.00.00.00.00.00.00.00.00.018.824.522.64.4
Llama 3.2 11B Vision I20.010.00.030.030.020.020.020.020.010.030.00.00.01.80.017.5

Constraint understanding across four simulated domains (Impossible Placement, Occlusion, Stability, Reachability) and real-world humanoid scenarios. Bold values indicate best performance.



Property Accuracy Heatmaps

Comprehensive heatmap visualization showing property understanding accuracy across all evaluated VLMs and property types. The color intensity represents performance levels, revealing significant performance variations across different physical properties and models.

Comprehensive Affordance Recognition: Identifying ALL Correct Affordances (%)

Model A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16 A17 A18
Claude 3.5 Sonnet0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
Claude 3.7 Sonnet0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
Gemini 2.0 Flash 0010.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.0
GPT-4.10.00.00.00.00.00.00.00.00.00.00.020.00.00.00.00.00.00.0
Llama 4 Scout0.00.00.00.00.00.00.00.00.00.00.00.04.50.00.00.00.00.0
Qwen 2.5 VL0.00.00.00.00.00.00.00.00.00.00.00.00.00.00.00.011.10.0

Critical Finding: When required to identify ALL correct affordances (not just one), performance drops to near-zero across all models and categories, with only isolated exceptions like GPT-4.1 (20.0% on Home Fixtures) and Qwen 2.5 VL (11.1% on Tools & Hardware). This reveals that VLMs can identify primary affordances but lack comprehensive functional understanding.

Complete 12-Property Understanding Results (%)

Model P1
Capacity
P2
Color
P3
Complexity
P4
Consumability
P5
Contents
P6
Density
P7
Hardness
P8
Orientation
P9
Sealing
P10
Stickiness
P11
Thickness
P12
Weight
Claude 3.5 Sonnet17.80.00.40.331.90.042.315.82.70.052.00.0
Claude 3.7 Sonnet88.120.234.091.423.536.737.048.766.496.659.232.6
Gemini 2.0 Flash 00159.419.784.87.035.358.043.957.656.138.224.340.8

Complete property evaluation across all 12 property types shows significant variation in model capabilities. Claude 3.7 Sonnet excels at Capacity (88.1%) and Consumability (91.4%), while Gemini 2.0 Flash shows strength in Complexity understanding (84.8%).