HOIDiNi:
Human-Object Interaction through
Diffusion Noise Optimization

We present HOIDiNi, a text-driven diffusion framework for synthesizing realistic and plausible human-object interaction (HOI). HOI generation is extremely challenging since it induces strict contact accuracies alongside a diverse motion manifold. While current literature trades off between realism and physical correctness, HOIDiNi optimizes directly in the noise space of a pretrained diffusion model using Diffusion Noise Optimization (DNO), achieving both. This is made feasible thanks to our observation that the problem can be separated into two phases: an object-centric phase, primarily making discrete choices of hand-object contact locations, and a human-centric phase that refines the full-body motion to realize this blueprint. This structured approach allows for precise hand-object contact without compromising motion naturalness. Quantitative, qualitative, and subjective evaluations on the GRAB dataset alone clearly indicate HOIDiNi outperforms prior works and baselines in contact accuracy, physical validity, and overall quality. Our results demonstrate the ability to generate complex, controllable interactions, including grasping, placing, and full-body coordination, driven solely by textual prompts.

Text-driven Object Interactions

The Method

HOIDiNi is a text-driven diffusion framework that satisfies the tight constraints of HOI while remaining on the manifold of realistic human motion. We address this challenge using an optimization strategy that, by design, preserves the learned motion distribution. Diffusion Noise Optimization (DNO), a test-time sampling method that traverses the noise space of a pretrained diffusion model to steer generation toward desired losses. Originally applied to control free-form motion synthesis, DNO proves to be a natural fit for HOI when carefully adapted to the structure and demands of the task.

We begin by training a diffusion model, CPHOI, to learn the joint distribution of full-body human motion and object trajectories, enabling coordinated interaction within a unified generative space. A key insight is that accurate HOI depends on identifying semantically meaningful contact pairs between the palm surface and the input object’s surface. Unlike prior methods that rely on heuristics, CPHOI dynamically predicts these contacts for each frame in addition to full-body, fingers, and object trajectories, allowing precise, frame-consistent interaction that adapts to object shape and motion, resulting in more stable and realistic behaviors.

model

As it turns out, diffusion noise optimization using DNO over this joint discrete/continuous space of Contact-Pairs, Human, and Object motions is challenging, with many local discontinuities that destabilize convergence. We observe that the complexity of HOI optimization can be separated into two optimization phases. The first, Object-Centric phase considers the motion of the object and its contacts with the hands only, forming a reliable structural blueprint for the ensuing full-body motion. This outline then guides the second, Human-Centric phase, which completes the full-body motion, refining finger articulation for precise grasping, and generating natural body posture that semantically supports the object’s behavior and dynamics.

arch

Two-phase Motion Generation

The first phase generates object motion and the contact point pairs which lay the outline for generating the full motion. In the second phase, human motion is generated according to the predetermined contacts here of passing a ball from one hand to another.

Results

HOIDiNi generates precise interactions with millimeter level accuracy, successfully handling delicate tasks like manipulating balls of different sizes or a pyramid.

Note how Houdini yields physically plausible results that also mimic tasks specific human-like nuances.

Comparisons

Compared to other baselines, the competing optimization scheme employed by IMOS [Ghosh et al. 2023] pulls generation to poor contacts and unrealistic motions. Applying our laws to a classified guidance baseline does not satisfy constraints closely enough, and using the popular nearest neighbor heuristic for contact pairs instead of our predictions fails to find semantically correct contacts. In contrast, HOIDiNi demonstrates pleasing and physically plausible results.