ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

Abstract

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has primarily focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise—requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

Method

Orientation Grounding Reward

To formulate the Orientation Grounding problem as a reward maximization problem, we define the Orientation Grounding Reward for a given target 3D orientation \(\Pi(\phi_i)\) and image \(\mathbf{I}\) using negative KL divergence as follows:
\[ \mathcal{R}(\mathbf{I}) = - \frac{1}{N} \sum_{i=1}^{N} D_{\text{KL}}\Big(\mathcal{D}\big(\mathrm{Crop}(\mathbf{I}, w_i)\big) \,\Big\|\, \Pi(\phi_i)\Big). \]
Here, \(\mathcal{D}\) is the orientation estimation model (Orient-Anything), and \(\mathrm{Crop}(\mathbf{I}, w_i)\) extracts a centered object image using GroundingDINO, an open-set object detection model. This reward function inherently supports multi-object orientation grounding by averaging rewards across multiple objects (N).

Reward-Adaptive Time-Rescaled Langevin SDE

We introduce Reward-Guided Langevin Dynamics, to efficiently sample a latent representation \(\mathbf{x}\) from the optimal reward-aligned distribution. Unlike traditional gradient ascent, which may get stuck in local optima, this approach incorporates stochasticity, leading to the following simple discretized update rule:
\[ \mathbf{x}_{i+1} = \sqrt{1-\gamma}\, \mathbf{x}_i + \gamma\eta \nabla \hat{\mathcal{R}}(\mathbf{x}_i) + \sqrt{\gamma} \epsilon_{i}. \] Note that for implementation, this requires only a single line of code, to add Gaussian noise \(\epsilon_i \sim \mathcal{N}(0, \mathbf{I})\) to the latent representation.

To further enhance convergence speed and performance, we introduce Reward-Adaptive Time Rescaling, modifying the Langevin process with a monitor function \(\mathcal{G}(\hat{\mathcal{R}}(\mathbf{x}))\) that dynamically adjusts the step size based on the reward:
\[ \mathbf{x}_{i+1} =\sqrt{1-\gamma(\mathbf{x}_i)}\mathbf{x}_i + \gamma(\mathbf{x}_i)\eta\,\nabla \hat{\mathcal{R}}(\mathbf{x}_i) + \frac{1}{2}\gamma(\mathbf{x}_i)\nabla\log \mathcal{G}(\hat{\mathcal{R}}(\mathbf{x}_i)) + \sqrt{\gamma(\mathbf{x}_i)}\,\epsilon_i. \]

Experimental Results

Experimental Setup

We introduce ORIBENCH benchmark based on the MS-COCO dataset, that consist of diverse text prompts, image and ground truth orientations.
1. ORIBENCH-Single: Single Object Scenario, total 1000 prompts, various azimuths.
2. ORIBENCH-Multi: Multi Object Scenario, total 371 prompts, various azimuths.
For more evaluation (General Orientation, primitive views), please refer to the Appendix of the paper.

Quantitative Results

We measure orientation grounding accuracy using two metrics:
1) Absolute Error, the absolute error on azimuth angles between the predicted and grounding object orientations.
2) Acc.@22.5°, the angular accuracy within a tolerance of ±22.5°.
For evaluation, we use OrientAnything to predict the 3D orientation from the generated images.
In the case of text-image alignment, we use three metrics: CLIP Score, VQA-Score, and PickScore.

ORIBENCH-Single

C3DW (Cheng et al.) is trained on synthetic data to learn orientation-to-image generation. Thus, it has limited generalizability to real-world images and the output images lack realism. Zero-1-to-3 (Liu et al.) is also trained on single-object images but without backgrounds, requiring additional background image composition that may introduce unnatrual artifacts. The existing methods (DPS, MPGD, FreeDoM, ReNO) on guided generation methods also achieve suboptimal results compared to ORIGEN.

ORIBENCH-Multi

ORIGEN outperforms all baseline models in orientation alignment. While FLUX-Schnell achieves the highest alignment among vanilla T2I models, ORIGEN surpasses it by over 2.5 times in the 3-view setting (82.4% vs. 31.2%) and over 2 times in the 4-view setting (86.6% vs. 42.4%). This highlights the limitations of vanilla T2I models, which struggle with precise orientation control due to the ambiguity in textual descriptions. Unlike these models, ORIGEN consistently generates images that accurately align with the specified orientations.

User Study

The input prompt and images generated by three models (Zero-1-to-3, C3DW, and ORIGEN) were provided, and participants were asked to select the image that best reflected both the input prompt and the grounding orientation. As a result, ORIGEN was preferred by 58.18% of the participants, outperforming the baseline models.

Additional Qualitative Results

ORIBENCH-Single

ORIBENCH-Multi

Additional - General Orientations

Additional - Primitive Views

BibTeX

@misc{min2025origen,
            title={ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation}, 
            author={Yunhong Min and Daehyeon Choi and Kyeongmin Yeo and Jihyun Lee and Minhyuk Sung},
            year={2025},
            eprint={2503.22194},
            archivePrefix={arXiv},
            primaryClass={cs.CV},
            url={https://arxiv.org/abs/2503.22194}, 
      }

ORIGEN: Zero-Shot 3D Orientation Grounding in Text-to-Image Generation

TL; DR

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories.

Abstract

Method

Orientation Grounding Reward

Reward-Adaptive Time-Rescaled Langevin SDE

Experimental Results

Experimental Setup

Quantitative Results

ORIBENCH-Single

ORIBENCH-Multi

User Study

Additional Qualitative Results

ORIBENCH-Single

ORIBENCH-Multi

Additional - General Orientations

Additional - Primitive Views

BibTeX

ORIGEN: Zero-Shot 3D Orientation Grounding
in Text-to-Image Generation

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in
text-to-image generation across multiple objects and diverse categories.