TL; DR

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in
text-to-image generation across multiple objects and diverse categories.

ORIGEN Teaser

Abstract

We introduce ORIGEN, the first zero-shot method for 3D orientation grounding in text-to-image generation across multiple objects and diverse categories. While previous work on spatial grounding in image generation has primarily focused on 2D positioning, it lacks control over 3D orientation. To address this, we propose a reward-guided sampling approach using a pretrained discriminative model for 3D orientation estimation and a one-step text-to-image generative flow model. While gradient-ascent-based optimization is a natural choice for reward-based guidance, it struggles to maintain image realism. Instead, we adopt a sampling-based approach using Langevin dynamics, which extends gradient ascent by simply injecting random noise—requiring just a single additional line of code. Additionally, we introduce adaptive time rescaling based on the reward function to accelerate convergence. Our experiments show that ORIGEN outperforms both training-based and test-time guidance methods across quantitative metrics and user studies.

Method

Orientation Grounding Reward

Orientation Grounding Reward

To formulate the Orientation Grounding problem as a reward maximization problem, we define the Orientation Grounding Reward for a given target 3D orientation \(\Pi(\phi_i)\) and image \(\mathbf{I}\) using negative KL divergence as follows:
\[ \mathcal{R}(\mathbf{I}) = - \frac{1}{N} \sum_{i=1}^{N} D_{\text{KL}}\Big(\mathcal{D}\big(\mathrm{Crop}(\mathbf{I}, w_i)\big) \,\Big\|\, \Pi(\phi_i)\Big). \]
Here, \(\mathcal{D}\) is the orientation estimation model (Orient-Anything), and \(\mathrm{Crop}(\mathbf{I}, w_i)\) extracts a centered object image using GroundingDINO, an open-set object detection model. This reward function inherently supports multi-object orientation grounding by averaging rewards across multiple objects (N).

Reward-Adaptive Time-Rescaled Langevin SDE

Reward-Adaptive Time-Rescaled Langevin SDE

We introduce Reward-Guided Langevin Dynamics, to efficiently sample a latent representation \(\mathbf{x}\) from the optimal reward-aligned distribution. Unlike traditional gradient ascent, which may get stuck in local optima, this approach incorporates stochasticity, leading to the following simple discretized update rule:
\[ \mathbf{x}_{i+1} = \sqrt{1-\gamma}\, (\mathbf{x}_i + \gamma\eta \nabla \hat{\mathcal{R}}(\mathbf{x}_i)) + \sqrt{\gamma} \epsilon_{i}. \] Note that for implementation, this requires only a single line of code, to add Gaussian noise \(\epsilon_i \sim \mathcal{N}(0, \mathbf{I})\) to the latent representation.

To further enhance convergence speed and performance, we introduce Reward-Adaptive Time Rescaling, modifying the Langevin process with a monitor function \(\mathcal{G}(\hat{\mathcal{R}}(\mathbf{x}))\) that dynamically adjusts the step size based on the reward:
\[ \mathbf{x}_{i+1} =\sqrt{1-\gamma(\mathbf{x}_i)}\Bigl(\mathbf{x}_i + \gamma(\mathbf{x}_i)\eta\,\nabla \hat{\mathcal{R}}(\mathbf{x}_i)\Bigr) + \frac{1}{2}\gamma(\mathbf{x}_i)\nabla\log \mathcal{G}(\hat{\mathcal{R}}(\mathbf{x}_i)) + \sqrt{\gamma(\mathbf{x}_i)}\,\epsilon_i. \]

Experimental Results

Experimental Setup

We introduce three benchmarks based on the MS-COCO dataset, that consist of diverse text prompts, image and ground truth orientations.
1. MS-COCO-Single: Single Object Scenario, total 1000 prompts, various azimuths.
2. MS-COCO-Nview: Single Object Scenario, total 252 prompts, and four views (front, left, back, right).
3. MS-COCO-Multi: Multi Object Scenario, total 371 prompts, various azimuths.

Quantitative Results

Quantitative Results

We measure orientation grounding accuracy using two metrics:
1) Absolute Error, the absolute error on azimuth angles between the predicted and grounding object orientations.
2) Acc.@22.5°, the angular accuracy within a tolerance of ±22.5°.
For evaluation, we use OrientAnything to predict the 3D orientation from the generated images.
In the case of text-image alignment, we use three metrics: CLIP Score, VQA-Score, and PickScore.

MS-COCO-Single

MS-COCO-Single

C3DW (Cheng et al.) is trained on synthetic data to learn orientation-to-image generation. Thus, it has limited generalizability to real-world images and the output images lack realism. Zero-1-to-3 (Liu et al.) is also trained on single-object images but without backgrounds, requiring additional background image composition that may introduce unnatrual artifacts. The existing methods (ReNO, FreeDoM) on guided generation methods also achieve suboptimal results compared to ORIGEN.

MS-COCO-NView

MS-COCO-NView

ORIGEN outperforms all baseline models in orientation alignment. While FLUX-Schnell achieves the highest alignment among vanilla T2I models, ORIGEN surpasses it by over 2.5 times in the 3-view setting (82.4% vs. 31.2%) and over 2 times in the 4-view setting (86.6% vs. 42.4%). This highlights the limitations of vanilla T2I models, which struggle with precise orientation control due to the ambiguity in textual descriptions. Unlike these models, ORIGEN consistently generates images that accurately align with the specified orientations.

MS-COCO-Multi

MS-COCO-Multi

Our approach is seamlessly generalizable to multiple objects by simply averaging the orientation grounding reward across multiple objects.

User Study

User study results

The input prompt and images generated by three models (Zero-1-to-3, C3DW, and ORIGEN) were provided, and participants were asked to select the image that best reflected both the input prompt and the grounding orientation. As a result, ORIGEN was preferred by 58.18% of the participants, outperforming the baseline models.

Additional Qualitative Results

MS-COCO-Single

General scenes

MS-COCO-NView

General scenes

MS-COCO-Multi

General scenes

BibTeX

@inproceedings{min2025origen,
      title     = {{ORIGEN}},
      author    = {Min, Yunhong and Choi, Daehyeon and Yeo, Kyeongmin and Lee, Jihyun and Sung, Minhyuk},
      booktitle = {Arxiv Preprint},
      year      = {2025},
      pages     = {1--10},
}