AnchorDS: Anchoring Dynamic Sources

for Semantically Consistent Text-to-3D Generation

AAAI 2026

Jiayin Zhu¹

Linlin Yang²

Yicong Li¹

Angela Yao¹

¹National University of Singapore ²Communication University of China

Paper ArXiv Code

Abstract

Optimization-based text-to-3D methods distill guidance from 2D generative models via Score Distillation Sampling (SDS), but implicitly treat this guidance as static. This work shows that ignoring source dynamics yields inconsistent trajectories that suppress or merge semantic cues, leading to "semantic over-smoothing" artifacts. As such, we reformulate text-to-3D optimization as mapping a dynamically evolving source distribution to a fixed target distribution. We cast the problem into a dual-conditioned latent space, conditioned on both the text prompt and the intermediately rendered image. Given this joint setup, we observe that the image condition naturally anchors the current source distribution. Building on this insight, we introduce AnchorDS, an improved score distillation mechanism that provides state-anchored guidance with image conditions and stabilizes generation. We further penalize erroneous source estimates and design a lightweight filter strategy and fine-tuning strategy that refines the anchor with negligible overhead. AnchorDS produces finer-grained detail, more natural colours, and stronger semantic consistency, particularly for complex prompts, while maintaining efficiency. Extensive experiments show that our method surpasses previous methods in both quality and efficiency.

3D Consistency Comparison with Existing Methods

A DSLR photo of a chow chow puppy

SDS-Bridge

VSD (SD 1.5)

VSD (SD 2.1)

Ours

While other methods show severe Janus problem (multiple heads), our method demonstrates superior 3D consistency compared to existing approaches, maintaining coherent geometry and texture across different viewpoints.

Visualization of Our Text-to-GS Results

A red barn in a green field

A vibrant orange pumpkin sitting on a hay bale

A green cactus in a clay pot

A gold glittery carnival mask

A bright red fire hydrant

A pair of shiny black patent leather shoes

Visualization of Our Text-to-NeRF Results

A wooden rocking chair on a porch

A bald eagle carved out of wood

A colorful parrot on a jungle tree

A faux-fur leopard print hat

A long woolen scarf, striped red and black

A pair of white sneakers on a black mat

Our method produces high-quality 3D objects for both Gaussian Splatting and NeRF representations, demonstrating consistent results across different 3D representations while maintaining semantic fidelity to the input text prompts.

Method Overview

Our AnchorDS method anchors dynamic sources in the score distillation process to ensure semantically consistent text-to-3D generation, effectively addressing multi-face artifacts while maintaining high-quality 3D representations.

Citation

If you use this work or find it helpful, please consider citing:

@inproceedings{zhu2026AnchorDS,
         author={Zhu, Jiayin and Yang, Linlin and Li, Yicong and Yao, Angela},
         title={AnchorDS: Anchoring Dynamic Sources for Semantically Consistent Text-to-3D Generation},
         volume={40},
         booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
         year={2026}
}