We present InstructHumans, a novel framework for instruction-driven 3D human texture editing. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models. This work shows that naively using such scores is harmful to editing as they destroy consistency with the source avatar. Instead, we propose an alternate SDS for Editing (SDS-E) that selectively incorporates subterms of SDS across diffusion timesteps. We further enhance SDS-E with spatial smoothness regularization and gradient-based viewpoint sampling to achieve high-quality edits with sharp and high-fidelity detailing. InstructHumans significantly outperforms existing 3D editing methods, consistent with the initial avatar while faithful to the textual instructions.
InstructHumans produce high-fidelity editing results, that align with editing instructions, while faithfully preserving the details of original avatars.
InstructHumans optimizes avatars' texture given text instructions. Images rendered through a conditional NeRF are edited by InstructPix2Pix. SDS-E is used to distill the editing gradients and update the texture latent codes anchored at the parametric human mesh. This process is enhanced by gradient-aware viewpoint sampling and a smoothness regularizer. The edited avatar is easily drivable by changing pose parameters.
If you use this work or find it helpful, please consider citing:
@article{zhu2024InstructHumans, author={Zhu, Jiayin and Yang, Linlin and Yao, Angela}, title={InstructHumans: Editing Animated 3D Human Textures with Instructions}, journal={arXiv preprint arXiv:2404.04037}, year={2024} }