Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Xuechao Zou^1*, Shun Zhang^1*, Xing Fu², Yue Li³, Kai Li⁴, Yushe Cao⁴, Congyan Lang^1†, Pin Tao⁴, Junliang Xing^4†

¹ Beijing Jiaotong University | ² Ant Group | ³ Qinghai University | ⁴ Tsinghua University

^* Equal contribution. ^† Corresponding authors.

arXiv

Paper

Code

🤗 Models

🤗 Datasets

Controllable face generation poses critical challenges in generative modeling due to the intricate balance required between semantic controllability and photorealism. While existing approaches struggle with disentangling semantic controls from generation pipelines, we revisit the architectural potential of Diffusion Transformers (DiTs) through the lens of expert specialization. This paper introduces Face-MoGLE, a novel framework featuring: (1) Semantic-decoupled latent modeling through mask-conditioned space factorization, enabling precise attribute manipulation; (2) A mixture of global and local experts that captures holistic structure and region-level semantics for fine-grained controllability; (3) A dynamic gating network producing time-dependent coefficients that evolve with diffusion steps and spatial locations. Face-MoGLE provides a powerful and flexible solution for high-quality, controllable face generation, with strong potential in generative modeling and security applications. Extensive experiments demonstrate its effectiveness in multimodal and monomodal face generation settings and its robust zero-shot generalization capability.

The Face-MoGLE framework extracts region-specific features from a semantic mask using a shared-weight VAE encoder, routing them to global and local experts. Their outputs are fused through a dynamic gating network, enabling high-fidelity generation with fine semantic alignment.

BibTeX Citation

@misc{zou2025mixturegloballocalexperts,
	title={Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation},
	author={Xuechao Zou and Shun Zhang and Xing Fu and Yue Li and Kai Li and Yushe Cao and Congyan Lang and Pin Tao and Junliang Xing},
	year={2025},
	eprint={2509.00428},
	archivePrefix={arXiv},
	primaryClass={cs.CV},
	url={https://arxiv.org/abs/2509.00428}
}

Acknowledgements

This webpage was originally made by Matan Kleiner with the help of Hila Manor. The code for the original template can be found here.
Icons are taken from Font Awesome or from Academicons.

Mixture of Global and Local Experts with Diffusion Transformer for Controllable Face Generation

Abstract

Method

Multimodal Face Generation

Zero-Shot Generalization on the MM-FFHQ-Female Dataset

Zero-Shot Generalization on the MM-FairFace-HQ Dataset

Ablation Studies

BibTeX Citation

Acknowledgements