SeFi-Image

A Text-to-Image Foundation Model with Semantic-First Diffusion

SeFi-Team

semantic-first denoising high-fidelity texture latent 1B / 2B / 5B variants

Highlights

Three things SeFi-Image is built to show.

01

SFD at foundation-model scale

SeFi-Image brings semantic-first diffusion from smaller class-conditional settings into high-resolution text-to-image generation with 1B, 2B, and 5B model variants.

02

Reconstruction-generation trade-off

The semantic stream gives texture generation a cleaner structural anchor, allowing stronger reconstruction fidelity without making diffusion training harder.

03

Competitive results with less compute

The 5B model is trained with 125K A800 GPU hours and remains strong across GenEval, DPG, LongTextBench, CVTG-2K, and OneIG benchmarks.

Method

Semantic latents lead texture denoising.

Semantic-First Diffusion separates high-level structure from texture details and lets the semantic stream move ahead by a small timestep offset. Texture generation then follows the semantic trajectory instead of denoising both signals in lockstep.

Semantic-First Diffusion stages showing semantic latents denoising before texture latents.
Semantic and texture streams follow separate denoising trajectories.

Stage I: semantic initialization

Semantic latents denoise first, giving the model an early structural anchor.

Stage II: asynchronous generation

Semantic and texture streams denoise together, with semantics leading textures by a temporal offset.

Stage III: texture completion

Texture latents finish the last refinement stage before the final image is decoded.

Benchmarks

Performance across the main evaluation axes.

The main benchmarks cover prompt following, long-text rendering, visual text generation, and instruction alignment.

0.88

GenEval

Object-focused prompt following and compositional reasoning.

SeFi-Image-5B0.88
Qwen-Image0.87
FLUX.2-Klein-9B0.85
Z-Image0.84
0.978

LongTextBench

Long text rendering across English and Chinese prompts.

SeFi-Image-5B0.978
JoyAI-Image0.963
Qwen-Image-25120.960
Qwen-Image0.945
Z-Image0.936
87.27

DPG-Bench

Detailed prompt following over objects, relations, attributes, and global composition.

Qwen-Image88.32
Z-Image88.14
JoyAI-Image88.05
SeFi-Image-5B87.27
0.895

CVTG-2K

Character-level visual text generation, reported here as word accuracy.

SeFi-Image-5B0.895
JoyAI-Image0.874
Z-Image0.867
Qwen-Image0.829
0.5606

OneIG-EN

Omni-dimensional English instruction generation across alignment, text, reasoning, style, and diversity.

SeFi-Image-5B0.5606
Z-Image0.5460
JoyAI-Image0.5420
Qwen-Image0.5390
0.5379

OneIG-ZH

Chinese instruction generation over alignment, text, reasoning, style, and diversity.

Qwen-Image0.5480
SeFi-Image-5B0.5379
Z-Image0.5350
JoyAI-Image0.5210

Showcases

Generation across five visual domains.

Qualitative examples cover the main visual categories in the paper: natural scenes, text-rich layouts, character images, stylized generation, and portraits.

Scene Composition

Wide natural scenes, weather, cities, animals, and free-aspect landscape framing.

Scene composition example: mountain ridge.
Scene composition example: rainy harbor.
Scene composition example: desert reflection.
Scene composition example: alpine lake.
Scene composition example: city rooftop.
Scene composition example: wild horse.
Scene composition example: forest valley.
Scene composition example: starry coast.
Scene composition example: frozen mountain.

Text-Rich Generation

Posters, signs, labels, maps, menus, and bilingual layouts with readable text.

Text-rich generation example: museum plaque.
Text-rich generation example: farm stall.
Text-rich generation example: nocturne cover.
Text-rich generation example: plum rain tea.
Text-rich generation example: mountain after rain.
Text-rich generation example: Paris clay open.
Text-rich generation example: trust the pace.
Text-rich generation example: sky orchard map.
Text-rich generation example: read for the morning poster.
Text-rich generation example: tonkotsu ramen.
Text-rich generation example: turbo room rally.
Text-rich generation example: garden house sign.
Text-rich generation example: illustrated book cover.
Text-rich generation example: sprout keeper character sheet.
Text-rich generation example: lavender after dark sign.

Anime Characters

Character-focused generation across close-ups, full-body layouts, fantasy scenes, and environmental shots.

Anime character example with purple hair and a cat.
Anime character example with mechanical armor.
Anime character example in a flower field.
Anime character example with yellow birds.
Anime character example with cosmic flame effects.
Anime scene example in a snowy village.
Anime character example in a white dress by a lake.
Anime character example in a blue maid outfit.
Anime character example with a star accessory.
Anime character example with wave-like composition.

Style Diversity

Illustration, ink painting, plush toy, sticker sheet, sketch, and graphic-design styles.

Style diversity example: pink plush elephant.
Style diversity example: cloud palace illustration.
Style diversity example: neon car sketch.
Style diversity example: kimono illustration.
Style diversity example: ink dragon.
Style diversity example: blue forest illustration.
Style diversity example: ink swordsman.
Style diversity example: rain mushroom illustration.
Style diversity example: plush rabbit character.
Style diversity example: sticker sheet.

Portraits

Close-up and environmental portraits with varied lighting, pose, material, and composition.

Portrait example: candle-lit profile.
Portrait example: bow and soft light.
Portrait example: clay-like material.
Portrait example: forest headphones.
Portrait example: red headscarf.
Portrait example: white braid.
Portrait example: blue dress.
Portrait example: standing fashion pose.

Citation

Cite SeFi-Image

Project citation for the SeFi-Image arXiv preprint.

@misc{sefiteam2026sefiimagetexttoimagefoundationmodel,
  title = {SeFi-Image: A Text-to-Image Foundation Model with Semantic-First Diffusion},
  author = {SeFi-Team},
  year = {2026},
  eprint = {2606.22568},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
  url = {https://arxiv.org/abs/2606.22568}
}