SeFi-Image

Highlights

Three things SeFi-Image is built to show.

01

SFD at foundation-model scale

SeFi-Image brings semantic-first diffusion from smaller class-conditional settings into high-resolution text-to-image generation with 1B, 2B, and 5B model variants.

02

Reconstruction-generation trade-off

The semantic stream gives texture generation a cleaner structural anchor, allowing stronger reconstruction fidelity without making diffusion training harder.

03

Competitive results with less compute

The 5B model is trained with 125K A800 GPU hours and remains strong across GenEval, DPG, LongTextBench, CVTG-2K, and OneIG benchmarks.

Method

Semantic latents lead texture denoising.

Semantic-First Diffusion separates high-level structure from texture details and lets the semantic stream move ahead by a small timestep offset. Texture generation then follows the semantic trajectory instead of denoising both signals in lockstep.

Semantic-First Diffusion stages showing semantic latents denoising before texture latents. — Semantic and texture streams follow separate denoising trajectories.

Stage I: semantic initialization

Semantic latents denoise first, giving the model an early structural anchor.

Stage II: asynchronous generation

Semantic and texture streams denoise together, with semantics leading textures by a temporal offset.

Stage III: texture completion

Texture latents finish the last refinement stage before the final image is decoded.

Benchmarks

Performance across the main evaluation axes.

The main benchmarks cover prompt following, long-text rendering, visual text generation, and instruction alignment.

0.88

GenEval

Object-focused prompt following and compositional reasoning.

SeFi-Image-5B0.88

Qwen-Image0.87

FLUX.2-Klein-9B0.85

Z-Image0.84

0.978

LongTextBench

Long text rendering across English and Chinese prompts.

SeFi-Image-5B0.978

JoyAI-Image0.963

Qwen-Image-25120.960

Qwen-Image0.945

Z-Image0.936

87.27

DPG-Bench

Detailed prompt following over objects, relations, attributes, and global composition.

Qwen-Image88.32

Z-Image88.14

JoyAI-Image88.05

SeFi-Image-5B87.27

0.895

CVTG-2K

Character-level visual text generation, reported here as word accuracy.

SeFi-Image-5B0.895

JoyAI-Image0.874

Z-Image0.867

Qwen-Image0.829

0.5606

OneIG-EN

Omni-dimensional English instruction generation across alignment, text, reasoning, style, and diversity.

SeFi-Image-5B0.5606

Z-Image0.5460

JoyAI-Image0.5420

Qwen-Image0.5390

0.5379

OneIG-ZH

Chinese instruction generation over alignment, text, reasoning, style, and diversity.

Qwen-Image0.5480

SeFi-Image-5B0.5379

Z-Image0.5350

JoyAI-Image0.5210

Showcases

Generation across five visual domains.

Qualitative examples cover the main visual categories in the paper: natural scenes, text-rich layouts, character images, stylized generation, and portraits.

Scene Composition

Wide natural scenes, weather, cities, animals, and free-aspect landscape framing.

Scene composition example: rainy harbor.

Scene composition example: desert reflection.

Scene composition example: city rooftop.

Scene composition example: forest valley.

Scene composition example: starry coast.

Scene composition example: frozen mountain.

Text-Rich Generation

Posters, signs, labels, maps, menus, and bilingual layouts with readable text.

Text-rich generation example: farm stall.

Text-rich generation example: nocturne cover.

Text-rich generation example: plum rain tea.

Text-rich generation example: mountain after rain.

Text-rich generation example: Paris clay open.

Text-rich generation example: trust the pace.

Text-rich generation example: sky orchard map.

Text-rich generation example: read for the morning poster.

Text-rich generation example: tonkotsu ramen.

Text-rich generation example: turbo room rally.

Text-rich generation example: garden house sign.

Text-rich generation example: illustrated book cover.

Text-rich generation example: sprout keeper character sheet.

Text-rich generation example: lavender after dark sign.

Anime Characters

Character-focused generation across close-ups, full-body layouts, fantasy scenes, and environmental shots.

Anime character example with purple hair and a cat.

Anime character example with mechanical armor.

Anime character example in a flower field.

Anime character example with yellow birds.

Anime character example with cosmic flame effects.

Anime character example in a white dress by a lake.

Anime character example in a blue maid outfit.

Anime character example with a star accessory.

Anime character example with wave-like composition.

Style Diversity

Illustration, ink painting, plush toy, sticker sheet, sketch, and graphic-design styles.

Style diversity example: cloud palace illustration.

Style diversity example: neon car sketch.

Style diversity example: kimono illustration.

Style diversity example: blue forest illustration.

Style diversity example: rain mushroom illustration.

Style diversity example: plush rabbit character.

Portraits

Close-up and environmental portraits with varied lighting, pose, material, and composition.

Portrait example: standing fashion pose.

Citation

Cite SeFi-Image

Project citation for the SeFi-Image arXiv preprint.

@misc{sefiteam2026sefiimagetexttoimagefoundationmodel,
  title = {SeFi-Image: A Text-to-Image Foundation Model with Semantic-First Diffusion},
  author = {SeFi-Team},
  year = {2026},
  eprint = {2606.22568},
  archivePrefix = {arXiv},
  primaryClass = {cs.CV},
  url = {https://arxiv.org/abs/2606.22568}
}