ACE++: Context-Aware Image Editing via Instruction

February 28, 2025

Intro

이번 포스트는 2025년 1월에 발표된 Technical Report ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling을 소개하려고 합니다.

해당 Technical Report는 알리바바의 Vision Intelligence Lab에서 발표한 논문으로, 이미지 편집을 보다 효율적으로 처리하기 위해 기존의 ACE 모델의 개선된 버전인 ACE++ 라는 모델을 발표하였고 오픈소스로 공개하였습니다.

이번 포스트에서는 해당 논문의 아이디어와 이 기술을 어떻게 활용할 수 있을지에 대해 소개해드리려고 합니다.

Abstract

ACE++는 다양한 이미지 생성과 편집 작업을 수행하는 지시문(Instruction) 기반 확산 모델입니다.

FLUX.1-Fill-dev가 제안한 인페인팅(inpainting) 입력 방식을 확장해, ACE에서 사용하던 LCU(Long-context Condition Unit)를 개선하여 모든 편집·생성 작업에 적용 가능하도록 만들었습니다.

또한, 기존 text-to-image diffusion 모델(ex: FLUX.1-dev)을 빠르게 활용하기 위해 2단계 학습 과정을 제안하였습니다.

1단계 (Pretraining)
- 모델이 처음부터 모든 작업을 학습하면 시간이 오래 걸리고 비용이 많이 듭니다.
- 따라서 먼저 기존의 강력한 text-to-image 모델(ex: FLUX.1-dev)을 기반으로 간단한 작업부터 학습합니다.
- 논문에서 ‘0-참조(0-ref) 태스크’라는 개념이 나오는데, 기존 이미지 없이 단순히 텍스트만 보고 이미지를 생성하는 기본 작업을 말합니다.
- 예를 들어, ‘바닷가에서 서핑하는 사람’이라는 텍스트만 주어지면, 기본 모델이 이를 기반으로 이미지를 생성하도록 합니다.
2단계 (Finetuning)
- 이제 기본적인 이미지 생성 능력이 확보되었으므로, ‘특정 스타일 유지’ 또는 ‘부분 수정’ 같은 더 복잡한 지시문 을 이해하도록 모델을 세분화합니다.
- 이 단계에서는 N-참조(N-ref) 태스크가 추가됩니다. 즉, 특정 스타일의 샘플(reference), 편집하고자 하는 이미지, 편집할 영역의 마스크 등이 포함될 수 있습니다.
- 예를 들어, 기존에 주어진 인물 사진을 기반으로 “같은 사람이 정장을 입고 서 있는 모습을 생성”하는 등의 작업을 수행합니다.

Background

Background

확산 모델 (Diffusion Models)의 비약적 발전

이미지 생성 분야에서 확산 모델은 큰 진전을 이루었습니다. Stable Diffusion, FLUX 등 오픈소스 T2I 모델이 등장하면서 고품질/고해상도 이미지 생성이 쉬워졌으며 지속적으로 새로운 연구 및 응용 서비스가 활발히 개발되고 있습니다.

text-to-image (T2I) 모델의 대중화

텍스트로부터 이미지를 생성하는 기술은 이제 예술, 광고, 교육, 소셜 미디어 등 다양한 분야에서 폭넓게 활용되고 있습니다. 하지만 단순 텍스트 지시만으로는 부분 편집 (inpainting), 특정 스타일 유지, 여러 이미지 참조 합성 등과 같은 복잡한 작업을 수행하는 데 한계가 있습니다.

Problem Statement

복합적 이미지 편집 요구 증가
- 실제 서비스 환경에서는 단순 이미지 생성을 넘어선 복잡한 편집 작업이 필요합니다.
- 특정 영역 수정, 동일 인물/오브젝트의 일관성 유지 등 세밀한 제어가 요구됩니다.
- 예를 들어 이미지 속 인물의 의상 변경, 특정 로고 합성, 동일 캐릭터로 새로운 장면 생성 등의 작업이 필요합니다.
모델 운영의 비효율성
- 각각의 편집/생성 기능마다 별도의 모델이나 플러그인이 필요한 현재 구조는 비효율적입니다.
- 인물 보정, 배경 수정, 스타일 변환 등 여러 모델을 개별적으로 운영하면서 리소스와 유지 비용이 증가합니다.
통합된 범용 모델의 부재
- 현재 대부분의 모델들은 inpainting이나 style transfer와 같은 특정 작업에만 특화되어 있습니다.
- 다양한 편집/생성 작업을 하나의 파이프라인에서 처리할 수 있는 범용 모델이 부족한 상황입니다.

이러한 문제들을 해결하기 위해 ACE++는 기존 text-to-image 모델의 강점을 활용하면서 편집용 마스크, 참조 이미지 등 다양한 입력을 자연스럽게 처리하고 단일 모델로 여러 이미지 편집/생성 작업을 통합적으로 수행할 수 있도록 설계되었습니다.

즉, ACE++는 편집 도구만 준비되어 있다면 하나의 모델로 다양한 workflow를 수행할 수 있게 해주는 모델이라고 이해할 수 있습니다.

METHOD

LCU++: 개선된 입력 형식

기존 ACE의 LCU(Long-Context Condition Unit)는 여러 조건(이미지, 마스크, 노이즈 등)을 “토큰 시퀀스”로 붙여 모델에 입력했습니다.
ACE++에서는 이를 개선하여, “이미지·마스크·노이즈”를 채널(Channel)로 붙이는 방식(LCU++)을 채택했습니다.
- 예를 들어, 편집할 이미지와 그 마스크, 그리고 확산 과정을 위한 노이즈 정보를 하나의 “3채널(혹은 그 이상) 이미지”처럼 취급하는 식입니다.
- 이렇게 하면, 기존 text-to-image 모델이 사용하던 “이미지를 토큰화”하는 방식과 자연스럽게 연결되고, 추가된 부분을 따로 크게 수정하지 않아도 모델이 조건부 입력을 처리할 수 있습니다.

모델 구조

Transformer 기반
- 이미지 생성은 기본적으로 Latent Diffusion을 사용하는 확산 프로세스 형태이며, Transformer 블록으로 확장된 형태입니다.
- 텍스트 입력(지시문)은 텍스트 임베딩으로 변환되고,
- LCU++ 입력(이미지·마스크·노이즈)을 채널 결합 형태로 모델 받아들여, 이를 토큰(token) 시퀀스로 변환합니다.
하나의 Attention 흐름에 모든 조건이 함께 들어감
- 본래 text-to-image 모델은 텍스트 임베딩과 노이즈(이미지를 복원할 때 필요한 잠재 표현)만 처리했습니다.
- ACE++는 여기에 참조 이미지와 마스크 정보를 추가적으로 넣어도, Transformer가 모두 한 번에 주목(attention)하여 해석하도록 설계되어 있습니다.
- 이를 통해 “어떤 영역을 어떻게 수정할지” 또는 “어떤 이미지를 참고할지” 같은 정보를 통합적으로 고려하게 됩니다.

3.출력(이미지 생성) 과정

Diffusion(확산) 단계에서 노이즈를 점진적으로 제거(역으로 복원)하며 최종 이미지를 만들어냅니다.
이때, 모델이 텍스트(명령어)와 참조 이미지, 마스크 등을 모두 반영해, 원하는 결과물(편집된 이미지나 완전 새 이미지)을 생성하게 됩니다.

동작 방식(훈련부터 추론까지)

훈련(Training)
- 두 단계로 진행되는 학습 방식을 채택합니다. (1단계) 우선 text-to-image 모델로부터 “기본적인 이미지 생성 능력”을 빠르게 빌려오고(간단한 0-참조 태스크 위주), (2단계) 참조 이미지나 마스크가 필요한 복잡한 작업(N-참조 태스크)을 추가해서 모델이 “편집, 스타일 전환, 부분 수정” 등을 학습하도록 합니다.
- 이 과정에서 모델은 “주어진 참조 이미지를 그대로 재현”하는 능력과 “목표 이미지를 새로 생성”하는 능력을 동시에 익혀, 맥락(Context)을 인지하는 법을 배웁니다.
추론(Inference)
- 학습이 끝나면, 유저가 텍스트 지시문(예: “이 컵에 로고를 박아줘”)과 함께 참조 이미지, 마스크 등을 주면, 모델이 이를 한 번에 받아들여 자연스럽게 해당 부분만 수정하거나, 완전히 새로운 장면을 생성합니다.
- “인물 일관성”이 필요한 경우(같은 얼굴 유지 등), 모델이 단계별로 노이즈를 제거하면서 참조 이미지를 바탕으로 얼굴 특성을 유지합니다.

Use Cases

ACE++은 아래와 같이 총 5가지 태스크를 수행할 수 있습니다.

Subject-Driven Generation

하나의 특정 ‘대상(Subject)’(예: 캐릭터, 마스코트, 브랜드 로고 등)을 중심으로, 새로운 이미지나 장면을 생성하는 태스크입니다.
활용예시
- 캐릭터 IP 확장: 게임·애니메이션에 등장하는 캐릭터를 여러 장소나 상황에서 재현해 굿즈(피규어, 포스터 등)를 디자인할 때.
- 브랜드 마케팅: 로고나 마스코트를 다양한 상황(행사·이벤트·상품)에 배치해 광고용 이미지를 빠르게 대량 생성할 때.

Portrait-Consistency Generation

특정 인물(인물사진, 배우, 모델 등)을 다른 상황(옷차림, 배경, 포즈 등)으로 배치해도, 얼굴 특징이나 인물 정체성을 유지하도록 이미지를 생성하는 기능입니다.
활용 예시
- 영화/드라마 후속편 기획: 동일 배우가 다른 시대적 배경이나 세계관에서 어떤 모습을 할지 시각화해보는 작업.
- 팬아트/굿즈 디자인: 유명 연예인·아이돌을 다양한 콘셉트로 그려서 팬 상품을 만드는 경우.
- 게임 캐릭터 커스터마이징: 플레이어 아바타(실제 인물 기반) 유지하면서 무기·의상·배경만 바꿔보는 식의 시나리오.

Flexible instruction

입력 이미지의 특징을 이해하고 프롬프트 명령에 맞게 편집하능 기능입니다.
활용 예시 • 동작 수정: 사진에서 취한 포즈를 다르게 변경 • 색상 변경: 빨간색 옷을 입은 사람의 옷을 주황색으로 변경

Local Editing

text instruction을 기반으로 이미 존재하는 이미지에서 특정 ‘영역’(마스크 처리된 부위)만 선택적으로 수정·보정·추가·삭제하는 기능입니다.
활용 예시
- 부분 편집: 이미지의 특정 부분에 효과를 추가하거나 변경

Local Reference Editing

기존 이미지에서 특정 부분을 편집하되, 다른 참조 이미지의 일부 속성이나 특징(예: 색상·패턴·로고·의상 등)을 가져와 해당 영역에 적용하는 기능입니다.
활용 예시
- 의상·패션 디자인: “새로 나온 패션 브랜드 로고를 기존 티셔츠 디자인에 붙여보기” 같은 시나리오에서 매우 유용.
- 광고: 특정 배경의 원하는 위치에 광고 대상이 되는 제품을 자연스럽게 추가하여 광고 이미지를 빠르게 대량 생성할 때.

Conclusion

지금까지 ACE++ 모델의 아이디어와 이 기술을 어떻게 활용할 수 있을지에 대해 알아보았습니다. 해당 모델은 기존 LORA 모델과 동일하게 사용할 수 있어 기존의 Flux Fill and Redux 워크플로우를 사용하던 분들을 바로 해당 모델을 테스트해 볼 수 있습니다.

개인적으로 테스트해본 결과 아직 수정이 완벽이 되지는 않지만 하나의 모델로 여러 태스크를 수행할 수 있다는 점에서 잠재성이 뛰어나다고 생각합니다. ACE++의 결과물이 서비스에 사용할 수준으로 컨트롤이 된다면 혁신적인 Instruction 기반 에디터를 만들어 볼 수 있을 것 같습니다.

keep going

Project Page: https://ali-vilab.github.io/ACE_plus_page
Code: https://github.com/ali-vilab/ACE_plus
Download: https://huggingface.co/ali-vilab/ACE_Plus/tree/main
ComfyUI usecase: https://www.youtube.com/watch?v=raETNJBkazA&t=1s

Intro

This post introduces the Technical Report ACE++: Instruction-Based Image Creation and Editing via Context-Aware Content Filling, published in January 2025.

This Technical Report was released by Alibaba’s Vision Intelligence Lab as an improved version of the existing ACE model called ACE++, designed for more efficient image editing. The source code has also been made publicly available.

In this post, I will introduce the main ideas from this paper and discuss how this technology can be leveraged.

Abstract

ACE++ is an instruction-based diffusion model that performs a wide range of image generation and editing tasks.

By extending the inpainting input method proposed by FLUX.1-Fill-dev, ACE++ improves upon the LCU (Long-context Condition Unit) used in ACE so that it can be applied to all editing and creation tasks.

In addition, to quickly utilize existing text-to-image diffusion models (e.g., FLUX.1-dev), a two-stage training process is proposed:

1st Stage (Pretraining)
- Training the model from scratch on all tasks takes a long time and is expensive.
- Therefore, the model first learns simpler tasks based on an existing powerful text-to-image model (e.g., FLUX.1-dev).
- The paper introduces the concept of a “0-ref task,” which refers to a basic task of generating images by only looking at text, without any reference image.
- For example, if given the text “a person surfing at the beach,” the base model generates an image based on that text alone.
2nd Stage (Finetuning)
- Once the model has acquired fundamental image generation capabilities, it is then further specialized to understand more complex instructions such as “maintaining a specific style” or “partial modifications.”
- At this stage, N-ref tasks are added. That is, it may include a reference image with a specific style, the image to be edited, and a mask for the region to be edited.
- For example, based on a provided portrait, one might perform a task like “generate the same person standing in a suit.”

Background

Background

Breakthroughs in Diffusion Models

Diffusion models have made remarkable progress in image generation. With open-source T2I models such as Stable Diffusion and FLUX, high-quality/high-resolution image generation has become more accessible, and new research and applications continue to emerge.

Popularization of Text-to-Image (T2I) Models

The technology of generating images from text is now widely used in various fields, including art, advertising, education, and social media. However, using only simple text prompts can be limiting for complex tasks like partial inpainting, maintaining a specific style, or combining multiple reference images.

Problem Statement

Increased demand for complex image editing
- In real-world service environments, more complex editing tasks are needed beyond simple image generation.
- Detailed control is required, such as modifying specific regions or maintaining the consistency of the same character/object.
- For instance, changing an individual’s outfit in an image, compositing a specific logo, or creating new scenes with the same character.
Inefficiency in model operations
- The current setup, where each editing/generation feature requires a separate model or plugin, is inefficient.
- Operating multiple models—for face retouching, background editing, style transformation, etc.—increases resource and maintenance costs.
Lack of a unified, general-purpose model
- Most current models are specialized for specific tasks like inpainting or style transfer.
- There is a shortage of general-purpose models that can handle diverse editing/generation tasks within a single pipeline.

To address these issues, ACE++ leverages the strengths of existing text-to-image models while naturally handling various inputs such as editing masks and reference images. It is designed to integrate multiple image editing/generation tasks into a single model.

In other words, with ACE++, as long as you have the right editing tools, a single model can be used for a wide range of workflows.

METHOD

LCU++: Enhanced Input Format

The original ACE model’s LCU (Long-context Condition Unit) fed various conditions (image, mask, noise, etc.) as a sequence of tokens to the model.
In ACE++, this is improved by combining “image, mask, noise” as channels, referred to as LCU++.
- For example, the editable image, its mask, and the noise information used during the diffusion process are treated as if they were a single “3-channel (or more) image.”
- This approach connects smoothly with the way the original text-to-image model tokenizes images, allowing the model to handle conditional inputs without major modifications.

Model Architecture

Transformer-based
- Image generation uses a latent diffusion process, extended with Transformer blocks.
- Text inputs (instructions) are converted into text embeddings,
- While the LCU++ inputs (image, mask, noise) are combined into channels and then converted into token sequences.
All conditions in a single attention flow
- Originally, text-to-image models only processed text embeddings and noise (the latent representation to be recovered when reconstructing an image).
- ACE++ adds reference images and mask information, all processed at once by the Transformer so it can attend to everything together.
- This allows the model to simultaneously consider “which region to edit,” “how to edit it,” or “which images to reference.”
Output (Image Generation) Process
- During the diffusion process, noise is gradually removed (in reverse) to produce the final image.
- The model incorporates the text (instruction), reference images, masks, etc. to generate the desired outcome (either an edited image or an entirely new one).

How It Works (Training to Inference)

Training
- Training proceeds in two stages:
  1. (Stage 1) Borrow “fundamental image generation ability” quickly from a text-to-image model (focusing on simpler 0-ref tasks),
  2. (Stage 2) Add tasks that involve reference images or masks (N-ref tasks) so the model learns “editing, style conversion, partial modification,” etc.
- Through this process, the model learns both “to faithfully reproduce a given reference image” and “to generate a new target image,” acquiring the ability to interpret context.
Inference
- Once trained, the user provides a text instruction (e.g., “Add a logo to this cup”) along with reference images, masks, etc., and the model handles them all at once, naturally editing only the specified regions or generating an entirely new scene.
- When “subject consistency” is required (e.g., maintaining the same face), the model gradually removes noise while preserving the facial characteristics from the reference image.

Use Cases

ACE++ can perform the following five tasks:

Subject-Driven Generation

This task generates new images or scenes centered on a specific “subject” (e.g., a character, mascot, brand logo).
Use cases
- Character IP extension: For creating merchandise (figures, posters, etc.) by depicting game/anime characters in various places or situations.
- Brand marketing: Quickly generating large volumes of advertising images by placing a logo or mascot in various settings (events, products, etc.).

Portrait-Consistency Generation

Generates images in different situations (outfits, backgrounds, poses, etc.) for a specific individual (photo of a person, actor, model), while preserving facial features or identity.
Use cases
- Film/TV sequel planning: Visualizing the same actor in a different time period or setting.
- Fan art/merch design: Depicting famous celebrities or idols in various concepts for fan merchandise.
- Game character customization: Keeping a player’s avatar (based on a real person) consistent while changing weapons, outfits, or backgrounds.

Flexible instruction

The model interprets the features of the input image and edits it according to the prompt instructions.
Use cases
- Motion change: Changing the pose of a person in a photo.
- Color change: Changing the color of someone’s clothes from red to orange.

Local Editing

Based on text instructions, selectively modify, correct, add, or delete specific “regions” (mask-defined areas) within an existing image.
Use cases
- Partial editing: Changing or adding effects to a specific region in the image.

Local Reference Editing

Edit a specific part of an existing image while applying certain attributes or features (e.g., color, pattern, logo, outfit) from another reference image to that region.
Use cases
- Apparel/fashion design: Very useful for scenarios like “trying out a newly released fashion brand logo on an existing T-shirt design.”
- Advertising: Quickly generating large volumes of advertisement images by naturally adding the target product to a specific region of the background.

Conclusion

We have looked at the ideas behind the ACE++ model and how we can use this technology. It can be used in the same way as existing LORA models, so those who have used the Flux Fill and Redux workflow can easily test this model.

From my personal testing, the edits are not yet perfect, but the potential is remarkable in that a single model can perform multiple tasks. If ACE++’s outputs can be reliably controlled for production environments, we might see groundbreaking instruction-based editors emerge.

keep going

Share on

Twitter Facebook LinkedIn

EUNPYO HONG

ACE++: Context-Aware Image Editing via Instruction

Intro

Abstract

Background

METHOD

Use Cases

Conclusion

Intro

Abstract

Background

METHOD

Use Cases

Conclusion

Share on

Leave a comment

You may also enjoy

How to Remove Objects Optimally

Enhancing CLIP Model Performance through Registers-Gated Fine-tuning

Deep dive into AWS CloudFront

Deploy ComfyUI as API on RunPod Serverless