Stable Diffusion工作流搭建踩坑记录

2025-08-26

AI/机器学习, LoRA, Python, Stable Diffusion, 实战

声明：本文部分内容使用AI辅助生成，经人工编辑、审核和补充个人经验。

更新说明：本文最后更新于 2025-08-26。

Stable Diffusion工作流搭建踩坑记录

折腾Stable Diffusion两年多了，从本地单卡到多卡集群，从随便生成到工业化工作流，踩过的坑不计其数。记录一下从安装到生产环境部署的完整过程。

环境搭建

硬件选择

一开始用GTX 1060 6G，生成512x512都费劲。后来升级到RTX 3090 24G，体验天差地别。

显卡	显存	512x512速度	1024x1024	LoRA训练
GTX 1060 6G	6GB	30s/it	OOM	不支持
RTX 3060 12G	12GB	8s/it	25s/it	慢
RTX 3090 24G	24GB	3s/it	8s/it	支持
RTX 4090 24G	24GB	1.5s/it	4s/it	很快
A100 40G	40GB	2s/it	5s/it	非常快

血泪教训：显存比速度更重要。12G是底线，24G才能玩得转。

安装WebUI

# 克隆仓库
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
cd stable-diffusion-webui

# 创建conda环境
conda create -n sd python=3.10
conda activate sd

# 安装依赖
pip install -r requirements.txt

# 启动
./webui.sh --xformers --autolaunch

坑1：xformers安装失败

1	ERROR: Could not find a version that satisfies the requirement xformers

解决：

# 方案1：用预编译包
pip install xformers --pre

# 方案2：从源码编译（慢但稳）
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .

# 方案3：不用xformers，用--opt-sdp-attention代替
./webui.sh --opt-sdp-attention

坑2：模型加载报CUDA out of memory

24G显存加载SDXL都报错，后来发现是系统缓存占用了显存。

# 清空显存缓存
sudo nvidia-smi --gpu-reset

# 或者重启后先运行SD

# 启动参数优化
./webui.sh --xformers --medvram-sdxl --no-half-vae

启动参数	作用	适用场景
–xformers	加速注意力计算	推荐
–medvram	中显存优化	8-12G
–lowvram	低显存优化	4-8G
–medvram-sdxl	SDXL中显存优化	12-16G
–no-half-vae	VAE用FP32	避免黑图
–opt-sdp-attention	替代xformers	xformers装不上

Docker部署

生产环境用Docker部署。

FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
    git python3 python3-pip wget \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app

# 克隆WebUI
RUN git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git .

# 安装依赖
RUN pip install -r requirements.txt
RUN pip install xformers

# 暴露端口
EXPOSE 7860

# 启动
CMD ["python3", "launch.py", "--xformers", "--listen", "--port", "7860"]

# 构建和运行
docker build -t sd-webui .
docker run -d --gpus all -p 7860:7860 \
  -v /data/models:/app/models \
  -v /data/outputs:/app/outputs \
  sd-webui

坑3：Docker里GPU不可用

1	RuntimeError: No CUDA GPUs are available

解决：

# 安装nvidia-docker2
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

# 运行时加--gpus all
docker run --gpus all ...

模型管理

模型下载和整理

模型文件越来越大，管理是个问题。

models/
├── Stable-diffusion/          # 基础模型
│   ├── v1-5-pruned-emaonly.safetensors
│   ├── sd_xl_base_1.0.safetensors
│   └── realisticVisionV51_v51VAE.safetensors
├── VAE/                       # VAE模型
│   ├── vae-ft-mse-840000-ema-pruned.safetensors
│   └── sdxl_vae.safetensors
├── Lora/                      # LoRA模型
│   ├── detail_slider.safetensors
│   └── add_detail.safetensors
├── ControlNet/                # ControlNet模型
│   ├── control_v11p_sd15_canny.pth
│   ├── control_v11p_sd15_openpose.pth
│   └── control_v11f1p_sd15_depth.pth
└── embeddings/                # Textual Inversion
    └── badhandv4.pt

坑4：模型格式混淆

.ckpt 和 .safetensors 的区别：
- .ckpt：PyTorch格式，可能包含恶意代码
- .safetensors：安全格式，只包含张量数据

推荐：一律用.safetensors，安全且加载快

模型切换优化

频繁切换模型，每次都要加载几十秒。

# 用API切换模型
import requests

def switch_model(model_name):
    """通过API切换模型"""
    url = "http://localhost:7860/sdapi/v1/options"
    payload = {
        "sd_model_checkpoint": model_name
    }
    response = requests.post(url, json=payload)
    return response.json()

# 切换
switch_model("realisticVisionV51_v51VAE.safetensors")

优化：多实例部署，每个实例固定一个模型。

# docker-compose.yml
version: '3'
services:
  sd-anime:
    image: sd-webui
    ports:
      - "7861:7860"
    volumes:
      - ./models:/app/models
    environment:
      - SD_MODEL=anime_model.safetensors

  sd-realistic:
    image: sd-webui
    ports:
      - "7862:7860"
    volumes:
      - ./models:/app/models
    environment:
      - SD_MODEL=realistic_model.safetensors

LoRA训练

环境准备

用kohya_ss训练LoRA，比WebUI内置的好用。

git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss

# Windows
setup.bat

# Linux
./setup.sh

# 启动GUI
./gui.sh --listen 127.0.0.1 --server_port 7860

坑5：训练时显存不够

12G显存训练SDXL LoRA，直接OOM。

# 优化方案1：用8bit Adam
pip install bitsandbytes

# 训练配置里选：
# Optimizer: AdamW8bit

# 优化方案2：降低分辨率
# 从1024降到768，显存省一半

# 优化方案3：Gradient Checkpointing
# 训练速度减半，显存省60%

数据集准备

dataset/
├── 100_mystyle person/       # 100是重复次数，mystyle是触发词
│   ├── 001.jpg
│   ├── 001.txt              # 标注文件
│   ├── 002.jpg
│   └── 002.txt
└── 20_regularization/        # 正则化图片
    ├── reg_001.jpg
    └── reg_001.txt

标注格式：

1 2	# 001.txt 1girl, solo, long hair, blue eyes, mystyle, looking at viewer, simple background

坑6：触发词和常规描述混在一起

# 错误：触发词藏在中间
1girl, mystyle, solo, long hair...

# 正确：触发词放前面或单独处理
mystyle, 1girl, solo, long hair...

训练参数调优

{
  "pretrained_model_name_or_path": "sd_xl_base_1.0.safetensors",
  "train_data_dir": "./dataset",
  "resolution": "1024,1024",
  "train_batch_size": 2,
  "num_train_epochs": 10,
  "learning_rate": 0.0001,
  "lr_scheduler": "cosine_with_restarts",
  "lr_warmup_steps": 100,
  "optimizer_type": "AdamW8bit",
  "network_dim": 32,
  "network_alpha": 16,
  "save_every_n_epochs": 2,
  "mixed_precision": "fp16",
  "gradient_checkpointing": true,
  "max_grad_norm": 1.0,
  "seed": 42
}

参数	说明	经验值
network_dim	网络维度	16-128，一般32
network_alpha	缩放因子	通常dim/2
learning_rate	学习率	1e-4 to 1e-3
train_batch_size	批次大小	显存允许越大越好
num_train_epochs	训练轮数	10-20

坑7：network_dim太大导致过拟合

dim=128训练出来，除了训练集里的姿势，其他姿势生成效果很差。

解决：

{
  "network_dim": 32,
  "network_alpha": 16,
  "num_train_epochs": 15,
  "lr_scheduler": "cosine_with_restarts",
  "lr_scheduler_num_cycles": 3
}

dim降到32，增加训练轮数，用余弦重启调度器。

训练效果评估

# 用训练好的LoRA生成测试图
import requests
import json

def test_lora(prompt, lora_name, weight=0.8):
    url = "http://localhost:7860/sdapi/v1/txt2img"

    payload = {
        "prompt": f"{prompt} <lora:{lora_name}:{weight}>",
        "negative_prompt": "low quality, blurry, bad anatomy",
        "width": 1024,
        "height": 1024,
        "steps": 30,
        "cfg_scale": 7,
        "sampler_name": "DPM++ 2M Karras",
        "seed": -1
    }

    response = requests.post(url, json=payload)
    result = response.json()

    # 保存图片
    import base64
    img_data = base64.b64decode(result['images'][0])
    with open(f'test_{lora_name}.png', 'wb') as f:
        f.write(img_data)

# 测试不同权重
test_lora("1girl, solo, standing", "mystyle", weight=0.6)
test_lora("1girl, solo, standing", "mystyle", weight=0.8)
test_lora("1girl, solo, standing", "mystyle", weight=1.0)

评估维度：

权重	风格强度	灵活性	适用场景
0.3-0.5	弱	高	轻微影响
0.6-0.8	中等	中	平衡
0.9-1.2	强	低	强风格
1.0+	过强	很低	可能崩坏

ControlNet使用

安装和配置

# 在WebUI的Extensions里安装
# Extensions -> Install from URL -> https://github.com/Mikubill/sd-webui-controlnet

# 下载模型到 models/ControlNet/
# 推荐模型：
# - control_v11p_sd15_canny.pth        # 边缘检测
# - control_v11p_sd15_openpose.pth     # 姿态
# - control_v11f1p_sd15_depth.pth      # 深度
# - control_v11p_sd15_lineart.pth      # 线稿
# - control_v11p_sd15_softedge.pth     # 软边缘

API调用

import requests
import base64

def generate_with_controlnet(prompt, control_image, control_type="canny"):
    """使用ControlNet生成图片"""

    # 编码控制图
    with open(control_image, "rb") as f:
        control_b64 = base64.b64encode(f.read()).decode()

    # ControlNet配置
    controlnet_configs = {
        "canny": {
            "module": "canny",
            "model": "control_v11p_sd15_canny [d14c016b]",
            "weight": 1.0,
            "processor_res": 512
        },
        "openpose": {
            "module": "openpose_full",
            "model": "control_v11p_sd15_openpose [cab727d4]",
            "weight": 1.0,
            "processor_res": 512
        },
        "depth": {
            "module": "depth_midas",
            "model": "control_v11f1p_sd15_depth [cfd03158]",
            "weight": 0.8,
            "processor_res": 512
        }
    }

    config = controlnet_configs[control_type]

    url = "http://localhost:7860/sdapi/v1/txt2img"
    payload = {
        "prompt": prompt,
        "negative_prompt": "low quality, blurry",
        "width": 512,
        "height": 512,
        "steps": 25,
        "cfg_scale": 7,
        "alwayson_scripts": {
            "ControlNet": {
                "args": [
                    {
                        "input_image": control_b64,
                        "module": config["module"],
                        "model": config["model"],
                        "weight": config["weight"],
                        "resize_mode": "Crop and Resize",
                        "lowvram": False,
                        "processor_res": config["processor_res"],
                        "threshold_a": 100,
                        "threshold_b": 200,
                        "guidance_start": 0,
                        "guidance_end": 1,
                        "control_mode": "Balanced",
                        "pixel_perfect": True
                    }
                ]
            }
        }
    }

    response = requests.post(url, json=payload)
    return response.json()

# 使用
result = generate_with_controlnet(
    "1girl, anime style, colorful",
    "pose_reference.jpg",
    control_type="openpose"
)

坑8：ControlNet权重和CFG冲突

ControlNet权重设太高，CFG也太高，生成结果很僵硬。

解决：

ControlNet类型	推荐权重	推荐CFG
Canny	0.8-1.0	5-7
OpenPose	0.8-1.0	5-7
Depth	0.6-0.8	7-9
LineArt	0.8-1.0	5-7

多ControlNet组合

# 同时用OpenPose + Canny
payload = {
    "prompt": "1girl, anime style",
    "alwayson_scripts": {
        "ControlNet": {
            "args": [
                {  # 第一个：OpenPose控制姿态
                    "input_image": pose_b64,
                    "module": "openpose_full",
                    "model": "control_v11p_sd15_openpose",
                    "weight": 1.0,
                },
                {  # 第二个：Canny控制轮廓
                    "input_image": canny_b64,
                    "module": "canny",
                    "model": "control_v11p_sd15_canny",
                    "weight": 0.6,  # 权重降低，避免冲突
                }
            ]
        }
    }
}

坑9：多ControlNet显存爆炸

两个ControlNet同时用，显存直接翻倍。

解决：

# 启动参数优化
./webui.sh --xformers --medvram --no-half-vae

# 或者降低分辨率生成后再放大
# 先生成512x512，再用Hires.fix放大到1024x1024

批量生成优化

异步队列

生产环境需要批量生成，不能一个一个来。

import asyncio
import aiohttp
import json
import base64
from typing import List, Dict
import queue
import threading

class SDQueue:
    """Stable Diffusion异步生成队列"""

    def __init__(self, base_url="http://localhost:7860", max_concurrent=2):
        self.base_url = base_url
        self.max_concurrent = max_concurrent
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self.session = None

    async def __aenter__(self):
        self.session = aiohttp.ClientSession()
        return self

    async def __aexit__(self, exc_type, exc_val, exc_tb):
        await self.session.close()

    async def generate_single(self, prompt: str, **kwargs) -> Dict:
        """单张生成"""
        async with self.semaphore:
            url = f"{self.base_url}/sdapi/v1/txt2img"
            payload = {
                "prompt": prompt,
                "negative_prompt": kwargs.get("negative", ""),
                "width": kwargs.get("width", 512),
                "height": kwargs.get("height", 512),
                "steps": kwargs.get("steps", 25),
                "cfg_scale": kwargs.get("cfg", 7),
                "sampler_name": kwargs.get("sampler", "DPM++ 2M Karras"),
                "seed": kwargs.get("seed", -1),
                "batch_size": kwargs.get("batch_size", 1),
                "n_iter": kwargs.get("n_iter", 1),
            }

            async with self.session.post(url, json=payload) as resp:
                return await resp.json()

    async def generate_batch(self, prompts: List[str], **kwargs) -> List[Dict]:
        """批量生成"""
        tasks = [self.generate_single(p, **kwargs) for p in prompts]
        return await asyncio.gather(*tasks)

# 使用
async def main():
    prompts = [
        "1girl, anime style, red hair, blue eyes",
        "1boy, anime style, black hair, green eyes",
        "1girl, anime style, blonde hair, purple eyes",
        "1boy, anime style, white hair, red eyes",
    ]

    async with SDQueue(max_concurrent=2) as queue:
        results = await queue.generate_batch(prompts, width=512, height=512)

        for i, result in enumerate(results):
            img_data = base64.b64decode(result['images'][0])
            with open(f'output_{i}.png', 'wb') as f:
                f.write(img_data)

asyncio.run(main())

多实例负载均衡

单卡生成太慢，多卡并行。

import random

class SDLoadBalancer:
    """SD多实例负载均衡"""

    def __init__(self, instances: List[str]):
        """
        instances: ["http://localhost:7861", "http://localhost:7862", ...]
        """
        self.instances = instances
        self.current = 0
        self.lock = threading.Lock()

    def get_instance(self) -> str:
        """轮询选择实例"""
        with self.lock:
            instance = self.instances[self.current]
            self.current = (self.current + 1) % len(self.instances)
            return instance

    def get_instance_random(self) -> str:
        """随机选择"""
        return random.choice(self.instances)

    def get_instance_least_busy(self, busy_counts: Dict[str, int]) -> str:
        """选择最空闲的"""
        return min(self.instances, key=lambda x: busy_counts.get(x, 0))

# 使用
lb = SDLoadBalancer([
    "http://sd-1:7860",
    "http://sd-2:7860",
    "http://sd-3:7860",
])

instance_url = lb.get_instance()

生成参数模板

# 预定义常用参数模板
GENERATION_PRESETS = {
    "anime_portrait": {
        "prompt_prefix": "masterpiece, best quality, 1girl, solo, ",
        "prompt_suffix": ", looking at viewer, simple background",
        "negative": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry",
        "width": 512,
        "height": 768,
        "steps": 28,
        "cfg": 7,
        "sampler": "DPM++ 2M Karras",
    },
    "realistic_photo": {
        "prompt_prefix": "masterpiece, best quality, realistic, photo, ",
        "prompt_suffix": ", detailed skin, professional lighting",
        "negative": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, anime, cartoon, 3d",
        "width": 512,
        "height": 768,
        "steps": 30,
        "cfg": 7,
        "sampler": "DPM++ 2M Karras",
    },
    "icon_design": {
        "prompt_prefix": "app icon, flat design, ",
        "prompt_suffix": ", clean, minimal, vector style",
        "negative": "lowres, blurry, text, watermark, realistic, photo, 3d render",
        "width": 512,
        "height": 512,
        "steps": 25,
        "cfg": 8,
        "sampler": "Euler a",
    }
}

def apply_preset(base_prompt: str, preset_name: str) -> Dict:
    """应用预设"""
    preset = GENERATION_PRESETS[preset_name]
    return {
        "prompt": f"{preset['prompt_prefix']}{base_prompt}{preset['prompt_suffix']}",
        "negative_prompt": preset["negative"],
        "width": preset["width"],
        "height": preset["height"],
        "steps": preset["steps"],
        "cfg_scale": preset["cfg"],
        "sampler_name": preset["sampler"],
    }

# 使用
params = apply_preset("cute girl with cat ears", "anime_portrait")

生产环境部署

API服务封装

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests
import base64
import io
from typing import Optional, List

app = FastAPI(title="SD生成服务")

class GenerateRequest(BaseModel):
    prompt: str
    negative_prompt: str = ""
    width: int = 512
    height: int = 512
    steps: int = 25
    cfg_scale: float = 7.0
    seed: int = -1
    lora: Optional[str] = None
    lora_weight: float = 0.8
    controlnet_image: Optional[str] = None  # base64
    controlnet_type: Optional[str] = None

class GenerateResponse(BaseModel):
    image: str  # base64
    seed: int
    info: dict

SD_API_URL = "http://localhost:7860/sdapi/v1"

@app.post("/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
    """文生图接口"""
    try:
        # 构建prompt
        prompt = req.prompt
        if req.lora:
            prompt = f"{prompt} <lora:{req.lora}:{req.lora_weight}>"

        payload = {
            "prompt": prompt,
            "negative_prompt": req.negative_prompt,
            "width": req.width,
            "height": req.height,
            "steps": req.steps,
            "cfg_scale": req.cfg_scale,
            "seed": req.seed,
            "sampler_name": "DPM++ 2M Karras",
        }

        # 添加ControlNet
        if req.controlnet_image and req.controlnet_type:
            payload["alwayson_scripts"] = {
                "ControlNet": {
                    "args": [{
                        "input_image": req.controlnet_image,
                        "module": req.controlnet_type,
                        "model": f"control_v11p_sd15_{req.controlnet_type}",
                        "weight": 1.0,
                    }]
                }
            }

        # 调用SD API
        resp = requests.post(f"{SD_API_URL}/txt2img", json=payload, timeout=120)
        result = resp.json()

        return GenerateResponse(
            image=result["images"][0],
            seed=result.get("seed", -1),
            info=result.get("info", {})
        )

    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

@app.get("/models")
async def list_models():
    """列出可用模型"""
    resp = requests.get(f"{SD_API_URL}/sd-models")
    return resp.json()

@app.get("/loras")
async def list_loras():
    """列出可用LoRA"""
    resp = requests.get(f"{SD_API_URL}/loras")
    return resp.json()

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

缓存优化

import hashlib
import redis
from functools import wraps

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def cache_result(ttl=3600):
    """缓存生成结果"""
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, **kwargs):
            # 生成缓存key
            cache_key = hashlib.md5(
                f"{func.__name__}:{str(args)}:{str(kwargs)}".encode()
            ).hexdigest()

            # 查缓存
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)

            # 执行生成
            result = await func(*args, **kwargs)

            # 存缓存
            redis_client.setex(cache_key, ttl, json.dumps(result))

            return result
        return wrapper
    return decorator

# 使用
@cache_result(ttl=86400)  # 缓存1天
async def generate_with_cache(prompt, **kwargs):
    # 实际生成逻辑
    pass

总结

折腾Stable Diffusion两年，最核心的经验：

显存是硬门槛：12G能玩，24G才能玩得爽，40G以上才能工业化
模型管理要规范：safetensors格式、目录结构、版本控制
LoRA训练靠调参：dim、lr、数据质量，三个缺一不可
ControlNet是神器：但权重要调，多ControlNet要省显存
批量生成要异步：队列+多实例+缓存，才能支撑生产环境

踩坑最多的地方：

xformers装不上，生成速度慢3倍
模型格式不对，加载不了或者不安全
LoRA过拟合，除了训练姿势其他都不会
ControlNet权重太高，生成结果僵硬
批量生成不控制并发，显存OOM

SD的工作流搭建是个持续优化的过程，没有一劳永逸的方案。根据业务需求不断调整，才能找到最适合自己的流程。

人工智能Python

Stable Diffusion工作流搭建踩坑记录

环境搭建

硬件选择

安装WebUI

Docker部署

模型管理

模型下载和整理

模型切换优化

LoRA训练

环境准备

数据集准备

训练参数调优

训练效果评估

ControlNet使用

安装和配置

API调用

多ControlNet组合

批量生成优化

异步队列

多实例负载均衡

生成参数模板

生产环境部署

API服务封装

缓存优化

总结