LLM本地部署踩坑记录：ChatGLM、Vicuna、InternLM实战经验

2024-12-15

LLM本地部署踩坑记录：ChatGLM、Vicuna、InternLM实战经验

去年折腾了快半年的大模型本地部署，踩了不少坑，记录一下供参考。主要涉及ChatGLM-6B、Vicuna、InternLM这几个模型的部署和优化。

硬件与软件环境

部署大模型最头疼的就是硬件门槛。我整理了当时测试过的配置：

模型	显存需求	推荐配置
ChatGLM-6B	12GB+	RTX 3060 12GB
Vicuna-7B	14GB+	RTX 3080 16GB
InternLM-7B	16GB+	RTX 3090 24GB
Vicuna-13B	28GB+	A100 40GB

软件环境用Conda管理比较省心：

conda create -n llm python=3.10 -y
conda activate llm
pip3 install --upgrade pip

# PyTorch CUDA版本别选错了
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

# 验证CUDA可用
python -c "import torch; print(torch.cuda.is_available())"

ChatGLM-6B部署

ChatGLM是国内最早开源的大模型之一，中文效果不错。

cd F:\GIT_AI
git clone https://github.com/THUDM/ChatGLM2-6B
cd ChatGLM2-6B

# 清华镜像源速度还行
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install streamlit streamlit_chat chardet -i https://pypi.tuna.tsinghua.edu.cn/simple

启动Web界面：

1	streamlit run web_demo2.py

API部署的话需要装fastapi：

1 2	pip install fastapi uvicorn python api.py

测试一下API：

1
2
3

curl -X POST "http://127.0.0.1:8000" \
     -H 'Content-Type: application/json' \
     -d '{"prompt": "你好", "history": []}'

Vicuna部署

Vicuna基于LLaMA微调，英文能力更强。

conda create -n FastChat python=3.10 -y
conda activate FastChat

pip3 install --upgrade pip
pip uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

git clone https://github.com/lm-sys/FastChat.git
cd FastChat
pip3 install -e .

命令行测试：

# 7B版本需要14GB+显存
python -m fastchat.serve.cli --model-path lmsys/vicuna-7b-v1.3

# 13B版本需要28GB+显存
python -m fastchat.serve.cli --model-path lmsys/vicuna-13b-v1.3

Web界面部署要开三个终端：

终端1 - 控制器：

1	python -m fastchat.serve.controller

终端2 - 模型工作进程：

1	python -m fastchat.serve.model_worker --model-path lmsys/vicuna-7b-v1.3

终端3 - Web服务：

1	python -m fastchat.serve.gradio_web_server

等看到 “Uvicorn running on …” 就表示成功，访问 http://localhost:7860。

InternLM部署与显存优化

InternLM是书生·浦语的开源版本，评测数据看起来不错。

conda create -n lmdeploy python=3.10 -y
conda activate lmdeploy

pip uninstall torch
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

git clone https://github.com/InternLM/InternLM.git
cd InternLM
pip install streamlit==1.24.0 transformers==4.30.2

显存不够的话（<16GB），需要改 web_demo.py：

# 原代码太吃显存
# model = AutoModelForCausalLM.from_pretrained("...").to(torch.bfloat16).cuda()

# 改成FP16省一半显存
model = AutoModelForCausalLM.from_pretrained("...").half().cuda()

命令行调用示例：

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("internlm/internlm-7b", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("internlm/internlm-7b", trust_remote_code=True).cuda()
model = model.eval()

inputs = tokenizer(["来到美丽的大自然，我们发现"], return_tensors="pt")
output = model.generate(
    **inputs,
    max_length=128,
    top_p=0.8,
    temperature=0.8
)
print(tokenizer.decode(output[0]))

量化技术：省显存的救星

显存不够就得上量化。GPTQ和GGML是两种主流方案：

GPTQ量化：

一次性权重量化到3-4位
175B模型也能在单卡跑
A100上推理速度提升3.25倍

GGML格式：

纯C实现，跨平台支持
可以在CPU上跑
适合边缘设备部署

实际测试下来，4bit量化损失不大，但显存省了很多。

踩过的坑

CUDA报错

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED

解决：降低batch size，或者改用FP16精度。

显存不足

方法	显存节省	效果
FP16	50%	轻微下降
8-bit	75%	中等下降
4-bit	87.5%	明显下降
CPU卸载	100%	慢很多

编码错误

UnicodeDecodeError 'gbk' codec can't decode

Windows环境下常见，改成UTF-8编码就行：

1	open('./file.txt', encoding="utf-8")

模型对比与选型

根据OpenCompass的评测数据（仅供参考）：

模型	参数量	综合评分(中文)
InternLM-Chat-7B	7B	57.5
ChatGPT	N/A	55.5
ChatGLM2-6B	6B	49.5
Vicuna-33B	33B	40.1

我自己的使用感受：

中文对话：ChatGLM-6B确实中文优化好
英文任务：Vicuna更强
学术研究：InternLM开源友好，文档全
低资源设备：量化版是必选

API封装示例

用FastAPI封装了个简单的服务：

from fastapi import FastAPI, Request
from transformers import AutoTokenizer, AutoModel
import uvicorn, json, datetime
import torch

app = FastAPI()

@app.post("/")
async def chat(request: Request):
    json_post = await request.json()
    prompt = json_post.get('prompt')
    history = json_post.get('history', [])

    response, history = model.chat(
        tokenizer,
        prompt,
        history=history,
        max_length=2048,
        top_p=0.7,
        temperature=0.95
    )

    return {
        "response": response,
        "history": history,
        "status": 200,
        "time": datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
    }

if __name__ == '__main__':
    tokenizer = AutoTokenizer.from_pretrained("chatglm2-6b", trust_remote_code=True)
    model = AutoModel.from_pretrained("chatglm2-6b", trust_remote_code=True).cuda()
    model.eval()
    uvicorn.run(app, host='0.0.0.0', port=8000)

Windows下为了方便启动，写了个bat脚本：

1	powershell.exe -ExecutionPolicy ByPass -NoExit -Command "& conda activate llm; cd F:\GIT_AI\; streamlit run web_demo.py"

硬件配置建议

根据实际测试：

RTX 3090 24GB：能跑ChatGLM-6B、Vicuna-7B、InternLM-7B（量化版）
A100 40GB：能跑Vicuna-33B、LLaMA-65B（量化版）
2×A100 80GB：能跑原始精度的65B模型

模型架构方面，Vicuna用causal decoder-only结构，ChatGLM用prefix decoder结构（双向+单向注意力混合），各有各的特点。

折腾大模型部署这段经历最大的收获是：显存永远不够用，量化技术很重要。另外国内网络环境下载模型权重也是个坑，建议提前准备好镜像源。

人工智能LLM

LLM本地部署踩坑记录：ChatGLM、Vicuna、InternLM实战经验

硬件与软件环境

ChatGLM-6B部署

Vicuna部署

InternLM部署与显存优化

量化技术：省显存的救星

踩过的坑

CUDA报错

显存不足

编码错误

模型对比与选型

API封装示例

硬件配置建议