跳到主要内容

在 M1000 进行 DeepSeek R1 蒸馏系列模型部署

1. 准备工作

下载安装torch_musa

下载链接

百度网盘链接:https://pan.baidu.com/s/1x0r0AvJ4TkPQP1wSvYr5Xg?pwd=cprw

文件列表和md5值:
b2d11eefd6593f7e9d5d7d53c92f8687 torch-2.2.0-cp310-cp310-linux_aarch64.whl
384cc6082805b702ea7eb1479e1cd661 torch_musa-1.3.2-cp310-cp310-linux_aarch64.whl
5f34663bbc796baef0cf4712a4e5ffe5 torchaudio-2.2.2+cefdb36-cp310-cp310-linux_aarch64.whl
c93deb6ea89a55dafdab70036414925f torchvision-0.17.2+c1d70fe-cp310-cp310-linux_aarch64.whl

安装脚本

pip install torch-2.2.0-cp310-cp310-linux_aarch64.whl
pip install torch_musa-1.3.2-cp310-cp310-linux_aarch64.whl
pip install torchaudio-2.2.2+cefdb36-cp310-cp310-linux_aarch64.whl
pip install torchvision-0.17.2+c1d70fe-cp310-cp310-linux_aarch64.whl

下载安装LLM大模型推理引擎vllm+mtt

下载链接

mtt百度网盘链接:https://pan.baidu.com/s/1x0r0AvJ4TkPQP1wSvYr5Xg?pwd=cprw

文件列表和md5值:
04ff2c3c5b88d8eb5a371d0f711588e5 mttransformer-20240402.dev65+g273eb81-py3-none-any.whl

vllm百度网盘链接:https://pan.baidu.com/s/1x0r0AvJ4TkPQP1wSvYr5Xg?pwd=cprw

文件列表和md5值:
66a369736c741aee840e53d14a8bcd50 vllm-0.4.2+musa0314.g4582adc-cp310-cp310-linux_aarch64.whl

安装脚本

# 如果已经安装需要先卸载
pip uninstall mttransformer
pip uninstall vllm

# 安装
pip install mttransformer-20240402.dev65+g273eb81-py3-none-any.whl
pip install vllm-0.4.2+musa0314.g4582adc-cp310-cp310-linux_aarch64.whl

2. 下载模型

下载MTT格式模型

git clone https://modelscope.cn/models/hiyangdong/DeepSeek-R1-Distill-Qwen-7B-GPTQ-Int4-MTT

已支持的MTT模型链接

部分模型在端侧不支持,主要是内存不够造成的,目前有DeepSeek-R1-Distill-Qwen-14B/32B (fp16),DeepSeek-R1-Distill-Llama-70B (fp16), Qwen2.5-14B/32B/72B (fp16)

[可选]下载Qwen2.5-7B开源模型

从huggingface下载Qwen2.5-7B模型并运行 以Qwen2.5-7B作为例子。另外也可以换成DeepSeek-R1-Distill-Qwen-1.5B/7B/14B

目前支持GPTQ-Int4模型,可以下载如下模型进行适配

git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
# 如果比较慢可以使用国内的modelscope
git clone https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

[可选]转换模型

# 参考:https://docs.mthreads.com/mtt/mtt-doc-online/quick_start
python -m mttransformer.convert_weight \
--in_file Qwen2.5-7B-Instruct-GPTQ-Int4 \
--saved_dir Qwen2.5-7B-Instruct-GPTQ-Int4-MTT

in_file是从huggingface上下载的模型

saved_dir将保存经过mtt转化之后的模型

model_type目前支持llama, mistral, chatglm2, baichuan, qwen, qwen2, yayi。

Qwen1.5和Qwen2同构,model_type都是qwen2,但是Qwen的依然是qwen。

baichuan, baichuan2,model_type都是baichuan。

arch仅支持mp_22,也就是M1000

3. 部署测试本地大模型服务

启动vllm大模型服务

参考:https://docs.mthreads.com/mtt/mtt-doc-online/start_api_server

python -m vllm.entrypoints.openai.api_server \
--served-model-name DeepSeek-R1-7B --model DeepSeek-R1-Distill-Qwen-7B-GPTQ-Int4-MTT \
--gpu-memory-utilization 0.4 --max-model-len 2048 --device musa \
--port 8000 --tensor-parallel-size 1 -pp 1 --block-size 64 --max-num-seqs 1 \
--trust-remote-code --disable-log-stats --disable-log-requests --swap-space=0

测试大模型接口

获取模型列表

curl http://localhost:8000/v1/models

# 输出
{"object":"list","data":[{"id":"DeepSeek-R1-7B","object":"model","created":1732267605,"owned_by":"vllm","root":"DeepSeek-R1-Distill-Qwen-7B-GPTQ-Int4-MTT","parent":null,"permission":[{"id":"modelperm-c163862182504fbda3f2591a1a6670e7","object":"model_permission","created":1732267605,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

调用模型接口

注意需要修改对应的model名称

# 中文问答
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1-7B",
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"repetition_penalty":1.05,
"messages": [{"role": "user", "content": "介绍一下北京"}]
}'

# 输出
{"id":"cmpl-4e44a394c2ce4627a73a35fcc919d8db","object":"chat.completion","created":1732268219,"model":"DeepSeek-R1-7B","choices":[{"index":0,"message":{"role":"assistant","content":"北京,中国的首都,位于中国北部的平原上,是全国的政治、文化和交通中心。北京的地势平坦,主要由山地和低平原构成。它周围被长城环绕,是中国著名的文化名城之一。\n\n北京的历史悠久,可追溯到公元前2300年前。自秦汉时期起,北京就是中国的重要政治中心。在隋唐至清朝,北京曾被称为“大都”或“燕京”,是当时中国最大的城市之一。直到明朝,北京一直被称为“北京”。\n\n北京不仅有着悠久的历史和浓厚的文化底蕴,同时也是中国的工业和商业中心之一。北京拥有丰富的自然景观和文化遗迹,如故宫、长城、颐和园、天坛等,都是中国著名的旅游景点。\n\n北京也是国际交往的重要城市,也是中国乃至亚洲的政治、经济、文化中心之一。北京拥有众多的高校和科研机构,是中国的人才和科研中心之一。北京还是中国重要的交通枢纽之一,拥有多个国际机场和高速铁路网络。北京还拥有丰富的教育资源和科研机构,吸引了大量的人才和科研力量聚集。北京是中国乃至亚洲的科技、教育和文化中心之一,也是中国现代化和国际化的重要象征。"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":31,"total_tokens":276,"completion_tokens":245}}

# 英文问答
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "DeepSeek-R1-7B",
"temperature": 0.7,
"top_p": 0.8,
"top_k": 20,
"repetition_penalty":1.05,
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'

# 输出
{"id":"cmpl-1c39630a5d204549af9d9055a01f10b1","object":"chat.completion","created":1732268198,"model":"DeepSeek-R1-7B","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I'm Qwen, created by Alibaba Cloud. I'm here to help you with any questions you might have. Is there anything specific you'd like to know or discuss?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":36,"total_tokens":74,"completion_tokens":38}}

python调用

流式输出

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
messages=[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": "北京有哪些名胜古迹?"
}],
model=model,
temperature=0.7,
top_p=0.8,
extra_body={
'top_k':20,
'repetition_penalty':1.05, # 惩罚重复,vllm默认没有加载需要添加,参考:https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct/file/view/master?fileName=generation_config.json&status=1#L9
},
max_tokens=512,
stream=True, # 启用流式输出
)

# 处理流式响应
print("Chat response (streaming):")
for chunk in chat_completion:
if chunk.choices:
delta = chunk.choices[0].delta
content = delta.content
if content:
print(content, end='', flush=True)
print("\n - Chat response (end) -\n")

非流式输出

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
# defaults to os.environ.get("OPENAI_API_KEY")
api_key=openai_api_key,
base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
messages=[{
"role": "system",
"content": "You are a helpful assistant."
}, {
"role": "user",
"content": "北京有哪些名胜古迹?"
}],
model=model,
temperature=0.7,
top_p=0.8,
extra_body={
'top_k':20,
'repetition_penalty':1.05,
},
max_tokens=512,
stream=False,
)

print("Chat completion results:")
print(chat_completion)

4. perf模型性能

配置perf_config.json

参考:

[
{
"model_name": "Qwen2.5-7B",
"path": "Qwen2.5-7B-Instruct-GPTQ-Int4-MTT",
"batchs": [1],
"prefill_token_lens": [256,512,1024,2048],
"decode_token_lens": [64,128]
},
{
"model_name": "Qwen2.5-3B",
"path": "Qwen2.5-3B-Instruct-GPTQ-Int4-MTT",
"batchs": [1],
"prefill_token_lens": [256,512,1024,2048],
"decode_token_lens": [64,128]
},
{
"model_name": "Qwen2.5-1.5B",
"path": "Qwen2.5-1.5B-Instruct-GPTQ-Int4-MTT",
"batchs": [1],
"prefill_token_lens": [256,512,1024,2048],
"decode_token_lens": [64,128]
}
]

perf模型性能

python -m mttransformer.perf_test perf_config.json

Open WebUI界面

也可以结合 Open WebUI 来创建一个通用的用户界面,实现类似OpenAI的聊天界面。配置详情参考使用摩尔线程 GPU 搭建个人 RAG 推理服务

5. Q&A

  1. 转换7B模型最后被Killed

内存不够需要增大swap超过16G

# 先把当前所有分区都关闭了
swapoff -a
# 创建要作为 Swap 分区文件
dd if=/dev/zero of=/var/swapfile bs=1G count=16
# 建立 Swap 的文件系统
mkswap /var/swapfile
# 启用 Swap 分区
swapon /var/swapfile
# 查看 Linux 当前分区
free -m
# 清理cache
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"
  1. 启动模型出现内存不够

当出现内存不足的问题时,可以尝试以下方法:

降低 --gpu-memory-utilization, 将该值从默认的较高值(如 0.5)降低到较低值(如 0.4 或 0.5)。

减少 --max-model-len,如果模型不需要处理特别长的上下文,可以将该值从默认的较大值(如 4096)降低到较小值(如 2048 或 1024)。

  1. 运行模型出现较为严重的重复

vllm目前不支持自动加载模型自带的config参数,需要在客户端加一下参数'repetition_penalty':1.05