在 M1000 进行 DeepSeek R1 蒸馏系列模型部署

1. 准备工作

下载安装torch_musa

下载链接

百度网盘链接：https://pan.baidu.com/s/1x0r0AvJ4TkPQP1wSvYr5Xg?pwd=cprw

文件列表和md5值：
b2d11eefd6593f7e9d5d7d53c92f8687  torch-2.2.0-cp310-cp310-linux_aarch64.whl
384cc6082805b702ea7eb1479e1cd661  torch_musa-1.3.2-cp310-cp310-linux_aarch64.whl
5f34663bbc796baef0cf4712a4e5ffe5  torchaudio-2.2.2+cefdb36-cp310-cp310-linux_aarch64.whl
c93deb6ea89a55dafdab70036414925f  torchvision-0.17.2+c1d70fe-cp310-cp310-linux_aarch64.whl

安装脚本

pip install torch-2.2.0-cp310-cp310-linux_aarch64.whl
pip install torch_musa-1.3.2-cp310-cp310-linux_aarch64.whl
pip install torchaudio-2.2.2+cefdb36-cp310-cp310-linux_aarch64.whl
pip install torchvision-0.17.2+c1d70fe-cp310-cp310-linux_aarch64.whl

下载安装LLM大模型推理引擎vllm+mtt

下载链接

mtt百度网盘链接：https://pan.baidu.com/s/1x0r0AvJ4TkPQP1wSvYr5Xg?pwd=cprw

文件列表和md5值：
04ff2c3c5b88d8eb5a371d0f711588e5  mttransformer-20240402.dev65+g273eb81-py3-none-any.whl

vllm百度网盘链接：https://pan.baidu.com/s/1x0r0AvJ4TkPQP1wSvYr5Xg?pwd=cprw

文件列表和md5值：
66a369736c741aee840e53d14a8bcd50  vllm-0.4.2+musa0314.g4582adc-cp310-cp310-linux_aarch64.whl

安装脚本

# 如果已经安装需要先卸载
pip uninstall mttransformer
pip uninstall vllm

# 安装
pip install mttransformer-20240402.dev65+g273eb81-py3-none-any.whl
pip install vllm-0.4.2+musa0314.g4582adc-cp310-cp310-linux_aarch64.whl

2. 下载模型

下载MTT格式模型

git clone https://modelscope.cn/models/hiyangdong/DeepSeek-R1-Distill-Qwen-7B-GPTQ-Int4-MTT

已支持的MTT模型链接

QwQ-32B-GPTQ-Int4 (w4a16) modelscope申请链接
DeepSeek-R1-Distill-Llama-8B (fp16) modelscope申请链接
DeepSeek-R1-Distill-Qwen-7B-GPTQ-Int4 (w4a16) modelscope申请链接
DeepSeek-R1-Distill-Qwen-14B-GPTQ-Int4 (w4a16) modelscope申请链接
DeepSeek-R1-Distill-Qwen-32B-GPTQ-Int4 (w4a16) modelscope申请链接
DeepSeek-R1-Distill-Qwen-1.5B-GPTQ-Int4 (w4a16) modelscope申请链接
Qwen2.5-7B-Instruct-GPTQ-Int4 (w4a16) modelscope申请链接
Qwen2.5-14B-Instruct-GPTQ-Int4 (w4a16) modelscope申请链接
Qwen2.5-32B-Instruct-GPTQ-Int4 (w4a16) modelscope申请链接
DeepSeek-R1-Distill-Qwen-1.5B (fp16) modelscope申请链接
DeepSeek-R1-Distill-Qwen-7B (fp16) modelscope申请链接

部分模型在端侧不支持，主要是内存不够造成的，目前有DeepSeek-R1-Distill-Qwen-14B/32B (fp16)，DeepSeek-R1-Distill-Llama-70B (fp16), Qwen2.5-14B/32B/72B (fp16)

[可选]下载Qwen2.5-7B开源模型

从huggingface下载Qwen2.5-7B模型并运行以Qwen2.5-7B作为例子。另外也可以换成DeepSeek-R1-Distill-Qwen-1.5B/7B/14B

目前支持GPTQ-Int4模型，可以下载如下模型进行适配

git clone https://huggingface.co/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4
# 如果比较慢可以使用国内的modelscope
git clone https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4

[可选]转换模型

# 参考：https://docs.mthreads.com/mtt/mtt-doc-online/quick_start
python -m mttransformer.convert_weight \
    --in_file Qwen2.5-7B-Instruct-GPTQ-Int4 \
    --saved_dir Qwen2.5-7B-Instruct-GPTQ-Int4-MTT

in_file是从huggingface上下载的模型

saved_dir将保存经过mtt转化之后的模型

model_type目前支持llama, mistral, chatglm2, baichuan, qwen, qwen2, yayi。

Qwen1.5和Qwen2同构，model_type都是qwen2，但是Qwen的依然是qwen。

baichuan, baichuan2，model_type都是baichuan。

arch仅支持mp_22，也就是M1000

3. 部署测试本地大模型服务

启动vllm大模型服务

参考：https://docs.mthreads.com/mtt/mtt-doc-online/start_api_server

python -m vllm.entrypoints.openai.api_server \
    --served-model-name DeepSeek-R1-7B --model DeepSeek-R1-Distill-Qwen-7B-GPTQ-Int4-MTT \
    --gpu-memory-utilization 0.4 --max-model-len 2048 --device musa \
    --port 8000 --tensor-parallel-size 1 -pp 1 --block-size 64 --max-num-seqs 1 \
    --trust-remote-code --disable-log-stats --disable-log-requests --swap-space=0

测试大模型接口

获取模型列表

curl http://localhost:8000/v1/models

# 输出
{"object":"list","data":[{"id":"DeepSeek-R1-7B","object":"model","created":1732267605,"owned_by":"vllm","root":"DeepSeek-R1-Distill-Qwen-7B-GPTQ-Int4-MTT","parent":null,"permission":[{"id":"modelperm-c163862182504fbda3f2591a1a6670e7","object":"model_permission","created":1732267605,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

调用模型接口

注意需要修改对应的model名称

# 中文问答
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-7B",
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "repetition_penalty":1.05,
    "messages": [{"role": "user", "content": "介绍一下北京"}]
  }'

# 输出
{"id":"cmpl-4e44a394c2ce4627a73a35fcc919d8db","object":"chat.completion","created":1732268219,"model":"DeepSeek-R1-7B","choices":[{"index":0,"message":{"role":"assistant","content":"北京，中国的首都，位于中国北部的平原上，是全国的政治、文化和交通中心。北京的地势平坦，主要由山地和低平原构成。它周围被长城环绕，是中国著名的文化名城之一。\n\n北京的历史悠久，可追溯到公元前2300年前。自秦汉时期起，北京就是中国的重要政治中心。在隋唐至清朝，北京曾被称为“大都”或“燕京”，是当时中国最大的城市之一。直到明朝，北京一直被称为“北京”。\n\n北京不仅有着悠久的历史和浓厚的文化底蕴，同时也是中国的工业和商业中心之一。北京拥有丰富的自然景观和文化遗迹，如故宫、长城、颐和园、天坛等，都是中国著名的旅游景点。\n\n北京也是国际交往的重要城市，也是中国乃至亚洲的政治、经济、文化中心之一。北京拥有众多的高校和科研机构，是中国的人才和科研中心之一。北京还是中国重要的交通枢纽之一，拥有多个国际机场和高速铁路网络。北京还拥有丰富的教育资源和科研机构，吸引了大量的人才和科研力量聚集。北京是中国乃至亚洲的科技、教育和文化中心之一，也是中国现代化和国际化的重要象征。"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":31,"total_tokens":276,"completion_tokens":245}}

# 英文问答
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "DeepSeek-R1-7B",
    "temperature": 0.7,
    "top_p": 0.8,
    "top_k": 20,
    "repetition_penalty":1.05,
    "messages": [{"role": "user", "content": "Hello! What is your name?"}]
  }'

# 输出
{"id":"cmpl-1c39630a5d204549af9d9055a01f10b1","object":"chat.completion","created":1732268198,"model":"DeepSeek-R1-7B","choices":[{"index":0,"message":{"role":"assistant","content":"Hello! I'm Qwen, created by Alibaba Cloud. I'm here to help you with any questions you might have. Is there anything specific you'd like to know or discuss?"},"logprobs":null,"finish_reason":"stop","stop_reason":null}],"usage":{"prompt_tokens":36,"total_tokens":74,"completion_tokens":38}}

python调用

流式输出

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "北京有哪些名胜古迹？"
    }],
    model=model,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        'top_k':20,
        'repetition_penalty':1.05, # 惩罚重复,vllm默认没有加载需要添加，参考：https://modelscope.cn/models/Qwen/Qwen2.5-7B-Instruct/file/view/master?fileName=generation_config.json&status=1#L9
    },
    max_tokens=512,
    stream=True,  # 启用流式输出
)

# 处理流式响应
print("Chat response (streaming):")
for chunk in chat_completion:
    if chunk.choices:
        delta = chunk.choices[0].delta
        content = delta.content
        if content:
            print(content, end='', flush=True)
print("\n - Chat response (end) -\n")

非流式输出

from openai import OpenAI

# Modify OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1"

client = OpenAI(
    # defaults to os.environ.get("OPENAI_API_KEY")
    api_key=openai_api_key,
    base_url=openai_api_base,
)

models = client.models.list()
model = models.data[0].id

chat_completion = client.chat.completions.create(
    messages=[{
        "role": "system",
        "content": "You are a helpful assistant."
    }, {
        "role": "user",
        "content": "北京有哪些名胜古迹？"
    }],
    model=model,
    temperature=0.7,
    top_p=0.8,
    extra_body={
        'top_k':20,
        'repetition_penalty':1.05,
    },
    max_tokens=512,
    stream=False,
)

print("Chat completion results:")
print(chat_completion)

4. perf模型性能

配置perf_config.json

参考：

[
    {
      "model_name": "Qwen2.5-7B",
      "path": "Qwen2.5-7B-Instruct-GPTQ-Int4-MTT",
      "batchs": [1],
      "prefill_token_lens": [256,512,1024,2048],
      "decode_token_lens": [64,128]
    },
    {
      "model_name": "Qwen2.5-3B",
      "path": "Qwen2.5-3B-Instruct-GPTQ-Int4-MTT",
      "batchs": [1],
      "prefill_token_lens": [256,512,1024,2048],
      "decode_token_lens": [64,128]
    },
    {
      "model_name": "Qwen2.5-1.5B",
      "path": "Qwen2.5-1.5B-Instruct-GPTQ-Int4-MTT",
      "batchs": [1],
      "prefill_token_lens": [256,512,1024,2048],
      "decode_token_lens": [64,128]
    }
]

perf模型性能

python -m mttransformer.perf_test perf_config.json

Open WebUI界面

也可以结合 Open WebUI 来创建一个通用的用户界面，实现类似OpenAI的聊天界面。配置详情参考使用摩尔线程 GPU 搭建个人 RAG 推理服务。

5. Q&A

转换7B模型最后被Killed

内存不够需要增大swap超过16G

# 先把当前所有分区都关闭了
swapoff -a
# 创建要作为 Swap 分区文件
dd if=/dev/zero of=/var/swapfile bs=1G count=16
# 建立 Swap 的文件系统
mkswap /var/swapfile
# 启用 Swap 分区
swapon /var/swapfile
# 查看 Linux 当前分区
free -m
# 清理cache
sudo sh -c "echo 3 > /proc/sys/vm/drop_caches"

启动模型出现内存不够

当出现内存不足的问题时，可以尝试以下方法：

降低 --gpu-memory-utilization, 将该值从默认的较高值（如 0.5）降低到较低值（如 0.4 或 0.5）。

减少 --max-model-len，如果模型不需要处理特别长的上下文，可以将该值从默认的较大值（如 4096）降低到较小值（如 2048 或 1024）。

运行模型出现较为严重的重复

vllm目前不支持自动加载模型自带的config参数，需要在客户端加一下参数'repetition_penalty':1.05

在 M1000 进行 DeepSeek R1 蒸馏系列模型部署

1. 准备工作​

下载安装torch_musa​

下载安装LLM大模型推理引擎vllm+mtt​

2. 下载模型​

下载MTT格式模型​

已支持的MTT模型链接​

[可选]下载Qwen2.5-7B开源模型​

[可选]转换模型​

3. 部署测试本地大模型服务​

启动vllm大模型服务​

测试大模型接口​

获取模型列表​

调用模型接口​

python调用​

流式输出​

非流式输出​

4. perf模型性能​

配置perf_config.json​

perf模型性能​

Open WebUI界面​

5. Q&A​