系统:CentOS-7
CPU: Intel® Xeon® CPU E5-2680 v4 @ 2.40GHz 14C28T
内存: 48G DDR3
make --version
GNU Make 4.3
gcc --version
gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
g++ --version
g++ (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
等待一会
查看
ls
-rwxr-xr-x. 1 root root 1.6M Feb 23 07:54 main
-rwxr-xr-x. 1 root root 2.6M Feb 23 07:55 server
.....
https://hf-mirror.com/Qwen/Qwen1.5-72B-Chat-GGUF
qwen1_5-72b-chat-q4_k_m.gguf.a
qwen1_5-72b-chat-q4_k_m.gguf.b
cat qwen1_5-72b-chat-q5_k_m.gguf.* > qwen1_5-72b-chat-q5_k_m.gguf
./server -m /models/Qwen1.5-72B-Chat-GGUF/qwen1_5-72b-chat-q4_k_m.gguf --host 192.168.31.222 -c 1024 -t 26
我的IP是192.168.31.222
或
./main -m /models/Qwen1.5-72B-Chat-GGUF/qwen1_5-72b-chat-q4_k_m.gguf -n 512 --color -i -cml -f prompts/chat-with-qwen.txt
方式1
http://192.168.31.222:8080/
方式2
curl --request POST \
--url http://192.168.31.222:8080/completion \
--header "Content-Type: application/json" \
--data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'
CPU利用2600%左右,42G内存,如果更强的CPU估计还是能到4倍速度吧
速度 0.6 tokens/s 这个速度还是很慢的,测试一下还是可以的,毕竟是70B的模型呀,继续研究中
print_timings: prompt eval time = 4839.81 ms / 13 tokens ( 372.29 ms per token, 2.69 tokens per second)
print_timings: eval time = 214075.61 ms / 128 runs ( 1672.47 ms per token, 0.60 tokens per second)
print_timings: total time = 218915.43 ms