• “私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2,尽享Code Llama支持!”


    “私密离线聊天新体验!llama-gpt聊天机器人:极速、安全、搭载Llama 2,尽享Code Llama支持!”

    一个自托管的、离线的、类似chatgpt的聊天机器人。由美洲驼提供动力。100%私密,没有数据离开您的设备。

    Demo

    https://github.com/getumbrel/llama-gpt/assets/10330103/5d1a76b8-ed03-4a51-90bd-12ebfaf1e6cd

    “私密离线聊天新体验!llama-gpt聊天机器人

    1.支持模型

    Currently, LlamaGPT supports the following models. Support for running custom models is on the roadmap.

    Model nameModel sizeModel download sizeMemory required
    Nous Hermes Llama 2 7B Chat (GGML q4_0)7B3.79GB6.29GB
    Nous Hermes Llama 2 13B Chat (GGML q4_0)13B7.32GB9.82GB
    Nous Hermes Llama 2 70B Chat (GGML q4_0)70B38.87GB41.37GB
    Code Llama 7B Chat (GGUF Q4_K_M)7B4.24GB6.74GB
    Code Llama 13B Chat (GGUF Q4_K_M)13B8.06GB10.56GB
    Phind Code Llama 34B Chat (GGUF Q4_K_M)34B20.22GB22.72GB

    1.1 安装LlamaGPT 在 umbrelOS

    Running LlamaGPT on an umbrelOS home server is one click. Simply install it from the Umbrel App Store.

    1.2 安装LlamaGPT on M1/M2 Mac

    Make sure your have Docker and Xcode installed.

    Then, clone this repo and cd into it:

    git clone https://github.com/getumbrel/llama-gpt.git
    cd llama-gpt
    
    • 1
    • 2

    Run LlamaGPT with the following command:

    ./run-mac.sh --model 7b
    
    • 1

    You can access LlamaGPT at http://localhost:3000.

    To run 13B or 70B chat models, replace 7b with 13b or 70b respectively.
    To run 7B, 13B or 34B Code Llama models, replace 7b with code-7b, code-13b or code-34b respectively.

    To stop LlamaGPT, do Ctrl + C in Terminal.

    1.3 在 Docker上安装

    You can run LlamaGPT on any x86 or arm64 system. Make sure you have Docker installed.

    Then, clone this repo and cd into it:

    git clone https://github.com/getumbrel/llama-gpt.git
    cd llama-gpt
    
    • 1
    • 2

    Run LlamaGPT with the following command:

    ./run.sh --model 7b
    
    • 1

    Or if you have an Nvidia GPU, you can run LlamaGPT with CUDA support using the --with-cuda flag, like:

    ./run.sh --model 7b --with-cuda
    
    • 1

    You can access LlamaGPT at http://localhost:3000.

    To run 13B or 70B chat models, replace 7b with 13b or 70b respectively.
    To run Code Llama 7B, 13B or 34B models, replace 7b with code-7b, code-13b or code-34b respectively.

    To stop LlamaGPT, do Ctrl + C in Terminal.

    Note: On the first run, it may take a while for the model to be downloaded to the /models directory. You may also see lots of output like this for a few minutes, which is normal:

    llama-gpt-llama-gpt-ui-1       | [INFO  wait] Host [llama-gpt-api-13b:8000] not yet available...
    
    • 1

    After the model has been automatically downloaded and loaded, and the API server is running, you’ll see an output like:

    llama-gpt-ui_1   | ready - started server on 0.0.0.0:3000, url: http://localhost:3000
    
    • 1

    You can then access LlamaGPT at http://localhost:3000.


    1.4 在Kubernetes安装

    First, make sure you have a running Kubernetes cluster and kubectl is configured to interact with it.

    Then, clone this repo and cd into it.

    To deploy to Kubernetes first create a namespace:

    kubectl create ns llama
    
    • 1

    Then apply the manifests under the /deploy/kubernetes directory with

    kubectl apply -k deploy/kubernetes/. -n llama
    
    • 1

    Expose your service however you would normally do that.

    2.OpenAI兼容API

    Thanks to llama-cpp-python, a drop-in replacement for OpenAI API is available at http://localhost:3001. Open http://localhost:3001/docs to see the API documentation.

    • 基线

    We’ve tested LlamaGPT models on the following hardware with the default system prompt, and user prompt: “How does the universe expand?” at temperature 0 to guarantee deterministic results. Generation speed is averaged over the first 10 generations.

    Feel free to add your own benchmarks to this table by opening a pull request.

    2.1 Nous Hermes Llama 2 7B Chat (GGML q4_0)

    DeviceGeneration speed
    M1 Max MacBook Pro (64GB RAM)54 tokens/sec
    GCP c2-standard-16 vCPU (64 GB RAM)16.7 tokens/sec
    Ryzen 5700G 4.4GHz 4c (16 GB RAM)11.50 tokens/sec
    GCP c2-standard-4 vCPU (16 GB RAM)4.3 tokens/sec
    Umbrel Home (16GB RAM)2.7 tokens/sec
    Raspberry Pi 4 (8GB RAM)0.9 tokens/sec

    2.2 Nous Hermes Llama 2 13B Chat (GGML q4_0)

    DeviceGeneration speed
    M1 Max MacBook Pro (64GB RAM)20 tokens/sec
    GCP c2-standard-16 vCPU (64 GB RAM)8.6 tokens/sec
    GCP c2-standard-4 vCPU (16 GB RAM)2.2 tokens/sec
    Umbrel Home (16GB RAM)1.5 tokens/sec

    2.3 Nous Hermes Llama 2 70B Chat (GGML q4_0)

    DeviceGeneration speed
    M1 Max MacBook Pro (64GB RAM)4.8 tokens/sec
    GCP e2-standard-16 vCPU (64 GB RAM)1.75 tokens/sec
    GCP c2-standard-16 vCPU (64 GB RAM)1.62 tokens/sec

    2.4 Code Llama 7B Chat (GGUF Q4_K_M)

    DeviceGeneration speed
    M1 Max MacBook Pro (64GB RAM)41 tokens/sec

    2.5 Code Llama 13B Chat (GGUF Q4_K_M)

    DeviceGeneration speed
    M1 Max MacBook Pro (64GB RAM)25 tokens/sec

    2.6 Phind Code Llama 34B Chat (GGUF Q4_K_M)

    DeviceGeneration speed
    M1 Max MacBook Pro (64GB RAM)10.26 tokens/sec

    4_K_M)

    DeviceGeneration speed
    M1 Max MacBook Pro (64GB RAM)10.26 tokens/sec

    更多优质内容请关注公号:汀丶人工智能;会提供一些相关的资源和优质文章,免费获取阅读。

  • 相关阅读:
    C语言指针快速入门
    企业电子文档管理系统哪个好?怎么选?
    大语言模型智能体简介
    【目标检测】48、YOLOv5 | 可方便工程部署的 YOLO 网络
    2022-09-09 Unity InputSystem2——代码检测输入
    UniApp如何打包IOS企业版APP及ios企版app证书申请方法图文教程
    矩阵相乘详解
    2.SpringSecurity - 处理器简单说明
    java基于springboot+vue的自习室座位预约系统 elementui
    《Similarity-based Memory Enhanced Joint Entity and Relation Extraction》论文阅读笔记
  • 原文地址:https://blog.csdn.net/sinat_39620217/article/details/133762478