AI 推理入门必看 | Triton Inference Server 原理入门之框架篇

终端输入:
docker pull nvcr.io/nvidia/tritonserver:22.05-py3
或者
sudo docker pull nvcr.io/nvidia/tritonserver:22.05-py3
安装成功会输出类似LOG:
7e9edccda8bc: Pull complete
a77d121c6271: Pull complete
074e6c40e814: Pull complete
Digest: sha256:1ddc4632dda74e3307e0251d4d7b013a5a2567988865a9fd583008c0acac6ac7
Status: Downloaded newer image for nvcr.io/nvidia/tritonserver:22.05-py3
nvcr.io/nvidia/tritonserver:22.05-py3
mkdir -p /home/triton/model_repository/<your model name>/<vision>
简单测试先使用:
mkdir -p /home/triton/model_repository/fc_model_pt/1
mkdir -p /home/triton/model_repository/fc_model_onnx/1
/home/triton/model_repository 文件目录表示模型仓库,所有的模型都在这个模型目录中。
启动容器后会将model_repository映射到tritonserver的docker镜像中。
import torch
import torch.nn as nn
class SimpleModel(nn.Module):
def __init__(self):
super(SimpleModel, self).__init__()
self.embedding = nn.Embedding(100, 8)
self.fc = nn.Linear(8, 4)
self.fc_list = nn.Sequential(*[nn.Linear(8, 8) for _ in range(4)])
def forward(self, input_ids):
word_emb = self.embedding(input_ids)
output1 = self.fc(word_emb)
output2 = self.fc_list(word_emb)
return output1, output2
if __name__ == "__main__":
pt_path = "/home/triton/model_repository/fc_model_pt/1/model.pt"
onnx_path = "/home/triton/model_repository/fc_model_onnx/1/model.onnx"
model = SimpleModel()
ipt = torch.tensor([[1, 2, 3], [4, 5, 6]], dtype=torch.long)
script_model = torch.jit.trace(model, ipt, strict=True)
torch.jit.save(script_model, model_path)
torch.onnx.export(model, ipt, onnx_path,
input_names=['input'],
output_names=['output1', 'output2']
)
# 建议先重启一下Docker 非必需
sudo systemctl restart docker
# 启动Tritont服务
sudo docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/triton/model_repository:/models nvcr.io/nvidia/tritonserver:22.05-py3 tritonserver --model-repository=/models --strict-model-config=false
注意:上面命令中 strict-model-config=false,这个表示使用Triton自动生成模型的配置
如果熟悉模型配置的规范,可以先自己配置好config文件。再启动Tritont服务,对应的strict-model-config传True。
这里我们为torch版本配置config参数
在"/home/triton/model_repository/fc_model_pt" 目录下新建 config.pbtxt 文件
name: "fc_model_pt" # 模型名,也是目录名
platform: "pytorch_libtorch" # 模型对应的平台,本次使用的是torch,不同格式的对应的平台可以在官方文档找到
max_batch_size : 64 # 一次送入模型的最大bsz,防止oom
input [
{
name: "input__0" # 输入名字,对于torch来说名字于代码的名字不需要对应,但必须是<name>__<index>的形式,注意是2个下划线,写错就报错
data_type: TYPE_INT64 # 类型,torch.long对应的就是int64,不同语言的tensor类型与triton类型的对应关系可以在官方文档找到
dims: [ -1 ] # -1 代表是可变维度,虽然输入是二维的,但是默认第一个是bsz,所以只需要写后面的维度就行(无法理解的操作,如果是[-1,-1]调用模型就报错)
}
]
output [
{
name: "output__0" # 命名规范同输入
data_type: TYPE_FP32
dims: [ -1, -1, 4 ]
},
{
name: "output__1"
data_type: TYPE_FP32
dims: [ -1, -1, 8 ]
}
]
运行成功可以看到日志:


由于模型是自动配置的参数,用户不知道配置参数,也不能灵活的配置。
理想情况下我们应该为每个模型创建一个 config.pbtxt 文件。
使用命令获取模型生成的配置文件:
curl localhost:8000/v2/models/<your model name>/config
#简单测试先使用:
curl localhost:8000/v2/models/fc_model_onnx/config
得到输出
{
"name": "fc_model_onnx",
"platform": "onnxruntime_onnx",
"backend": "onnxruntime",
"version_policy": {
"latest": {
"num_versions": 1
}
},
"max_batch_size": 0,
"input": [{
"name": "input",
"data_type": "TYPE_INT64",
"format": "FORMAT_NONE",
"dims": [2, 3],
"is_shape_tensor": false,
"allow_ragged_batch": false,
"optional": false
}],
"output": [{
"name": "output2",
"data_type": "TYPE_FP32",
"dims": [2, 3, 8],
"label_filename": "",
"is_shape_tensor": false
}, {
"name": "output1",
"data_type": "TYPE_FP32",
"dims": [2, 3, 4],
"label_filename": "",
"is_shape_tensor": false
}],
"batch_input": [],
"batch_output": [],
"optimization": {
"priority": "PRIORITY_DEFAULT",
"input_pinned_memory": {
"enable": true
},
"output_pinned_memory": {
"enable": true
},
"gather_kernel_buffer_threshold": 0,
"eager_batching": false
},
"instance_group": [{
"name": "fc_model_onnx",
"kind": "KIND_GPU",
"count": 1,
"gpus": [0],
"secondary_devices": [],
"profile": [],
"passive": false,
"host_policy": ""
}],
"default_model_filename": "model.onnx",
"cc_model_filenames": {},
"metric_tags": {},
"parameters": {},
"model_warmup": []
}
通过JSON的输出配置。如torch版本一样, 在"/home/triton/model_repository/fc_model_onnx" 目录下新建 config.pbtxt 文件,并修改如下:(格式要求详见model_configuration)
name: "fc_model_onnx"
platform: "onnxruntime_onnx"
max_batch_size : 0
input [
{
name: "input"
data_type: TYPE_INT64
dims: [2, 3]
}
]
output [
{
name: "output1"
data_type: TYPE_FP32
dims: [2, 3, 4]
},
{
name: "output2"
data_type: TYPE_FP32
dims: [2, 3, 8]
}
]
因为有了灵活的配置文件,重新启动一下Triton
注意,如果有配置文件后,之后的每次启动都使用下面的这个命令
# 设置 strict-model-config=True 严格按config文件
sudo docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v /home/triton/model_repository:/models nvcr.io/nvidia/tritonserver:22.05-py3 tritonserver --model-repository=/models --strict-model-config=True
# 如果端口被占用 重启一下docker
sudo systemctl restart docker
Torch版本
import requests
request_data = {
"inputs": [{
"name": "input__0",
"shape": [2, 3],
"datatype": "INT64",
"data": [[1, 2, 3],[4,5,6]]
}],
"outputs": [{"name": "output__0"}, {"name": "output__1"}]
}
res = requests.post(url="http://localhost:8000/v2/models/fc_model_pt/versions/1/infer",json=request_data).json()
print(res)
或者
import numpy as np
import tritonclient.http as httpclient
triton_client = httpclient.InferenceServerClient(url="localhost:8000", verbose=False)
model_name = "fc_model_pt"
inputs = [
httpclient.InferInput('input__0', [2, 3], "INT64")
]
outputs = [
httpclient.InferRequestedOutput('output__0'),
httpclient.InferRequestedOutput('output__1')
]
inputs[0].set_data_from_numpy(np.random.randint(0, high=1000, size=(2, 3)))
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
print(results)
onnx版本
import requests
request_data = {
"inputs": [{
"name": "input",
"shape": [2, 3],
"datatype": "INT64",
"data": [[1, 2, 3],[4,5,6]]
}],
"outputs": [{"name": "output1"}, {"name": "output2"}]
}
res = requests.post(url="http://localhost:8000/v2/models/fc_model_onnx/versions/1/infer",json=request_data).json()
print(res)
torch版本
import numpy as np
import tritonclient.grpc as grpcclient
triton_client = grpcclient.InferenceServerClient(url="localhost:8001", verbose=False)
model_name = "fc_model_pt"
inputs = [
grpcclient.InferInput('input__0', [2, 3], "INT64")
]
outputs = [
grpcclient.InferRequestedOutput('output__0'),
grpcclient.InferRequestedOutput('output__1')
]
inputs[0].set_data_from_numpy(np.random.randint(0, high=1000, size=(2, 3)))
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
print(results)
onnx版本
import numpy as np
import tritonclient.grpc as grpcclient
triton_client = grpcclient.InferenceServerClient(url="localhost:8001", verbose=False)
model_name = "fc_model_onnx"
inputs = [
grpcclient.InferInput('input', [2, 3], "INT64")
]
outputs = [
grpcclient.InferRequestedOutput('output1'),
grpcclient.InferRequestedOutput('output2')
]
inputs[0].set_data_from_numpy(np.random.randint(0, high=1000, size=(2, 3)))
results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
print(results)