向量是具有一定大小和方向的量,可以简单理解为一串数字的集合,就像一行多列的矩阵,比如:[2,0,1,9,0,6,3,0]。每一行代表一个数据项,每一列代表一个该数据项的各个属性。
特征向量是包含事物重要特征的向量。大家比较熟知的一个特征向量是RGB (红-绿-蓝)色彩。每种颜色都可以通过对红®、绿(G)、蓝(B)三种颜色的比例来得到。这样一个特征向量可以描述为:颜色 = [红,绿,蓝]。
向量检索是指从向量库中检索出距离目标向量最近的 K 个向量。一般我们用两个向量间的欧式距离,余弦距离等来衡量两个向量间的距离,一次来评估两个向量的相似度
Milvus的官网地址是:https://milvus.io/
github地址是:https://github.com/milvus-io/milvus,目前Fork数为:2.6k,Star有23.7k
Milvus创建于2019年,其目标只有一个:存储、索引和管理由深度神经网络和其他机器学习(ML)模型生成的大量嵌入向量。作为一个专门设计用于处理输入向量查询的数据库,它能够在万亿规模上对向量进行索引。与现有的关系数据库主要按照预定义的模式处理结构化数据不同,Milvus是从自底向上设计的,以处理从非结构化数据转换而来的嵌入向量。
Milvus 是一款开源的向量数据库,支持针对 TB 级向量的增删改操作和近实时查询,具有高度灵活、稳定可靠以及高速查询等特点。Milvus 集成了 Faiss、NMSLIB、Annoy 等广泛应用的向量索引库,Milvus 支持数据分区分片、数据持久化、增量数据摄取、标量向量混合查询、time travel 等功能,同时大幅优化了向量检索的性能,可满足任何向量检索场景的应用需求,提供了一整套简单直观的 API,让你可以针对不同场景选择不同的索引类型。此外,Milvus 还可以对标量数据进行过滤,进一步提高了召回率,增强了搜索的灵活性。
Milvus 采用共享存储架构,存储计算完全分离,计算节点支持横向扩展。从架构上来看,Milvus 遵循数据流和控制流分离,整体分为了四个层次:分别为接入层(access layer)、协调服务(coordinator service)、执行节点(worker node)和存储层(storage)。各个层次相互独立,独立扩展和容灾。
随着互联网的发展和发展,非结构化数据变得越来越普遍,包括电子邮件、论文、物联网传感器数据、Facebook照片、蛋白质结构等等。为了让计算机理解和处理非结构化数据,使用嵌入技术将这些数据转换为向量。Milvus存储并索引这些向量。Milvus能够通过计算两个向量的相似距离来分析它们之间的相关性。如果两个嵌入向量非常相似,则表示原始数据源也非常相似。Milvus 向量数据库专为向量查询与检索设计,能够为万亿级向量数据建立索引。与现有的主要用作处理结构化数据的关系型数据库不同,Milvus 在底层设计上就是为了处理由各种非结构化数据转换而来的 Embedding 向量而生。
使用Docker Compose独立安装Milvus,安装前请检查硬件和软件的要求:至少使用2个vcpu和8gb的初始内存。否则可能导致安装失败。
详见官网:https://milvus.io/docs/install_standalone-docker.md
先安装wget命令
yum install wget
下载配置文件,通过docker-compose进行安装
wget https://github.com/milvus-io/milvus/releases/download/v2.3.2/milvus-standalone-docker-compose.yml
如果无法访问github,可以直接通过浏览器下载,然后通过rz命令上传到虚拟机
yum install lrzsz -y
由于后期milvus会在当前目录生成很多数据,因此建议创建一个新的目录,然后再上传 docker-compose.yml 文件
mkdir milvus
cd milvus
rz
yum install docker
首先找到docker-compose的github路径:https://github.com/docker/compose/releases
查看系统型号
uname -a
查看系统信息uname -s
查看系统名称uname -a
查看系统架构
下载安装文件
sudo curl -L "https://github.com/docker/compose/releases/download/2.23.0/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
赋予执行权限
sudo chmod +x /usr/local/bin/docker-compose
创建软连接
sudo ln -s /usr/local/bin/docker-compose /usr/bin/docker-compose
测试,能出版本号就说明安装成功了
docker-compose --version
安装 pip
yum -y install epel-release
yum -y install python-pip
升级 pip
pip install --upgrade pip
安装 docker-compose 插件
pip install docker-compose
验证安装是否成功
docker-compose --version
在没启动docker 的情况下,直接启动docker-compose会报错,因此要先启动docker,并将docker设置为开机自动启动
systemctl start docker
systemctl enable docker
通过docker-compose up -d
命令在后台运行docker-compose容器,默认的配置文件名是 docker-compose.yml,可以通过-f
参数进行修改
sudo docker-compose -f milvus-standalone-docker-compose.yml up -d
使用下面命令,查看容器允许状态,我这里全部启动失败了
docker-compose ps -a
使用下面命令,查看容器出错日志
sudo docker-compose logs etcd | grep error
报错的内容是目录无法访问: open /etcd: permission denied
,这是因为CentOS7中的安全模块selinux把权限禁掉了。解决办法
setenforce 0
关闭 selinux--privileged=true
,或者在 docker-compose 服务里增加 privileged: true
(因为milvus会启动3个服务,因此这里一共要加3处)注意,如果你修改了存放目录,这里要先删除之前的镜像
docker rm milvus-standalone
docker rm milvus-minio
docker rm milvus-etcd
然后重新启动容器(我这里在前台启动,目的是方便查看日志,所以我没加-d
参数)
sudo docker-compose -f milvus-standalone-docker-compose.yml up
如果你是在window上通过python连接虚拟机中的Milvus,只需要在window中安装 python 环境即可( 我这里装的是 anaconda + python3.11.4 )
pip3 install pymilvus==2.3.2
如果你要直接在虚拟机中运行python,则需要在虚拟机中安装python环境,具体步骤如下
首先,安装anaconda:https://repo.anaconda.com/archive/
wget https://repo.anaconda.com/archive/Anaconda3-2023.09-0-Linux-x86_64.sh
由于我这里wget出错,因此我是直接从window上下载,然后使用rz命令传到centos7虚拟机的
下载完sh安装文件后,先给文件加上可执行权限
chmod +x Anaconda3-2023.09-0-Linux-x86_64.sh
然后执行命令进行安装
bash Anaconda3-2023.09-0-Linux-x86_64.sh
# 按照安装提示,键入回车
Please, press ENTER to continue
>>> ENTER
# 中途大概按两次回车进行翻页
# 输入yes,表示同意安装协议
Do you accept the license terms? [yes|no][no]
>>> yes
# 多次yes后安装成功
安装好后,anaconda会写入你的环境变量(用户名不同,地址也会有所不同)
执行 source ~/.bashrc
命令刷新环境变量,然后就可以使用 conda --version
查看 conda 的版本了。
目前最新版的 conda 是 23.7.4 版本,对应的python 版本是 3.11.5 (centos7 默认的python版本是2.7.5,这里是进入了conda 的base环境,所以是3.11.5版本)
milvus需要的是python3.7+ 的运行环境。环境搭建好后,需要安装依赖库 pymilvus
详见官网介绍:https://milvus.io/docs/install-pymilvus.md
conda install numpy
pip3 install pymilvus==2.3.2
注意:pymilvus 默认conda 是安装不来的,但是pip3 可以直接安装
可以用如下命令验证环境是否搭建成功(没报错就说明成功了)
python3 -c "from pymilvus import Collection"
也可以使用 python 命令行的方式执行
milvus使用的端口是19530
docker port milvus-standalone
docker port milvus-standalone 19530/tcp
我这里使用的是开源的Virtual box,网络使用的是net方式。因此找到端口转发,配置19530即可
这里,子系统的IP,可以通过ip a
查看并替换成你的虚拟机上的IP
Milvus在Github上为其提供了python语言的测试demo,地址为:https://github.com/milvus-io/pymilvus/tree/master/examples。用来测试的文件是 hello_milvus.py
由于需要vpn才能访问,我贴一下它的代码
# hello_milvus.py demonstrates the basic operations of PyMilvus, a Python SDK of Milvus.
# 1. connect to Milvus
# 2. create collection
# 3. insert data
# 4. create index
# 5. search, query, and hybrid search on entities
# 6. delete entities by PK
# 7. drop collection
import time
import numpy as np
from pymilvus import (
connections,
utility,
FieldSchema, CollectionSchema, DataType,
Collection,
)
fmt = "\n=== {:30} ===\n"
search_latency_fmt = "search latency = {:.4f}s"
num_entities, dim = 3000, 8
#################################################################################
# 1. connect to Milvus
# Add a new connection alias `default` for Milvus server in `localhost:19530`
# Actually the "default" alias is a buildin in PyMilvus.
# If the address of Milvus is the same as `localhost:19530`, you can omit all
# parameters and call the method as: `connections.connect()`.
#
# Note: the `using` parameter of the following methods is default to "default".
print(fmt.format("start connecting to Milvus"))
connections.connect("default", host="localhost", port="19530")
has = utility.has_collection("hello_milvus")
print(f"Does collection hello_milvus exist in Milvus: {has}")
#################################################################################
# 2. create collection
# We're going to create a collection with 3 fields.
# +-+------------+------------+------------------+------------------------------+
# | | field name | field type | other attributes | field description |
# +-+------------+------------+------------------+------------------------------+
# |1| "pk" | VarChar | is_primary=True | "primary field" |
# | | | | auto_id=False | |
# +-+------------+------------+------------------+------------------------------+
# |2| "random" | Double | | "a double field" |
# +-+------------+------------+------------------+------------------------------+
# |3|"embeddings"| FloatVector| dim=8 | "float vector with dim 8" |
# +-+------------+------------+------------------+------------------------------+
fields = [
FieldSchema(name="pk", dtype=DataType.VARCHAR, is_primary=True, auto_id=False, max_length=100),
FieldSchema(name="random", dtype=DataType.DOUBLE),
FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=dim)
]
schema = CollectionSchema(fields, "hello_milvus is the simplest demo to introduce the APIs")
print(fmt.format("Create collection `hello_milvus`"))
hello_milvus = Collection("hello_milvus", schema, consistency_level="Strong")
################################################################################
# 3. insert data
# We are going to insert 3000 rows of data into `hello_milvus`
# Data to be inserted must be organized in fields.
#
# The insert() method returns:
# - either automatically generated primary keys by Milvus if auto_id=True in the schema;
# - or the existing primary key field from the entities if auto_id=False in the schema.
print(fmt.format("Start inserting entities"))
rng = np.random.default_rng(seed=19530)
entities = [
# provide the pk field because `auto_id` is set to False
[str(i) for i in range(num_entities)],
rng.random(num_entities).tolist(), # field random, only supports list
rng.random((num_entities, dim)), # field embeddings, supports numpy.ndarray and list
]
insert_result = hello_milvus.insert(entities)
hello_milvus.flush()
print(f"Number of entities in Milvus: {hello_milvus.num_entities}") # check the num_entities
################################################################################
# 4. create index
# We are going to create an IVF_FLAT index for hello_milvus collection.
# create_index() can only be applied to `FloatVector` and `BinaryVector` fields.
print(fmt.format("Start Creating index IVF_FLAT"))
index = {
"index_type": "IVF_FLAT",
"metric_type": "L2",
"params": {"nlist": 128},
}
hello_milvus.create_index("embeddings", index)
################################################################################
# 5. search, query, and hybrid search
# After data were inserted into Milvus and indexed, you can perform:
# - search based on vector similarity
# - query based on scalar filtering(boolean, int, etc.)
# - hybrid search based on vector similarity and scalar filtering.
#
# Before conducting a search or a query, you need to load the data in `hello_milvus` into memory.
print(fmt.format("Start loading"))
hello_milvus.load()
# -----------------------------------------------------------------------------
# search based on vector similarity
print(fmt.format("Start searching based on vector similarity"))
vectors_to_search = entities[-1][-2:]
search_params = {
"metric_type": "L2",
"params": {"nprobe": 10},
}
start_time = time.time()
result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, output_fields=["random"])
end_time = time.time()
for hits in result:
for hit in hits:
print(f"hit: {hit}, random field: {hit.entity.get('random')}")
print(search_latency_fmt.format(end_time - start_time))
# -----------------------------------------------------------------------------
# query based on scalar filtering(boolean, int, etc.)
print(fmt.format("Start querying with `random > 0.5`"))
start_time = time.time()
result = hello_milvus.query(expr="random > 0.5", output_fields=["random", "embeddings"])
end_time = time.time()
print(f"query result:\n-{result[0]}")
print(search_latency_fmt.format(end_time - start_time))
# -----------------------------------------------------------------------------
# pagination
r1 = hello_milvus.query(expr="random > 0.5", limit=4, output_fields=["random"])
r2 = hello_milvus.query(expr="random > 0.5", offset=1, limit=3, output_fields=["random"])
print(f"query pagination(limit=4):\n\t{r1}")
print(f"query pagination(offset=1, limit=3):\n\t{r2}")
# -----------------------------------------------------------------------------
# hybrid search
print(fmt.format("Start hybrid searching with `random > 0.5`"))
start_time = time.time()
result = hello_milvus.search(vectors_to_search, "embeddings", search_params, limit=3, expr="random > 0.5", output_fields=["random"])
end_time = time.time()
for hits in result:
for hit in hits:
print(f"hit: {hit}, random field: {hit.entity.get('random')}")
print(search_latency_fmt.format(end_time - start_time))
###############################################################################
# 6. delete entities by PK
# You can delete entities by their PK values using boolean expressions.
ids = insert_result.primary_keys
expr = f'pk in ["{ids[0]}" , "{ids[1]}"]'
print(fmt.format(f"Start deleting with expr `{expr}`"))
result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"])
print(f"query before delete by expr=`{expr}` -> result: \n-{result[0]}\n-{result[1]}\n")
hello_milvus.delete(expr)
result = hello_milvus.query(expr=expr, output_fields=["random", "embeddings"])
print(f"query after delete by expr=`{expr}` -> result: {result}\n")
###############################################################################
# 7. drop collection
# Finally, drop the hello_milvus collection
print(fmt.format("Drop collection `hello_milvus`"))
utility.drop_collection("hello_milvus")
下载好后,使用 docker ps -a
确保虚拟机中的milvus服务器已全部开启
执行命令 python hello_milvus.py
,或是使用编辑器执行均可(我这里用的是Anaconda提供的Spyder)
完整的执行结果如下
=== start connecting to Milvus ===
Does collection hello_milvus exist in Milvus: False
=== Create collection `hello_milvus` ===
=== Start inserting entities ===
Number of entities in Milvus: 3000
=== Start Creating index IVF_FLAT ===
=== Start loading ===
=== Start searching based on vector similarity ===
hit: id: 2998, distance: 0.0, entity: {'random': 0.9728033590489911}, random field: 0.9728033590489911
hit: id: 1262, distance: 0.08883658051490784, entity: {'random': 0.2978858685751561}, random field: 0.2978858685751561
hit: id: 1265, distance: 0.09590047597885132, entity: {'random': 0.3042039939240304}, random field: 0.3042039939240304
hit: id: 2999, distance: 0.0, entity: {'random': 0.02316334456872482}, random field: 0.02316334456872482
hit: id: 1580, distance: 0.05628091096878052, entity: {'random': 0.3855988746044062}, random field: 0.3855988746044062
hit: id: 2377, distance: 0.08096685260534286, entity: {'random': 0.8745922204004368}, random field: 0.8745922204004368
search latency = 0.4576s
=== Start querying with `random > 0.5` ===
query result:
-{'embeddings': [0.20963514, 0.39746657, 0.12019053, 0.6947492, 0.9535575, 0.5454552, 0.82360446, 0.21096309], 'pk': '0', 'random': 0.6378742006852851}
search latency = 0.5080s
query pagination(limit=4):
[{'random': 0.6378742006852851, 'pk': '0'}, {'random': 0.5763523024650556, 'pk': '100'}, {'random': 0.9425935891639464, 'pk': '1000'}, {'random': 0.7893211256191387, 'pk': '1001'}]
query pagination(offset=1, limit=3):
[{'random': 0.5763523024650556, 'pk': '100'}, {'random': 0.9425935891639464, 'pk': '1000'}, {'random': 0.7893211256191387, 'pk': '1001'}]
=== Start hybrid searching with `random > 0.5` ===
hit: id: 2998, distance: 0.0, entity: {'random': 0.9728033590489911}, random field: 0.9728033590489911
hit: id: 747, distance: 0.14606499671936035, entity: {'random': 0.5648774800635661}, random field: 0.5648774800635661
hit: id: 2527, distance: 0.1530652642250061, entity: {'random': 0.8928974315571507}, random field: 0.8928974315571507
hit: id: 2377, distance: 0.08096685260534286, entity: {'random': 0.8745922204004368}, random field: 0.8745922204004368
hit: id: 2034, distance: 0.20354536175727844, entity: {'random': 0.5526117606328499}, random field: 0.5526117606328499
hit: id: 958, distance: 0.21908017992973328, entity: {'random': 0.6647383716417955}, random field: 0.6647383716417955
search latency = 0.1996s
=== Start deleting with expr `pk in ["0" , "1"]` ===
query before delete by expr=`pk in ["0" , "1"]` -> result:
-{'random': 0.6378742006852851, 'embeddings': [0.20963514, 0.39746657, 0.12019053, 0.6947492, 0.9535575, 0.5454552, 0.82360446, 0.21096309], 'pk': '0'}
-{'random': 0.43925103574669633, 'embeddings': [0.52323616, 0.8035404, 0.77824664, 0.80369574, 0.4914803, 0.8265614, 0.6145269, 0.80234545], 'pk': '1'}
query after delete by expr=`pk in ["0" , "1"]` -> result: []
=== Drop collection `hello_milvus` ===
sudo docker-compose down
在停止Milvus后,可以使用如下命令删除 milvus 容器挂载在本机上的数据
sudo rm -rf volumes
在github上有安装命令,地址为:https://github.com/zilliztech/attu/blob/main/doc/zh-CN/attu_install-docker.md
docker 的运行命令如下
docker run -p 8000:3000 -e HOST_URL=http://{ your machine IP }:8000 -e MILVUS_URL={your machine IP}:19530 zilliz/attu:latest
注意:这里的本机IP可以直接使用127.0.0.1
docker run -p 8000:3000 -e HOST_URL=http://127.0.0.1:8000 -e MILVUS_URL=127.0.0.1:19530 zilliz/attu:latest
启动报错了
$ node
node[9]: ../src/node_platform.cc:61:std::unique_ptr node::WorkerThreadsTaskRunner::DelayedTaskScheduler::Start(): Assertion `(0) == (uv_thread_create(t.get(), start_thread, this))' failed.
1: 0xb57f90 node::Abort() [node]
2: 0xb5800e [node]
3: 0xbc915e [node]
4: 0xbc9230 node::NodePlatform::NodePlatform(int, v8::TracingController*, v8::PageAllocator*) [node]
5: 0xb1b3d1 node::InitializeOncePerProcess(int, char**, node::InitializationSettingsFlags, node::ProcessFlags::Flags) [node]
6: 0xb1bc89 node::Start(int, char**) [node]
7: 0x7f2ca389fd90 [/lib/x86_64-linux-gnu/libc.so.6]
8: 0x7f2ca389fe40 __libc_start_main [/lib/x86_64-linux-gnu/libc.so.6]
9: 0xa93f0e _start [node]
Aborted (core dumped)
使用 docker version
查看docker 版本可以发现 centos7 通过yum安装的docker版本应该是太旧了
下面需要重装docker
sudo yum remove docker \
docker-client \
docker-client-latest \
docker-common \
docker-latest \
docker-latest-logrotate \
docker-logrotate \
docker-engine
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
sudo systemctl start docker
sudo systemctl enable docker
docker version
逐个删除原有镜像:docker rm <镜像ID>
(镜像ID可以使用 docker ps -a
进行查看)
接着分别启动 milvus-standalone和attu(这个命令要在官网的基础上加个-d
参数,使其在后台启动)
docker-compose -f milvus-standalone-docker-compose.yml up -d
docker run -d -p 8000:3000 -e HOST_URL=http://127.0.0.1:8000 -e MILVUS_URL=127.0.0.1:19530 zilliz/attu:latest
然后在 Virtualbox 的 net网络地址转发这里加一个8000的端口映射
接着打开window上的浏览器,输入 http://localhost:8000
点击连接之后报错:Error: 14 UNAVAILABLE: No connection established
查看所有容器ID,先关闭掉错误的attu容器,然后再进行删除
docker ps -a
docker stop
docker rm
将Milvus的IP从127.0.0.1
改为本机IP(enp0s3)或docker(docker0)的IP,都是可以的。使用 ip a
或 ifconfig
命令,均可以查看到本机所有网卡对应的IP
使用修改IP后的命令重启Attu容器
docker run -d -p 8000:3000 -e HOST_URL=http://127.0.0.1:8000 -e MILVUS_URL=172.17.0.1:19530 zilliz/attu:latest
然后使用外部浏览器输入 http://localhost:8000,进行连接,并登录。这里连接成功了
另外,你也可以将Attu加入到docker-compose.xml文件中,让它们一起启动,命令如下
attu:
container_name: attu
image: zilliz/attu:v2.2.6
environment:
MILVUS_URL: milvus-standalone:19530
ports:
- "8000:3000"
depends_on:
- "standalone"
参考:https://blog.csdn.net/sinat_39620217/article/details/131847096