• milvus和相似度检索


    流程

    milvus的使用流程是 创建collection -> 创建partition -> 创建索引(如果需要检索) -> 插入数据 -> 检索
    这里以Python为例, 使用的milvus版本为2.3.x
    首先按照库, python3 -m pip install pymilvus

    Connect

    from pymilvus import connections
    connections.connect(
      alias="default",
      user='username',
      password='password',
      host='localhost',
      port='19530'
    )
    
    
    connections.list_connections()
    connections.get_connection_addr('default')
    
    connections.disconnect("default")
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14

    2.png
    以上是源码,可以看出alias只是一个字典的映射的key

    3.png
    通过源码可以看到,还有两种连接方式:

    1. 在.env文件中添加参数,MILVUS_URI=milvus://:,之后可以使用connections.connect()连接
    2. 在一次连接成功后,将连接配置数据保存在内存,下次近执行connections.connect()即可连接,可以通过connections.remove_connection删除连接配置数据

    Database

    from pymilvus import connections, db
    
    conn = connections.connect(host="127.0.0.1", port=19530)
    
    database = db.create_database("book")
    
    db.using_database("book") # 切换数据库
    db.list_database()
    db.drop_database("book")
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9

    Collection

    和一些非关系型数据库(MongoDB)类似,Collection就是表

    # collection
    from pymilvus import Collection, CollectionSchema, FieldSchema, DataType, utility
    
    ## 需要提前创建列的名称、类型等数据,并且必须添加一个主键
    book_id = FieldSchema(
      name="book_id",
      dtype=DataType.INT64,
      is_primary=True,
    )
    book_name = FieldSchema(
      name="book_name",
      dtype=DataType.VARCHAR,
      max_length=200,
      # The default value will be used if this field is left empty during data inserts or upserts.
      # The data type of `default_value` must be the same as that specified in `dtype`.
      default_value="Unknown"
    )
    word_count = FieldSchema(
      name="word_count",
      dtype=DataType.INT64,
      # The default value will be used if this field is left empty during data inserts or upserts.
      # The data type of `default_value` must be the same as that specified in `dtype`.
      default_value=9999
    )
    book_intro = FieldSchema(
      name="book_intro",
      dtype=DataType.FLOAT_VECTOR,
      dim=2
    )
    # dim=2是向量的维度
    
    schema = CollectionSchema(
      fields=[book_id, book_name, word_count, book_intro],
      description="Test book search",
      enable_dynamic_field=True
    )
    
    
    collection_name = "book"
    
    collection = Collection(
        name=collection_name,
        schema=schema,
        using='default',
        shards_num=2
        )
    
    utility.rename_collection("book", "lights4") 
    utility.has_collection("lights1")
    utility.list_collections()
    # utility.drop_collection("lights")
    
    collection = Collection("lights3")      
    collection.load(replica_number=2)
    # reduce memory usage
    collection.release()
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56

    Partition

    # Create a Partition
    
    collection = Collection("book")      # Get an existing collection.
    collection.create_partition("novel")
    
    • 1
    • 2
    • 3
    • 4

    Index

    milvus的索引决定了搜索所用的算法,必须设置好所引才能进行搜索。

    # Index
    index_params = {
      "metric_type":"L2",
      "index_type":"IVF_FLAT",
      "params":{"nlist":1024}
    }
    
    collection.create_index(
      field_name="book_intro", 
      index_params=index_params
    )
    
    ## metric_type是相似性计算算法,可选的有以下
    ## For floating point vectors:
    ## L2 (Euclidean distance)
    ## IP (Inner product)
    ## COSINE (Cosine similarity)
    ## For binary vectors:
    ## JACCARD (Jaccard distance)
    ## HAMMING (Hamming distance)
    utility.index_building_progress("")
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21

    Data

    数据可以从dataFrame来,也可以从其他方式获得,只要列名对上,即可。

    import pandas as pd
    import numpy as np
    
    insert_data = pd.read_csv("")
    mr = collection.insert(insert_data)
    
    • 1
    • 2
    • 3
    • 4
    • 5

    Search

    # search
    search_params = {
        "metric_type": "L2", 
        "offset": 5, 
        "ignore_growing": False, 
        "params": {"nprobe": 10}
    }
    
    results = collection.search(
        data=[[0.1, 0.2]], 
        anns_field="book_intro", 
        # the sum of `offset` in `param` and `limit` 
        # should be less than 16384.
        param=search_params,
        limit=10,
        expr=None,
        # 这里需要将想看的列名列举出来
        output_fields=['title'],
        consistency_level="Strong"
    )
    
    # get the IDs of all returned hits
    results[0].ids
    
    # get the distances to the query vector from all returned hits
    results[0].distances
    
    # get the value of an output field specified in the search request.
    hit = results[0][0]
    hit.entity.get('title')
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30

    具体的代码在我的github。希望对你有所帮助!

  • 相关阅读:
    【初识Linux】:背景介绍以及环境搭建
    通过源码分析RocketMQ主从复制原理
    AcWing 831. KMP字符串
    我与足球以及世界杯的过往
    【Hive】insert into 与 insert overwrite的区别
    python毕业设计项目源码选题(16)跳蚤市场二手物品交易系统毕业设计毕设作品开题报告开题答辩PPT
    Hugging News #0904: 登陆 AWS Marketplace
    2022.8.9考试排列变换--1200题解
    数组处理方法总结
    Nacos多种安装方式
  • 原文地址:https://blog.csdn.net/majiayu000/article/details/133815205