• [Document]VectoreStoreToDocument开发


    该document是用来检索文档的。

    第一步:定义组件对象,该组件返回有两种类型:document和text。
    第二步:获取需要的信息,向量存储库,这里我使用的是内存向量存储(用该组件拿到文档,并检索)
    第三步:在做返回结果处理时,分开处理组件返回类型

    from langchain.vectorstores.base import VectorStore
    
    class VectorStoreToDocument:
        def __init__(self,param_dict:Optional[dict[str,Any]] = None) -> None:
            vectorStore:VectorStore = param_dict.get("vectorStore")
            if param_dict.get("minScore") is None or len(str(param_dict.get("minScore")))<=0:
                minimumScore = 75
            else: 
                minimumScore : float = param_dict.get("minScore")
            
            query : str = param_dict.get("question","")
            outputs:dict = param_dict.get("outputs")
            self.__output = outputs['output'] if outputs is not None and len(outputs)>0 else "text"
            self.__vectorStore = vectorStore
            self.__miniumScore = minimumScore
            self.__query = query
        
        def source(self):
            docs = self.__vectorStore.similarity_search_with_score(self.__query)
            if self.__output.lower() == "document":
                finalDocs = []
                for doc in docs:
                    if self.__miniumScore is not None and float(self.__miniumScore)/100 < doc[1]:
                        finalDocs.append(doc[0])
                return finalDocs
            else:
                finalText = ""
                for doc in docs:
                    if self.__miniumScore is not None and self.__miniumScore/100 < doc[1]:
                        finalText += doc[0].page_content+'\n'
                return finalText
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31

    调用:

    from langchain.document_loaders import TextLoader
    from langchain.text_splitter import CharacterTextSplitter
    from langchain.vectorstores import Chroma
    import chromadb
    from chromadb import Settings
    # Load the document, split it into chunks, embed each chunk and load it into the vector store.
    raw_documents = TextLoader('D:/Workspace/pythonProjectSpacework/state_of_the_union.txt').load()
    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
    documents = text_splitter.split_documents(raw_documents)
    vectorstore = Chroma.from_documents(client=chromadb_client,documents=documents, embedding=embeddings)
    retriever = vectorstore.as_retriever()
    minimumScore=30
    param_dict = {
        "vectorStore":vectorstore,
        "minimumScore":minimumScore,
        "output":"text",
        "query":"president said"
    }
    from mth.main.flow_modules.document.MthVectorStoreToDocument import MthVectorStoreToDocument
    
    text = VectorStoreToDocument(param_dict=param_dict).source()
    print(text)
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22

    由于这里的组件会在后续使用promptTemplate时,将查询到的文件信息传递给prompt,再由prompt将文档值和问题一起丢给大语言模型处理。因此,需要在进入promptTemplate之前对传入进来的值做处理。
    处理逻辑的代码:
    promptTemplate的值:
    “promptValues”:“{
    “context”:“vectoreStoreToDocument_0.data.instance”
    }”

    valueJson = json.loads(value) //输入进来的prompt组件信息
    for valKey in valueJson:
         val = valueJson[valKey]
         if val.startswith("{{") and val.endswith("}}"):
            valReplace = val.replace("{{","").replace("}}","").split(".") // 去除插入表达式的符号,然后通过.分割分数组形式
            node = [x for x in allNodes if x["id"]==valReplace[0]] // 在该流中查找vectoreStoreToDocument的节点信息
            if len(node) == 0:
                continue
            kk = node[0]
            for i in range(1, len(valReplace)): // 获取节点信息,并取得该节点的值。即获取vectoreStoreToDocument实例化以后的值,通过输出传过来的值
                 kk = kk[valReplace[i]]
            param_dict[valKey] = kk
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
  • 相关阅读:
    html5 语义化标签实用指南
    HCIE Routing&Switching之MPLS静态LSP配置
    Java和vue的包含数组组件contains、includes
    bert 环境搭建之Pytorch&Transformer 安装
    【AI】如何让两个图案重叠的部分变成透明
    Java项目:ssm实验室设备管理系统
    Java语言程序设计实践考试
    2.9.39:Flexmonster:网络报告数据透视表组件
    LVI-SAM:配置环境、安装测试、适配自己采集数据集
    【Shiro】SpringBoot集成Shiro权限认证《上》
  • 原文地址:https://blog.csdn.net/weixin_44236424/article/details/133081658