• Elasticsearch:与多个 PDF 聊天 | LangChain Python 应用教程(免费 LLMs 和嵌入)


    在本博客中,你将学习创建一个 LangChain 应用程序,以使用 ChatGPT API 和 Huggingface 语言模型与多个 PDF 文件聊天。

    如上所示,我们在最最左边摄入 PDF 文件,并它们连成一起,并分为不同的 chunks。我们可以通过使用 huggingface 来对 chunks 进行处理并形成 embeddings。我们把 embeddings 写入到 Elasticsearch 向量数据库中,并保存。在搜索的时候,我们通过 LangChain 来进行向量化,并使用 Elasticsearch 进行向量搜索。在最后,我们通过大模型的使用,针对提出的问题来进行提问。我们最终的界面如下:

    如上所示,它可以针对我们的问题进行回答。进一步阅读 

    所有的源码可以在地址 GitHub - liu-xiao-guo/ask-multiple-pdfs: A Langchain app that allows you to chat with multiple PDFs 进行下载。

    安装

    如果你还没有安装好自己的 Elasticsearch 及 Kibana 的话,那么请参考如下的链接:

    在安装的时候,我们选择 Elastic Stack 9.x 的安装指南来进行安装。在默认的情况下,Elasticsearch 集群的访问具有 HTTPS 的安全访问。

    在安装时,我们可以在 Elasticsearch 的如下地址找到相应的证书文件 http_ca.crt:

    1. $ pwd
    2. /Users/liuxg/elastic/elasticsearch-8.10.0/config/certs
    3. $ ls
    4. http.p12 http_ca.crt transport.p12

    我们需要把该证书拷贝到项目文件的根目录下:

    1. $ tree -L 3
    2. .
    3. ├── app.py
    4. ├── docs
    5. │   └── PDF-LangChain.jpg
    6. ├── htmlTemplates.py
    7. ├── http_ca.crt
    8. ├── lib_embeddings.py
    9. ├── lib_indexer.py
    10. ├── lib_llm.py
    11. ├── lib_vectordb.py
    12. ├── myapp.py
    13. ├── pdf_files
    14. │   ├── sample1.pdf
    15. │   └── sample2.pdf
    16. ├── readme.md
    17. ├── requirements.txt
    18. └── simple.cfg

    如上所示,我们把 http_ca.crt 拷贝到应用的根目录下。我们在 pdf_files 里放了两个用于测试的 PDF 文件。你可以使用自己的 PDF 文件来进行测试。我们在 simple.cfg 做如下的配置:

    1. ES_SERVER: "localhost"
    2. ES_PASSWORD: "vXDWYtL*my3vnKY9zCfL"
    3. ES_FINGERPRINT: "e2c1512f617f432ddf242075d3af5177b28f6497fecaaa0eea11429369bb7b00"

    在上面,我们需要配置 ES_SERVER。这个是 Elasticsearch 集群的地址。这里的 ES_PASSWORD 是 Elasticsearch 的超级用户 elastic 的密码。我们可以在 Elasticsearch 第一次启动的画面中找到这个 ES_FINGERPRINT:

    你还可以在 Kibana 的配置文件 confgi/kibana.yml 文件中获得 fingerprint 的配置:

    在项目的目录中,我们还可以看到一个叫做 .env-example 的文件。我们可以使用如下的命令把它重新命名为 .env:

    mv .env.example .env

    在 .env 中,我们输入 huggingface.co 网站得到的 token:

    1. $ cat .env
    2. OPENAI_API_KEY=your_openai_key
    3. HUGGINGFACEHUB_API_TOKEN=your_huggingface_key

    在本例中,我们将使用 huggingface 来进行测试。如果你需要使用到 OpenAI,那么你需要配置它的 key。有关 huggingface 的开发者 key,你可以在地址获得。

    运行项目

    在运行项目之前,你需要做一下安装的动作:

    1. python3 -m venv env
    2. source env/bin/activate
    3. python3 -m pip install --upgrade pip
    4. pip install -r requirements.txt

    创建界面

    本应用的界面,我们采用是 streamlit 来创建的。它的创建也是非常地简单。我们可以在 myapp.py 中看到如下的代码:

    myapp.py

    1. import streamlit as st
    2. from dotenv import load_dotenv
    3. from PyPDF2 import PdfReader
    4. from htmlTemplates import css, bot_template, user_template
    5. def get_pdf_texts(pdf_docs):
    6. text = ""
    7. for pdf in pdf_docs:
    8. pdf_reader = PdfReader(pdf)
    9. for page in pdf_reader.pages:
    10. text += page.extract_text()
    11. return text
    12. def main():
    13. load_dotenv()
    14. st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
    15. st.write(css, unsafe_allow_html=True)
    16. st.header("Chat with multiple PDFs :books:")
    17. user_question = st.text_input("Ask a question about your documents")
    18. if user_question:
    19. pass
    20. st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
    21. st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)
    22. # Add a side bar
    23. with st.sidebar:
    24. st.subheader("Your documents")
    25. pdf_docs = st.file_uploader(
    26. "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
    27. print(pdf_docs)
    28. if st.button("Process"):
    29. with st.spinner("Processing"):
    30. # Get pdf text from
    31. raw_text = get_pdf_texts(pdf_docs)
    32. st.write(raw_text)
    33. if __name__ == "__main__":
    34. main()

    在上面的代码中,我创建了一个 sidebar 用来选择需要的 PDF 文件。我们可以点击 Process 按钮来显示已经提取的 PDF 文本。我们可以使用如下的命令来运行应用:

    (venv) $ streamlit run myapp.py
    1. venv) $ streamlit run myapp.py
    2. You can now view your Streamlit app in your browser.
    3. Local URL: http://localhost:8502
    4. Network URL: http://198.18.1.13:8502

    运行完上面的命令后,我们可以在浏览器中打开应用:

    我们点击 Browse files,并选中 PDF 文件:

    点击上面的 Process,我们可以看到:

    在上面,我们为了显示的方便,我使用 st.write 直接把结果写到浏览器的页面里。我们接下来需要针对这个长的文字进行切分为一个一个的 chunks。我们需要按照模型的需要,不能超过模型允许的最大值。

    上面我简单地叙述了 UI 的构造。最终完整的 myapp.py 的设计如下:

    myapp.py

    1. import streamlit as st
    2. from dotenv import load_dotenv
    3. from PyPDF2 import PdfReader
    4. from langchain.text_splitter import CharacterTextSplitter
    5. from langchain.text_splitter import RecursiveCharacterTextSplitter
    6. from langchain.embeddings import OpenAIEmbeddings
    7. from htmlTemplates import css, bot_template, user_template
    8. import lib_indexer
    9. import lib_llm
    10. import lib_embeddings
    11. import lib_vectordb
    12. index_name = "pdf_docs"
    13. def get_pdf_text(pdf):
    14. text = ""
    15. pdf_reader = PdfReader(pdf)
    16. for page in pdf_reader.pages:
    17. text += page.extract_text()
    18. return text
    19. def get_pdf_texts(pdf_docs):
    20. text = ""
    21. for pdf in pdf_docs:
    22. pdf_reader = PdfReader(pdf)
    23. for page in pdf_reader.pages:
    24. text += page.extract_text()
    25. return text
    26. def get_text_chunks(text):
    27. text_splitter = CharacterTextSplitter(
    28. separator="\n",
    29. chunk_size=1000,
    30. chunk_overlap=200,
    31. length_function=len
    32. )
    33. chunks = text_splitter.split_text(text)
    34. # chunks = text_splitter.split_documents(text)
    35. return chunks
    36. def get_text_chunks1(text):
    37. text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
    38. chunks = text_splitter.split_text(text)
    39. return chunks
    40. def handle_userinput(db, llm_chain_informed, user_question):
    41. similar_docs = db.similarity_search(user_question)
    42. print(f'The most relevant passage: \n\t{similar_docs[0].page_content}')
    43. ## 4. Ask Local LLM context informed prompt
    44. # print(">> 4. Asking The Book ... and its response is: ")
    45. informed_context= similar_docs[0].page_content
    46. response = llm_chain_informed.run(context=informed_context,question=user_question)
    47. st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
    48. st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)
    49. def main():
    50. # # Huggingface embedding setup
    51. hf = lib_embeddings.setup_embeddings()
    52. # # # ## Elasticsearch as a vector db
    53. db, url = lib_vectordb.setup_vectordb(hf, index_name)
    54. # # # ## set up the conversational LLM
    55. llm_chain_informed= lib_llm.make_the_llm()
    56. load_dotenv()
    57. st.set_page_config(page_title="Chat with multiple PDFs", page_icon=":books:")
    58. st.write(css, unsafe_allow_html=True)
    59. st.header("Chat with multiple PDFs :books:")
    60. user_question = st.text_input("Ask a question about your documents")
    61. if user_question:
    62. handle_userinput(db, llm_chain_informed, user_question)
    63. st.write(user_template.replace("{{MSG}}", "Hello, human").replace("{{MSG1}}", " "), unsafe_allow_html=True)
    64. st.write(bot_template.replace("{{MSG}}", "Hello, robot").replace("{{MSG1}}", " "), unsafe_allow_html=True)
    65. # Add a side bar
    66. with st.sidebar:
    67. st.subheader("Your documents")
    68. pdf_docs = st.file_uploader(
    69. "Upload your PDFs here and press on click on Process", accept_multiple_files=True)
    70. print(pdf_docs)
    71. if st.button("Process"):
    72. with st.spinner("Processing"):
    73. # Get pdf text from
    74. # raw_text = get_pdf_text(pdf_docs[0])
    75. raw_text = get_pdf_texts(pdf_docs)
    76. # st.write(raw_text)
    77. print(raw_text)
    78. # Get the text chunks
    79. text_chunks = get_text_chunks(raw_text)
    80. # st.write(text_chunks)
    81. # Create vector store
    82. lib_indexer.loadPdfChunks(text_chunks, url, hf, db, index_name)
    83. if __name__ == "__main__":
    84. main()

    创建嵌入模型

    lib_embedding.py

    1. ## for embeddings
    2. from langchain.embeddings import HuggingFaceEmbeddings
    3. def setup_embeddings():
    4. # Huggingface embedding setup
    5. print(">> Prep. Huggingface embedding setup")
    6. model_name = "sentence-transformers/all-mpnet-base-v2"
    7. return HuggingFaceEmbeddings(model_name=model_name)

    创建向量存储

    lib_vectordb.py

    1. import os
    2. from config import Config
    3. ## for vector store
    4. from langchain.vectorstores import ElasticVectorSearch
    5. def setup_vectordb(hf,index_name):
    6. # Elasticsearch URL setup
    7. print(">> Prep. Elasticsearch config setup")
    8. with open('simple.cfg') as f:
    9. cfg = Config(f)
    10. endpoint = cfg['ES_SERVER']
    11. username = "elastic"
    12. password = cfg['ES_PASSWORD']
    13. ssl_verify = {
    14. "verify_certs": True,
    15. "basic_auth": (username, password),
    16. "ca_certs": "./http_ca.crt",
    17. }
    18. url = f"https://{username}:{password}@{endpoint}:9200"
    19. return ElasticVectorSearch( embedding = hf,
    20. elasticsearch_url = url,
    21. index_name = index_name,
    22. ssl_verify = ssl_verify), url

    创建使用带有上下文和问题变量的提示模板的离线 LLM

    lib_llm.py

    1. ## for conversation LLM
    2. from langchain import PromptTemplate, HuggingFaceHub, LLMChain
    3. from langchain.llms import HuggingFacePipeline
    4. import torch
    5. from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, AutoModelForSeq2SeqLM
    6. def make_the_llm():
    7. # Get Offline flan-t5-large ready to go, in CPU mode
    8. print(">> Prep. Get Offline flan-t5-large ready to go, in CPU mode")
    9. model_id = 'google/flan-t5-large'# go for a smaller model if you dont have the VRAM
    10. tokenizer = AutoTokenizer.from_pretrained(model_id)
    11. model = AutoModelForSeq2SeqLM.from_pretrained(model_id) #load_in_8bit=True, device_map='auto'
    12. pipe = pipeline(
    13. "text2text-generation",
    14. model=model,
    15. tokenizer=tokenizer,
    16. max_length=100
    17. )
    18. local_llm = HuggingFacePipeline(pipeline=pipe)
    19. # template_informed = """
    20. # I know the following: {context}
    21. # Question: {question}
    22. # Answer: """
    23. template_informed = """
    24. I know: {context}
    25. when asked: {question}
    26. my response is: """
    27. prompt_informed = PromptTemplate(template=template_informed, input_variables=["context", "question"])
    28. return LLMChain(prompt=prompt_informed, llm=local_llm)

    写入以向量表示的 PDF 文件

    以下是我的分块和向量存储代码。 它需要在 Elasticsearch 中准备好组成的 Elasticsearch url、huggingface 嵌入模型、向量数据库和目标索引名称

    lib_indexer.py

    1. from langchain.text_splitter import RecursiveCharacterTextSplitter
    2. from langchain.document_loaders import TextLoader
    3. ## for vector store
    4. from langchain.vectorstores import ElasticVectorSearch
    5. from elasticsearch import Elasticsearch
    6. from config import Config
    7. with open('simple.cfg') as f:
    8. cfg = Config(f)
    9. fingerprint = cfg['ES_FINGERPRINT']
    10. endpoint = cfg['ES_SERVER']
    11. username = "elastic"
    12. password = cfg['ES_PASSWORD']
    13. ssl_verify = {
    14. "verify_certs": True,
    15. "basic_auth": (username, password),
    16. "ca_certs": "./http_ca.crt"
    17. }
    18. url = f"https://{username}:{password}@{endpoint}:9200"
    19. def parse_book(filepath):
    20. loader = TextLoader(filepath)
    21. documents = loader.load()
    22. text_splitter = RecursiveCharacterTextSplitter(chunk_size=384, chunk_overlap=0)
    23. docs = text_splitter.split_documents(documents)
    24. return docs
    25. def parse_triplets(filepath):
    26. docs = parse_book(filepath)
    27. result = []
    28. for i in range(len(docs) - 2):
    29. concat_str = docs[i].page_content + " " + docs[i+1].page_content + " " + docs[i+2].page_content
    30. result.append(concat_str)
    31. return result
    32. #db.from_texts(docs, embedding=hf, elasticsearch_url=url, index_name=index_name)
    33. ## load book utility
    34. ## params
    35. ## filepath: where to get the book txt ... should be utf-8
    36. ## url: the full Elasticsearch url with username password and port embedded
    37. ## hf: hugging face transformer for sentences
    38. ## db: the VectorStore Langcahin object ready to go with embedding thing already set up
    39. ## index_name: name of index to use in ES
    40. ##
    41. ## will check if the index_name exists already in ES url before attempting split and load
    42. def loadBookTriplets(filepath, url, hf, db, index_name):
    43. with open('simple.cfg') as f:
    44. cfg = Config(f)
    45. fingerprint = cfg['ES_FINGERPRINT']
    46. es = Elasticsearch( [ url ],
    47. basic_auth = ("elastic", cfg['ES_PASSWORD']),
    48. ssl_assert_fingerprint = fingerprint,
    49. http_compress = True )
    50. ## Parse the book if necessary
    51. if not es.indices.exists(index=index_name):
    52. print(f'\tThe index: {index_name} does not exist')
    53. print(">> 1. Chunk up the Source document")
    54. results = parse_triplets(filepath)
    55. print(">> 2. Index the chunks into Elasticsearch")
    56. elastic_vector_search= ElasticVectorSearch.from_documents( docs,
    57. embedding = hf,
    58. elasticsearch_url = url,
    59. index_name = index_name,
    60. ssl_verify = ssl_verify)
    61. else:
    62. print("\tLooks like the pdfs are already loaded, let's move on")
    63. def loadBookBig(filepath, url, hf, db, index_name):
    64. es = Elasticsearch( [ url ],
    65. basic_auth = ("elastic", cfg['ES_PASSWORD']),
    66. ssl_assert_fingerprint = fingerprint,
    67. http_compress = True )
    68. ## Parse the book if necessary
    69. if not es.indices.exists(index=index_name):
    70. print(f'\tThe index: {index_name} does not exist')
    71. print(">> 1. Chunk up the Source document")
    72. docs = parse_book(filepath)
    73. # print(docs)
    74. print(">> 2. Index the chunks into Elasticsearch")
    75. elastic_vector_search= ElasticVectorSearch.from_documents( docs,
    76. embedding = hf,
    77. elasticsearch_url = url,
    78. index_name = index_name,
    79. ssl_verify = ssl_verify)
    80. else:
    81. print("\tLooks like the pdfs are already loaded, let's move on")
    82. def loadPdfChunks(chunks, url, hf, db, index_name):
    83. es = Elasticsearch( [ url ],
    84. basic_auth = ("elastic", cfg['ES_PASSWORD']),
    85. ssl_assert_fingerprint = fingerprint,
    86. http_compress = True )
    87. ## Parse the book if necessary
    88. if not es.indices.exists(index=index_name):
    89. print(f'\tThe index: {index_name} does not exist')
    90. print(">> 2. Index the chunks into Elasticsearch")
    91. print("url: ", url)
    92. print("index_name", index_name)
    93. elastic_vector_search = db.from_texts( chunks,
    94. embedding = hf,
    95. elasticsearch_url = url,
    96. index_name = index_name,
    97. ssl_verify = ssl_verify)
    98. else:
    99. print("\tLooks like the pdfs are already loaded, let's move on")

    提问

    我们使用 streamlit 的 input 来进行提问:

    1. user_question = st.text_input("Ask a question about your documents")
    2. if user_question:
    3. handle_userinput(db, llm_chain_informed, user_question)

    当我们打入 ENTER 键后,上面的代码调用 handle_userinput(db, llm_chain_informed, user_question):

    1. def handle_userinput(db, llm_chain_informed, user_question):
    2. similar_docs = db.similarity_search(user_question)
    3. print(f'The most relevant passage: \n\t{similar_docs[0].page_content}')
    4. ## 4. Ask Local LLM context informed prompt
    5. # print(">> 4. Asking The Book ... and its response is: ")
    6. informed_context= similar_docs[0].page_content
    7. response = llm_chain_informed.run(context=informed_context,question=user_question)
    8. st.write(user_template.replace("{{MSG}}", user_question).replace("{{MSG1}}", " "), unsafe_allow_html=True)
    9. st.write(bot_template.replace("{{MSG}}", response).replace("{{MSG1}}", similar_docs[0].page_content),unsafe_allow_html=True)

    首先它使用 db 进行相似性搜索,然后我们再使用大模型来得到我们想要的答案。

    运行结果

    我们使用命令来运行代码:

    streamlit run myapp.py

    我们在浏览器中选择在 pdf_files 中的两个 PDF 文件:

    在上面,我们输入想要的问题:

    上面的问题是:

    what do I make all the same and put a cup next to him on the desk?

    再进行提问:

    上面的问题是:

    when should you come? I will send a car to meet you from the half past four arrival at Harrogate Station.

    上面的问题是:

    what will I send to meet you from the half past four arrival at Harrogate Station?

    你进行多次尝试其它的问题。Happy journery :)

    有关 ChatGPT 的使用也是基本相同的。你需要使用 ChatGPT 的模型及其相应的 key 即可。在这里就不赘述了。

  • 相关阅读:
    【入门篇】ClickHouse 的安装与配置
    【Java】CAP理论以及它的实际应用案例
    如何把A3 pdf 文章打印成A4
    vue数据监听 -key的作用
    如何快速学习AdsPower RPA(1)——简单、进阶部分
    浏览器视频倍速播放方法
    二、集成学习:Boosting 之 AdaBoost_分类问题
    uniapp微信小程序《隐私保护协议》弹窗处理流程
    22-08-12 西安 尚医通(05) JWT令牌、阿里云发送短信验证码
    过去式-ed的发音规则
  • 原文地址:https://blog.csdn.net/UbuntuTouch/article/details/133270431