• 【基于langchain + streamlit 完整的与文档对话RAG】


    本地部署文档问答webdemo

    • 支持 pdf
    • 支持 txt
    • 支持 doc/docx
    • 支持 源文档索引

    你的点赞收藏是我持续分享优质内容的动力哦~

    废话不多说直接看效果

    在这里插入图片描述

    准备

    • 首先创建一个新环境(选择性)
    conda create -n chatwithdocs python=3.11
    conda activate chatwithdocs
    
    • 1
    • 2
    • 新建一个requirements.txt文件
    streamlit
    python-docx
    PyPDF2
    faiss-gpu
    langchain
    langchain-core
    langchain-community
    
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 然后安装相应的包
    pip install -r requirements.txt -U
    
    • 1

    代码

    创建一个app.py文件, 把下边的复制进去
    注意:替换你自己的api-keybase-url

    import streamlit as st
    from langchain.text_splitter import RecursiveCharacterTextSplitter
    from langchain_community.vectorstores import FAISS
    from langchain_openai import ChatOpenAI
    from langchain_openai import OpenAIEmbeddings
    from langchain_core.documents import Document
    from langchain.chains import ConversationalRetrievalChain
    import docx
    from PyPDF2 import PdfReader
    
    import os
    os.environ['OPENAI_API_KEY']='xxx'
    # os.environ['OPENAI_BASE_URL']='xxx' # 看你的情况
    
    st.set_page_config(page_title="Chat with Documents", page_icon=":robot:", layout="wide")
    
    st.markdown(
        """
    # """,
        unsafe_allow_html=True,
    )
    
    bot_template = """
    
    {{MSG}}
    """
    user_template = """
    {{MSG}}
    """
    def get_pdf_text(pdf_docs): docs = [] for document in pdf_docs: if document.type == "application/pdf": pdf_reader = PdfReader(document) for idx, page in enumerate(pdf_reader.pages): docs.append( Document( page_content=page.extract_text(), metadata={"source": f"{document.name} on page {idx}"}, ) ) elif ( document.type == "application/vnd.openxmlformats-officedocument.wordprocessingml.document" ): doc = docx.Document(document) for idx, paragraph in enumerate(doc.paragraphs): docs.append( Document( page_content=paragraph.text, metadata={"source": f"{document.name} in paragraph {idx}"}, ) ) elif document.type == "text/plain": text = document.getvalue().decode("utf-8") docs.append(Document(page_content=text, metadata={"source": document.name})) return docs def get_text_chunks(docs): text_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=0) docs_chunks = text_splitter.split_documents(docs) return docs_chunks def get_vectorstore(docs_chunks): embeddings = OpenAIEmbeddings() vectorstore = FAISS.from_documents(docs_chunks, embedding=embeddings) return vectorstore def get_conversation_chain(vectorstore): llm = ChatOpenAI() conversation_chain = ConversationalRetrievalChain.from_llm( llm=llm, retriever=vectorstore.as_retriever(), return_source_documents=True, ) return conversation_chain def handle_userinput_pdf(user_question): chat_history = st.session_state.chat_history response = st.session_state.conversation( {"question": user_question, "chat_history": chat_history} ) st.session_state.chat_history.append(("user", user_question)) st.session_state.chat_history.append(("assistant", response["answer"])) st.write( user_template.replace("{{MSG}}", user_question), unsafe_allow_html=True, ) sources = response["source_documents"] source_names = set([i.metadata["source"] for i in sources]) src = "\n\n".join(source_names) src = f"\n\n> source : {src}" message = st.session_state.chat_history[-1] st.write(bot_template.replace("{{MSG}}", message[1] + src), unsafe_allow_html=True) def show_history(): chat_history = st.session_state.chat_history for i, message in enumerate(chat_history): if i % 2 == 0: st.write( user_template.replace("{{MSG}}", message[1]), unsafe_allow_html=True, ) else: st.write( bot_template.replace("{{MSG}}", message[1]), unsafe_allow_html=True ) def main(): st.header("Chat with Documents") # 初始化会话状态 if "conversation" not in st.session_state: st.session_state.conversation = None if "chat_history" not in st.session_state: st.session_state.chat_history = [] with st.sidebar: st.title("文档管理") pdf_docs = st.file_uploader( "选择文件", type=["pdf", "txt", "doc", "docx"], accept_multiple_files=True, ) if st.button( "处理文档", on_click=lambda: setattr(st.session_state, "last_action", "pdf"), use_container_width=True, ): if pdf_docs: with st.spinner("Processing"): docs = get_pdf_text(pdf_docs) docs_chunks = get_text_chunks(docs) vectorstore = get_vectorstore(docs_chunks) st.session_state.conversation = get_conversation_chain(vectorstore) else: st.warning("记得上传文件哦~~") def clear_history(): st.session_state.chat_history = [] if st.session_state.chat_history: st.button("清空对话", on_click=clear_history, use_container_width=True) with st.container(): user_question = st.chat_input("输入点什么~") with st.container(height=400): show_history() if user_question: if st.session_state.conversation is not None: handle_userinput_pdf(user_question) else: st.warning("记得上传文件哦~~") if __name__ == "__main__": main()
    • 1
    • 2
    • 3
    • 4
    • 5
    • 6
    • 7
    • 8
    • 9
    • 10
    • 11
    • 12
    • 13
    • 14
    • 15
    • 16
    • 17
    • 18
    • 19
    • 20
    • 21
    • 22
    • 23
    • 24
    • 25
    • 26
    • 27
    • 28
    • 29
    • 30
    • 31
    • 32
    • 33
    • 34
    • 35
    • 36
    • 37
    • 38
    • 39
    • 40
    • 41
    • 42
    • 43
    • 44
    • 45
    • 46
    • 47
    • 48
    • 49
    • 50
    • 51
    • 52
    • 53
    • 54
    • 55
    • 56
    • 57
    • 58
    • 59
    • 60
    • 61
    • 62
    • 63
    • 64
    • 65
    • 66
    • 67
    • 68
    • 69
    • 70
    • 71
    • 72
    • 73
    • 74
    • 75
    • 76
    • 77
    • 78
    • 79
    • 80
    • 81
    • 82
    • 83
    • 84
    • 85
    • 86
    • 87
    • 88
    • 89
    • 90
    • 91
    • 92
    • 93
    • 94
    • 95
    • 96
    • 97
    • 98
    • 99
    • 100
    • 101
    • 102
    • 103
    • 104
    • 105
    • 106
    • 107
    • 108
    • 109
    • 110
    • 111
    • 112
    • 113
    • 114
    • 115
    • 116
    • 117
    • 118
    • 119
    • 120
    • 121
    • 122
    • 123
    • 124
    • 125
    • 126
    • 127
    • 128
    • 129
    • 130
    • 131
    • 132
    • 133
    • 134
    • 135
    • 136
    • 137
    • 138
    • 139
    • 140
    • 141
    • 142
    • 143
    • 144
    • 145
    • 146
    • 147
    • 148
    • 149
    • 150
    • 151
    • 152
    • 153
    • 154
    • 155
    • 156
    • 157
    • 158
    • 159
    • 160
    • 161
    • 162
    • 163
    • 164
    • 165
    • 166
    • 167
    • 168
    • 169
    • 170
    • 171
    • 172
    • 173
    • 174
    • 175
    • 176
    • 177
    • 178
    • 179
    • 180
    • 181
    • 182
    • 183
    • 184
    • 185
    • 186
    • 187
    • 188
    • 189
    • 190
    • 191
    • 192
    • 193
    • 194
    • 195
    • 196
    • 197
    • 198
    • 199
    • 200
    • 201
    • 202
    • 203
    • 204
    • 205
    • 206
    • 207
    • 208
    • 209
    • 210
    • 211
    • 212
    • 213
    • 214
    • 215
    • 216
    • 217
    • 218
    • 219

    启动

    • 自动在浏览器打开
    streamlit run app.py
    
    • 1
  • 相关阅读:
    探索“科技助实”,上海交通大学、蚂蚁集团等发起第三届ATEC科技精英赛
    JVM内存模型
    线程池相关总结
    Kafka 线上性能调优
    CSS网格布局
    【浏览器】Cookies.set domain 是什么
    【Overload游戏引擎细节分析】standard材质Shader
    Feign 从注册到调用原理分析
    窗口函数笔记
    【前端精进之路】JS篇:第11期 深拷贝与浅拷贝
  • 原文地址:https://blog.csdn.net/qq_39749966/article/details/136665751