大家好,我是 同學(xué)小張,持續(xù)學(xué)習(xí)C++進(jìn)階知識(shí)和AI大模型應(yīng)用實(shí)戰(zhàn)案例,持續(xù)分享,歡迎大家點(diǎn)贊+關(guān)注,共同學(xué)習(xí)和進(jìn)步。
上篇文章我們通過(guò)一個(gè)簡(jiǎn)單的例子,學(xué)習(xí)了LlamaIndex的安裝和基本使用,使用 LlamaIndex 構(gòu)建了一個(gè)簡(jiǎn)單的RAG問(wèn)答系統(tǒng)。今天我們開(kāi)始系統(tǒng)化學(xué)習(xí),首先看一下LlamaIndex的Load部分。Load部分負(fù)責(zé)文件數(shù)據(jù)鏈接。 0. 文件類(lèi)型加載器:SimpleDirectoryReader上篇文章代碼的一開(kāi)始,我們就使用了這個(gè)Reader: from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
# 使用SimpleDirectoryReader從指定路徑加載數(shù)據(jù) documents = SimpleDirectoryReader("D:\GitHub\LEARN_LLM\LlamaIndex\data").load_data()
這是 LlamaIndex 中最容易使用的一個(gè)文件夾加載器。它會(huì)讀取傳入的文件夾路徑中的所有文件,可以讀取各種格式,包括Markdown、PDF、Word、PowerPoint、圖像、音頻和視頻等。 來(lái)看下其具體集成的類(lèi)型: 參考:https:///l/readers/llama-index-readers-file?from=
from llama_index.core import SimpleDirectoryReader from llama_index.readers.file import ( DocxReader, HWPReader, PDFReader, EpubReader, FlatReader, HTMLTagReader, ImageCaptionReader, ImageReader, ImageVisionLLMReader, IPYNBReader, MarkdownReader, MboxReader, PptxReader, PandasCSVReader, VideoAudioReader, UnstructuredReader, PyMuPDFReader, ImageTabularChartReader, XMLReader, PagedCSVReader, CSVReader, RTFReader, )
# PDF Reader with `SimpleDirectoryReader` parser = PDFReader() file_extractor = {".pdf": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Docx Reader example parser = DocxReader() file_extractor = {".docx": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# HWP Reader example parser = HWPReader() file_extractor = {".hwp": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Epub Reader example parser = EpubReader() file_extractor = {".epub": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Flat Reader example parser = FlatReader() file_extractor = {".txt": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# HTML Tag Reader example parser = HTMLTagReader() file_extractor = {".html": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Image Reader example parser = ImageReader() file_extractor = { ".jpg": parser, ".jpeg": parser, ".png": parser, } # Add other image formats as needed documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# IPYNB Reader example parser = IPYNBReader() file_extractor = {".ipynb": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Markdown Reader example parser = MarkdownReader() file_extractor = {".md": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Mbox Reader example parser = MboxReader() file_extractor = {".mbox": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Pptx Reader example parser = PptxReader() file_extractor = {".pptx": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Pandas CSV Reader example parser = PandasCSVReader() file_extractor = {".csv": parser} # Add other CSV formats as needed documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# PyMuPDF Reader example parser = PyMuPDFReader() file_extractor = {".pdf": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# XML Reader example parser = XMLReader() file_extractor = {".xml": parser} documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# Paged CSV Reader example parser = PagedCSVReader() file_extractor = {".csv": parser} # Add other CSV formats as needed documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
# CSV Reader example parser = CSVReader() file_extractor = {".csv": parser} # Add other CSV formats as needed documents = SimpleDirectoryReader( "./data", file_extractor=file_extractor ).load_data()
1. LlamaHub 中的加載器除了上面的讀取文件之外,實(shí)際生活中還有很多地方可以獲取數(shù)據(jù),例如GitHub,網(wǎng)頁(yè),數(shù)據(jù)庫(kù)等。這些數(shù)據(jù)加載器在 LlamaHub 中實(shí)現(xiàn),可以按需使用。下圖是 LlamaHub 中加載器列表: 1.1 使用方式首先得安裝相應(yīng)的包,例如使用DatabaseReader: pip install llama-index-readers-google
# 或者在使用之前加下面這一行應(yīng)該也行: # from llama_index.core import download_loader
否則會(huì)報(bào)錯(cuò): 然后就可以正常使用了: from llama_index.core import download_loader
from llama_index.readers.database import DatabaseReader
reader = DatabaseReader( scheme=os.getenv("DB_SCHEME"), host=os.getenv("DB_HOST"), port=os.getenv("DB_PORT"), user=os.getenv("DB_USER"), password=os.getenv("DB_PASS"), dbname=os.getenv("DB_NAME"), )
query = "SELECT * FROM users" documents = reader.load_data(query=query)
2. 可以直接將文字轉(zhuǎn)換成 LlamaIndex 需要的 Document 結(jié)構(gòu)from llama_index.core import Document
doc = Document(text="text")
3. 文檔內(nèi)容轉(zhuǎn)換加載數(shù)據(jù)后,下一步是將數(shù)據(jù)進(jìn)行處理和轉(zhuǎn)換。這些轉(zhuǎn)換包括分塊、提取元數(shù)據(jù)和對(duì)每個(gè)塊進(jìn)行向量化,從而確保大模型能夠檢索數(shù)據(jù)。 3.1 一步到位的簡(jiǎn)單方法其中最簡(jiǎn)單的轉(zhuǎn)換做法,是上篇文章中我們使用的:from_documents 方法。 from llama_index.core import VectorStoreIndex
vector_index = VectorStoreIndex.from_documents(documents) vector_index.as_query_engine()
from_documents()方法,接受一個(gè)Document對(duì)象數(shù)組,并自動(dòng)解析和拆分它們。 3.2 自定義轉(zhuǎn)換有時(shí)候我們需要自己控制分塊等這些轉(zhuǎn)換的邏輯。有以下兩種方式: (1)使用 from_documents 的 transformations 參數(shù),傳入一個(gè)自定義的分塊器。 from llama_index.core.node_parser import SentenceSplitter
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)
# per-index index = VectorStoreIndex.from_documents( documents, transformations=[text_splitter] )
(2)使用全局設(shè)置,設(shè)置默認(rèn)的分塊器。 from llama_index.core.node_parser import SentenceSplitter
text_splitter = SentenceSplitter(chunk_size=512, chunk_overlap=10)
# global from llama_index.core import Settings
Settings.text_splitter = text_splitter
# per-index index = VectorStoreIndex.from_documents(documents)
4. 總結(jié)本文我們介紹了 LlamaIndex 中加載器,以及如何使用它們。LlamaIndex提供了內(nèi)置的文件加載器,同時(shí)也支持 LlamHub 中提供的各種其它類(lèi)型加載器。 文章最后,簡(jiǎn)單介紹了下 LlamaIndex 中如何將加載到的文檔數(shù)據(jù)轉(zhuǎn)換成索引。 5. 參考如果覺(jué)得本文對(duì)你有幫助,麻煩點(diǎn)個(gè)贊和關(guān)注唄 ~~~點(diǎn)擊上方公眾號(hào),關(guān)注↑↑↑
公眾號(hào)內(nèi)文章一覽
|