%%capture !pip install openai llama_hub llama_index pypdf accelerate sentence_transformers -q -U
%%capture %%bash wget -O state_of_ai_2023.zip https://github.com/harpreetsahota204/langchain-zoomcamp/raw/main/State%20of%20AI%20Report%202023%20-%20ONLINE.pdf.zip unzip state_of_ai_2023.zip
from pathlib import Path from llama_hub.file.pdf.base import PDFReader from llama_index.response.notebook_utils import display_source_node from llama_index.retrievers import RecursiveRetriever from llama_index.query_engine import RetrieverQueryEngine from llama_index import VectorStoreIndex, ServiceContext from llama_index.llms import OpenAI import json from llama_index.node_parser import SimpleNodeParser from llama_index.schema import IndexNode from llama_index import Document
Introduction
The much-anticipated “State of AI 2023 Report” has finally graced us with its presence.
Clocking in at a hefty 163 pages, it’s an exhaustive tome teeming with insights on the ever-evolving landscape of Artificial Intelligence. However, the sheer volume of the report can be overwhelming for many.
That’s where my curiosity piqued – could I harness the power of LlamaIndex and DeciLM-6B-Instruct to extract meaningful answers from this mammoth document through Retrieval Augmented Generation?
As I embarked on this journey, it wasn’t just about simplifying the report; it was also a golden opportunity for me to get acquainted with LlamaIndex. My goal? To demystify it for our vibrant community.
The process was exhilarating, and what you’re about to read is not just a digest of the report, but also a crash course in LlamaIndex, leveraging open-source embedding models and LLMs.
Dive in and join me on this enlightening adventure!
🦙 How to use LlamaIndex
The basic usage pattern for LlamaIndex is a 5-step process that takes you from your raw, unstructed data to LLM generated content based on that data:
- Load documents
- Parse Documents into Nodes
- Build an Index
- Query the index
- Parse the response
📑 Load Documents
To start with, the initial step is to load data in the form of Document
objects.
For this purpose, LlamaIndex
has several data loaders which will help you load Documents via the load_data
method.
No matter the type of data loader you use, the nice thing about LlamaIndex
is that they all follow the same basic pattern: choose the appropriate loader, instantiate the loader, and load documents.
❓What is a Document
?
A Document
is a container that holds data from various sources such as a PDF, an API output or retrieved data from a database.
It can be created manually or automatically through the data loaders.
loader = PDFReader() docs0 = loader.load_data(file=Path("/content/State of AI Report 2023 - ONLINE.pdf"))
The PDF has been converted to a list of 163 elements, where each element of
- List item
- List item that list is a
Document
object.
print(f" docs is a {type(docs0)}, of length {len(docs0)}, where each element is a {type(docs0[0])} object")
docs is a <class 'list'>, of length 163, where each element is a <class 'llama_index.schema.Document'> object
🤌🏽 Data quality matters
Take a look at the State of AI Report 2023 and you can see that our PDF is actually a slide deck.
Each slide has a topic or theme it’s addressing and the that ideas/concepts/points on that slide as grouped as bullet points, mostly seperated by n
, \n●
, or \n-
. This will be important to keep in mind when you split the text..
🏷️ Metadata
You can inspect the content of the doucment and see that it has a lot of metadata associated with it
You should add some metadata to each Document
based on the page_label
.
The report is organized into sections, so you can add metadata to indicate which section each of the pages (which are now Document
objects) belong to.
You can look at the report and verify that:
- Page 1 through 10 are the Introduction
- Page 11 through 68 are related to research
- Page 69 through 120 are related to Politics
- Page 121 through 137 are related to Safety
- Pages 138 to the end are related to predictions
docs0[94]
Document(id_='7ab058ff-eefb-43e5-ace2-3da46aa3b1b0', embedding=None, metadata={'page_label': '95', 'file_name': 'State of AI Report 2023 - ONLINE.pdf'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash='fdbfaad00d3059b0afe90a354f3eca708bef811b943b7c513adc71ceb5f92145', text=" In Oct 2022, Shutterstock - a leading stock multimedia provider - announced it will work with OpenAI to bring \nDALL·E-powered content onto the platform. Then in July 2023, the two companies signed a 6-year content \nlicensing agreement that would give OpenAI access to Shutterstock's image, video and music libraries and \nassociated metadata for model training. Furthermore, Shutterstock will offer its customers indemnification for AI \nimage creation. The company also entered into a content license with Meta for GenAI. This pro-GenAI stance is in \nstark contrast to Shutterstock’s competitor, Getty Images, which is profoundly against GenAI as evidenced by its \nongoing lawsuit against Stability AI for copyright infringement filed in Feb 2023. \nstateof.ai 2023 #stateofai | 95 Introduction | Research | Industry | Politics | Safety | Predictions \n2022 Prediction: A major user generated content site negotiates a commercial \nsettlement with a start-up producing AI models (e.g. OpenAI) for training on their corpus \nvs.", start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
💽 Dirty data
You’ll notice that each slide deck has the following text:
- a footer that reads
stateof.ai 2023
- a header that reads
Introduction | Research | Industry | Politics | Safety | Predictions
- and in the top right of each page you’ll see the following pattern
#stateofai | n
wheren
is the slide number - icons in the top right corner indicate input and output modalities for the model, for example
📝 → 📝
The icons are important, but the rest of that text is unnecessary. We don’t need to waste time embedding that text, storing it in our vector database, or finding its way into our context.
docs0[94].get_content()
In Oct 2022, Shutterstock - a leading stock multimedia provider - announced it will work with OpenAI to bring \nDALL·E-powered content onto the platform. Then in July 2023, the two companies signed a 6-year content \nlicensing agreement that would give OpenAI access to Shutterstock's image, video and music libraries and \nassociated metadata for model training. Furthermore, Shutterstock will offer its customers indemnification for AI \nimage creation. The company also entered into a content license with Meta for GenAI. This pro-GenAI stance is in \nstark contrast to Shutterstock’s competitor, Getty Images, which is profoundly against GenAI as evidenced by its \nongoing lawsuit against Stability AI for copyright infringement filed in Feb 2023. \nstateof.ai 2023 #stateofai | 95 Introduction | Research | Industry | Politics | Safety | Predictions \n2022 Prediction: A major user generated content site negotiates a commercial \nsettlement with a start-up producing AI models (e.g. Op
🧼 Clean text and add metadata
The assign_section
function categorizes each document into predefined sections (e.g., Introduction, Research, Politics) based on its page number, which is extracted from the metadata.
Additionally, the text
content of each document undergoes a cleaning process using the clean_slide_text
function.
When applied to a list of documents, the code ensures that each document is both categorized by section and has its text content refined.
import re def clean_slide_text(text: str) -> str: """ Cleans the provided slide text by removing specific patterns and extra whitespace. Parameters: - text (str): The raw text from a slide. Returns: - str: The cleaned text. Example: >>> clean_slide_text("LINGO-1 is Wayve’s vision-language-action model ... stateof.ai 2023 #stateofai | 43 \nLeveraging LLMs for autonomous driving'") 'LINGO-1 is Wayve’s vision-language-action model ... Leveraging LLMs for autonomous driving' """ # Remove the footer text text = text.replace("stateof.ai 2023", "") # Remove the header text text = text.replace("Introduction | Research | Industry | Politics | Safety | Predictions", "") # Remove the pattern "#stateofai | n" text = re.sub(r"#stateofai(\s*\|\s*\d+)?", "", text) # Replace multiple consecutive spaces with a single space text = re.sub(r" +", " ", text) # Remove any leading or trailing whitespace text = text.strip() return text def assign_section(document): """ Assigns a section to the document based on its page number. The function updates the 'metadata' attribute of the document with a key 'section' that has a value corresponding to the section the page number falls into. Sections: - Page 1 through 10: Introduction - Page 11 through 68: Research - Page 69 through 120: Politics - Page 121 through 137: Safety - Pages 138 and beyond: Predictions Args: - document (Document): The Document object to be updated. Returns: None. The function updates the Document object in-place. """ page_number = int(document.metadata['page_label']) if 1 <= page_number <= 10: document.metadata['section'] = 'Introduction' elif 11 <= page_number <= 68: document.metadata['section'] = 'Research' elif 69 <= page_number <= 120: document.metadata['section'] = 'Politics' elif 121 <= page_number <= 137: document.metadata['section'] = 'Safety' else: document.metadata['section'] = 'Predictions' # Iterate through each Document object in docs0 for doc in docs0: # Update the metadata using assign_section assign_section(doc) # Metadata keys that are excluded from text for the embed model. doc.excluded_embed_metadata_keys=['file_name'] # Apply clean_slide_text to the text attribute doc.text = clean_slide_text(doc.text)
You can review the metadata and cleaned text to confirm:
docs0[94].metadata
{'page_label': '95', 'file_name': 'State of AI Report 2023 - ONLINE.pdf', 'section': 'Politics'}
docs0[94].get_content()
In Oct 2022, Shutterstock - a leading stock multimedia provider - announced it will work with OpenAI to bring \nDALL·E-powered content onto the platform. Then in July 2023, the two companies signed a 6-year content \nlicensing agreement that would give OpenAI access to Shutterstock's image, video and music libraries and \nassociated metadata for model training. Furthermore, Shutterstock will offer its customers indemnification for AI \nimage creation. The company also entered into a content license with Meta for GenAI. This pro-GenAI stance is in \nstark contrast to Shutterstock’s competitor, Getty Images, which is profoundly against GenAI as evidenced by its \nongoing lawsuit against Stability AI for copyright infringement filed in Feb 2023. \n \n2022 Prediction: A major user generated content site negotiates a commercial \nsettlement with a start-up producing AI models (e.g. OpenAI) for training on their corpus \nvs.
🗳️ As mentioned before, you can do one of two things before
- Covert the
Document
intoNode
objects before sending them to the index - Directly send the entire
Document
object to the index
The choice between sending the entire Document
object to the index or converting the Document
into Node
objects before indexing depends on your specific use case and the structure of your data.
- Sending the entire
Document
object to the index: This approach is suitable for maintaining the entire document as a single unit. This might be useful when your documents are relatively short or when the context between different parts of the document is important. - Converting the
Document
intoNode
objects before indexing: This approach is practical when your documents are long, and you want to break them down into smaller chunks (or nodes) before indexing. This can be beneficial when you want to retrieve specific parts of a document rather than the entire document.
📰 Since the State of AI Report 2023 is a 163 page PDF, it makes sense to first convert the Document
objects to Node
objects.
Which begs the question…
What are Node
objects in LlamaIndex?
A Node
object in LlamaIndex
represents a “chunk” or a portion of a source Document.
This could be a text chunk, an image, or other types of data. Similar to Documents
, Nodes
also contain metadata and relationship information with other nodes.
Nodes
are considered a first-class citizen in LlamaIndex
.
This means you can define Nodes
and all its attributes directly.
Alternatively, you can also “parse” source Documents
into Nodes
using the NodeParser
classes. By default, every Node derived from a Document will inherit the same metadata from that Document. For example, a “file_name” field in the Document is propagated to every Node.
🕵🏻 A bit of exploratory data analysis
Identifying the best chunk size for a RAG system involves both intuition and empirical evidence.
When building a RAG system, it’s essential to invest time in evaluating and adjusting the chunk_size for optimal results.
- A smaller
chunk_size
(e.g., 128) provides more granular chunks. However, there’s a risk that essential information might not be among the top retrieved chunks. - A larger chunk size (e.g., 512) is likely to encompass all necessary information within the top chunks.
- As
chunk_size
increases, more information is directed into the LLM to generate an answer. This can ensure a comprehensive context but might slow down the system.
Taking a moment to understand how words are distributed across sections, slides, and bullet points can help you select a better informed chunk_size
when you need to split your text.
import re # Define the pattern for bullet points and newlines split_pattern = r"\n●|\n-|\n" # Initialize lists to store the word counts of all chunks and entire texts across all documents chunk_word_counts = [] entire_text_word_counts = [] # Initialize a dictionary to store word counts and slide counts by section section_data = {} # Iterate through each Document object in your list of documents for doc in docs0: # Split the document's text into chunks based on the pattern chunks = re.split(split_pattern, doc.text) # Calculate the number of words in each chunk and store it chunk_word_counts.extend([len(chunk.split()) for chunk in chunks]) # Calculate the number of words in the entire text and store it entire_word_count = len(doc.text.split()) entire_text_word_counts.append(entire_word_count) # Update the word count and slide count for the section in the dictionary section = doc.metadata['section'] if section in section_data: section_data[section]['word_count'] += entire_word_count section_data[section]['slide_count'] += 1 else: section_data[section] = {'word_count': entire_word_count, 'slide_count': 1} # Calculate the total word count across all sections total_word_count = sum(data['word_count'] for data in section_data.values()) # Calculate the number of sections num_sections = len(section_data) # Calculate the average word count across all sections average_word_count_across_sections = total_word_count / num_sections # Calculate summary statistics for chunks average_chunk_word_count = sum(chunk_word_counts) / len(chunk_word_counts) max_chunk_word_count = max(chunk_word_counts) # Calculate average word count for entire texts average_entire_text_word_count = sum(entire_text_word_counts) / len(entire_text_word_counts) print(f"Average word count for a slide: {average_entire_text_word_count}") print(f"Average word count per bullet point: {average_chunk_word_count}") print(f"Longest bullet point: {max_chunk_word_count}") print(f"Average word count in a section: {average_word_count_across_sections:.2f}")
Average word count for a slide: 127.04907975460122 Average word count per bullet point: 10.796663190823775 Longest bullet point: 33 Average word count in a section: 4141.80
📦 Parse some nodes
Now that we have an understanding of what Nodes are and the role of a Node Parser, let’s start parsing these Nodes!
If you’ve ever seen a slideshow before, you know that bullet points usually go on slides.
And the bullets usually have some relationship to 3-5 bullets that come before or after it.
Our Chunking Strategy
You’ll make use of a strategy that utilized smaller child chunks that refer to bigger parent chunks.
To do this you’ll use first use a SimpleNodeParser
with a SentenceSplitter
to create “base nodes” which are larger chunks of text.
Then, you’ll create child nodes using a SentenceWindowNodeParser
, which will have nodes that represent bullet points from the slide deck and metadata that references a “window” of a few bullets on either side.
✅ SimpleNodeParser
SimpleNodeParser
converts documents into a list of nodes. It offers flexibility in how the document is parsed, allowing for customization in terms of chunk size, overlap, and inclusion of metadata.
The SimpleNodeParser
with a SentenceSplitter
is used when you want to break down your documents into chunks of a specific size, with each chunk being a node.
This is particularly useful when dealing with large documents that need to be divided into manageable pieces for processing.
✏️ SentenceSplitter
The SentenceSplitter
is a type of text splitter that breaks down the text into sentences.
This is useful when you want to maintain the integrity of individual sentences within each chunk.
Looking at the State of AI 2023 Report, you’ll see that ideas/concepts/points as grouped as bullet points, mostly seperated by n
, \n●
, or \n-
, making r"\n●|\n-|\n"
a good choice for the paragraph_separator
for sentence_splitter
The SentenceSplitter
class is designed to split text with a preference for complete sentences. Let’s break down its behavior:
Splitting Mechanism:
- Primary Splitting:
- The text is first attempted to be split by paragraph separators (
\n\n\n
by default). - If that doesn’t result in multiple splits, it tries to split the text using a sentence tokenizer (by default, this uses the
nltk.sent_tokenize
function).
- The text is first attempted to be split by paragraph separators (
- Secondary Splitting:
- If the primary splitting doesn’t yield multiple splits, the text is then attempted to be split by a regex pattern (
"[^,.;。?!]+[,.;。?!]?"
by default). This pattern tries to split the text at punctuation marks like commas, periods, semicolons, etc. - If that doesn’t work, it tries to split by a default separator (a space
" "
). - Lastly, if all else fails, it splits the text character by character.
- If the primary splitting doesn’t yield multiple splits, the text is then attempted to be split by a regex pattern (
from llama_index.text_splitter import SentenceSplitter
bullet_splitter = SentenceSplitter(paragraph_separator=r"\n●|\n-|\n", chunk_size=250) slides_parser = SimpleNodeParser.from_defaults( text_splitter=bullet_splitter, include_prev_next_rel=True, include_metadata=True ) slides_nodes = slides_parser.get_nodes_from_documents(docs0)
slides_nodes[42]
TextNode(id_='09a280f9-5cf5-4c0f-ad66-f4e60d200a4f', embedding=None, metadata={'page_label': '10', 'file_name': 'State of AI Report 2023 - ONLINE.pdf', 'section': 'Introduction'}, excluded_embed_metadata_keys=['file_name'], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='5332b5ea-1353-4dd6-a9be-b9bf0e8586fa', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'page_label': '10', 'file_name': 'State of AI Report 2023 - ONLINE.pdf', 'section': 'Introduction'}, hash='8422b5813301649b72c0d6d36cc080cd1d3562a5989afd3d57825c2feb30163d'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='5373df85-b25a-4614-b18b-d1648283960d', node_type=<ObjectType.TEXT: '1'>, metadata={'page_label': '10', 'file_name': 'State of AI Report 2023 - ONLINE.pdf', 'section': 'Introduction'}, hash='718605586a22be21923ab626976c8cc2788459558c6d57ee5307d9becf7cfe75'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='869eec4e-d491-46a6-9233-a478021eeecb', node_type=<ObjectType.TEXT: '1'>, metadata={'page_label': '10', 'file_name': 'State of AI Report 2023 - ONLINE.pdf', 'section': 'Introduction'}, hash='596b10dffc166eaad3b140ef640f3aab64f44ecea123c5ebba02977b16262c61')}, hash='0d9079b5a0e2838ab2a53549d1d487ce2b7833ff93316606f161f2b29524ec7d', text='Generative audio tools emerge that attract over 100,000 developers by September 2023. YES Both ElevenLabs and Resemble.ai claim over 1 million users each since launch. \nGAFAM invests >$1B into an AGI or open source AI company (e.g. OpenAI). YES Microsoft invested a further $10B into OpenAI in Jan. 2023. \nReality bites for semiconductor startups in the face of NVIDIA’s dominance and a high \nprofile start-up is shut down or acquired for <50% of its most recent valuation. NO There have been markdowns, but no major shutdowns or depressed acquisitions. \nA proposal to regulate AGI Labs like Biosafety Labs (BSL) gets backing from an elected \nUK, US or EU politician. NO Calls for regulation have significantly heightened, but no backing for BSL yet.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')
slides_nodes[42].text
Generative audio tools emerge that attract over 100,000 developers by September 2023. YES Both ElevenLabs and Resemble.ai claim over 1 million users each since launch. \nGAFAM invests >$1B into an AGI or open source AI company (e.g. OpenAI). YES Microsoft invested a further $10B into OpenAI in Jan. 2023. \nReality bites for semiconductor startups in the face of NVIDIA’s dominance and a high \nprofile start-up is shut down or acquired for <50% of its most recent valuation. NO There have been markdowns, but no major shutdowns or depressed acquisitions. \nA proposal to regulate AGI Labs like Biosafety Labs (BSL) gets backing from an elected \nUK, US or EU politician. NO Calls for regulation have significantly heightened, but no backing for BSL yet.
len(slides_nodes[42].text)
748
🪟 SentenceWindowNodeParser
The SentenceWindowNodeParser
class is designed to parse documents into nodes (sentences) and capture a window of surrounding sentences for each node.
This can be useful for context-aware text processing, where understanding the surrounding context of a sentence can provide valuable insights.
- Node: Represents a unit of text, in this case, a sentence.
- Window: A range of sentences surrounding a particular sentence. For example, if the window size is 3, and the current sentence is the 5th sentence, the window will capture sentences 2 to 8.
- Metadata: Additional information associated with a node, such as the window of surrounding sentences.
Here’s what happens during instantiation with the provided arguments:
When you create an instance of the SentenceWindowNodeParser
using the from_defaults
method with the custom_sentence_splitter
(which splits text based on "\n●"
, "\n-"
, or "\n"
delimiters) and the specified parameters (window_size=3
, include_prev_next_rel=True
, include_metadata=True
), you’re setting up a parser to process documents as follows:
- Each document’s text will be divided into sentences using the custom splitter.
- For each sentence, a node is generated.
- This node will contain metadata capturing the surrounding 3 sentences on each side.
- Additionally, each node will reference its preceding and succeeding sentences.
- Calling
get_nodes_from_documents
with a list of documents will return a list of these nodes, each representing a sentence, enriched with the specified metadata and relationships.
from llama_index.node_parser import SentenceWindowNodeParser from typing import List import re def custom_sentence_splitter(text: str) -> List[str]: return re.split(r'\n●|\n-|\n', text) bullet_node_parser = SentenceWindowNodeParser.from_defaults( sentence_splitter=custom_sentence_splitter, window_size=3, include_prev_next_rel=True, include_metadata=True )
⚙️ Processing Nodes
The code below processes a list of base nodes (slides_nodes
).
For each base node, it generates sub-nodes using the SentenceWindowNodeParser
(with custom settings). Then, it converts the base nodes and their corresponding sub-nodes into IndexNode
instances.
The final list of IndexNode
instances is stored in all_nodes.
What is an IndexNode
in LlamaIndex
?
An IndexNode
is a node object used in LlamaIndex
.
It represents chunks of the original documents that are stored in an Index. The Index is a data structure that allows for quick retrieval of relevant context for a user query, which is fundamental for retrieval-augmented generation (RAG) use cases.
At its core, the IndexNode
inherits properties from a TextNode
, meaning it primarily represents textual content.
However, the distinguishing feature of an IndexNode
is its index_id
attribute. This index_id
acts as a unique identifier or reference to another object, allowing the node to point or link to other entities within the system.
This referencing capability adds a layer of connectivity and relational information on top of the textual content.
For example, in the context of recursive retrieval and node references, smaller chunks (represented as IndexNode
objects) can point to bigger parent chunks. Smaller chunks are retrieved during query time, but references to bigger chunks are followed.
This allows for more context for synthesis.
Conceptually, you can think of an IndexNode
as a bridge or link node.
While it holds its own textual content (inherited from TextNode
), it also serves as a pointer or reference to another entity in the system, as indicated by its index_id
.
This dual nature allows for more complex and interconnected data structures, where nodes can represent both content and relationships to other objects.
sub_node_parsers =[bullet_node_parser] all_nodes = [] for base_node in slides_nodes: for parser in sub_node_parsers: sub_nodes = parser.get_nodes_from_documents([base_node]) sub_inodes = [ IndexNode.from_text_node(sn, base_node.node_id) for sn in sub_nodes ] all_nodes.extend(sub_inodes) # also add original node to node original_node = IndexNode.from_text_node(base_node, base_node.node_id) all_nodes.append(original_node)
all_nodes_dict = {n.node_id: n for n in all_nodes}
Embedding model and LLM
For this example, you’ll use the BGE
embedder from HuggingFace
as our embeddings model.
You’ll also use DeciLM-6B-Instruct as the LLM.
DeciLM-6B-Instruct is an advanced large language model derived from the foundational DeciLM 6B. The base model, DeciLM 6B, boasts 5.7 billion parameters and serves as a cornerstone in the realm of LLMs. Recognizing the need for specialized instruction-following capabilities, DeciLM 6B was meticulously fine-tuned to birth DeciLM 6B-Instruct. This refined model astonishingly achieves a throughput that’s 15 times higher than its competitor, Llama 2 7B, all the while upholding exceptional quality in its outputs.
A standout feature of DeciLM is its architectural innovation. Unlike conventional LLMs, DeciLM incorporates a unique implementation of variable Grouped-Query Attention (GQA). This innovation marks DeciLM as the pioneering LLM where transformer layers are not mere structural duplicates of one another. The genesis of this distinctive architecture can be attributed to Deci’s proprietary Neural Architecture Search engine, AutoNAC, which played a pivotal role in shaping the model’s structure.
In terms of training, DeciLM 6B was rigorously trained using a subset of the SlimPajamas dataset. This extensive dataset served as a robust foundation for the model’s learning. To further enhance its capabilities, the model underwent a specialized fine-tuning process known as LoRA, resulting in the creation of DeciLM 6B-Instruct.
When benchmarked against other models in its category, both DeciLM 6B and DeciLM 6B-Instruct consistently emerge as top contenders. Their performance is especially noteworthy when considering models in the 7 billion parameter category, showcasing their prowess and setting new standards in the LLM domain.
DeciLM-6B-Instruct epitomizes the fusion of innovative architecture, rigorous training, and top-tier performance, making it a formidable player in the world of efficient AI modeling.
from llama_index.embeddings import resolve_embed_model embed_model = resolve_embed_model("local:BAAI/bge-large-en-v1.5")
# Define a new prompt template template = """Below is context that has been retrieved. Your task is to synthesize \ the query, which is delimited by triple backticks, and write a response that appropriately answers the query based on the retrieved context. ### Query: ```{query_str}``` ### Response: Begin! """
%%capture from llama_index.llms import HuggingFaceLLM from llama_index.prompts import PromptTemplate llm = HuggingFaceLLM( model_name="Deci/DeciLM-6b-instruct", tokenizer_name="Deci/DeciLM-6b-instruct", query_wrapper_prompt=PromptTemplate("<|system|>\n</s>\n<|user|>\n{query_str}</s>\n<|assistant|>\n"), # query_wrapper_prompt=PromptTemplate(template), context_window=4096, max_new_tokens=512, model_kwargs={'trust_remote_code':True}, generate_kwargs={"temperature": 0.0}, device_map="auto", )
💁🏽♂️ Service Context
The ServiceContext
in LlamaIndex is a utility container that bundles commonly used resources during the indexing and querying stages of a LlamaIndex pipeline or application.
It can be used to set both global and local configurations at specific parts of the pipeline.
service_context = ServiceContext.from_defaults(llm=llm, embed_model=embed_model)
↗️ VectorStoreIndex
in LlamaIndex
A VectorStoreIndex
in LlamaIndex is a type of index that uses vector representations of text for efficient retrieval of relevant context.
It is built on top of a VectorStore
, which is a data structure that stores vectors and allows for quick nearest neighbor search.
The VectorStoreIndex
takes in IndexNode
objects, which represent chunks of the original documents.
It uses an embedding model (specified in the ServiceContext
) to convert the text content of these nodes into vector representations. These vectors are then stored in the VectorStore
.
During query time, the VectorStoreIndex
can quickly retrieve the most relevant nodes for a given query.
It does this by converting the query into a vector using the same embedding model, and then performing a nearest neighbor search in the VectorStore
.
vector_index_chunk = VectorStoreIndex( all_nodes, service_context=service_context )
The as_retriever
method of a VectorStoreIndex
in LlamaIndex
is used to create a retriever object from the index.
A retriever is a component that is responsible for fetching relevant context from the index given a user query.
When you call as_retriever
on a VectorStoreIndex
, it returns a VectorStoreRetriever
object.
This retriever uses the vector representations stored in the VectorStoreIndex
to perform efficient nearest neighbor search and retrieve the most relevant IndexNode
objects for a given query.
Below, this is configured to fetch the 2 most similar chunks.
vector_retriever_chunk = vector_index_chunk.as_retriever(similarity_top_k=2)
🔁 RecurseiveRetriever
in LlamaIndex
The RecursiveRetriever
is designed to recursively explore links from nodes to other retrievers or query engines.
This means that when the retriever fetches nodes, if any of those nodes point to another retriever or query engine, the RecursiveRetriever
will follow that link and query the linked retriever or engine as well.
If any of the retrieved nodes are of type IndexNodes
, the retriever will specifically explore the linked retriever or query engine associated with those IndexNodes
and initiate a query on that linked entity.
RecursiveRetriever
is designed to handle complex retrieval tasks, especially when data is spread across different retrievers or query engines. It follows links, retrieves data from linked sources, and can combine results from multiple sources into a single coherent response.
Here’s a brief explanation of the arguments:
root_id
: The root ID of the query graph, in this case you pass"vector"
retriever_dict
: A dictionary mapping IDs to retrievers.node_dict
: A dictionary that seems to map IDs to nodes.
retriever_chunk = RecursiveRetriever( "vector", retriever_dict={"vector": vector_retriever_chunk}, node_dict=all_nodes_dict, verbose=True, )
The retrieve
method accepts a query, which can be either a simple string or a more structured QueryBundle
object.
If given a string, the method converts it into a QueryBundle
. It then calls an internal method to fetch a list of nodes based on this query.
Each node in the list is paired with a score, indicating its relevance or confidence in relation to the query.
display_source_node
accepts a NodeWithScore
object, which consists of a node and its associated score. It displays the node ID, its similarity score, and a truncated version of its content.
When displaying the content of a source node, the text is truncated (or shortened) to the specified source_length.
This helps in ensuring that the displayed text remains concise and doesn’t overwhelm the notebook’s display area, especially when the original text is very long.
nodes = retriever_chunk.retrieve( "What is FlashAttention?" ) for node in nodes: display_source_node(node, source_length=1000)
Retrieving with query id None: What is FlashAttention? Retrieved node with id, entering: 443fcfd9-0e1e-40e1-a9aa-2d0c531ae36c Retrieving with query id 443fcfd9-0e1e-40e1-a9aa-2d0c531ae36c: What is FlashAttention? Node ID: 443fcfd9-0e1e-40e1-a9aa-2d0c531ae36c Similarity: 0.7095023791690348 Text: ●FlashAttention introduces a significant memory saving by making attention linear instead of quadratic in sequence length. FlashAttention-2 further improves computing the attention matrix by having fewer non-matmul FLOPS, better parallelism and better work partitioning. The result is a 2.8x training speedup of GPT-style models. ●Reducing the number of bits in the parameters reduces both the memory footprint and the latency of LLMs. The case for 4-bit precision: k-bit Inference Scaling Laws shows across a variety of LLMs that 4-bit quantisation is universally optimal for maximizing zero-shot accuracy and reducing the number of bits used. ●Speculative decoding enables decoding multiple tokens in parallel through multiple model heads rather than forward passes, speeding up inference by 2-3X for certain models. ●SWARM Parallelism is a training algorithm designed for poorly connected and unreliable devices.
🧑🏽💻 RetrieverQueryEngine
in LlamaIndex
A RetrieverQueryEngine
in LlamaIndex
is a type of query engine that uses a retriever to fetch relevant context from an index given a user query.
It is designed to work with retrievers, such as the VectorStoreRetriever
created from a VectorStoreIndex
.
The RetrieverQueryEngine
takes a retriever and a response synthesizer as inputs. The retriever is responsible for fetching relevant IndexNode
objects from the index, while the response synthesizer is used to generate a natural language response based on the retrieved nodes and the user query.
Response mode
For this example, you’ll use the "compact"
response mode.
Compact combines text chunks into larger consolidated chunks that more fully utilize the available context window, then refine answers across them.
Refer to the docs for full description of all the response modes.
query_engine_chunk = RetrieverQueryEngine.from_args( retriever_chunk, service_context=service_context, verbose=True, response_mode="compact" )
Now, you can query the State of AI 2023 Report!
response = query_engine_chunk.query( "Who are the authors of this report?" ) str(response)
Retrieving with query id None: Who are the authors of this report? Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Retrieved node with id, entering: f9f977fb-047c-4afd-afc0-aafbc1036137 Retrieving with query id f9f977fb-047c-4afd-afc0-aafbc1036137: Who are the authors of this report? Retrieved node with id, entering: e5addaec-e204-48c4-8892-02abe46fc697 Retrieving with query id e5addaec-e204-48c4-8892-02abe46fc697: Who are the authors of this report? Nathan Benaich and the team at Air Street Capital.
response = query_engine_chunk.query( "What is new about FlashAttention?" ) str(response)
Retrieving with query id None: What is new about FlashAttention? /usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:362: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.0` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `temperature`. warnings.warn( Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Retrieved node with id, entering: 443fcfd9-0e1e-40e1-a9aa-2d0c531ae36c Retrieving with query id 443fcfd9-0e1e-40e1-a9aa-2d0c531ae36c: What is new about FlashAttention? Retrieved node with id, entering: a764c3d7-d887-4789-8b71-6ff947ba66ca Retrieving with query id a764c3d7-d887-4789-8b71-6ff947ba66ca: What is new about FlashAttention? FlashAttention introduces a signi\x0ccant memory saving by making attention linear instead of quadratic in sequence length. FlashAttention-2 further improves computing the attention matrix by having fewer non-matmul FLOPS, better parallelism and better work partitioning. The result is a 2.8x training speedup of GPT-style models.
response = query_engine_chunk.query( "Does the report mention anything about inference and latency concerns?" ) str(response)
Retrieving with query id None: Does the report mention anything about inference and latency concerns? Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Retrieved node with id, entering: ee90f521-b7f8-4c16-bccf-5fe5db985766 Retrieving with query id ee90f521-b7f8-4c16-bccf-5fe5db985766: Does the report mention anything about inference and latency concerns? Retrieved node with id, entering: e7ef0e97-ad6c-4060-9210-486213072099 Retrieving with query id e7ef0e97-ad6c-4060-9210-486213072099: Does the report mention anything about inference and latency concerns? Yes, the report mentions that AlphaZero has been used to reach superhuman levels in chess, Go, and shogi, or even to improve chip design. It also mentions that AlphaDev reformulates code optimization as an RL problem. The discovered algorithms for sort3, sort4, and sort5, led to improvements of ~1.7% for sequences larger than 250K. These were open-sourced in the ubiquitous LLVM library.
response = query_engine_chunk.query( "What does the report say about text to image models?" ) str(response)
Retrieving with query id None: What does the report say about text to image models? Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Retrieved node with id, entering: ae4e725d-2874-4c6d-8d5f-b4fe3bd9081f Retrieving with query id ae4e725d-2874-4c6d-8d5f-b4fe3bd9081f: What does the report say about text to image models? Retrieved node with id, entering: dff79d14-69a2-4c4d-8953-ce3609ff922f Retrieving with query id dff79d14-69a2-4c4d-8953-ce3609ff922f: What does the report say about text to image models? The report discusses the advancements in text-to-image models, such as NeRFs, and their applications in various fields, including 3D modeling, text-to-3D synthesis, and image editing. NeRFs have improved in speed and quality, and have enabled GenAI to model 3D geometry. The report also highlights the use of NeRFs in various applications, such as DreamFusion and Score Jacobian Chaining, which use pretrained 2D text-to-image diffusion models to perform text-to-3D synthesis. RealFusion and SKED are methods that edit an entire NeRF scene rather than a region or generate from scratch. Instruct-Nerf2Nerf edits an entire NeRF scene rather than a region or generating from scratch.
response = query_engine_chunk.query( "Summarize the research section of the report" ) str(response)
Retrieving with query id None: Summarize the research section of the report Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Retrieved node with id, entering: 085efe41-df16-410f-8c3c-92887ee57312 Retrieving with query id 085efe41-df16-410f-8c3c-92887ee57312: Summarize the research section of the report Retrieved node with id, entering: 673b020c-39b4-4fd9-b7d9-e9333a7fef53 Retrieving with query id 673b020c-39b4-4fd9-b7d9-e9333a7fef53: Summarize the research section of the report The research section of the report discusses the various methods and techniques used by artificial intelligence (AI) systems to improve their performance and capabilities. These methods include chain of thought prompting, tree of thought, graph of thought, and auto-chain of thought. The report also highlights the importance of quality prompts in enhancing AI performance.
response = query_engine_chunk.query( "What does the report say about the importance of quality prompts?" ) str(response)
Retrieving with query id None: What does the report say about the importance of quality prompts? Setting `pad_token_id` to `eos_token_id`:2 for open-end generation. Retrieved node with id, entering: 673b020c-39b4-4fd9-b7d9-e9333a7fef53 Retrieving with query id 673b020c-39b4-4fd9-b7d9-e9333a7fef53: What does the report say about the importance of quality prompts? Retrieved node with id, entering: 620ed33d-3f44-402a-b37d-34ecb4ceb8ec Retrieving with query id 620ed33d-3f44-402a-b37d-34ecb4ceb8ec: What does the report say about the importance of quality prompts? The report highlights the importance of quality prompts in improving task performance and enhancing the capabilities of LLMs. It emphasizes the role of prompting techniques such as Chain of Thought prompting (CoT), Tree of Thought (T oT), and Graph of Thought (GoT) in enhancing the reasoning capabilities of LLMs. These prompting techniques help the LLM generate intermediate reasoning steps, which can lead to better task performance and improved safety.
Wrapping Up Our LlamaIndex Journey
And there we have it – a deep dive into the vast ocean of LlamaIndex
, surfacing with a treasure trove of knowledge.
We began by understanding the fundamental concepts of documents and nodes and swiftly transitioned into the nuances of attaching metadata to these nodes. The journey further led us to the intricate world of text data, where we explored the art of cleaning for RAG and delved into some insightful exploratory data analysis.
But, what’s knowledge without application? We ventured into establishing parent-child relationships using the dynamic duo of SimpleNodeParsers
and SentenceWindowNodeParsers
.
Our exploration didn’t stop there.
We unlocked the mysteries of RecursiveRetrievers
, ensuring we had a robust mechanism for information retrieval. And, as the grand finale, we tapped into the power of generation with the context we so meticulously retrieved.
While this blog serves as a crash course, remember, the world of LlamaIndex
is vast and ever-evolving. I hope this exploration fuels your curiosity further and equips you with the tools to harness AI for more such exciting endeavors.
Next Step: Overcoming LLM Deployment Challenges
Having mastered the use of LlamaIndex and DeciLM for building applications, you’re now at a pivotal point: deployment. This stage introduces the real-world challenges of inference cost, latency and throughput in LLMs.
The complex computations required by LLMs can result in high latency, adversely affecting the user experience, particularly in real-time applications. Additionally, a crucial challenge is managing low throughput, which leads to slower response times and difficulties in processing multiple user requests simultaneously. This often requires the adoption of more expensive, high-performance hardware to enhance throughput, further increasing operational costs. Therefore, the need to invest in such hardware adds to the inherent computational expenses of deploying these models.
Deci’s Infery-LLM offers a robust solution to these challenges. This Inference SDK significantly enhances LLM performance, achieving up to fivefold throughput increases while maintaining accuracy. Crucially, it optimizes the use of computational resources, enabling the deployment of larger models on more economical GPUs and thus reducing overall operational costs.
When combined with Deci’s open-source models like DeciCoder or DeciLM 6B, Infery-LLM’s efficiency is further amplified. These models, optimized for performance, pair seamlessly with the SDK, enhancing its ability to minimize latency and reduce costs.
Below is a chart that demonstrates the throughput acceleration on NVIDIA A10 GPUs using DeciLM 6B with Infery-LLM, compared to the standard performance of both DeciLM 6B and Llama 2, as well as Llama 2 utilized with vLLM, an open-source library for LLM inference and serving. This comparison highlights the feasibility of migrating from more powerful NVIDIA A100 GPUs to the A10 models, showcasing efficient performance on the less resource-intensive hardware.
In summary, Infery-LLM is vital for addressing the challenges of latency, throughput, and cost in LLM deployment, proving indispensable for developers and organizations leveraging these sophisticated AI models.
Discover the power of Infery-LLM for yourself; click below for a live demo and witness its groundbreaking impact.