%%capture !pip install -U -q "transformers==4.31.0" "datasets==2.13.0" "peft==0.4.0" "accelerate==0.21.0" "bitsandbytes==0.41.0" "trl==0.4.7" "safetensors>=0.3.1"
Introduction
Large Language Models (LLMs) excel in general language understanding.
Models like GPT-3, PaLM, Llama 2, and DeciLM 6B have shown outstanding abilities in various natural language tasks. However, when you need to perform specialized tasks, you might observe a difference between the model’s training objectives and your specific task requirements. In such cases, you must customize the model to meet your needs.
One way to do this is through supervised fine-tuning.
It allows you to adapt LLMs to specific tasks by using labelled data. This process involves loading a pre-trained model, fine-tuning it with a task-specific dataset, and adjusting it to fit your task requirements. The end result is a powerful model that is tailored to your particular task.
Instruction Tuning is a subset of supervised fine-tuning that has become popular for its ability to refine the capabilities and controllability of LLMs.
While a base model focuses on next-word prediction, there often needs to be more connection between this objective and the user’s desire for the model to follow specific instructions. Instruction tuning bridges this gap by training LLMs on datasets that pair instructions with their desired outputs.
Instruction Tuning
Instruction tuning improves the capabilities and controllability of LLMs by fine-tuning a base model using pairs of instructions and their corresponding outputs.
This approach helps to align LLMs more closely with human instructions, making them more controllable, predictable, and adaptable without requiring extensive retraining.
The instruction tuning process follows a clear pipeline, which involves training a base model on instruction-output pairs.
This results in a more fine-tuned model that can better understand and respond to human instructions. This process allows LLMs to better interpret human language and produce more accurate and reliable outputs.
At a high level, it’s a two step process:
- Instruction Dataset Construction: Gather instructions from existing datasets or generate them using LLMs.
- Instruction Tuning: Fine-tune a base model with the assembled instruction dataset, ensuring they adhere more closely to human directives.
Two Primary Methods for Constructing Instruction Datasets
Employing these methods ensures diverse instruction sets in the datasets. Consequently, models trained on them excel in understanding and executing various tasks.
1. Data Integration from Annotated Natural Language Datasets
- Extracts (instruction, output) pairs from existing annotated natural language datasets.
- Transforms text-label pairs into (instruction, output) pairs using templates.
2. Generating Outputs with LLMs
- Uses LLMs such as GPT-3.5-Turbo and GPT4 to quickly generate outputs for specific instructions, eliminating manual collection.
- Instructions are sourced in two ways:
- Direct manual collection.
- Expansion from a select set of seed instructions with LLM assistance.
- LLMs then generate the corresponding outputs.
- For multi-turn conversational datasets, LLMs engage in self-play, simulating roles like user or AI assistant to craft conversations.
- Make sure you’re using any generated output from any LLM in accordance with the license that governs it.
Open Instruct V1 – A dataset for having LLMs follow instructions.
For this tutorial, we’ll use Open InstructV1, which is an amalgamation of various datasets which are cleaned and then collated into a singular format for training. Open InstructV1 includes:
Dataset | Number of Samples |
---|---|
Alpaca | 51_759 |
Self Instruct | 82_599 |
GPT-4 Instruct | 18_194 |
Code Alpaca | 18_019 |
Dolly | 15_015 |
Synthetic | 33_143 |
Roleplay | 3_146 |
asss | 448 |
instruction-dataset | 327 |
The downloaded dataset files have a size of 316 MB. After auto-conversion, the Parquet files are approximately 165 MB in size.
The dataset comprises a total of 498,813 rows.
Here’s a cool space on HuggingFace where you can interactively explore the dataset. A great collection of open-source datasets to train instruction-following LLMs is awesome-instruction-dataset
on GitHub.
%%capture from datasets import load_dataset open_instruct_dataset = load_dataset("hakurei/open-instruct-v1", split="train") # filter dataset to rows where the entire context length is less than or equal to 4096, # which is the size of the DeciLM-6B context window dataset = open_instruct_dataset.filter(lambda example: (len(example["input"]) + len(example["output"]) + len(example["instruction"])) <= 4096)
open_instruct_dataset.to_pandas()[:10]
input | output | instruction | |
---|---|---|---|
0 | 1. Eat a balanced diet and make sure to includ… | Give three tips for staying healthy. | |
1 | The three primary colors are red, blue, and ye… | What are the three primary colors? | |
2 | An atom is made up of a nucleus, which contain… | Describe the structure of an atom. | |
3 | There are a number of ways to reduce air pollu… | How can we reduce air pollution? | |
4 | I had to make a difficult decision when I was … | Pretend you are a project manager of a constru… | |
5 | Twitter, Instagram, Telegram | Telegram | Identify the odd one out. |
6 | 4/16 | The fraction 4/16 is equivalent to 1/4 because… | Explain why the following fraction is equivale… |
7 | John was at a crossroads in his life. He had j… | Write a short story in third person narration … | |
8 | He finnished his meal and left the resturant | He finished his meal and left the restaurant. | Evaluate this sentence for spelling and gramma… |
9 | Julius Caesar was assassinated by a group of u… | How did Julius Caesar die? |
Randomly selecting only 2500 examples. You can change this to a larger number if you want – more samples means better performance.
Double-click (or enter) to edit.
import random total_data_points = len(dataset) sample_size = 5_000 random_indices = random.sample(range(total_data_points), sample_size) subset = dataset.select(random_indices)
len(subset)
5000
Formatting Dataset for Instruction Tuning
Cleaning and Formatting Domain-Specific Text
Before fine-tuning, it’s essential to clean and format the text specific to the domain you’re interested in.
This involves removing any irrelevant or redundant information, ensuring the text is coherent, and structuring it in a way that’s suitable for the LLM.
Generating Synthetic Instruction-Based Fine-tuning Datasets
After the dataset is cleaned and formatted, the next step is to generate synthetic instruction-based datasets for the desired domain.
This involves creating datasets where each entry consists of an instruction and the corresponding desired output.
Structuring the Data
The structure of the data is crucial for the desired behavior of the LLM. For instance, if you want to improve the LLM’s question-answering capabilities, you can use question and answer pairs.
If you only have long text files available, they can be chunked or annotated to create a structured format.
Alpaca instruction format
There are a few different ways you can format your instruction prompt. In this tutorial we’ll use the Alpaca format, which is:
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: {instruction} ### Input: {input} ### Response:
If you wanted, you could try to use the template that the authors of the Llama 2 paper used:
<s>[INST] <<SYS>> System prompt <</SYS>> User prompt [/INST] Model answer </s>
The function format_row_as_instruction_prompt
is willbformat an example into the Alpaca instruction prompt for fine-tuning. It takes an example with ‘instruction’, ‘input’, and ‘output’ fields and formats it into a structured prompt that can be used for instruction-based fine-tuning of a language model.
Primer Prompts
primer_prompt_with_input:
This is a general instruction that sets the context for the model, indicating that there’s an instruction paired with an input, and the model needs to generate a response.primer_prompt_no_input:
This is another general instruction, but it only mentions the instruction without any paired input. However, this prompt isn’t used in the function you provided.
Instruction Template
instruction_template
template provides a clear format for the instruction from the example. It starts with a general directive to the model, followed by the specific instruction from the example.
Input Template
input_template
formats the input from the example, clearly labeling it as “Input”.
Response Template
response_template
: This template formats the output from the example, labeling it as “Response”.
Return Statement
The function returns a formatted string that combines the primer prompt (with input), the instruction, the input, and the response. This structured format is designed to guide the model during fine-tuning, ensuring it understands the context and the expected output format.
def format_row_as_instruction_prompt(example): # Check if 'input' key exists and has content has_input = example.get('input', None) is not None # Define the prompts based on the presence of input if has_input: primer_prompt = ("Below is an instruction that describes a task, paired with an input " "that provides further context. Write a response that appropriately completes the request.") input_template = f"### Input: \n{example['input']}\n\n" else: primer_prompt = ("Below is an instruction that describes a task. " "Write a response that appropriately completes the request.") input_template = "" instruction_template = f"### Instruction: \n{example['instruction']}\n\n" # Check if 'output' key exists if example.get('output', None): response_template = f"### Response: \n{example['output']}\n\n" else: response_template = "" return f"{primer_prompt}\n\n{instruction_template}{input_template}{response_template}" # Test with an example dictionary test_example = { 'instruction': "Open the door.", 'input': "The door is locked.", 'output': "Use the key to unlock and then open the door." } print(format_row_as_instruction_prompt(test_example))
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Open the door. ### Input: The door is locked. ### Response: Use the key to unlock and then open the door.
Let’s just confirm the format is as we expect:
print(format_row_as_instruction_prompt(open_instruct_dataset[5]))
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Identify the odd one out. ### Input: Twitter, Instagram, Telegram ### Response: Telegram
print(format_row_as_instruction_prompt(open_instruct_dataset[0]))
Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request. ### Instruction: Give three tips for staying healthy. ### Input: ### Response: 1. Eat a balanced diet and make sure to include plenty of fruits and vegetables. 2. Exercise regularly to keep your body active and strong. 3. Get enough sleep and maintain a consistent sleep schedule.
Training Preparation
BitsandBytes
The BitsandBytes library is integrated with Hugging Face’s Transformers library to simplify the process of model quantization.
Quantization is a technique used to reduce the precision of numerical values in a model. Instead of using high-precision data types like 32-bit floating-point numbers, quantization represents values using lower-precision data types, such as 8-bit integers. This process significantly reduces memory usage and can speed up model execution while maintaining acceptable accuracy.
The integration of Hugging Face’s Transformers library with the BitsandBytes library makes this technique more accessible and user-friendly.
BitsAndBytesConfig
The BitsAndBytesConfig
configures the quantization process for a model, specifying that it should use 4-bit quantization with the NF4 data type and compute using the torch.bfloat16 data type. The nested quantization technique is also enabled for enhanced memory efficiency.
load_in_4bit=True
: This argument indicates that the model should be loaded in 4-bit quantization. By doing so, memory usage can be reduced by approximately fourfold.
bnb_4bit_use_double_quant=True
: This argument suggests using the nested quantization technique, which offers even greater memory efficiency without compromising performance.bnb_4bit_quant_type="nf4"
: The NF4 data type is designed for weights initialized using a normal distribution. By specifying this type, the model uses the NF4 data type for quantization.bnb_4bit_compute_dtype=torch.bfloat16
: This argument allows you to modify the data type used during computation. By setting it totorch.bfloat16
, it can result in speed improvements in specific scenarios.
import torch from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig model_id = "Deci/DeciLM-6b" quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16 ) model = AutoModelForCausalLM.from_pretrained( model_id, quantization_config=quantization_config, use_cache=False, device_map="auto", trust_remote_code=True ) model.config.pretraining_tp = 1 tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token tokenizer.padding_side = "right"
PEFT
Fine-tuning a model for each specific task by adapting all of its parameters can be impractical due to the large number of parameters in modern models.
In such scenarios, PEFT proposes techniques that only train a smaller subset of parameters or use low-rank adaptation (LoRA) methods to minimize the number of trainable parameters. This way, the process becomes more efficient and practical.
PEFT is a library created by HuggingFace that focuses on efficiently adjusting pre-trained language models for various downstream applications without fine-tuning all of the model’s parameters.
The primary goal of PEFT is to address the increasing computational and storage costs associated with fine-tuning large-scale PLMs by fine-tuning only a small number of additional model parameters, significantly reducing these costs. Despite reducing the number of fine-tuned parameters, recent PEFT techniques have achieved performance comparable to full fine-tuning. PEFT provides tools and methods that make fine-tuning large language models more efficient and accessible, especially on consumer hardware.
It balances computational efficiency and model performance, making it a valuable tool for researchers and practitioners working with large-scale PLMs.
LoRA
LoRA’s approach to fine-tuning uses low-rank decomposition to represent weight updates with two smaller matrices.
This reduces the number of trainable parameters, making fine-tuning more efficient. The original pre-trained weights remain frozen, allowing multiple lightweight and portable LoRA models to be built on top of them for different tasks.
LoRA is compatible with many other parameter-efficient methods and you can stack methods.
Fine-tuned models using LoRA perform comparably to fully fine-tuned models, and LoRA doesn’t add any inference latency. It can be applied to any subset of weight matrices in a neural network. Transformer models typically apply LoRA to attention blocks only.
The number of trainable parameters in a LoRA model depends on the size of the low-rank update matrices, determined mainly by the rank r
and the shape of the original weight matrix.
from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model peft_config = LoraConfig( lora_alpha=16, lora_dropout=0.1, r=64, bias="none", task_type="CAUSAL_LM" ) model = prepare_model_for_kbit_training(model)
LoraConfig
This configuration is used to store the settings for a LoRA model, which is designed for fine-tuning a model with LoRA (Low Rank Adapters). The provided parameters are essential for determining how the LoRA layers behave during the fine-tuning process.
lora_alpha (int)
: This parameter represents the scaling factor for the weight matrices in LoRA, which is adjusted by alpha to control the magnitude of the combined output from the base model and low-rank adaptation.r (int)
: This represents the LoRA rank of the update matrices, expressed in int. Lower rank results in smaller update matrices with fewer trainable parameters.bias (str)
: This parameter specifies the bias type for LoRA. It can take values such as ‘none’, ‘all’, or ‘lora_only’. If set to ‘all’ or ‘lora_only’, the corresponding biases will be updated during training. This means that even when disabling the adapters, the model might not produce the same output as the base model would have without adaptation. In the given code, the bias is set to “none”, meaning no bias is used.
ℹ️ Note: The weight matrix is multiplied by lora_alpha/r
, and a higher lora_alpha
value assigns more weight to the LoRA activations.
For better performance, the HuggingFace docs recommend setting bias
to None first, and then lora_only
, before trying all
.
lora_dropout (float)
: This parameter indicates the dropout probability for LoRA layers. Dropout is a regularization technique where randomly selected neurons are ignored during training, helping to prevent overfitting. In this configuration, the dropout rate is set to 0.1 or 10%.task_type (str)
: This parameter indicates the type of task for which the model is being fine-tuned. In the provided code, it’s set toCAUSAL_LM
, which stands for causal language modeling.
prepare_model_for_kbit_training
This method outlines the protocol for preparing a model prior to training. The steps include casting layernorm in fp32
, requiring grads for output embedding layer, and upcasting the LLMs head to fp32
.
Initialize Training Arguments
num_train_epochs(float, defaults to 3.0)
: Total number of training epochs to perform (if not an integer, will perform the decimal part percents of the last epoch before stopping training).
per_device_train_batch_size
: The batch size per device for traininggradient_accumulation_steps
: Number of updates steps to accumulate the gradients for, before performing a backward/update pass.gradient_checkpointing
: If True, use gradient checkpointing to save memory at the expense of slower backward pass.optim
the optimizer to use. You can choose from:
- `adamw_hf` - `adamw_torch` - `adamw_torch_fused` - `adamw_torch_xla` - `adamw_apex_fused` - `adafactor` - `adamw_anyprecision` - `sgd` - `adagrad` - `adamw_bnb_8bit` - `adamw_8bit` (just an alias for `adamw_bnb_8bit`) - `lion_8bit` - `lion_32bit` - `paged_adamw_32bit` - `paged_adamw_8bit` - `paged_lion_32bit` - `paged_lion_8bit` - `rmsprop`
save_steps
: Number of updates steps before two checkpoint saves ifsave_strategy="steps"
. Should be an integer or a float in range[0,1)
. If smaller than 1, will be interpreted as ratio of total training steps.bf16
: Whether to use bf16 16-bit (mixed) precision training instead of 32-bit training.max_grad_norm (float, *optional*, defaults to 1.0)
: Maximum gradient norm (for gradient clipping)warmup_ratio (float, optional, default 0.0
:Ratio of total training steps used for a linear warmup from 0 tolearning_rate
.
–lr_scheduler_type (defaults to linear)
: The scheduler type to use. You can choose from the following:
- `linear` - `cosine` - `cosine_with_restarts` - `polynomial` - `constant` - `constant_with_warmup` - `inverse_sqrt` - `reduce_lr_on_plateau`
from transformers import TrainingArguments args = TrainingArguments( output_dir="decilm6b_open_instruct", # just for demo purposes num_train_epochs=1, # trying to max out resources on colab per_device_train_batch_size=4, gradient_accumulation_steps=10, gradient_checkpointing=True, optim="paged_adamw_32bit", logging_steps=25, save_strategy="steps", save_steps=100, learning_rate=3e-5, bf16=True, tf32=True, max_grad_norm=0.3, warmup_ratio=0.03, lr_scheduler_type="linear", disable_tqdm=False ) model = get_peft_model(model, peft_config)
What is the TRL library?
The TRL (Transformers Reinforcement Learning) library is a tool that makes the Reinforcement Learning (RL) step in fine-tuning Large Language Models (LLMs) more straightforward and more flexible. It lets users fine-tune their language models using RL with their custom datasets and training setups. The library supports the Deep RL algorithm called PPO, which can be run distributed or on a single device.
TRL also uses the accelerate
feature from the HuggingFace ecosystem, which allows users to scale up their experiments and achieve better results.
What problems does it solve?
The TRL library solves the challenges associated with fine-tuning Language Models (LLMs) using Reinforcement Learning (RL), mainly when dealing with memory constraints.
Training an LLM with Reinforcement Learning with Human Feedback (RLHF) involves several steps, such as fine-tuning a pre-trained LLM on a specific domain, collecting a human-annotated dataset to train a reward model, and further fine-tuning the LLM using the reward model and dataset with RL. TRL simplifies the RL step in this process, making it more accessible and efficient.
TRL also offers solutions to fit the setup on a single GPU, even with an increased model size. Integrating TRL with Parameter-Efficient Fine-Tuning (PEFT) allows for fine-tuning large LLMs using RLHF at a reasonable cost.
PEFT supports creating and fine-tuning adapter layers on LLMs, enabling fine-tuning with significantly reduced GPU memory requirements. By integrating TRL and PEFT, users can fine-tune massive models on a 24GB consumer GPU, which typically requires 40GB in bfloat16
.
What is the SFTTrainer?
The SFTTrainer
(Supervised Fine-tuning Trainer) is a class provided by the TRL (Transformers Reinforcement Learning) library.
It facilitates supervised fine-tuning, a crucial step in RLHF (Reinforcement Learning with Human Feedback). The SFTTrainer provides an easy-to-use API to create and train SFT models with just a few lines of code on a given dataset.
When initializing the SFTTrainer
class, you pass the following:
- base model to be trained
- the training dataset
- PEFT configurations
- and the method for converting the training data into a “prompt”
What do the following parameters mean?
model
: This can be a pre-trained model, a PyTorch module, or a string representing the model name. It specifies the model to be trained.train_dataset
: This is the Dataset used for training.peft_config
: This is the PEFT (Plug and Play Language Model Fine-Tuning) library configuration. It allows users to train adapters and share them on the Hub.max_seq_length
: It defines the maximum sequence length for the ConstantLengthDataset and for automatically creating the Dataset. The default value is 512.tokenizer
: This is the tokenizer used for training. If not specified, the tokenizer associated with the model will be used.packing
indicates whether multiple short examples should be packed into the same input sequence to increase training efficiency. This is done using the ConstantLengthDataset utility class.formatting_func
: This function is used for formatting the text before tokenization. It’s commonly used for instruction fine-tuning where datasets might have separate columns for prompts and responses.
from trl import SFTTrainer max_seq_len = 4096 trainer = SFTTrainer( model=model, train_dataset=subset, peft_config=peft_config, max_seq_length=max_seq_len, tokenizer=tokenizer, packing=True, formatting_func=format_row_as_instruction_prompt, args=args, )
/usr/local/lib/python3.10/dist-packages/trl/trainer/utils.py:246: UserWarning: The passed formatting_func has more than one argument. Usually that function should have a single argument `example` which corresponds to the dictonnary returned by each element of the dataset. Make sure you know what you are doing. warnings.warn(
Training Execution
Once initialized and provided with a dataset, the trainer.train()
method is called to start the training process.
This method internally manages the training loop, including forward and backward passes, optimization steps, and logging.
trainer.train()
Save
trainer.save_model()
Test Model
Now that we’ve instruct-tuned our model – let’s see how it performs!
import torch from peft import AutoPeftModelForCausalLM from transformers import AutoTokenizer instruction_tuned_model = AutoPeftModelForCausalLM.from_pretrained( "decilm6b_open_instruct", low_cpu_mem_usage=True, torch_dtype=torch.float16, load_in_4bit=True, trust_remote_code=True, local_files_only=True, ) tokenizer = AutoTokenizer.from_pretrained(args.output_dir)
from random import randrange sample = dataset[5]
print(sample)
{'input': 'Twitter, Instagram, Telegram', 'output': 'Telegram', 'instruction': 'Identify the odd one out.'}
First, let’s see how our model handles this instruction task!
instruction_template = f"### Instruction:\n{sample['instruction']}" input_template = f"### Input:\n{sample['input']}" generate_response_template = "### Response:" true_response_template = f"{sample['output']}" partial_prompt_list = [instruction_template] if sample["input"]: partial_prompt_list.append(input_template) partial_prompt_list.append(generate_response_template) generate_prompt = "\n\n".join(partial_prompt_list) print(f"Prompt:\n{generate_prompt}\n --------")
Prompt: ### Instruction: Identify the odd one out. ### Input: Twitter, Instagram, Telegram ### Response: --------
input_ids = tokenizer(generate_prompt, return_tensors="pt", truncation=True).input_ids.cuda() outputs = instruction_tuned_model.generate(input_ids=input_ids, max_new_tokens=250, do_sample=True, temperature=0.5, early_stopping=True ) print(f"Generated Response:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(generate_prompt):]}\n -----") print(f"Actual Response:\n{true_response_template}")
Generated Response: ``` Telegram ``` ### Explanation: Telegram is a messaging app that is very popular in Asia. ### Note: You can select multiple options. ----- Actual Response: Telegram
Not bad! It provides a decent explanation – and gets the answer “correct”!
Now let’s see how the base model would’ve performed on the same prompt:
model_id = "Deci/DeciLM-6b" base_decilm_6b = AutoModelForCausalLM.from_pretrained( model_id, low_cpu_mem_usage=True, torch_dtype=torch.float16, load_in_4bit=True, trust_remote_code=True ) base_decilm_6b_tokenizer = AutoTokenizer.from_pretrained(model_id)
instruction_template = f"### Instruction:\n{sample['instruction']}" input_template = f"### Input:\n{sample['input']}" generate_response_template = "### Response:" true_response_template = f"{sample['output']}" partial_prompt_list = [instruction_template] if sample["input"]: partial_prompt_list.append(input_template) partial_prompt_list.append(generate_response_template) generate_prompt = "\n\n".join(partial_prompt_list) print(f"Prompt:\n{generate_prompt}\n --------")
Prompt: ### Instruction: Identify the odd one out. ### Input: Twitter, Instagram, Telegram ### Response: --------
input_ids = base_decilm_6b_tokenizer(generate_prompt, return_tensors="pt", truncation=True).input_ids.cuda() outputs = base_decilm_6b.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9, temperature=0.9) print(f"Generated Response:\n{base_decilm_6b_tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(generate_prompt):]}\n -----") print(f"Actual Response:\n{true_response_template}")
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation. Generated Response: Instagram ### Explanation: We are using the "most frequent word" heuristic, which finds the most frequent word in the corpus. ----- Actual Response: Telegram
Export and Share
You’ve completed the challenging part of the task, and now it’s time to share your work with the with the community!
Like our friends HuggingFace, Deci believes in promoting open sharing of knowledge and resources to make artificial intelligence more accessible to everyone.
I urge you to consider sharing your model with the community to help others save time and resources.
And feel free to come and hangout in our Discord community to talk about what you’ve done!
from huggingface_hub import notebook_login notebook_login()
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful
model = AutoPeftModelForCausalLM.from_pretrained( args.output_dir, low_cpu_mem_usage=True ) merged_model = model.merge_and_unload()
HF_USERNAME = "harpreetsahota" HF_REPO_NAME = "DeciLM-6B-hf-open-instruct-v1-blog-post" merged_model.push_to_hub(f"{HF_USERNAME}/{HF_REPO_NAME}") tokenizer.push_to_hub(f"{HF_USERNAME}/{HF_REPO_NAME}")
Next Step: Overcoming LLM Deployment Challenges
While the use of LoRA streamlines the fine-tuning of large language models, challenges remain in managing inference latency, throughput, and cost.
The complex computations required by LLMs can result in high latency, adversely affecting the user experience, particularly in real-time applications. Additionally, a crucial challenge is managing low throughput, which leads to slower response times and difficulties in processing multiple user requests simultaneously. This often requires more expensive, high-performance hardware to enhance throughput, increasing operational costs. Therefore, the need to invest in such hardware adds to the inherent computational expenses of deploying these models.
Deci’s Infery-LLM addresses these issues effectively. This Inference SDK boosts LLM performance, offering up to five times higher throughput while maintaining accuracy. Significantly, it optimizes computational resource use, allowing for the deployment of larger models on cost-effective GPUs, which lowers operational costs.
When combined with Deci’s open-source models like DeciCoder or DeciLM 6B, Infery-LLM’s efficiency is further amplified. These models, optimized for performance, pair seamlessly with the SDK, enhancing its ability to minimize latency, boost throughput, and reduce costs.
Below is a chart that demonstrates the throughput acceleration on NVIDIA A10 GPUs using DeciLM 6B with Infery-LLM, compared to the standard performance of both DeciLM 6B and Llama 2, as well as Llama 2 utilized with vLLM, an open-source library for LLM inference and serving. This comparison highlights the feasibility of migrating from more powerful NVIDIA A100 GPUs to the A10 models, showcasing efficient performance on the less resource-intensive hardware.
In conclusion, Infery-LLM is crucial in tackling latency, throughput, and cost challenges in LLM deployment, proving to be an invaluable tool for developers and organizations using these advanced AI models.
Experience the capabilities of Infery-LLM firsthand; click below for a live demo and explore its transformative potential.