Introduction
2023 has proven to be a pivotal year for the open-source landscape, marked by the rise of large language models (LLMs) that hold their own against formidable proprietary counterparts like GPT-3 and GPT-3.5. To help you navigate this dynamic terrain, we dove into the ocean of open source possibilities to curate a select list of the most intriguing and influential models making waves in 2023. Here are the models we have chosen to spotlight for their unique contributions to the landscape:
- LLaMA
- LLaMA 2
- Alpaca
- Vicuna
- Guanaco
- RedPajama
- Falcon
- FLAN-T5
- Stable Beluga (formerly ‘FreeWilly’)
- MPT
For every model on the list, we offer an in-depth exploration that includes key details such as its architectural design, the database it was trained on, its training process, licensing details, and interesting characteristics. Our findings are summarized in this table:
Model Family Name | Created By | Sizes | Focus | Foundation or Fine-Tuned | License | What’s interesting | Architectural Notes |
LLaMA | Meta | 7B, 13B, 32B, 65.2B | Varied | Foundation | Non-commercial | The basis for a vast number of fine-tuned model variants | Uses SwiGLU instead of ReLU activation |
LLaMA 2 | Meta with Microsoft | 7B, 13B, 70B | Chat | Foundation | Commercial | Improves on OpenAI’s models in safety vs. helpfulness | Uses SwiGLU instead of ReLU activation and RoPE (over traditional embeddings) |
Alpaca | Stanford’s CRFM | 7B | Instruction following | Fine-tuned LLaMA 7B | Non-commercial | Trained on text-davinci-003 examples | – |
Vicuna | LMSYS | 7B, 13B | Chat | Fine-tuned LLaMA 13B | Non-commercial | Trained on human-generated conversations collected from ShareGPT.com | – |
Guanaco | KBlueLeaf | 7B | Instruction following | Parameter efficient fine-tuned LLaMA 7B | Non-commercial | Fined-tuned using QLoRA | – |
RedPajama | Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research | 3B, 7B | Chat, Instruction following | Foundation | Commercial | Trained on the fully open RedPajama dataset following the LLaMA training recipe | Based on the Pythia architecture with slight modifications |
Falcon | Technology Innovation Institute of UAE | 7B, 40B | Varied | Foundation | Commercial | Efficient training process, using a 2D parallelism strategy combined with ZeRo optimization | Employs FlashAttention and Multi-query Attention |
Flan-T5 | 80M, 250M, 780M, 3B, 11B | Varied | Foundation | Commercial | Trained on 473 datasets, 146 task categories, and 1836 tasks | Based on the T5 encoder-decoder architecture | |
Stable Beluga 2 (Freewilly) | Stability AI | 70B | Varied | Fine-tuned LLaMA 2 70B | Non-commercial | Training based on a modified Orca approach, which produced high quality examples | – |
MPT | MosaicML | 7B, 30B | Chat, Instruction following, Story Writing | Foundation | Commercial | MPT-7B-StoryWriter-65k+ can generate text as long as 84k tokens | Employs FlashAttention |
Download a PNG version of this table here.
Whether you’re a deep learning engineer seeking the next tool for your project, a team lead scouting the latest AI trends to stay competitive, or a company executive looking for a bird’s-eye view of the open-source AI landscape, this comprehensive guide is tailored for you.
Read on to discover what’s transforming the open-source landscape in 2023. Explore the potential of these models, understand their diverse applications, and gain the insights needed to make informed decisions in this rapidly evolving field.
LLaMA
The LLaMA model kickstarted an explosion of open-source large language models!
LLaMA is not just a single model; it is a collection of Large Language Models that vary in size, ranging from 7 billion to 65 billion parameters. The available sizes include 6.7B, 13.0B, 32.5B, and 65.2B parameters, each excelling at different tasks, with larger models generally performing better on more complex tasks.
Developed by Meta AI, LLaMA is based on transformer architecture, the standard architecture for language modelling since 2018. It shares similarities with GPT-3 but also incorporates some minor architectural differences. Instead of ReLU activation functions, LLaMA uses SwiGLU activation functions, rotary positional embeddings are used instead of absolute positional embeddings, and root-mean-squared layer normalization replaces standard layer normalization.
The models stand out as they are trained on diverse domains and have been designed to be open-sourced. Researchers can utilize them for various applications, including translation, question answering, text generation, and more. LLaMA’s versatility allows it to be fine-tuned for numerous tasks, making it an ideal foundation model for various AI projects.
The training data for LLaMA is extensive, with the models trained on 1.4 trillion tokens from publicly available data sources. These sources include webpages scraped by CommonCrawl, open-source repositories from GitHub, Wikipedia in multiple languages, public domain books from Project Gutenberg, and questions and answers from Stack Exchange websites. The models’ developers focused on scaling performance by increasing the volume of training data rather than solely increasing the number of parameters.
As for licensing, Meta released LLaMA’s model weights to the research community under a noncommercial license.
To train the LLaMA models, the developers utilized the AdamW optimizer with a cosine learning rate schedule. The final learning rate is 10% of the maximal learning rate. Additionally, the models use a weight decay of 0.1 and gradient clipping of 1.0. The learning rate and batch size are tailored to each model’s size, further optimizing their performance during training.
LLaMA 2
LLaMA 2 ia Meta’s second iteration of the LLaMA model, specifically designed for use in dialogue situations. It has undergone extensive fine-tuning to make it comparable to other models like ChatGPT, providing impressive performance for various tasks. LLaMA 2 models come in three sizes: 7 billion, 13 billion, and 70 billion parameters.
LLaMA 2 introduces significant advancements and improvements over its predecessor, Llama 1. It is trained on a new mix of publicly available data, with a pretraining corpus that is 40% larger. The context length of the model is doubled, and it utilizes a grouped-query attention mechanism.
Llama 2-Chat is a fine-tuned version optimized for chat-based interactions. LLaMA 2 and Llama 2-Chat have been developed to ensure their output is helpful and safe for human consumption. These auto-regressive models generate text based on input, ideal for assistant-like chat and various natural language generation tasks.
The most significant innovation of the LLaMA 2 models is in their ability to balance safety vs. helpfulness better than most other models, including ChatGPT, on human evaluation benchmarks.
LLaMA 2 is licensed for researchers and commercial entities, adhering to the principles of openness. This enables a wide range of users to leverage the capabilities of LLaMA 2 for various purposes, be it for research, commercial applications, or specialized projects.
The training data for LLaMA 2 is extensive, comprising 2 trillion tokens from publicly available sources. The fine-tuning data includes publicly available instruction datasets and over one million new human-annotated examples. Notably, neither the pretraining nor the fine-tuning datasets include Meta user data, ensuring user privacy and data security.
LLaMA 2 employs a modified version of the Llama 1 model to enhance performance and uses the AdamW optimizer with a standard transformer architecture. It utilizes the same tokenizer as Llama 1, employing a bytepair encoding (BPE) algorithm with a vocabulary size of 32k tokens.
The development of Llama 2-Chat took place in two stages. Initially, LLaMA 2 wastrained using publicly available online data. Then, an initial version of Llama 2-Chat was created through supervised fine-tuning. In the second stage, Llama 2-Chat was refined using Reinforcement Learning from Human Feedback (RLHF). This process involves rejection sampling and proximal policy optimization (PPO) to enhance its performance in dialogue-based applications.
Overall, LLaMA 2 represents a significant advancement in language modelling, offering transparency, accessibility, and performance improvements that are likely to be widely embraced by the research and commercial communities.
Alpaca
Developed by researchers at Stanford University’s Center for Research on Foundation Models (CRFM), Alpaca is a language model that excels in following instructions. It is fine-tuned from Meta’s LLaMA 7B model and has been trained on 52,000 instruction-following demonstrations in the self-instruct style, using OpenAI’s text-davinci-003 as a reference. Despite exhibiting behaviour similar to OpenAI’s text-davinci-003, Alpaca is smaller. It is also and accessible for replication.
The standout feature of the Alpaca model is its robust instruction-following capabilities. With its fine-tuned design and training on many instruction-following demonstrations, it offers a reliable and effective option for tasks requiring precise adherence to instructions.
The model is intended primarily for academic research. However, it is not ready for general use due to inadequate safety measures. It is not available for commercial use because the instruction data used for training Alpaca is based on OpenAI’s text-davinci-003, which has terms of use prohibiting the development of models that compete with OpenAI.
Hugging Face’s training framework was employed for fine-tuning Alpaca, taking advantage of Fully Sharded Data Parallel and mixed precision training. Fine-tuning the 7B LLaMA model took approximately 3 hours using 8 80GB A100s, showcasing its efficiency and potential for rapid development.
Overall, Alpaca is a specialized and highly effective language model for tasks requiring accurate and precise instruction. Its development and usage are geared towards research and academic exploration, focusing on maintaining safety and compliance with licensing restrictions.
Vicuna
The Vicuna family of large language models, developed by LMSYS, is renowned for its capability to generate human-like text. These models excel in understanding and providing responses based on user prompts, making them highly useful for various applications such as chatbots and content generation.
Vicuna comes in two sizes: Vicuna-7B and Vicuna-13B. Initial assessments using GPT-4 as a reference indicate that Vicuna-13B has achieved over 90% quality compared to OpenAI ChatGPT and Google Bard. Moreover, it has demonstrated superior performance in over 90% of cases compared to other models like LLaMA and Stanford Alpaca.
One of the significant aspects of the Vicuna model is its reliance on human-generated data. This sets it apart as one of the first open-source large language models trained with such data, generating coherent and creative text. Vicuna represents an improved version of the Alpaca model, based on the transformer architecture but fine-tuned on a dataset of human-generated conversations.
The primary use of Vicuna is intended for research purposes, particularly for researchers and hobbyists in natural language processing, machine learning, and artificial intelligence. Vicuna is meant solely for non-commercial users must adhere to the rules set by LLaMA for using the model, respect OpenAI’s terms for using the data it generates, and comply with ShareGPT’s privacy rules.
Both Vicuna models were built on the foundation of the LLaMA-13B model and have undergone fine-tuning using approximately 70,000 user-shared conversations collected from ShareGPT.com via public APIs. HTML was converted back to markdown to ensure data quality, and inappropriate or low-quality samples were filtered out. Lengthy conversations were also divided into smaller segments to fit within the model’s maximum context length of 2048 tokens.
In the training process, Vicuna was built upon Stanford’s Alpaca model with several key improvements:
- Multi-turn Conversations: The training loss was adjusted to account for multi-turn conversations, allowing the model to understand better and respond to complex, multi-turn dialogues.
- Memory Optimizations: The maximum context length was expanded from 512 (as in Alpaca) to 2048, enabling Vicuna to understand longer contexts. Gradient checkpointing and Flash Attention are utilized for memory optimizations to manage the increased GPU memory requirements.
- Cost Reduction via Spot Instance: To mitigate the significant training expenses resulting from the larger dataset and increased sequence length, SkyPilot-managed spot instances were employed. These instances are cheaper and come with auto-recovery for preemptions and auto zone switch, significantly reducing the cost of training.
With its human-like text generation capabilities, openness, and versatility, Vicuna represents a valuable addition to the landscape of large language models, promising exciting possibilities for research and natural language processing applications.
Guanaco
Guanaco is an advanced language model family based on Meta’s LLaMA models, specifically designed to excel in following instructions and performing well in multilingual environments. Built upon the foundation of LLaMA 7B, Guanaco is the result of significant improvements and fine-tuning using the innovative QLoRA (Quantized Low-Rank Adapters) method. This method allows large language models to be fine-tuned on a single GPU.
The Guanaco family of models includes variants with varying numbers of parameters, ranging from 7 billion to 65 billion parameters. According to the researchers, the largest Guanaco model achieves 99.3% of Chat GPT’s performance, showcasing its exceptional performance in benchmark runs.
Guanaco’s training used the QLoRA method, which efficiently quantizes the model to 4-bit precision and incorporates low-rank adaptive weights (LoRAs), significantly reducing the memory requirement while maintaining high performance. This approach allows the largest 65-billion-parameter Guanaco model to function effectively with less than 48 gigabytes of GPU memory, substantially reducing from over 780 gigabytes without compromising performance.
One of Guanaco’s distinctive features is its adaptability for extended conversations. It can continue answering questions or discussing topics upon user request, making it highly suitable for chatbot applications. The model also supports Visual Question Answering (VQA), enabling it to interpret and respond to text and visual input queries.
Guanaco expands upon the initial 52,000 datasets from the Alpaca model by incorporating over 534,530 additional entries, encompassing various languages, linguistic tasks, and grammatical tasks. This extensive training contributes to its ability to perform multilingual and multimodal tasks effectively.
The Guanaco model, however, is not licensed for commercial use. Its primary intended usage is for academic research and non-commercial applications. Nevertheless, its versatility and robust performance make it a valuable tool for various natural language processing tasks.
Overall, Guanaco’s combination of efficient fine-tuning, multilingual capabilities, and adaptable conversational skills makes it a significant advancement in the landscape of language models, with potential applications in chatbots, content generation, and private models for mobile hardware.
RedPajama
RedPajama is a collaborative project involving Together, Ontocord.ai, ETH DS3Lab, Stanford CRFM, and Hazy Research, with the mission to create a set of leading, fully open-source language models. The project’s primary objective is to bridge the quality gap between open and closed models, as many powerful foundation models are currently locked behind commercial APIs, limiting research, customization, and usage with sensitive data.
The RedPajama project consists of three key components:
RedPajama Dataset: The RedPajama dataset is an impressive 1.2 trillion token fully-open dataset created following the recipe described in the LLaMA paper. This vast dataset comprises seven data slices from diverse sources, including CommonCrawl, C4, GitHub, arXiv, Books, Wikipedia, and StackExchange. Each data slice undergoes meticulous pre-processing and filtering, ensuring data quality and token count alignment with the numbers reported by Meta AI in the LLaMA paper.
RedPajama Base Models: The 3 billion parameter and 7 billion parameter base models form the foundation of the RedPajama models. They were developed based on the Pythia architecture and designed to excel in different tasks. Two notable variations are RedPajama-INCITE-Chat-3B-v1 and RedPajama-INCITE-Instruct-3B-v1, both featuring 3 billion parameters. The RedPajama-INCITE-Chat-3B-v1 model is optimized for conversational AI tasks, adept at generating human-like text in a conversational context. On the other hand, the RedPajama-INCITE-Instruct-3B-v1 model is designed to follow instructions effectively, making it ideal for understanding and executing complex instructions.
RedPajama Instruction Tuning Data and Models: This component focuses on fine-tuning the base models to excel in specific tasks. The project offers variations of the RedPajama-INCITE-Base models, each with distinct characteristics and applications. For example, the RedPajama-INCITE-Chat models are fine-tuned using Dolly 2.0 and Open Assistant data. In contrast, the RedPajama-INCITE-Instruct models are designed for few-shot prompts, eliminating any dataset that overlaps with the HELM benchmark.
The RedPajama models and dataset were released under the permissive Apache 2.0 license, allowing for use in both research and commercial applications.
Falcon
The Falcon family of models, developed by the Technology Innovation Institute, comprises a series of large language models. They were optimized to be effective across various applications, including text generation, summarization, and chatbot functionality.
The Falcon family consists of two models: Falcon-40B and Falcon-7B, each tailored to specific requirements and use cases. The Falcon-40B model has 40 billion parameters and was trained on the extensive RefinedWeb dataset. This dataset contains 1,500 billion tokens, curated for high-quality, filtered, and deduplicated web data. The Falcon-7B model is a smaller variant with 7 billion parameters, also trained on the RefinedWeb dataset but further supplemented with curated corpora to enhance its capabilities.
As causal decoder-only models, the Falcon models can predict the next token in a sequence based on the preceding tokens, making them particularly suitable for text generation tasks, including summarization and chatbot functionalities. Their architecture is built upon the foundation of the GPT-3 model, with several adjustments implemented to achieve better optimization and enhanced Foundperformance. For instance, they employ FlashAttention and multiquery attention mechanisms..
The versatility and effectiveness of the Falcon models make them suitable for a broad spectrum of applications. They can be employed for research on large language models, and they serve as a strong foundation for further specialization and fine-tuning to cater to specific use cases like summarization, text generation, and chatbot functionality.
The Falcon models are available under the Apache 2.0 license, enabling commercial use without royalties or restrictions.
Falcon-40B was trained on a massive 2,500 billion tokens of RefinedWeb data, with training spanning two weeks and utilizing 384 A100 40GB GPUs. The Falcon-7B model, training on 1,500 billion tokens of RefinedWeb data, also underwent a two-week training process using the same 384 A100 40GB GPUs setup. The efficient training process is achieved through a 2D parallelism strategy (PP=2, DP=192) combined with ZeRO optimization, resulting in models outperforming comparable open-source models while using significantly less training compute.
FLAN-T5
The FLAN-T5 family includes several variations with different numbers of parameters:
- Flan-T5 small (80M)
- Flan-T5 base (250M)
- Flan-T5 large (780M)
- Flan-T5 XL (3B)
- Flan-T5 XXL (11B)
The architecture of FLAN-T5 is based on the T5 encoder-decoder architecture, where both the encoder and decoder are transformers. This transformer-based language model consists of 12 transformer layers and a feed-forward neural network for processing text in parallel.Tra
FLAN-T5 performs well on tasks such as Multi-Task Language Understanding and Cross-Lingual Question Answering. It is a powerful tool for various business applications, including text generation, common sense reasoning, question answering, sentiment classification, translation, pronoun resolution, and more. Additionally, it is a valuable resource for research on zero-shot NLP tasks and in-context few-shot learning NLP tasks, like reasoning and question answering. Furthermore, it contributes to advancing fairness and safety research and understanding the limitations of current large language models.
Google open-sourced FLAN-T5 under the Apache license at the end of 2022. This enables developers and researchers to utilize and build upon this powerful language model for various applications.
During its training, Flan-T5 was exposed to a large corpus of text data in a two-stage process: pre-training and instruction fine-tuning. The pre-training stage follows the T5 architecture, where the model is trained to predict the next token in a sequence given the previous tokens. In the instruction fine-tuning phase, FLAN-T5’s capabilities were refined through specific instructions to enhance its performance on various tasks and languages.
The fine-tuning data for FLAN-T5 is extensive, comprising 473 datasets, 146 task categories, and 1,836 tasks. The fine-tuning process utilizes a mixture of four task mixtures: Muffin, T0-SF, NIV2, and CoT. These mixtures include various tasks, such as dialog data, program synthesis data, arithmetic reasoning, multi-hop reasoning, natural language inference, and more.
The Flan-T5 models are not limited to specific tasks or languages, offering researchers and developers a powerful tool for pushing the boundaries of natural language understanding and generation.
Stable Beluga (Formerly Free Willy)
The Stable Beluga project by Stability AI and its CarperAI lab gave rise to two models, Stable Beluga 1 and Stable Beluga 2, as part of their commitment to providing open access to LLMs. These models were built upon Meta’s Llama models and fine-tuned using new synthetically-generated datasets in standard Alpaca format. The project aimed to bridge the quality gap between open and closed models, allowing researchers and developers to explore and customize these models for diverse natural language processing tasks.
Stable Beluga 1 and Stable Beluga 2 leverage the LLaMA 65B and LLaMA 2 70B foundation models, respectively. Both models have performed well across various benchmarks. Stable Beluga 2 has even outperformed Llama 2 in certain benchmarks.
The Stable Beluga LLMs excel in solving complex problems in specialized fields such as law and mathematics, with a keen focus on delicate linguistic details. They have proven their worth by providing insightful responses to intricate questions and reasoning tasks, making them valuable assets for researchers and specialists in these domains.
As a research experiment, the Stable Beluga models come under a non-commercial license, emphasizing their dedication to promoting open research and accessibility in the AI community. This license ensures that the models are freely available for academic and non-commercial purposes, encouraging collaboration and innovation in natural language processing.
The training process for the Stable Beluga models is based on the Orca approach, similar to Microsoft’s progressive learning methodology. However, the datasets used in the Stable Beluga project differ from the original Orca paper. The team employed Enrico Shippole’s datasets, which included COT Submix Original, NIV2 Submix Original, FLAN 2021 Submix Original, and T0 Submix Original, to prompt language models. The resulting dataset contained 600,000 high-quality examples, about 10% of the size of the original Orca dataset. Using a carefully filtered dataset and removing evaluation benchmarks, the Stable Beluga models were fine-tuned to achieve their exceptional performance.
MPT
The MPT models, developed by MosaicML, are a series of transformers-based language models. Designed for commercial use, these models are open-source and built upon the foundation of the GPT-3 model, aiming to be more efficient and flexible in various natural language processing tasks.
The MPT family consists of several variations, with MPT-7B and MPT-7B-StoryWriter being two prominent models. The MPT-7B Base is a decoder-only transformer model with 6.7 billion parameters, trained on a large corpus of 1 trillion tokens of text and code curated by MosaicML’s data team. The base model utilizes FlashAttention, and for handling long context lengths, it leverages ALiBi.
The license for MPT-7B is Apache-2.0. However, it is important to note that the base model is not intended for deployment without fine-tuning, and further guardrails and user consent are recommended for human-facing interactions.
MPT-7B-StoryWriter-65k+ is a variant of MPT-7B specifically tailored for reading and writing stories with extremely long context lengths. It’s the result of fine-tuning MPT-7B on a filtered fiction subset of the books3 dataset with a context length of 65k tokens. MPT-7B-StoryWriter-65k+ can generate content reaching as long as 84k tokens on a single node of A100-80GB GPUs. Like MPT-7B, its license is also Apache-2.0.
MPT-7B-Chat is designed to serve as a chatbot-like model for dialogue generation, fine-tuned on multiple datasets, including ShareGPT-Vicuna, HC3, Alpaca, Helpful and Harmless, and Evol-Instruct datasets. Its license is CC-By-NC-SA-4.0, limiting its usage to non-commercial purposes.
MPT-7B-Instruct is a model tailored for short-form instruction following, created by fine-tuning MPT-7B on a dataset released by MosaicML, derived from Databricks Dolly-15k and Anthropic’s Helpful and Harmless datasets. Its license is CC-By-SA-3.0.
The training process of MPT 7B utilized 8 A100-80GB GPUs with sharded data parallelism, LION optimizer, and the Fully Sharded Data Parallelism (FSDP) technique. Gradient checkpointing was employed to optimize memory usage during training. The model consists of 6.7 billion parameters, 32 transformer layers, each with a hidden size of 4096, 16 attention heads, and a vocabulary of 50432 words, with a sequence length of 65536.
Overall, the MPT models are a valuable addition to natural language processing. They focus on efficiency, flexibility, and impressive performance in handling long context lengths, making them suitable for diverse language-related tasks and applications.
Wrapping it up
The models we’ve explored in this blog post, from Vicuna and Guanaco to RedPajama, Falcon, and MPT, exemplify the tremendous progress in natural language processing. Each model brings unique strengths and applications, catering to various tasks, from chatbot functionality to text generation, summarization, and beyond.
The advancements LLMs have made so far are only the beginning. The AI community’s pursuit of crushing the benchmarks promises even more remarkable innovations to come in the weeks, months, and years to come.
We’re excited to see what happens the rest of the year and are looking forward to writing Part 2 of this blog.
See you then!