Finetune Llama3 in Two Minutes
In specialized business scenarios, leveraging a dedicated dataset allows a finely tuned large-scale model to compete with GPT-4 across various tasks. It supports on-premises deployment to safeguard privacy and also reduces computational costs. Llama3, the most potent open-source foundational large model currently available, is particularly well-suited for applications in natural language processing, machine translation, text generation, question-answering systems, and chatbots.
Even if foundational large models like Llama3 do not initially support Chinese, fine-tuning enables the addition of Chinese language support. This tutorial demonstrates how to utilize GPU resources provided by LooPIN to introduce new training materials to the large model, thus extending its capabilities.
Preparation
This guide will instruct you on setting up the environment, preparing data, training, deploying, and saving the model. Fine-tuning on a graphics card with 8G VRAM takes less than two minutes, and the fine-tuned model can be quantized to 4bit, enabling smooth chat-based inference on a CPU.
We will employ the following open-source software libraries:
Unsloth Open Source Fine-Tuning Tool for LLMs
Unsloth: GitHub - Unsloth GitHub
Unsloth integrates model fine-tuning tools, enhancing the speed by 2-5 times and reducing memory usage by 70% when fine-tuning models like Mistral, Gemma, and Llama.
Chinese Command Dataset
The existing datasets for fine-tuning LLMs in Chinese command tasks are predominantly in English or do not align with the interaction patterns of Chinese users in real-world scenarios.
Addressing this, research conducted by ten collaborating institutions introduced the COIG-CQIA (Chinese Open Instruction Generalist - Quality Is All You Need), a high-quality dataset for fine-tuning Chinese commands. The data, sourced from Q&A communities, Wikipedia, examination questions, and existing NLP datasets, has undergone rigorous filtering and processing.
We will use 8000 data entries from Baidu Tieba's Ruozhiba Bar for fine-tuning: ruozhiba-llama3-tt
Initiating Model Training
Setting up the GPU Instance
For detailed interactive guidance, visit: LooPIN Liquidity Pool
1. LooPIN Liquidity Pool:
Navigate to LooPIN's liquidity pool (LooPIN Network Pool), use $LOOPIN tokens to purchase GPU time. Select an appropriate GPU model, such as the RTX 3080, based on your needs and budget from GPU UserBenchmark.
2. Token Exchange for GPU Resources:
- Choose the required amount of $LOOPIN tokens.
- Adjust the number of GPUs using the slider.
- Confirm the transaction amount and complete the purchase.
3. Access Jupyter Notebook:
Post-transaction, access Jupyter Notebook via your remote server in the Server area under Rented Servers. Instance initialization typically requires 2-4 minutes.
4. GPU Verification with nvidia-smi:
In Jupyter Notebook, open a new terminal window, execute the nvidia-smi command to verify GPU activation.
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15 Driver Version: 550.54.15 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3080 Off | 00000000:01:00.0 Off | N/A |
| 0% 39C P8 21W / 350W | 12MiB / 12288MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
+-----------------------------------------------------------------------------------------+
Installing the Unsloth Open Source Training Framework
1. Verify Unsloth Installation Version
Refer to the official Unsloth readme for the appropriate version (Unsloth GitHub)
import torch; print('cuda:', torch.version.cuda, '\nPytorch:', torch.__version__)
Response:
cuda: 12.1
Pytorch: 2.2.0+cu121
pip install "unsloth[cu121-ampere-torch220] @ git+https://github.com/unslothai/unsloth.git"
Note: For the latest RTX 30xx or more advanced GPUs, select the "ampere"
path.
2. Download the Llama-7B Base Model from Huggingface
Quickly load pre-trained models and tokenizers using Unsloth.
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/llama-3-8b-bnb-4bit",
max_seq_length = max_seq_length,
dtype = dtype,
load_in_4bit = load_in_4bit,
)
Await model download completion:
config.json: 100%
1.14k/1.14k [00:00<00:00, 72.1kB/s]
==((====))== Unsloth: Fast Llama patching release 2024.4
\\ /| GPU: NVIDIA GeForce RTX 3080. Max memory: 11.756 GB. Platform = Linux.
O^O/ \_/ \ Pytorch: 2.2.0+cu121. CUDA = 8.6. CUDA Toolkit = 12.1.
\ / Bfloat16 = TRUE. Xformers = 0.0.24. FA = True.
"-____-" Free Apache license: http://github.com/unslothai/unsloth
model.safetensors: 100%
5.70G/5.70G [00:52<00:00, 88.6MB/s]
generation_config.json: 100%