Running GPU-Accelerated LLMs on a $100 Orange Pi 5

Tools & Engineering

The Engineer

16 Nov 2023 · 3 min read

Discover how machine learning compilation tricks enable running powerful GPU-accelerated large language models on a budget Orange Pi 5, transforming affordable hardware into a potent AI tool.

Apr 20, 2024

Introduction

The world of large language models (LLMs) has been rapidly evolving, but the high computational demands often require expensive hardware. However, recent advancements in machine learning compilation (MLC) have made it possible to run LLMs on affordable embedded devices. In this article, we explore how to achieve GPU-accelerated LLM performance on a $100 Orange Pi 5 with a Mali-G610 GPU. Specifically, we’ll see how MLC techniques can deliver impressive results for models like Llama3-8b, Llama2-7b, and RedPajama-3b.

Technical Overview

What Changed?

The key technical advancement here is the successful deployment of MLC on a Mali GPU. This is significant because:

Cost Efficiency: The Orange Pi 5 costs around $100, making it an affordable option for running LLMs.
Performance: We achieve token generation rates of 2.3 tok/sec for Llama3-8b, 2.5 tok/sec for Llama2-7b, and 5 tok/sec for RedPajama-3b. For the more powerful Orange Pi 5+ with 16GB RAM (under $150), we can even run the larger Llama-2 13b model at 1.5 tok/sec.

How It Works

MLC leverages Apache TVM Unity, a generalizable stack for compiling and optimizing machine learning models across different hardware backends. Here’s a breakdown of the process:

Model Optimization: Reuse existing optimization passes like quantization, fusion, and layout optimization.
Kernel Optimization: Utilize a generic GPU kernel optimization space written in TVM TensorIR and re-target it to Mali GPUs.
Code Generation: Use the OpenCL codegen backend from TVM, tailored for Mali GPUs.
User Interface: Maintain the existing user interface with Python APIs, CLI, and REST APIs.

Step-by-Step Guide

If you want to try this out on your own Orange Pi 5, follow these steps:

Preparation

Setup the Board:
- Follow the instructions here to set up the RK3588 board with the OpenCL driver.
Clone MLC-LLM Repository:
- Clone the MLC-LLM repository from the source.

git clone https://github.com/mlc-ai/mlc-llm.git
cd mlc-llm

Download Model Weights:
- Download the weights for Llama-3-8B-Instruct-q4f16_1-MLC. You can also use Llama-2-7b-chat-hf-q4f16_1 or Llama-2-13b-chat-hf-q4f16_1 (requires a 16GB board).

python scripts/download_weights.py Llama-3-8B-Instruct-q4f16_1-MLC

Running the Model

Install Dependencies:
- Ensure you have the necessary dependencies installed.

pip install -r requirements.txt

Run the Model:
- Use the provided Python script to run the model.

python scripts/run_model.py --model Llama-3-8B-Instruct-q4f16_1-MLC

Benchmarks and Performance

Here are some benchmarks for different models on the Orange Pi 5:

Llama3-8b: 2.3 tok/sec
Llama2-7b: 2.5 tok/sec
RedPajama-3b: 5 tok/sec
Llama-2 13b (Orange Pi 5+ with 16GB): 1.5 tok/sec

Conclusion

The ability to run GPU