AI Text Generation 4-bit mode running on RTX 4070 Ti and Core i9-12900K

How to Run a ChatGPT Alternative on Your Local PC

Posted on

ChatGPT can give some impressive results, and also sometimes some very poor advice. But while it’s free to talk with ChatGPT in theory, often you end up with messages about the system being at capacity, or hitting your maximum number of chats for the day, with a prompt to subscribe to ChatGPT Plus. Also, all of your queries are taking place on ChatGPT’s server, which means that you need Internet and that OpenAI can see what you’re doing.

Fortunately, there are ways to run a ChatGPT-like LLM (Large Language Model) on your local PC, using the power of your GPU. The oogabooga text generation webui (opens in new tab) might be just what you’re after, so we ran some tests to find out what it could — and couldn’t! — do, which means we also have some benchmarks.

Getting the webui running wasn’t quite as simple as we had hoped, in part due to how fast everything is moving within the LLM space. There are the basic instructions in the readme, the one-click installers, and then multiple guides for how to build and run the LLaMa 4-bit models (opens in new tab). We encountered varying degrees of success/failure, but with some help from Nvidia and others, we finally got things working. And then the repository was updated and our instructions broke, but a workaround/fix was posted today. Again, it’s moving fast!

It’s like running Linux and only Linux, and then wondering how to play the latest games. Sometimes you can get it working, other times you’re presented with error messages and compiler warnings that you have no idea how to solve. We’ll provide our version of instructions below for those who want to give this a shot on their own PCs. You may also find some helpful people in the LMSys Discord (opens in new tab), who were good about helping me with some of my questions.

(Image credit: Toms’ Hardware)

It might seem obvious, but let’s also just get this out of the way: You’ll need a GPU with a lot of memory, and probably a lot of system memory as well, should you want to run a large language model on your own hardware — it’s right there in the name. A lot of the work to get things running on a single GPU (or a CPU) has focused on reducing the memory requirements.

Using the base models with 16-bit data, for example, the best you can do with an RTX 4090, RTX 3090 Ti, RTX 3090, or Titan RTX — cards that all have 24GB of VRAM — is to run the model with seven billion parameters (LLaMa-7b). That’s a start, but very few home users are likely to have such a graphics card, and it runs quite poorly. Thankfully, there are other options.

Loading the model with 8-bit precision cuts the RAM requirements in half, meaning you could run LLaMa-7b with many of the best graphics cards — anything with at least 10GB VRAM could potentially suffice. Even better, loading the model with 4-bit precision halves the VRAM requirements yet again, allowing for LLaMa-13b to work on 10GB VRAM. (You’ll also need a decent amount of system memory, 32GB or more most likely — that’s what we used, at least.)

Getting the models isn’t too difficult at least, but they can be very large. LLaMa-13b for example consists of 36.3 GiB download for the main data (opens in new tab), and then another 6.5 GiB for the pre-quantized 4-bit model (opens in new tab). Do you have a graphics card with 24GB of VRAM and 64GB of system memory? Then the 30 billion parameter model (opens in new tab) is only a 75.7 GiB download, and another 15.7 GiB for the 4-bit stuff. There’s even a 65 billion parameter model, in case you have an Nvidia A100 40GB PCIe (opens in new tab) card handy, along with 128GB of system memory (well, 128GB of memory plus swap space). Hopefully the people downloading these models don’t have a data cap on their internet connection.

Testing Text Generation Web UI Performance

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *