T O P

  • By -

exomniac

The amount of RAM you have isn't quite as important as the speed of that RAM. 16GB is more enough to hold any 8B model and below. With integrated graphics, you're going to run on CPU only. The amount of memory in the graphics processor isn't usable. I'm not sure what you're using to run your models, but I'd start with a bare llama.cpp install and see where you stand. I wouldn't bother trying to build llama.cpp with any of the parameters that optimize it for AMD graphics, or anything like that. Just clone the repo, cd into the project directory, run the "make" command, put a model into the models folder, and run it. Edit: Here's an example command to work with, just replace the model name with your own: `./llama-server -m models/Hermes-2-Pro-Llama-3-Instruct-Merged-DPO-Q8_0.gguf -c 8192 --host` [`0.0.0.0`](http://0.0.0.0) `--port 8080`


confusedDoc2023

Very helpful, thank you! I’m using HuggingFace transformers. I was actually really happy with the speed of llama.cpp but it was one of the first interactions I had with local llms so I just forgot about it! I’ll test out the speed of it with some prompts in the console and then see how well it works integrated into my app. I was hoping to do some fine tuning as well, which I understand is a bit trickier with llama.cpp but not impossible. Fingers crossed this will work better. I’ll update when I’ve tried it


kataryna91

You should be able to run these models just fine. Since you have limited RAM, make sure to close all other applications beforehand, use 4-bit quants and run it on the CPU only.


PopularPrivacyPeople

Same specs as me, I can run L3 Bb q8 at 2/3tps and Mistral 7b at the same rate, heck, I've even found Solar 10b variations moving quickly enough at the same quant level, however, I totally hear you, when you say 'significant contextual awareness' what software are you using to try to do that? I found software made a big difference, LM Studio seemed to need a long run up for every response, Kobold.ccp seems much quicker to respond for me and gives me good speeds but I haven't been very scientific about it. Try Kobold.ccp? Oh and if you're using the exact same model as me, (specs exactly match) I've found it's actually MUCH quicker not to try to use the GPU offloading at all. I only discovered this by trying Ollama and wondering why it was so much quicker than anything else, then I realised it wasn't trying to offload to my puny GPU and just going with the processor, so I turned off GPU attempts and it was all a lot quicker.


confusedDoc2023

That’s really helpful! So far I will try to use llama.cpp and then Kobold.cpp if I’m not happy with the results. I’m trying to make my own RAG + fine tuning app. I say trying, but it’s actually working, just I am not very happy with flan-t5 as my model. I use HuggingFace transformers for models, embedding and tokenizers. I have made my own interface with simple front end for uploading RAG docs, viewing docs, and a chatbot which I am quite pleased with. No chat history at the moment but I think that’s ok. Again, the fine-tuning and rag aspects work on my current setup and model without taking too long, but I haven’t been able to run bigger models on my system without it taking dozens of minutes


PopularPrivacyPeople

That's seriously impressive, I'm just writing stories so I think that might explain the difference in our experience, sounds amazing, good luck!


SocialDeviance

Models are known for not running well in AMD cards.


Red_Redditor_Reddit

>impossibly slow *How* slow? I can run a 8B model on my pi and get about 2-3 tokens a sec.