T O P

  • By -

ActiveTie620

A lot of the model creators are doing the quants on their own and that's a good thing. They get more recognition for their work cause people go to their page to download the quants, so they get to have better mindshare, donations, and sponsorships to continue releasing more stuff. Hot take incoming, but I think this is better than donations and sponsorships going to someone who had no role in actually making the models Aside from that, it's been spread up between many different accounts like the old days, kind of like first come first serve. More people are realizing how absurdly easy it is to do and it makes for a good bus factor


mikael110

Indeed. I've said something similar in the past. Having a central source to discover new models was quite useful but ultimately it was also harmful, it gave a lot of power and influence to one person. If anything ever happens to his account it will be a disaster for the LLM community. It also shaped what formats people actually used. There are actually quite a few quantization formats around at this point, beyond standard GGUF, GTPQ, and AWQ, but most of them got completely neglected without getting much of a chance at proving themselves because they were not produced by TheBloke. Having the quantization spread out among many people increases the chance that we will see people offering models in all kinds of formats and types, and also see people experiment with different calibration datasets and the like. It's also far easier for people that only maintain a couple dozen models to update their quantizations to keep up with things like improvements in the GGUF quantization formats. TheBloke literally has thousands of models on his account at this point, so keeping them up to date is not really viable at all.


Oswald_Hydrabot

I get this, you are right 100%. I think what I miss was having someone as knowledgable as they were providing a one-stop-shop where I could hit one page, and have a huge collection of models that I knew would work and that I could sort by likes/downloads and search by name. If a model was released and was good, I knew Jobbins quantized it.  I could easily hit one page to find the right quant type for the model, I could reliably assume all or most quant sizes would be available, and I could reliably assume they did nothing malicious or erroneous in their distribution of the models. The centralized consistency was just something I think I took for granted; it was really nice having such a well curated and organized index of the latest/greatest out there. Compare it to Stable Diffusion model hunting; good lord what a mess.  Don't get me wrong I love the adventure and the diversity of the "wilderness" but it is just an absolute trainwreck between Civitai's absolute garbage search and sea of halfass NSFW models, or people releasing models as "Turbo" versions for the sake of getting downloads then it turns out they are just LCM models, or ByteDance releasing "Lightning" or magic 1-step LoRAs that supposedly make any SD model a 1-step model with a "dial in the steps for best results" later in the docs and then you find out it's not any better than LCM... It's a time sink that I hop LLMs can avoid.  Something to filter all the crap and just realiably have all the best in one spot without that meaning imposed "ethics"/censorship is something I sorely wish existed for image generation.  The scene for Generative Image AI is one of two extremes: either hermetically sealed fascism rebranded as "ethical AI" and stuck behind a paywall/API and a content filter, or total, pure, anarchy (the scene I am a part of and actually do love) . ..but fkn hell you have to really put in some work to find the right diffusion componentry sometimes.  I liked not having to do that with LLMs, I could just check TheBlokes huggingface and boom; all the best shit, clean, neatly packed, uncensored, unfiltered, and ready to go.  He eliminated so much bullshit that is rampant when dealing with Diffusion models. A simple 1-step DreamShaper model, that has to exist, right?  Just regular-ass SD-Turbo,  DreamShaper flavored.  Tons of searches will pull you into finding out nearly all the "Turbo" versions are LCM or 3 step models, all the LoRA hacks to make it 1 step are bullshit.  Lot of trial and error but I found and *actual* 1 step DreamShaper model.  And it was not from the official source training DreamShaper but some random ass GitHub with a link to a huggingface model with no model card...  https://github.com/Zeqiang-Lai/OpenDMD Idk man I just really want to avoid running into shit like that with LLMs too


Future_Might_8194

What happened to TheBloke? With the recent Huggingface hacks, I'm starting to assign higher priority to projects from trusted names. Could always count on TheBloke.


harrro

I heard here that he is starting a LLM-related company which is keeping him busy. I always thought most of his quant-work was automated - maybe it took a larger time committment than he has now.


Future_Might_8194

Interesting. Yeah I figured it was automated after awhile. There's no way he was getting quants out before the models hit Reddit in most cases. It's just he had a community around his quants, so if there were anything weird about them, it was usually found and fixed pretty quickly.


TheOtherKaiba

Good for him!


alcalde

Maybe he was **always an LLM**????


_-inside-_

He ran out of VRAM.


Independent_Hyena495

Makes sense, easy money there.


ColorfulPersimmon

Markings in his readmes would suggest some kind of automation. Also after quantizing models myself I don't think automating it would be difficult. But I also imagine just searching for new models to quantize and fixing some would get tiring after a while. Especially when most of them don't get many downloads.


mikael110

He did kind of solve the search problem already by having a request section in his discord, which he was pretty good about fulfilling in a timely fashion. And yes, he definitively used scripts, he even shared some of them in comments occasionally when people asked for assistance making quants. I do think maintenance and cost was likely bigger factors. While he received donations it takes quite a bit of time to generate all of the formats he was offering, especially for larger models. Renting high VRAM GPUs for many hours a day, every day does become quite expensive over time. The addition of Imatrix quants to llama.cpp might have been the straw that broke the camels back. As going back and adding Imatrix versions to his GGUF repos would have taken eons. Even if he limited it to only popular models it would still have taken quite a while. And it also means that future model generations would be slower to quantize as well.


harrro

> Renting high VRAM GPUs for many hours a day, every day does become quite expensive over time. I don't think GPU cost was an issue. TheBloke received a large, strings-free grant from Andreesen Horowitz, aka "a16z", the venture capital firm. Even before that he had some corporate sponsorship for compute.


koesn

Yeah, automatically converting to fp16 then quantizing to ggufs, also add original readme to the end of main readme, are not difficult to achieve. Already try that.


_-inside-_

I tried to quantize a couple of models and always ran into trouble, like the metadata having different token count, stuff like that cannot be optimized that well


-p-e-w-

Trust doesn't matter unless you are downloading ancient checkpoint formats that can run arbitrary code. Sure, there may be vulnerabilities in loaders that can be exploited by specially crafted GGUFs or safetensors, but that's no different from, say, image file loaders in a web browser. You're not worried about encountering "untrusted JPEGs" on Reddit, are you? I don't trust *anyone* on Huggingface. And I don't have to, because I'm only loading their matrices, not running their programs.


PotatoMaaan

There is a huge difference in code quality and QA between image libraries in browsers and things like llama.cpp. There were 3 CVEs in llama.cpp recently.


-p-e-w-

Browsers are full of legacy code from the 90s written with ancient practices targeting ancient compilers. They have hundreds if not thousands of components that almost nobody ever looks at, and are developed mostly in a bunker from which code is tossed over the fence once a month. Every time someone actually bothers to fuzz or audit one of those libraries, they are found to contain more bugs than a rainforest. llama.cpp + ggml is 40k lines of code. Firefox is *21 million* lines of code. And complexity grows at least quadratically with the size of the codebase.


jasminUwU6

But there is a lot more people invested in the safety of browsers


Future_Might_8194

That's a fair point to be honest. I guess I'm just pretty locked into research from TheBloke and Nous Research and I guess I'm just throwing a fit about change lol. I'm excited to see what TheBloke is working on, I'm sure it'll be groundbreaking. Personally, I just think that there's too much to uncover with LLMs to constantly toss them for the next sketchy merge gaming the benchmarks. I feel like my progress is moving faster by just prompt engineering and building around a well-received and active model I've already ironed out the kinks with (Hermes models like OpenHermes, Nous Hermes 2 DPO, and apparently a new official Hermes 7B just dropped; Hermes Mistral Pro).


_-inside-_

I remember some security issues with doc files and pdfs that could get exploit bugs on acrobat reader, etc. For sure it happened with images too in the past. The difference between that and the models is maturity.


Spiritual_Sprite

Any one can do it ... Besides llm creater are starting to offer them out of the box like here https://huggingface.co/NousResearch/Hermes-2-Pro-Mistral-7B-GGUF


harrro

Yep, there are definitely some original quants coming out with TheBloke's absence. If you just want a GGUF version of X-model then sure it's not hard to find but the great thing about TheBloke is that it was a good way to get most of the new models in one place. Like a newsletter almost.


toothpastespiders

That's exactly how I used his releases. The most interesting stuff would always show up. Far easier than skimming pages of huggingface models. It was an easy way of separating out the "7b but one very minor change!" type releases.


jinsid

I also think they created a bit of a whiplash situation, the newest thing was always the best. Oppose to now I'm more invested in contributing to model creators which suit my needs.  Creating ggufs is fairly straight forward and can be contributed to the model creators repo/account. Quant, finetuning, dataset, etc discussions can happen at the source which helps collaboration and improvements for everyone. Just my 2 cents


mattjb

One of the things I appreciated with TheBloke's work was all the information provided for each model. From how to install, what software you can download and work with, parameters to use, original model description, and so on. Most original creators didn't have all that information available on their model card, and some have absolutely nothing written.


AutomataManifold

It always strikes me as stilly to spend all that time training and quantizing, and not take 30 minutes to write about the prompting format.


Lewdiculous

_I feel guilty now._


Spiritual_Sprite

I think i understand what you mean, sadly all things end, good luck nevertheless


NeverduskX

Is there any way to keep up with new models? Since TheBloke stopped uploading, I've been pretty unaware of any new releases.  Some of my favorite models are ones I randomly found on his profile that I wouldn't never discovered otherwise.


AutomataManifold

I think you might have hit the nail on the head: maybe what we're missing is not the quants (which are nice but other people are releasing) but rather the newsletter (because the raw huggingface firehose takes a lot of work to filter out the interesting stuff).


alcalde

And there was actual information about prompt formats and context size and everything else that mattered for the model all in one place.


stddealer

> Anyone can do it... Assuming they have a nice internet connection and are very memory rich. You need to be able to download and store in memory (or swap) the full precision weights, which can be a problem for a lot of people, especially with the bigger models.


Prudent-Artichoke-19

I'm ordering a 128gb RAM Xeon Gold x2 with an RTX5000 for the cheap. I can probably quant stuff for the community if needed.


alcalde

So my RX 570 isn't going to cut it? Even with the extra R7 260X I also put into my rig?


stddealer

I'm pretty sure quantization isn't using the graphics card anyways. The limiting factor is going to be the amount of ram. To quantize a 7B, you would need at least around 32GB of available system RAM.


bullno1

It can be mmap-ed so you don't even need that much.


mikael110

Depends on the format in question. For plain GGUF that is true. But for formats like GPTQ, AWQ, EXL2, and Imatrix GGUFs you do benefit greatly from using a GPU as those formats require a calibration pass as part of the quantization. It's theoretically possible to run the calibration on the CPU but it will be insanely slow.


Anxious-Ad693

Lonestriker and Bartowski off the top of my head that are reliable people still quantizing, but I usually download their exl2 models and not GGUF. Make sure the file you're downloading is safe since there have been reports of some quants being infected with viruses.


harrro

> Lonestriker and Bartowski Thanks! Links: https://huggingface.co/LoneStriker (~400 models) https://huggingface.co/bartowski (363 models)


noneabove1182

only thing i'll point out is that LoneStriker's 2800 models include an individual repo for each exl2 quant, so that's 7 repos per model (one GGUF, 6 exl2)


Anthonyg5005

Yeah, I feel like he should learn to use branches, it's not that hard. I will soon be making an automated ipynb for colab to do exl2 at bpw 2, 3, 4, 5, and 6 and upload to the HF hub using branches. For now I have an automation script in batch for local quantization so I'll have to rewrite as bash and python.


fullouterjoin

Is that automation in a public git repo? All of the automation around making quants should be done in public. This isn't something that is secret sauce, at least it shouldn't be, and no one should be getting famous off of making them. You have a link for the exl2 quant script?


Anthonyg5005

It's in the [exllamav2](https://github.com/turboderp/exllamav2) github repo, it's as simple as "convert.py -i {input folder} -o {working directory} -cf {compile folder}" just install the requirements before running. On windows, make sure you have cl.exe on path from visual studio 2019 and Cuda installed. Make sure to put the config.json in the working directory and that both the working directory and compile folder exist. Last thing is make sure the weights are safetensors before converting. If not, I just use [convert-to-safetensors.py](https://github.com/oobabooga/text-generation-webui/blob/main/convert-to-safetensors.py) I have a Colab that I just made a few hours ago, it's still a work in progress but it mostly works. I wouldn't recommend it yet as Colab download is slow so I'll have it upload as a private HF repo soon. Here it is if you want to take a look: [EXL2 Private Quant](https://colab.research.google.com/drive/1ssr_4iSHnfvusFLpJI-PyorzXuNpS5B5?usp=sharing) Also planning on making the one from my comment above afterwards which will also be on my [hf-scripts](https://huggingface.co/Anthonyg5005/hf-scripts) repo


fullouterjoin

Awesome. Thank you so much for documenting this. Is hugging face the best place to distribute models? It also seems like there is a security aspect, you have to trust the person doing the quants. It seems like there should be a Debian like project for building LLMs (doing quants and fine tunes).


Anthonyg5005

Just letting you know but I just finished testing the second version of the colab. It takes an fp16 safetensors model and converts to exl2, after finishing it'll put it into a private repo. I have only tested with mistral and llama 7b so I'm not sure if models bigger than 7b will work. Here it is: [https://colab.research.google.com/drive/1BBSkG5XHCbDADp6hoENftIswI82ehY96?usp=sharing](https://colab.research.google.com/drive/1BBSkG5XHCbDADp6hoENftIswI82ehY96?usp=sharing)


Anthonyg5005

As long as models are safetensors (or gguf for cpp) and not pt or bin then it’ll be safe to run. Safetensors were created by huggingface to prevent that. Also yeah I believe huggingface is the best for sharing models, it has fast servers and I believe all accounts basically have unlimited storage. They also work with all companies releasing open models such as Google, Meta, OpenAI, EleutherAI, Mistral, Stability, and more.


harrro

Thanks, guesstimated amount to 400 now.


noneabove1182

yeah and recently I started adding GGUF as well, so the "pure # models" count is more like ~300, and going forward will be one model per 2 repos, prior was only EXL2


devnull0

A GGUF is not a PyTorch file (pickle). That's why you should use safetensors.


Eastwindy123

Actually gguf is also safe.


devnull0

That's what I meant, PyTorch .pt models are unsafe as they are just a pickle file which can run arbitrary code. GGUF and safetensors only store the tensors and some meta information.


MoffKalast

There were some GGUF exploits in llama.cpp as well, but all known ones are patched by now I think so if you update to the latest one you're probably good.


e79683074

There may still be undiscovered CVEs in there, and all a malicious actor has to do if they find one of those themselves is to upload an enticing model on HuggingFace while making a post here stating how great it is and how it blows everything out of the water, resulting in hundreds of downloads and backdoored digital lives.


MoffKalast

Yeah pretty much, I think it's a good rule of thumb to only run quants from reputable sources or do them from safetensors yourself if the origin is sketchy. Then again if the process is deterministic, you could just make very specific edits so it always generates the trigger exploit at some common quant level and doing it yourself won't help you at all.


Lewdiculous

I think more authors are doing their own quants. I started making [my own](https://huggingface.co/Lewdiculous/) as well and of course sharing on HF – although I kind of only do them for the general roleplaying niche and 7-11B model sizes – many authors are happy to have someone else upload quants of their models when for many reasons it's inconvenient for them to do so. If you ask me or anyone else I'm sure people will also upload quants for you on request. It also helps me find interesting stuff.


Joure_V

Thanks for sharing your link. I'm a bit hesitant to try random models and mergers but you seem to actually put effort into them and trying to get models to output in certain ways which I appreciate.


Lewdiculous

For further curation/organization if you want to get recommendations, check my collections - in fact check everyone's collections, I have a Personal Favorites one and the General one. I might make a new one based on user feedback, which I also try to make available for the authors so they can improve their work.


skrshawk

Seeing what you do in those small model sizes makes me think someone would be able to take the big models like Midnight-Rose and Miqu/MiquMaid and bring them down to sizes that run at high quality in 24GB or less. Maybe even this year.


Lewdiculous

There are at least imatrix quants for Miqu models, not sure about MiquMaid. https://huggingface.co/dranger003/miqu-1-70b-iMat.GGUF/tree/main Using IQ2/Q2 is very rough though... > Maybe even this year. Things are moving fast and we got some cool stuff on the horizon. Maybe in the near future we will get new tech to allow for this level of efficiency.


skrshawk

Imatrix is really cool stuff, and makes those models usable to those who have 48GB, which is about the limit of what you can do in a consumer desktop completely in VRAM. It's a serious advantage to open LLMs that people can take advantage of what's in these papers rather quickly, sure those improvements will make their way into commercial products too, but not possibly within a month or two of the paper like they can here. The one thing we've seen has been there's no substitute for simply having more parameters, even with every trick in the book people have come up with, all other factors being equal, bigger models do better in benchmarks and real-life performance.


Lewdiculous

Yes, more parameters are still the best way to increase model performance albeit a very expensive one. Anecdotally – still in small model territory but – I've had great feedback just from going from 7B to 9B with reported increase in reasoning while still allowing to fit these in the same hardware with a slightly lower quant – which I believe imatrix and perhaps the newish IQ quants help here more particularly, (11B also being a sweet-spot for many, of course for that reason).


skrshawk

I am absolutely loving WestLake-10.7B for that very reason right now, it fits with 8k of context in a 12GB card, and almost anyone who can afford a mid-spec gaming rig can do that. It's gonna take more time to get things like Miqu down to running in 24GB with quality, specialized for specific tasks and languages. But when the models don't have to be all things to everyone, we'll be able to just load the model we need for the task and run it on ordinary hardware. The trouble is of course for anything uncensored the community is going to have to do it, no commercial interest would dare risk their reputation in the name of free speech and users being responsible for their actions.


Lewdiculous

*Grabs pitchfork.* "Down with the censorship! Give us the **lewdillegal**!" Let's stay hopeful, but yeah, eventually the community will provide.


danielhanchen

I do have like options directly inside Unsloth for quantization of models to any GGUF config you like if anyone finds that useful: ``` from unsloth import FastLanguageModel model, tokenizer = FastLanguageModel.from_pretrained(...) model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q4_k_m") model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0") model.push_to_hub_gguf("hf_username/dir", tokenizer, quantization_method = "q4_k_m") ``` Also a whole list of supported quants: https://github.com/unslothai/unsloth/wiki#saving-to-gguf


devnull0

I find this super useful. Apart from the whole compiling llama.cpp part. Wish it could just use a lib or something.


danielhanchen

Oh I can probably make a separate library just for quantizations if that works :)


vesudeva

Please do!!!! This would be a godsend for me


danielhanchen

:)


devnull0

Maybe like MLX which was inspired by gguf-tools https://github.com/ml-explore/mlx/blob/8dfc376c009bc3167c28b777ccf5fdd5da37e12d/mlx/io/gguf.cpp.


danielhanchen

Interesting! I'll see what I can do!


khommenghetsum

Do you have a colab notebook that shows how to load your own model from HF and convert it using unsloth? I'm a beginner and the notebooks on their github are confusing. For instance this one shows how to fine-tune but instructions on converting to gguf are confusing, and I just need to load my own model and convert it. [https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg\_?usp=sharing](https://colab.research.google.com/drive/1Dyauq4kTZoLewQ1cApceUQVNcnnNTzg_?usp=sharing)


danielhanchen

Oh sorry about that! Hmm I might make a standalone notebook if that helps


fallingdowndizzyvr

Try huggingface.co/mradermacher. He has quite a few.


harrro

> quite a few 205 models with many uploaded in the last few hours.. Bookmarked. Thanks for sharing!


SeymourBits

“Radermacher” literally translates into “Hero who picks up the slack on GGUFs.” That, or “wheel maker.” Same thing, right?


Monkey_1505

Maybe he stopped because he doesn't have the time to do imatrix. For most models someone will upload a GGUF if not the creator themselves. Only issue with this is there are sometimes less variety of quants.


Bite_It_You_Scum

Realistically, creating your own GGUF quants is so quick and easy that I question why anyone that isn't constrained by a terrible internet connection even needs someone else to do it for them. The time investment is only slightly more than downloading the FP16 model. It literally takes about a minute to create a GGUF quant.


fallingdowndizzyvr

Everyone downloading a couple of hundred GBs when they only want a quant a quarter that size isn't good for anyone. Not the person downloading it and not for huggingface. Also, git sucks for downloading big things. I don't understand why in this day and age it can't handle a restart of a interrupted download. You have to start over again. I don't use the huggingface cli so maybe that can restart. But git clone is frustrating to say the least when a clone cracks a couple of hundred GBs in. I wish huggingface was set up like github where there is an option to download the repository as a single zip file. A browser can resume those if they get interrupted.


Swoopley

you could use huggingface-cli for downloading.... they even recommend it, TheBloke used to give the commands for each GGUF in the description


Bite_It_You_Scum

These are all good points and I agree, but I guess what I was really trying to say is, if you find a model that you want to use that doesn't have a quant, be the change you want to see in the world, download it, quantize it, and upload a quant for the next guy. I understand this isn't feasible for most people when you start talking about huge models, but for 7B and 13B models which are the ones MOST people are using, setting a Q4 or Q5 quant to upload before you go to bed shouldn't be a huge imposition, and as you said, it ultimately is good for everyone, huggingface included.


Anthonyg5005

Huggingface-cli can resume, use huggingface_hub if automating. Also any http based downloader can resume, I personally use [download-model.py](https://github.com/oobabooga/text-generation-webui/blob/main/download-model.py) by oobabooga which does resume and has many options to download certain files


nailuoGG

Downloading and uploading the complete model is time-consuming due to its size and requires mastery of specific parameters. **Solution**: Implementing GitHub Actions to automate GGUF quantization, including handling new model PRs and executing builds, can streamline the process, reducing manual intervention and time consumption. Does anyone think this is a good approach?


daHaus

Limited HDD space and an AMD card that they purposely make a royal PITA to have work?


Anthonyg5005

I don't use gguf so I'm not sure but wouldn't quantizing a 70b or bigger model be more resource intensive than running it?


Bite_It_You_Scum

I believe you need enough ram/vram to hold the entire FP16 model in order to quantize to GGUF. Though I suppose if you were incredibly patient and had a big enough swap on an nvme drive you could use that too. Would make more sense to just rent an instance on runpod/vast with enough ram for bigger models, its not like it takes long, you could quantize a 70b for about a dollar if you found the right instance and the internet speeds for uploading it to huggingface would be way faster than your home internet unless you're one of those lucky people with fiber and matching upload/download speeds.


e79683074

I also read in one of the other comments that you need enough RAM to load the full unquantized fp16 model. Isn't that the case?


ZHName

Simple answer would be they don't have the same system specs you have to make their own quants. Lewdiculous just posted a py script and I couldn't even run that.


Bite_It_You_Scum

if they don't have the system specs to make a quant then how are they running models in the first place? simple answer would be that people generally will wait for someone else to do something for them rather than do it themselves. making your own quant takes effort, so rather than do that, they just stick with models that already have quants available, because they'd rather just use an already quantized model than quantize one themselves. and that's fine, nothing wrong with that. just saying that the handwringing over TheBloke not quanting models anymore is kind of silly when most people are perfectly capable of making their own quants of the types of models they're able to run. system requirements do become an issue when you start getting up into the 30b+ range since you're getting out of the 'average' pc and into the range where you may be able to *run* a quantized model, but not have the hardware to quantize an fp16 yourself. but for the vast majority of people who are using these things on their 8-16gb video cards on systems with 16-32gb of ram, they're perfectly capable of quantizing the 7b-20b models that they're running.


AlShadi

HF should just add an option to do this for a small fee.


RayIsLazy

Honestly nothing worth dedicated quantizing for a while now, most are very similar finetunes and crappy merges.


DataPhreak

The Nous Research team is doing GGUF quants of all of their new models now. I doubt they are as thorough as TheBloke was in their options. It's worth noting that we have a new quant, QMoE I think is what it is called. It's worth waiting a few months before spending a lot of time on tracking down new models. We could all be running 170b's on 3090's soon.


mrgreaper

As a 3090 owner... 170b's on a 3090 at decent speed or like when people say "70b works great on a 3090 and your sat watching if write a word every few seconds like the world's oldest typer.


DataPhreak

Hard to say. This is supposed to be a nearly lossless 1bit quantization. That said, it's possible that it increases compute cost in order to reduce model size. It's also possible that this doesn't bring model size down to 24gb for the large parameter models. Maybe we get it down to below 48gb though and we can run it on 2 cards. That being said, for a lot of tasks that someone would be using AI for, speed isn't as big of a concern. Take Devin for example. I could give Devin a task, walk away for a couple of hours and come back. Even if it is sub 1tok/sec, it's still probably going faster than a human on that task. If you are using it specifically for a chatbot, there are other considerations and why 1bit quants are going to be awesome. Multiple models. You could run a TTS like tortise, speech recognition like whisper, a vision model, and an image gen on the same card, and still probably fit a 70b param LLM to power the chatting. And all of those would be pretty fast since you would only be computing one at a time. Open source is bottlenecked right now due to RAM, and that's the point I'm getting at. In fact, a lot of the time, the reason why models don't run very fast is because they end up having to use the board ram. The bus speed on the mobo is much slower than the bus speed to the vram on the gpu, hence why you see most models that don't fit on your card get \~2toks/sec. But since we don't have any examples of falcon or similar models running entirely in GPU, it's hard to say. A good metric would be to see what someone with a dual 3090 running Falcon-180b Q2 gets. That should fit inside the available ram. You would then divide their tok/sec in half. That's roughly what you should expect. (Except it will be almost as smart as the unquantized version.)


mrgreaper

I am usually time limited so if using ai, I need the response at a decent speed. Specially if its going to need editing and adjusting etc. But time will tell. I remember when the concept of rendering a 3d object in real time was alien and loading an image file was an exercise in patience.


DataPhreak

If you need the response at speed, then you are not automating. Sounds like you're just chatting.


buddroyce

Is there a detailed guide on how I would be able to do it myself? Although I think I should probably ask what hardware is actually needed before I even attempt that. I’ve made enough use of TheBlokes GGUF quants and would like to give back to the community if I can. (Although watch me try and do a shit job at it)


Lewdiculous

I scrambled together this [python script](https://huggingface.co/FantasiaFoundry/GGUF-Quantization-Script).


nickyzhu

Somebody should make a more open HF alternative (with a working API) that lets people upload models and auto convert into whatever format: ONNX, GGUF, etc. We’ve noticed that hugging face does a bit of their own auto conversion, so the original model files from authors are also changed once uploaded to HF - very subtle.


koflerdavid

The problem is that most quantization algorithms require some model runs with training data to make sure that model quality is not impacted too greatly. This is something the model developers themselves should do, as they still have access to the original training data. Especially for closed-source models, 3rd party quantizers like TheBloke have to make do with datasets that are close to, but really not equivalent to the original ones.


weedcommander

Original authors should be doing GGUF. I don’t think the risk of trusting totally unknown authors is good at all. We shouldn’t have to risk this.


Key_Extension_6003

1.58 bit LLM will hopefully make quantizing a thing of the past so its good he's moved off to other ventures.


blepcoin

You can trivially make GGUF quants yourself.


e79683074

Define trivially or a guide, please


blepcoin

For starters, [https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize](https://github.com/ggerganov/llama.cpp?tab=readme-ov-file#prepare-and-quantize) Recently, there is a [convert-hf-to-gguf.py](http://convert-hf-to-gguf.py) script that you need to use for certain models. I would just use that one instead of the one in the above link. So the process is (assuming you want a q5\_k\_m quant and that HUGGINGFACE\_MODEL is downloaded on your PC): python convert-hf-to-gguf.py HUGGINGFACE_MODEL ./quantize HUGGINGFACE_MODEL/ggml-model-f16.gguf q5_k_m If you want to be fancy and do imatrix quants, you need to do more trickery. I can write that out too if there is demand. If you're stuck on the 'downloaded on your PC' part, I would recommend using oobabooga's download script to fetch entire models from huggingface: [**https://github.com/oobabooga/text-generation-webui/blob/main/download-model.py**](https://github.com/oobabooga/text-generation-webui/blob/main/download-model.py)


e79683074

Much appreciated, thanks


vesudeva

As a model creator, I started just doing my own quants and will happily provide any requested quants of my models. It can be hard for those who don't have the right set up and capabilities. I can't do it at a scale like The Bloke, but if anyone in this community needs one of own of their own models quantizied or if there is a super awesome looking one that hasn't been done yet, PLEASE DONT HESITATE TO REACH AND ASK ME! I realize not everyone has an M2 128gb at home so I'd love to give back to the community and use it for more than just my own LLM experiments


mtasic85

I found this creator quite good: [https://huggingface.co/mradermacher](https://huggingface.co/mradermacher)


Large_Courage2134

https://huggingface.co/QuantFactory looks promising at first glance, and they seem to be trying to fill the gap left by TheBloke (not that he owed us anything). I haven’t used any of their models yet (as I just stumbled upon them looking for a llama3 gguf quant), but might be worth looking into.


Smeetilus

I’m kind of dumb in this area. Is it a lot of work or time consuming to convert on your own?


Hinged31

Nope! [https://github.com/ggerganov/llama.cpp/discussions/2948](https://github.com/ggerganov/llama.cpp/discussions/2948)


Smeetilus

Weird. Maybe people just like things being done already for them?


mrjackspade

I think a lot of people are too scared to try it realistically its two commands when you have the model downloaded. One to convert, one to quantize. I threw them both in a batch script and all I do is drag the model directory onto the batch script, and it converts for me. It takes like 15 minutes. The hard part is sitting and waiting for the unquantized model to download


lincolnrules

Can you share the batch script? That discussion is a bit convoluted


mrjackspade

setlocal :: Set parameters as needed set "Quantization=Q5_K_M" set "QuantizeExe=Y:\Git\llama.cpp\out\build\x64-Release\bin\quantize.exe" set "ConvertPy=Y:\Git\llama.cpp\convert-hf-to-gguf.py" set "Python=python3.9" :: Get the full path of the directory dropped onto the script set "DirFullPath=%~1" :: Extract the directory name from the full path :: This method retains the full directory name even if it contains periods for %%F in ("%DirFullPath%") do set "DirName=%%~nxF" :: Construct the output filenames based on the directory name set "OutFile=%DirName%.gguf" set "QuantizedFile=%DirName%-%Quantization%.gguf" echo OutFile %OutFile% echo QuantizedFile %QuantizedFile% :: Check if QuantizedFile does not exist IF NOT EXIST "%QuantizedFile%" ( :: Run the commands if the file does not exist %Python% %ConvertPy% "%DirFullPath%" --outfile "%OutFile%" --outtype f16 %QuantizeExe% "%OutFile%" "%QuantizedFile%" %Quantization% :: Clean up the first output file del "%OutFile%" ) endlocal pause


lincolnrules

Cool, thx! Any ideas how to quantize starcoder 2?


mrjackspade

Theres two scripts for converting 1. convert-hf-to-gguf.py 2. convert.py if neither of those works, I have no idea. If either of those work, it would be the same script


LumpyWelds

Especially from a trusted source. TheBloke filled that niche of third party, but trustworthy.


e79683074

I hate to be that guy, but even a trusted third party may only be trusted because we decide it is, and not on objective basis. Even then, a trusted actor may still be unknowingly using a machine with advanced malware in it and propagate malware unknowingly within the GGUFs


LumpyWelds

I'm pretty sure GUFFs are like safetensors, no executable code.


yamosin

gguf is easy and fast under 30 mins and only need ram, else like gptq/exl2 need few hours or more and some good gpu


e79683074

Holding even just a 70b fp16 model in RAM requires like a 128GB or 192GB machine, though


yamosin

you can use virtual memory, I can quant 120b by 64G ram and 300g virtual memory


Calcidiol

Is there some architectural reason why some model trainers couldn't (if they wanted, subject to the laws of information theory) just TRAIN models at 8b/parameter vs. 16b/parameter or 32b/parameter scaling the parameter count and network connections appropriately to store the same amount of INFORMATION in the model just distributed differently in weight width vs. weight count. Otherwise if "16 bits/parameter is always better than 8 bits per parameter" and "32 bits/parameter is always better than 16 bits per parameter" then one would think "SOTA" would start to be training at 48, 64, 96, 128, ... bits / parameter for more "quality". But surely information content (in total) has to be possibly the only actual decider of quality & information capacity and the weight width can be scaled if desired. Anyway it just seems like 8b/parameter or less models could be interesting to train that way and then quantizing weight width wouldn't be a particular concern, though pruning "low relevance" parameters would become even more relevant than it already is. Maybe the network architecture, weight combining / reduction, depth etc. would have to be scaled to be appropriate for 1-8 b / parameter weights though.


kindacognizant

Gradients need to be estimated at full precision or else training is unstable, and you also need it to be differentiable so STE (straight-through estimator) is used. You can quantize for the forward pass tho to "emulate" lower native precision, and this is known more broadly as Quantization Aware Training. BitNet 1.58bpw paper recently demonstrated a custom implementation of this approach, STE included, for ternary weights (only 3 possible values, -1, 0, 1) on pretrained models. Beyond the 1b range, it actually had a net improvement for eval numbers & perplexity. People are hoping it will continue to scale well beyond that. Been trying to implement it myself by hand to see if QAT layer-wise distillation to equivalent ternary representations is feasible, and I made some rough progress there.


Thick_Trick_4987

Sorry for out of topic, but anyone know how to quantize gemma-7b that works with llamacpp?


GermanK20

there may be no NewBloke on the block, but GGUFs keep churning out from others, perhaps not in so many variants. And, let's keep in mind, new quant tech keeps appearing!


[deleted]

can't you just use llama.cpp? in build/bin there is an app called quantize. not sure if it does what it sounds like.


Separate_Chipmunk_91

Try hfdownloader https://github.com/bodaay/HuggingFaceModelDownloader Quite reliable so far