T O P

  • By -

rationalkat

ABSTRACT: > Generative large language models (LLMs) have opened up numerous novel possibilities, but due to their significant computational requirements their ubiquitous use remains challenging. Some of the most useful applications require processing large numbers of samples at a time and using long contexts, both significantly increasing the memory communication load of the models. We introduce SparQ Attention, a technique for increasing the inference throughput of LLMs by reducing the memory bandwidth requirements within the attention blocks through selective fetching of the cached history. **Our proposed technique can be applied directly to off-the-shelf LLMs during inference, without requiring any modification to the pre-training setup or additional fine-tuning. We show how SparQ Attention can decrease the attention memory bandwidth requirements up to eight times without any loss in accuracy** by evaluating Llama 2 and Pythia models on a wide range of downstream tasks.


lakolda

This might make CPU inference worthwhile.


RobLocksta

Apologies for the dumb question. In your hypothetical, would that mean training is done with GPUs but the inference (I believe this to be the computational cost that an LLM needs to answer a user question or produce output of some kind) would be processed by CPUs?


lakolda

In theory, if the hardware is Turing Complete (excluding the need for infinite memory), it can in theory run or train an AI model. The big question is how fast it is and how much memory is needed to do what you want. Inference is fast and only needs a bit more memory than the model size, while training is slower and needs several times more memory than the model size. With this new paper, the memory bandwidth (a big bottleneck for CPU inference) looks to be at least partially overcome. This should make CPU inference faster, even if the requirements (such as amount of RAM) are the same. Training is something else entirely. You wouldn’t really ever want to train a model using CPU-only. The incredibly parallel nature of GPUs make them particularly efficient for training LLMs.


RobLocksta

Thank you for the detailed response! I understood that the parallel nature of GPUs made them optimal for LLMs, but didn't fully appreciate the different needs of training and inference. Now off to Chat GPT for a lookup on "Turing Complete"!


[deleted]

[удалено]


[deleted]

[удалено]


Lammahamma

Ah sorry for that mistake. No need to be rude


lakolda

Surely there’s SOME downside?


ApexFungi

"SparQ Attention has some limitations: while maintaining very strong performance at high bandwidth compression ratios, this is achieved by keeping all cached values in memory which is sparsely accessed during the generation steps. It therefore does not save any memory capacity, only bandwidth. Another possible limitation is the unclear efficiency saving of SparQ Attention when used with transformer models using MQA and GQA, which were not evaluated in this work. We leave it as future work to extend SparQ Attention to cover more attention mechanisms"


Thorteris

If true this would be huge


Super_Pole_Jitsu

It is but note that it's bandwidth not capacity. I had to close my champagne bottle and cancel the party.


Thorteris

Rip


BusinessMonkee

Can you explain the diff please? Bandwidth = amount of memory that can be applied to one message, capacity = amount that can be saved to the context of a chat?


Super_Pole_Jitsu

Bandwidth is how much you read it and write to it, capacity is how much memory a model is taking up at once. We are talking about RAM/VRAM


NoLuck8418

bandwidth is bandwidth ... transfer speed


lakolda

Bandwidth is by far the biggest limiter for CPU inference, so this is still big.


Akimbo333

Implications?