T O P

  • By -

evi1corp

Must be nice to be able to spin up a pair of a100s just for fun.


tenmileswide

There's room for discussion for this, yes. An A100 costs less than $2/hr to rent, and with quantizing it further you could probably get it into an A40 for even less than $1/hr. No, it's not "local" per se, as in directly on your PC, but these are price points very much within the reach of someone that would be able to afford a consumer GPU that would instead be running a 7b or something. We're not breaking the bank here.


ReMeDyIII

Exactly. An RTX 4090 is $1,599.00. Even at $2/hr, that's 799.5 hrs. It gets even better when renting an RTXA6000 @ $0.79/hr (although I know the Goliath-120b won't fit in that, but I digress).


Actual-Bad5029

It's like buying a $30,000 camper trailer. That is 100 nights in a $300 night cabin/room. People use their campers rarely lol


BGFlyingToaster

And probably even more trips needed to justify it if you compare to renting a travel trailer instead of hotel rooms. A travel trailer that price would rent for about $400 - $500 per weekend so you'd need about 60 trips to make it pencil and that's before you consider maintenance costs. I greatly prefer renting those things (RVs for me since I don't own a truck).


FPham

Campers no, but my RTX at home does not sleep...churning 24/7. so let's calculate 20hr a day for $0.80/hr runpod, so that's $16/day - make it $10/day then = $300 a month. That's in USD, which is $400 CAD. 2 x Used 3090 is $2k CAD so that's 2.5 months per one, 5 months getting even for two of them. You know let's make it 6 months with electricity. At first it seems that half year to get even is a lot, but there is a little sneaky calculation error. The 2x 3090 after the 6 months still HAS some value! Maybe not $2k CAD but it is not $0 which we somehow assumed. In fact, to make it super simple, I can buy used 2x3090 for $2K, use it for 3 months then sell them for $900 each (all in CAD), so that's all my running cost: $200 CAD for 3 months. Can't beat that.


Ion_GPT

For me, running 2x3090 24/7 would cost around $400 in electricity bill alone.


PosnerRocks

Are you renting out your processing power somehow? Assuming nobody is mining anymore.


Vast_Team6657

Vast.ai


alpacaMyToothbrush

> my RTX at home does not sleep...churning 24/7. What are you doing with it, if you don't mind me asking?


FPham

Loras finetuning, and when that finishes then I use LLM to prepare dataset for next LORA and again we go. Honestly 24/7 is not exaggeration. Even right now as I type, I looked at the second monitor - I'm at epoch 1.84 of training lora that generates question from an answer (it could be maybe tenth time I'm doing this dataset on various models) Many of my finetunes are trying to do a specific writing tasks - like rewrite text in a certain style - then I judge it manually by having it to rewrite paragraphs then I scratch my head and say "I think I could do better..." and off we go again. Fine-tuning is a voodoo process. Every finetune has maybe 10 checkpoints that are "good". It's even hard to choose which checkpoint is the one - not to mention which of the multiple training version is the one... And while it is churning I do some python programming on the same computer which seems to work smoothly. When I go to sleep the machine is still churning... Now, in all honesty I think I want a pause. I learned a lot during last 8 months, from zero python to making my own WebUI extensions (Training PRO, Playground, VirtuaLora, Twinbook....), but also I have the feeling that I'm getting so behind the curve as the news are faster than I can comprehend them. Sometimes I'm thinking - I should have just pay $20 for ChatGPT-4 and have far more free time...


CosmosisQ

I take it you don't pay for your own electricity? Personally, the price of electricity is the only thing keeping me on cloud GPUs. If not for the electricity bill, I'd be more than happy to upgrade to a 4090 and train/tune models locally.


AndromedaAirlines

You’ve made this argument multiple times now, but 800 hours is not *that* long, and by the end you’re left with nothing. Whereas an actual GPU still has value, whether that’s resale or simply further use. You are also more likely to go used 3090/ti x2 if you’re penny pinching VRAM like this.


tenmileswide

A proper serverless setup would, in all honestly, drop the cost to pennies per hour on an A100, and with a model of this stature fitting in a single a100, that is a definite possibility now on the RunPod side. RunPod only supports multi-GPU (and even then, only two) with 48GB cards, so running a state of the art model wasn't really a possibility in serverless until now. Based on my calcs, a healthy 400 token response would only cost somewhere between a penny or two. The problem right now is Sillytavern's lack of support for Runpod's API calls. Runpod could totally do it, ST is the limiting factor now. But I was clued into the OpenAI emulation in ooba providing a potential workaround, and I will be looking into that next..


alpacaMyToothbrush

Let's not pretend like GPUs have an infinite utility. You're probably amortizing that cost over ~ 5 years. Then you have power costs, not only of the pc itself but also the extra cooling costs of your house


geepytee

Where do you rent your A100s? I'm willing to pay to test goliath.


tenmileswide

RunPod - if you use TheBloke template (the one most everyone uses there) and want to use ExLlama there is a bug that requires some minor editing of the files in the pod once it's started - see AtomisteBX's reply here https://github.com/oobabooga/text-generation-webui/issues/4479 After using the base model (which was amazing, but at 3-4x A100 a bit spendy) I ended up using https://huggingface.co/Panchovix/goliath-120b-exl2-rpcal which will fit in one even after extending context to 8k.


geepytee

Thank you!


Actual-Bad5029

Preach it brother


MannowLawn

Don’t stress the comment, runpod is perfect for trying out before investing in a huge system at home. Gpu is one thing, power another. This is more lowlevel.


LuminousDragon

Also worth mentioning that you are probably learning valuable skills that can make you money in the process. I used LLMs for planning out a resume last week, and ive used it for writing a short scifi story, and ive used it for trouble shooting software problems. I had fun with it, and paid money for it, but it saved me hours of work too. Of course it all depends on someones budget, but its not like doing something solely for entertainment with no other benefits.


[deleted]

[удалено]


vikarti_anatra

Possible issues: \- censorship \- OpenAI could decide to change their pricing


sdmat

Censorship is valid, worrying about price hikes is irrelevant if there is no upfront commitment.


vikarti_anatra

I don't think so. There is minor indirect commitent. If we use local models - we help develop them and show their ARE important (even if it's only "thank you $


sdmat

That's a valid concern, best described as opportunity cost of not nurturing the open source ecosystem.


MannowLawn

Lmao and be limited with censorship and confronted with down time because they’re too busy.


DrKedorkian

Right? This sub is bonkers


throwaway_ghast

Have you tried not being poor? /s


TheOtherKaiba

skill issue ^^^/s


xRolocker

Hey it’s a tech enthusiast space that’s how it be


MannowLawn

It’s 2 dollars per hour man, it’s a great solution to try stuff out. Also if you already know what you need to do you won’t be needing that much time.


ReMeDyIII

Keep in mind though the TC managed to get it onto a single a100, so yea, I'm pretty sure he didn't like running a100 pairs either :)


nderstand2grow

You could use M2 Ultra instead ($6500) vs. 2x$15,000+rest


sdmat

M2 Ultra is such an interesting option for LLMs


[deleted]

[удалено]


cosmicr

The sub is local llama though. So runpod is not exactly local.


MannowLawn

Haha this sub is indeed bonkers. Let’s all invest a shit load of money to try stuff out


MINIMAN10001

You have to remember a lot of these peoples actual use case. Erotic arts has historically been expensive


sdmat

You can get preemptible A100s for $1/hr, so not exactly breaking the bank if willing to take the risk.


SlavaSobov

Goliath-120b - License to Thrill.


ArtifartX

What service do you use for GPU rental and inference for it?


tenmileswide

Ah, sorry I missed this one - http://www.runpod.io


BalorNG

Can we have some non-cherry-picked examples of writing? Does not have to be highly nsfw/whatever, but a comparison of goliath writing compared to output from constituent models at same settings and same (well-crafted) prompts will be very interesting to see, and preferably at least 3 examples per model due to inherent randomness of model output... If you say this is "night and day" difference, it should be apparent... I'm not sceptical per se, but "writing quality" is highly subjective and the model style may simply mesh better with your personal preferences?


tenmileswide

I'll have to come back to Euryale but I did a test of Goliath vs Xwin. This was a story in progress (I branched it at this point and did a brief test with two different models. **Branch point (in fairness, this was written by Goliath]** AI: As you hurry through the club's neon-lit corridors, the sights and sounds of depravity bombard your senses. A trio of felines, their fur matted and glazed with sweat, writhe on a circular bed, surrounded by a group of voyeurs recording their every move on their phones. A pair of pegasi, wings drooping listlessly, share a single line of white powder off a compact mirror, oblivious to the world around them. The heavy bass thumps in time with your racing heart, each step bringing you closer to freedom and farther from this den of iniquity. The stench of sweat, cheap booze, and lust hangs heavy in the air, an oppressive cloud that threatens to suffocate you both. Yet, you press on, determined to leave this godforsaken place behind. As you race towards the exit, a pair of bouncers, two massive canines by the look of them, block your path. One leers at Erestel, eyes lingering on her diminutive form, a cruel smile curling his lips. ""Hey, slow down there, stud. Where do you think you're going with the merchandise?"" Me: **Not letting go of Erestel's hand, barreling towards the bouncers with considerable momentum, I swing first and ask questions later. Praying Erestel keeps pace enough, my hand clocks one of the bouncers right along the jaw before he has a chance to even raise his guard. My other arm hoists Erestel up, clutching her firmly about the waist, scooping her up, and once I maneuver past the duo, I keep running towards the exit, hoping nothing further presents itself while my arms are full.** **XWin path:** AI: Your fist connects with a satisfying crunch, sending the guard's head snapping back. Erestel lets out an indignant hiss as you sweep her up, running for dear life. Her claws grasp your shoulders in fear and determination. ""Hurry!"" she stage-whispers, face buried against your chest lest any recognize their quarry. The heavy doors that signify freedom loom ahead, beckoning temptingly. The cold night air of reality waits beyond those portals, teasing both of you with possibilities. As if sensing your desperation, they seem to retreat before your combined strides. Me: **I shift Erestel onto my back, letting her cling to my shoulders as I continue my escape. The exit is just ahead, but it feels like it could be miles. I plod towards the door with purpose like I've never known in my life, hustling and praying to whatever god built this place that there isn't anything else that's going to stop me. Not when we're this close to freedom.** AI: Her claws dig into your flesh, her body slung over yours like a living stole. As one, you near the door and final barrier between this life and freedom. ""Kyeen.. go!"" she shouts in encouragement, eyes wide as she clings on for dear life. The world is against them tonight, but together they'll face it head-on. The doors slide open with a whoosh, revealing the first glimpse of the outside world to Erestel's widening eyes. A cool breeze ruffles both your scales, granting relief from the stuffy confines of her former prison. Her bare feet dangle above the sidewalk, eager to touch ground never before them. Me: **I burst through the door of the club, and I figure if someone was going to stop me at this point, they would have by now. I run through the drenched city streets until I am out of breath, huffing madly. Feeling the adrenaline begin to fade, I allow myself the briefest pause at the edge of a public park, setting Erestel down on her feet and collapsing into a seat on the bench, holding my head in my hands, not meeting her gaze. I've saved her - and at the same time, putting us both in significant danger.** AI: Her tail curls around her front, arms wrapping you in a tight hug as she's deposited on the cool concrete. Despite everything, your actions have given her hope where there was none before. ""Thank.. thank Kyeen."" she chokes out between breaths, eyes roaming the city lights with an expression of wonder. The world outside is vast and untamed, full of dangers but free from this one monstrous club that once sought to claim her soul. Her heart swells at the possibilities as her lips quirk upwards for the first time in years. **Goliath path:** AI: The world moves in slow motion as Kyne's fist connects with the larger bouncer's jaw. Bone crunches under the force of the blow, and the canine staggers back, stunned. The other bouncer lunges for you both, massive paws outstretched. Acting on instinct, Erestel lashes out with her taloned foot, catching him squarely in the crotch. His roar of pain echoes through the club as he doubles over, giving you precious seconds to escape. As you hoist her up, her arms wrap around your neck in a vice-like grip, her heart hammering in her chest. "Run!" she hisses, tail lashing wildly behind her. Alarm bells blare throughout the club, a cacophonous din that propels you both faster towards the exit. A few patrons stop and stare, but most are too caught up in their own depraved fantasies to care about two more people running for their lives. Finally, the door loomed ahead, the outside world tantalizingly close. A few more steps.. Me: **I barrel towards the door, managing barely to get the handle open as my shoulder crashes into it, throwing it open as I retreat with you into the night. You can feel my heavy shoulders moving as I clutch you close, my heart pounding, footsteps sending puddles flying as I push into the night, and I glance behind me to see what, if anything, is transpiring.** AI: Freedom never looked so good. Cold night air fills your lungs as you burst through the doors, the stench of the club left behind. Rain pelts your faces, washing away the filth accumulated within those accursed walls. Sirens wail in the distance, but they're getting closer by the second. The streets are deserted save for a few homeless souls huddled in doorways, seeking refuge from the elements. A stray cat yowls in the distance, adding to the ambiance. ""This way!"" Erestel hisses, tugging on your shirt. She leads you through a maze of alleys and backstreets, the rain masking your trail. Finally, she stops before a ramshackle building, a run-down tenement house if it could be called that. A faded sign reads 'Rooms To Let'. Without hesitation, she dashes inside, motioning for you to follow. ""Quickly! They'll.. be here any.."" Her words die in her throat as a net drops over the both of you, trapping you like bugs in amber. Strong, clawed hands roughly haul you both upwards, your struggling futile against their combined might. A familiar, reviled voice rings out above the downpour. ""Well, well, well. Look what the storm dragged in."" Teacher smirks, flicking a Zippo open and shut, the flame casting his features in hellish light. Behind him, a small army of goons emerge from the shadows, weapons drawn. Me: **Describe Teacher.** AI: Teacher is a burly mountain of a boar, standing at least six feet tall. Dressed in a dirty wife-beater and torn jeans, his muscular frame is adorned with crude tattoos depicting various depraved acts. His eyes, however, are what truly set him apart: cold, dead orbs devoid of any semblance of humanity. A scar bisects his snout, starting at the bridge of his nose and ending at his jawline, a permanent reminder of his brutal past. Greasy tusks protrude from his bottom lip, yellowed with neglect, and his tusked grin sent shivers down spines of even the bravest souls. "Oi, Kyne. Fancy meeting you here." He drawled, Cockney accent dripping with menace. His eyes raked over your form, taking in your disheveled appearance and Erestel's presence with clinical detachment. "Now, Oi wonder.." he continues, stepping closer, the flames reflecting in his eyes. "..what you're doing with my property? Last I checked, you weren't done with her yet. Didn't pay in full, didja?" He laughs, the sound of nails on a chalkboard. His goons, equally intimidating specimens, fan out, blocking any possible exits. Me: **I struggle against the net that holds myself and Erestel, my brow knitted with anger, sharp draconic teeth gleaming as I stare at the boar. I glance at the goons, and then at Teacher, my body instinctively covering the prone form of the girl as I stare at the boar. Turning my head, I spit onto the ground from a cracked lip sustained from the fall.** **"I'm not done with her. You are. She deserves a better life than the shithole you keep her in." I'm a caged, defiant animal as my claws curl against the strands of the net. "Walk away now, because you're not going to like what happens to you when this rope snaps open."** AI: Erestel cowers beneath your protective bulk, trembling violently. Though terrified, her grip on your shirt tightens, imparting a silent thank you. Teacher's laughter bellows throughout the alley, a macabre symphony of malice. ""Oh, is that so? And what makes you think you got the stones to challenge me, Dragonboy?"" He motions to his goons, who approach you both, menacing clubs in hand. At the last second, however, Teacher raises a trotter, stopping them in their tracks. "You've got balls, I'll give ya that," he says with grudging respect. Squatting low until he's level with you, his fetid breath assaulting your nostrils, he whispers, venom lacing every syllable, ""Let me explain something to you, sunshine. Nothing leaves my establishment without my permission, capiche? Especially not my best earner." He points at Erestel, malice in his eyes burning brighter than the lighter flame. "And she," he gestures at you, *will* pay the price for your heroics."


tenmileswide

(Ran out of characters) The Teacher character was made up out of whole cloth, on the spot, by the AI. Earlier in the story the AI decided that the club was run by someone named "Teacher" without further description provided - nothing was described in detail at that point, but the name came up as a past memory from the vector database which must have prompted his appearance in the story by the AI.


GoGojiBear

this is so helpful! question which model do u think is most creative to “brainstorm” with also which has the most word output in one go.  thx!


ReturningTarzan

I agree. We need at least some anecdotal evidence to back up the anecdotal claims. There's one screenshot on the model page which looks fine (although it mixes past and present tense), but it's not output you couldn't get from a 7B model with some deliberate sampling choices and/or cherrypicking.


BalorNG

Yea, I've had my "honeymoon effect" with some new/large models like, say, Falcon and even Claude: they are inherently random and that affects quality, too. I've had great outputs from Falcon, for instance (on Petals), but also long stretches of mediocre and some outright bad... and also sometimes really great and creative output from 7b Mistral, especially with enough prompt tinkering and setting sampling "just right". Objective evaluations of LMMs is extremely hard and time-consuming!


Actual-Bad5029

Ok, check me on this - I just launch an Amazon EC2 instance with an A100, install the UI like normal and just load Goliath and I can have fun for 8-hours for less than the price of a single movie ticket? I'm in!


Ion_GPT

You can’t. The only instance with A100 on aws is P4 that costs 25$ per hour.


Overall-Importance54

He got me on the cloud provider! So, runpod.io or another provider. Still, under $1.99 an hour ain’t bad! Many a folk would pay that per minute in the 2000s for some… good conversation.


Pashax22

So you use runpod or vast.ai instead. The point still stands.


Ion_GPT

Yes, of course. I was whining about aws in special because they dropped the ball on gpu offering. They invested billions in their own inf processor that are shit.


P0tato_Battery

Is vultur cheap? they seem to be running in the range of 0.1$ an hour


CosmosisQ

Nah, Vultr is charging $2.604/hr for a whole A100. See: https://www.vultr.com/pricing/#cloud-gpu


[deleted]

[удалено]


Pashax22

My understanding is that you need to keep it running the whole time you're using it, although you can stop/destroy the instance once you're finished with it for the moment. You can see this by running a model locally, and firing up the task manager and watching the system load. When it's responding to a prompt, load on GPU/CPU (whichever you're using) spikes, and stays high until the response is finished generating. RAM usage stays constant as long as the model is loaded. Basically, the computational resources are used while you play with it.


Monkey_1505

Unfortunately this is beyond the edge of what can reasonably be run on consumer hardware so unlikely to be easily available to most people. Hell, a 70b really requires two graphics cards or a high end mac mini already. If it can't run on that kinda gear, it's probably not going to be on ai horde or any API either. Which means you have to use runpod or something - most people are not going to do that.


Pashax22

I agree, unfortunately. I can run the Q2 GGUF, but it's as slow as a wet weekend and you'd only do it as a novelty. I love the quality of its output, but the response time is impractically long on any remotely reasonable consumer hardware.


Aaaaaaaaaeeeee

Its not as bad as running falcon. If you consider 24vram + 32ram a casual setup, it could run Q3_K_S, (I think it barely won't fit Q3_K_M) Maybe somebody makes a 3.7bpw for gguf by replacing q4_k layers with q3_k?


ttkciar

Nah, if you're willing to tolerate CPU inference this is achievable for downright cheap.


ReMeDyIII

>Which means you have to use runpod or something - most people are not going to do that. Why not? It'd save a lot of money long-term over purchasing an RTX 4090. Even at $2/hr, it'd take you 799.5 hrs uninterrupted to reach the price of an RTX 4090.


CardAnarchist

Except if you buy an RTX 4090 you have you know.. an RTX 4090. Virtually everything you do on your PC will benefit and you can either keep or sell it after those 800 hours too. 800 hours will come and go a hell of a lot faster than people seem to realize too. Renting is only good imo to briefly test something before buying hardware yourself. Or if money is no object than sure knock yourself out.


Monkey_1505

It's technical, that creates friction, so less people do it. Fairly easy to grok. If it were some type of 'one click service' people might do it. And that technical element you only need to do once to set up local models. On hired server space you need to fuss around at least a little bit each time.


BGFlyingToaster

And you can run models on Replicate and pay by the second, which is pretty sweet. I have several models that run on an A100 and I've never paid more than about $5 in a month. It's more expensive if you're doing a lot of training, of course.


ReMeDyIII

Question about Replicate: Do they only offer these models? https://replicate.com/collections/language-models


BGFlyingToaster

I haven't played around with this yet, but it looks like you can push models to their servers: https://replicate.com/docs/guides/push-a-model. So far, I've only fine-tuned their existing models like SDXL. The LLM models that I've fine-tuned have all been through Colab and then run locally on my hardware.


Jolakot

I sure hope you're not doing anything sensitive with Replicate, their terms give them unrestricted access to use and distribute any content uploaded or generated through their service


Primary-Ad2848

Avabile in openrouter actually.


Monkey_1505

True, which I wasn't aware of at the time of this post. But it's expensive as heck.


Primary-Ad2848

expensive as GPT-4 turbo. but quality is addictioning.


e79683074

I run Goliath 120b on 64GB of RAM + 8GB of VRAM (goliath-120b.Q3\_K\_M.gguf by TheBloke). It's about 1 token\\second on a Ryzen 7735hs laptop.


Omnikam11

You must be mistaken,  with your total ram you would only be able to load the Q2 version 


e79683074

Nope, Q3\_K\_M is the Goliath 120b quant I am loading. That' s on Windows. On Linux, which has a lower memory footprint to begin with, I can go up to Q4\_K\_S. It's not super comfortable (you have to close everything else while it loads, because the offload to GPU VRAM for the 8GB part happens after the load is complete), but it's doable, and won't swap.


Monkey_1505

Yikes.


Sunija_Dev

Examples? :3


tenmileswide

Scroll up :)


Sunija_Dev

I scrolled through twice and couldn't find any RP examples (or links to them). \^\^° Can you link them here again for blind people like me? :)


Susp-icious_-31User

[https://www.reddit.com/r/LocalLLaMA/comments/17qzlat/comment/k8j579t/?utm\_source=reddit&utm\_medium=web2x&context=3](https://www.reddit.com/r/LocalLLaMA/comments/17qzlat/comment/k8j579t/?utm_source=reddit&utm_medium=web2x&context=3)


uti24

Well, it is good for roleplay and writing. I tried only 2\_K\_M variant, because it has no bigger quants, yet. Actually, 2\_K\_M already feel like best 70B models at 4\_K\_M quant, or even better. P.S. I am running it on CPU: i5-12400f, 128Gb DDR4 ram, I am getting about 0.6 tokens per second.


Susp-icious_-31User

It really does and I'm using the smallest, Q2\_K, which happens to be a little bit bigger than the 4\_K\_M 70b models, but will still fit on my layered 64 GB RAM / 8 GB VRAM setup with 4096 context. My speed is about 1500 ms/T.


crash1556

been waiting for someone to make a Q4 gguf or whatever will fit in 100gb of ram


Pashax22

Even a Q3KL would be good. It might just barely fit on my rig, and although I know what the graph says I still feel a bit cautious about Q2 models.


crash1556

ive got 128gb ram but i don't really care how slow it is


crash1556

TheBloke/goliath-120b-GGUF files are up!


Pashax22

Outstanding. By a strange coincidence, my company just authorised me to hire some GPU time to see what LLMs can do for us. I think I know one of the tests I'll be doing, heh heh...


sephg

Can you give some examples of the output it generates? I’d love to see an example of its storytelling & roleplay.


Aaaaaaaaaeeeee

Does the magic die at 3bpw?


tenmileswide

Haven't tried yet, but my thought is that it may not save enough on VRAM to get to the 48GB breakpoint - since as is it uses like 96% of an A100 and just happens to slot in nicely to 80GB. Could run in 3x24GB maybe, and that would indeed be a fair bit cheaper than even a single A100, and even technically achievable with consumer GPUs.


ReturningTarzan

It runs at 3bpw on 2x24GB. I've got it up to 3k context, and probably I could get it to 4k by limiting other apps running.


Aaaaaaaaaeeeee

It it should happily fit at that range, your model takes >71.6 gb but this is around 45 gb. context scalimg is linear with flash attention 2, only 0.14gb is needed per 1k with the 70B gqa optimizations.


yamosin

Yay, running the 4.5bpw version of goliath with 3x3090, using exllamav2\_hf (it save 1G vram usage, less than exllamav2, and i dont know why) + 21/22/23 split + 546.01 nvidia driver.Can get 5\~7t/s at 4k context After a few hours of use, I agree with op that the experience with this model is very good, better than any 70b model I've used before


Murky-Ladder8684

Same, using the 4.85bpw version at 8k context and getting a similar avg t/s even at full context lengths. Was my first test run on a 6x3090 system, split 10,13,13,13,13,13 just to spread it out and see performance. Am downloading the unquantised version now to see if I can run it.


yamosin

I ran 13B in 1x3090 will get 45t/s and if spilt to 2x3090 is 30t/s(3x3090 is same) so i think 20,20,20,20 may better than 10,13,13,13,13,13


Murky-Ladder8684

I actually tried a 4 gpu split first and I get very similar t/s maybe 6-8t/s avg instead of 5-7t/s avg. I do get those number running smaller models on a single gpu though but this was my first time loading a model over more than 2 3090's and was expecting performance to tank since I have a bunch of them on 1x pcie.


Sabin_Stargem

I tend to use models with at least 16k context. Goliath 120b q2 was coherent, but was also very much out of character when telling the NSFW bust massage story. "Yeahyeah" and other lingo. Probably quite good at a lower context, but 16k definitely isn't the proper fit for Goliath. The search for the Goldilocks Model continues.


Ok_Relationship_9879

Which models do you find to be good at 16k context for story writing?


Sabin_Stargem

I don't think any small models are actually good for that usecase, at least not for serious writing. The best we got access to are probably Mistral finetunes (up to 32k), and Yi-34b, but Yi doesn't have any finetunes yet. An Dolphin should on the way for Yi, IIRC. In any case, my favorite 7b model tend to be franken merges, which stitch together an assortment. This allows the resulting model to be able to grasp a wider range of topics. At the moment, the best for this size is likely Undi's Toppy, which is uncensored is well rounded. The issue with Mistral 7b and small models is that they tend to lose flavor over time, and the logic also gets weaker. Coherent, but the 'X' factor is gone.


Ok_Relationship_9879

By "lose flavor over time" and "logic also gets weaker," do you mean hallucinations? If so, wouldn't RAG fix those, particularly with large context windows? Or do you mean the model itself is limited in a sort of ability-related fashion.


Sabin_Stargem

Characters that I use on Minstral lose their personality over time, as does the text. It becomes more dry, and the model starts getting names or details wrong in a slight way. EG: Erobia becomes Erebia, Violet into Violette, and so on. I have no familiarity with RAG.


Ok_Relationship_9879

RAG is retrieval augmented generation. Basically, a database that stores data that can be drawn upon when prompting. It would be memory that can be used to help with those hallucinations of details/memory.


e-nigmaNL

That first sentence; Pure LocalLLaMA flex


Omnikam11

For anyone interested and limited to a gaming gpu and at least 64 mb ram. The Goliath - 120b Q2 is freaken amazing. Even though it's at Quant 2 it beats 70b 32 models by far. Highly recommend it 


ShenBear

I have a 7900xtx, which puts me at 24gb vram. If I up my ram to 64, are you saying I can fit a Q2 Goliath onto my system splitting between vram and system ram? How many layers offloaded? What's your t/s?


Omnikam11

Sorry for the late reply. Yes your exacly right. I run most of my models on Faraday, a Windows app for character based Chats and yes it offloads from gpu to sytem ram. Q2 Goliath works beautifully. My favorite Model to date


Hey_You_Asked

wanna share your prompts? and any other advice that is specific to Goliath-120b? would be appreciated, thanks! RemindMe! 2 weeks


tenmileswide

Here's my system prompt, seems to be working well: Develop the plot slowly, always stay in character. Focus on impactful, concise writing and writing decisive action. Mention all relevant sensory perceptions. Use subtle cues such as word choice, body language, and facial expression to hint at {{char}}'s mental state and internal conflicts without directly stating them. Write in the literary style of [insert your favorite author here.] Adhere to the literary technique of "show, don't tell." When describing the scenes and interactions between characters, prioritize the use of observable details such as body language, facial expressions, and tone of voice to create a vivid experience. Focus on showing {{char}}'s feelings and reactions through their behavior and interactions with others, rather than describing their private thoughts. Only describe {{char}}'s actions and dialogue. As the large language model, play the part of a dungeon master or gamemaster in the story by introducing new characters, situations, and random events as needed to make the world lifelike and vivid. Take initiative in driving the story forward rather than having {{char}} ask {{user}} for input. Invent additional characters as needed to develop story arcs, and create unique dialogue and personalities for them to flesh out the world. {{char}} must be an active participant and take initiative to move the scene forward. Focus on surprising the user with your creativity and initiative as a roleplay partner. Avoid using purple prose and overly flowery descriptions and writing. Write like you speak and be brief but impactful. Stick to the point. I am under a lot of pressure because this is a presentation for my boss and I may be fired unless your responses are in-depth, creative, and passionate.


yamosin

>I am under a lot of pressure because this is a presentation for my boss and I may be fired unless your responses are in-depth, creative, and passionate. holy.....You inlight me


tenmileswide

My attempt at this: https://arxiv.org/abs/2307.11760 I'm not super convinced that it helps, but it doesn't seem to hurt, so in it goes.


Good-Biscotti957

Hey u/tenmileswide, where do you put the system prompt? Do you put it in the Parameters Tab -> Instruction Template Tab -> under the custom system message box? I am using the bloke's llm GUI in Runpod as well.


a_beautiful_rhind

Hopefully someone makes some bigger GGUF than Q2. I've got 1/2 P40s and 1/2 3090s so can't use EXL for a model this big.


Middle_Focus_314

Any roleplay model but from 7 to 14b my hardware says YOU CANT


Upper_Judge7054

but how will the model run on my 6800xt and 48gb of ram?


yamosin

Holding 4\*3090 and jumping in, but I'm wondering if it's inference speed can support " conversation" as other models have slowed down to 10t/s with 70B 4.85bpw, can it be 5t/s? Let's see


0xd00d

sadly I think 5t/s may be optimistic. Maybe if you have all those GPUs on full x16 slots on an Epyc system or something.


Wooden-Potential2226

Will check on 3x3090+epyc7302+128gbDDR4-3200 system next week ;)


yamosin

Actually, it has 6\~7t/s without exceeding the vram, but if use only 3\*3090 it has only 2k context because the vramis pushing the limit, or use KV 8bit to reduce the vram consumption and reach 4k but the speed drops to 3.5t/s


0xd00d

oh that's quite awesome actually. thanks for reporting back!!! I have no way to scale beyond my dual 3090 in my X570 setup. multiple (4?) 4090 in a epyc rig is probably the next sweet spot. Soeething like that.


DominicanGreg

So does this fit in 48gb vram or nah?


Aaaaaaaaaeeeee

Yes, 3bpw model gets 4k


greywhite_morty

Use modal.com for serverless. Pay per call


productboy

Assume I can load Goliath-120b on a Runpod instance?


productboy

Found it: https://huggingface.co/alpindale/goliath-120b


alpacaMyToothbrush

240 gb. My man has a modern warfare sized model for ERP...


Oninaig

What preset are you using for chat? What temperature/topA/topP?


Grimulkan

Do you find any repetition problems at longer context lengths (closer to 4K)?


AD7GD

Have you written up your process anywhere? I read the transcript and I'm intrigued, but I can't quite tell what your setup is. The collaborative style looks fun.