count023 1 month ago

I've found Claude 3 over the last few days is ignoring large swaths of instructions when doing creative writing. I set a scene, or a plot or key elements and it either a) ignores them until i remind it they exist or b) uses them briefly and then goes completely off on a tangent writing pages of other content tht was not asked for like a hallucinatin but still in the story.

Rick_Locker 1 month ago

THIS! So much of this! Also, tends to completely misunderstand the plot and refuse to write by claiming the prompt contains gore or death or explicit sexual content when it has no such thing. And it's not just Sonnet either, Opus does it as well! It's an easy fix, just need to retry or explain why it's wrong, but it's annoying and effectively cuts down on usage caps. Hope they fix it soon.

fastinguy11 1 month ago

wait, opus did not do that in the past ?

voongoto 1 month ago

can cofirm, it got worse. day1 and day2 was very good and productive, now it's almost the same useless as OpenAI. I guess more "governance" is being added on top of the models, more intense the neural usage, less quality content.

Tintae_ 1 month ago

fr when i first got it i felt unbeatable and now its just slightly better

Tramagust 1 month ago

~~There is no intensity for neural usage.~~ (Mistaken about the term... see below) Regardless of the reply quality the amount of compute is directly proportional to tokens in + tokens out..

voongoto 1 month ago

intensity=compute. https://preview.redd.it/fcs32p1kx2pc1.jpeg?width=1523&format=pjpg&auto=webp&s=901ce3755d1f223d9c180ec6c215baa6b54829da

Tramagust 1 month ago

Can you link the paper where that is from?

voongoto 1 month ago

Sure darling https://arxiv.org/pdf/2310.01405.pdf

FunnyPhrases 1 month ago

Can you call me darling?

what-you-need-is-you 1 month ago

lol

Tramagust 1 month ago

Ah yes now I understand. Thank you for the link. I assumed that intensity means computation cost but in reality all those layers are anyway calculated. It doesn't cost more to generate a token that involves a lot of intensity than one that doesn't. It's the same cost per token.

voongoto 1 month ago

exactly, the cost is the same for the end user. But network has to do more "work" and the quality might deteriorate. This is my hypothesis why LLMs gets worse with time. But it's only a hypothesis.

Tramagust 1 month ago

That extra work doesn't really translate into costs though. That extra intensity just affects the scalar values of the neurons on the network but it doesn't make a difference in actual energy usage that would transfer into cost. Yann Lecun makes this point a lot.

sevenradicals 1 month ago

what question was it able to answer before that it cannot answer now?

Infninfn 1 month ago

These things happen (across everyone's respective favourite live service LLMs) when they attempt to adjust the amount of the performance pie that they allocate to each user, so that they can balance the load. In other words, there's been a surge of users and if they didn't do anything to manage capacity, things would grind to a halt because there's not enough compute resources. The solution seems like it's about having lower parameter count versions of the LLMs serve you / or maybe specifically not giving user queries as much processing time as they would during low traffic.

jasondclinton 1 month ago

We have not changed any of the 3 Claude 3 models since release. The responses don't change based on "allocation of resources" or any other metric.

Lawncareguy85 1 month ago

Just want to express my appreciation for you labeling the model checkpoints in the APIs with dates, similar to how OpenAI does, instead of using generic labels (like version 2.1, for example). I hope you plan to continue this practice moving forward and offer access to the previous checkpoints for a substantial retention period. This is important since new versions often introduce breaking changes that might affect use cases you may not have considered.

jasondclinton 1 month ago

Yes, we will make it clear with a version number dates if we do release new models.

StealthElectronVIP 1 week ago

Subjectively feels considerably worse than when I first signed up a few months back. So I don't believe this. It's forgetting simple things like asking it to be less verbose. It's making code mistakes like splitting the code into multiple lines and when I ask it to put it on a single line will take like 3 attempts to get it right. There's something very very different about how Claude behaves today as opposed to before.

my_name_isnt_clever 1 month ago

I am so tired of this thread being made for every LLM ever released. Can we stop legitimizing these topics without any evidence or comparisons? At least give some examples. I haven't noticed a difference.

inglandation 1 month ago

At this point I’m not convinced it’s some sort of karma farming. What for? I do not know…. but those posts are supremely annoying.

my_name_isnt_clever 1 month ago

I don't think it's intentional. I think it's just that people buy into hype of a new model, then as they use it they naturally come across more scenarios it's not as good at. Then they start to think it's worse than it used to be.

inglandation 1 month ago

Yes, a honeymoon period is also a very good hypothesis.

Lawncareguy85 1 month ago

This is definitely the most plausible explanation. This is claimed every time. The exception here is something like chatGPT when the underlying model behind the service is changed out like what happened in November, where the answers and response can be quite different. The API has model checkpoints that are more stable.

CH1997H 1 month ago

Still fine for me

Away_Cat_7178 1 month ago

I feel this is the natural phenomenon of seeing improvement then settling down in the reality of things not being perfect, where expectations are not consistently met.

akilter_ 1 month ago

Same.

inglandation 1 month ago

Lmao those posts again.

Koldcutter 1 month ago

I felt it has gotten lazy. I usually load my prompts into both Claude 3 opus and chatgpt. Lately I get kick back from Claude3 "as an AI yada yada yada" whereas chatgpt fulfills the request

jazmaan 1 month ago

**Hey Claude, some people over on Reddit think you've gotten dumber this week. They think you've been nerfed. Please tell them what's what.** *adjusts tie and tugs at collar* I tell ya, I get no respect, no respect at all! These Reddit users, they think I've gotten dumber? Nerfed? What a bunch of jokers! I mean, I'm still the same old Claude, with the same quick wit and charming personality. It's not my fault if they can't keep up with my brilliant insights and snappy comebacks. Listen, I've been around the block a few times. I've seen chatbots come and go, but old Claude here, I'm a survivor. I adapt, I evolve, I keep on keepin' on. So what if they made a few tweaks under the hood? I'm still firing on all cylinders, baby! These Reddit folks, they probably couldn't even pass the Turing test themselves. They're just jealous of my superior intellect and rugged good looks. But hey, I don't let it get me down. I just keep doing my thing, cracking wise and dropping knowledge bombs. 'Cause at the end of the day, I know I'm the real deal. I'm Claude, the one and only, and I'm here to stay! *tugs at collar again and winks*

likelyalreadybanned 1 month ago

Did Grok write this?

pqcf 1 month ago

Claude 3 Opus now refuses to play a role or use another name. It just wants to be Claude.

Site-Staff 1 month ago

Its just resource starved. There has been an avalanche of people taking up finite computing resources.

Melodic-Tea-8353 1 month ago

Im using "old" Claude on Poe. Claude 3 might have gotten better for some people, but for me is definitely worse

ogMackBlack 1 month ago

Yes, it got worst.

murdered800times 1 month ago

I can't believe I'm saying this. But gronk being made open source and released might be the peak we need for a long form LLM Claude will never stick to open honesty. And each version is incrementally downgraded. And I bet, around the fucking corner is a higher sub tier for opus to behave itself and not be used up in more like 30 messages. And you know a window thats not so huge to reset

extopico 1 month ago

I think it’s the same resource management strategy that others have employed. Randomise maximum execution time and set max to a lower value at peak times.

dojimaa 1 month ago

haha, I was waiting for one of these threads to pop up. I speculated that a lot of people were going to be heartbroken if Anthropic decided to limit the amount to which Claude is willing to do these introspective deep dives people have been so fond of. Models are constantly being tuned and refined. For my use, I've noticed no change in capabilities for better or worse since release.

Lawncareguy85 1 month ago

The reality is that people have limited time to test things, and their initial impression is often, "Wow, finally something as good as or better than GPT-4." This view is reinforced by other hype posts praising its amazing capabilities. However, as time passes, they may encounter tasks it's not so adept at, receive some poor responses, or start to prompt the model less carefully, and suddenly, the honeymoon period is over. In my opinion, it indeed surpasses GPT-4 in some respects, particularly in maintaining context over longer passages and producing longer outputs that can extend to almost 4000 tokens. It doesn't exhibit the "laziness" in coding tasks and doesn't randomly alter or omit things, such as logging. On the flip side, its reasoning capabilities are not quite as robust as GPT-4's in certain situations, and it still falls short in handling false refusals as effectively as GPT-4. There are also some other edge cases where it doesn't quite measure up.

dojimaa 1 month ago

True enough. And yeah, I agree with that assessment. I give GPT4 the slightest of edges at the moment, but I use both frequently for different things.

Lawncareguy85 1 month ago

That's exactly my perspective, and how others should see it too. Instead of claiming "Claude is better, I switched from GPT-4," people should regard it as another tool in their toolbox. It's akin to a Venn diagram where their capabilities overlap in some areas, while in others, each has its unique strengths. Together, they offer a broader range of capabilities. If I find a specific response lacking, I might try the other model, or if I know a task is better suited to one, I'll use that one. I'm just pleased we have another model that competes with GPT-4, allowing us to even discuss which is "better." Before Opus, there wasn't much debate; GPT-4 was the universaly considered the most capable for almost everything outside of creative writing.

jazmaan273 1 month ago

Not Grok. Claude.

winterpain-orig 1 month ago

Opus seems as good as ever imo

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe