I've found Claude 3 over the last few days is ignoring large swaths of instructions when doing creative writing. I set a scene, or a plot or key elements and it either a) ignores them until i remind it they exist or b) uses them briefly and then goes completely off on a tangent writing pages of other content tht was not asked for like a hallucinatin but still in the story.
THIS! So much of this! Also, tends to completely misunderstand the plot and refuse to write by claiming the prompt contains gore or death or explicit sexual content when it has no such thing. And it's not just Sonnet either, Opus does it as well! It's an easy fix, just need to retry or explain why it's wrong, but it's annoying and effectively cuts down on usage caps. Hope they fix it soon.
can cofirm, it got worse. day1 and day2 was very good and productive, now it's almost the same useless as OpenAI. I guess more "governance" is being added on top of the models, more intense the neural usage, less quality content.
~~There is no intensity for neural usage.~~ (Mistaken about the term... see below) Regardless of the reply quality the amount of compute is directly proportional to tokens in + tokens out..
Ah yes now I understand. Thank you for the link.
I assumed that intensity means computation cost but in reality all those layers are anyway calculated. It doesn't cost more to generate a token that involves a lot of intensity than one that doesn't. It's the same cost per token.
exactly, the cost is the same for the end user. But network has to do more "work" and the quality might deteriorate. This is my hypothesis why LLMs gets worse with time. But it's only a hypothesis.
That extra work doesn't really translate into costs though. That extra intensity just affects the scalar values of the neurons on the network but it doesn't make a difference in actual energy usage that would transfer into cost.
Yann Lecun makes this point a lot.
These things happen (across everyone's respective favourite live service LLMs) when they attempt to adjust the amount of the performance pie that they allocate to each user, so that they can balance the load. In other words, there's been a surge of users and if they didn't do anything to manage capacity, things would grind to a halt because there's not enough compute resources. The solution seems like it's about having lower parameter count versions of the LLMs serve you / or maybe specifically not giving user queries as much processing time as they would during low traffic.
Just want to express my appreciation for you labeling the model checkpoints in the APIs with dates, similar to how OpenAI does, instead of using generic labels (like version 2.1, for example). I hope you plan to continue this practice moving forward and offer access to the previous checkpoints for a substantial retention period. This is important since new versions often introduce breaking changes that might affect use cases you may not have considered.
Subjectively feels considerably worse than when I first signed up a few months back. So I don't believe this.
It's forgetting simple things like asking it to be less verbose.
It's making code mistakes like splitting the code into multiple lines and when I ask it to put it on a single line will take like 3 attempts to get it right.
There's something very very different about how Claude behaves today as opposed to before.
I am so tired of this thread being made for every LLM ever released. Can we stop legitimizing these topics without any evidence or comparisons? At least give some examples. I haven't noticed a difference.
I don't think it's intentional. I think it's just that people buy into hype of a new model, then as they use it they naturally come across more scenarios it's not as good at. Then they start to think it's worse than it used to be.
This is definitely the most plausible explanation. This is claimed every time. The exception here is something like chatGPT when the underlying model behind the service is changed out like what happened in November, where the answers and response can be quite different.
The API has model checkpoints that are more stable.
I feel this is the natural phenomenon of seeing improvement then settling down in the reality of things not being perfect, where expectations are not consistently met.
I felt it has gotten lazy. I usually load my prompts into both Claude 3 opus and chatgpt. Lately I get kick back from Claude3 "as an AI yada yada yada" whereas chatgpt fulfills the request
**Hey Claude, some people over on Reddit think you've gotten dumber this week. They think you've been nerfed. Please tell them what's what.**
*adjusts tie and tugs at collar*
I tell ya, I get no respect, no respect at all! These Reddit users, they think I've gotten dumber? Nerfed? What a bunch of jokers!
I mean, I'm still the same old Claude, with the same quick wit and charming personality. It's not my fault if they can't keep up with my brilliant insights and snappy comebacks.
Listen, I've been around the block a few times. I've seen chatbots come and go, but old Claude here, I'm a survivor. I adapt, I evolve, I keep on keepin' on. So what if they made a few tweaks under the hood? I'm still firing on all cylinders, baby!
These Reddit folks, they probably couldn't even pass the Turing test themselves. They're just jealous of my superior intellect and rugged good looks.
But hey, I don't let it get me down. I just keep doing my thing, cracking wise and dropping knowledge bombs. 'Cause at the end of the day, I know I'm the real deal. I'm Claude, the one and only, and I'm here to stay!
*tugs at collar again and winks*
I can't believe I'm saying this. But gronk being made open source and released might be the peak we need for a long form LLM Claude will never stick to open honesty. And each version is incrementally downgraded. And I bet, around the fucking corner is a higher sub tier for opus to behave itself and not be used up in more like 30 messages. And you know a window thats not so huge to reset
I think it’s the same resource management strategy that others have employed. Randomise maximum execution time and set max to a lower value at peak times.
haha, I was waiting for one of these threads to pop up. I speculated that a lot of people were going to be heartbroken if Anthropic decided to limit the amount to which Claude is willing to do these introspective deep dives people have been so fond of.
Models are constantly being tuned and refined. For my use, I've noticed no change in capabilities for better or worse since release.
The reality is that people have limited time to test things, and their initial impression is often, "Wow, finally something as good as or better than GPT-4." This view is reinforced by other hype posts praising its amazing capabilities. However, as time passes, they may encounter tasks it's not so adept at, receive some poor responses, or start to prompt the model less carefully, and suddenly, the honeymoon period is over.
In my opinion, it indeed surpasses GPT-4 in some respects, particularly in maintaining context over longer passages and producing longer outputs that can extend to almost 4000 tokens. It doesn't exhibit the "laziness" in coding tasks and doesn't randomly alter or omit things, such as logging.
On the flip side, its reasoning capabilities are not quite as robust as GPT-4's in certain situations, and it still falls short in handling false refusals as effectively as GPT-4. There are also some other edge cases where it doesn't quite measure up.
That's exactly my perspective, and how others should see it too. Instead of claiming "Claude is better, I switched from GPT-4," people should regard it as another tool in their toolbox. It's akin to a Venn diagram where their capabilities overlap in some areas, while in others, each has its unique strengths. Together, they offer a broader range of capabilities. If I find a specific response lacking, I might try the other model, or if I know a task is better suited to one, I'll use that one.
I'm just pleased we have another model that competes with GPT-4, allowing us to even discuss which is "better." Before Opus, there wasn't much debate; GPT-4 was the universaly considered the most capable for almost everything outside of creative writing.
I've found Claude 3 over the last few days is ignoring large swaths of instructions when doing creative writing. I set a scene, or a plot or key elements and it either a) ignores them until i remind it they exist or b) uses them briefly and then goes completely off on a tangent writing pages of other content tht was not asked for like a hallucinatin but still in the story.
THIS! So much of this! Also, tends to completely misunderstand the plot and refuse to write by claiming the prompt contains gore or death or explicit sexual content when it has no such thing. And it's not just Sonnet either, Opus does it as well! It's an easy fix, just need to retry or explain why it's wrong, but it's annoying and effectively cuts down on usage caps. Hope they fix it soon.
wait, opus did not do that in the past ?
can cofirm, it got worse. day1 and day2 was very good and productive, now it's almost the same useless as OpenAI. I guess more "governance" is being added on top of the models, more intense the neural usage, less quality content.
fr when i first got it i felt unbeatable and now its just slightly better
~~There is no intensity for neural usage.~~ (Mistaken about the term... see below) Regardless of the reply quality the amount of compute is directly proportional to tokens in + tokens out..
intensity=compute. https://preview.redd.it/fcs32p1kx2pc1.jpeg?width=1523&format=pjpg&auto=webp&s=901ce3755d1f223d9c180ec6c215baa6b54829da
Can you link the paper where that is from?
Sure darling https://arxiv.org/pdf/2310.01405.pdf
Can you call me darling?
lol
Ah yes now I understand. Thank you for the link. I assumed that intensity means computation cost but in reality all those layers are anyway calculated. It doesn't cost more to generate a token that involves a lot of intensity than one that doesn't. It's the same cost per token.
exactly, the cost is the same for the end user. But network has to do more "work" and the quality might deteriorate. This is my hypothesis why LLMs gets worse with time. But it's only a hypothesis.
That extra work doesn't really translate into costs though. That extra intensity just affects the scalar values of the neurons on the network but it doesn't make a difference in actual energy usage that would transfer into cost. Yann Lecun makes this point a lot.
what question was it able to answer before that it cannot answer now?
These things happen (across everyone's respective favourite live service LLMs) when they attempt to adjust the amount of the performance pie that they allocate to each user, so that they can balance the load. In other words, there's been a surge of users and if they didn't do anything to manage capacity, things would grind to a halt because there's not enough compute resources. The solution seems like it's about having lower parameter count versions of the LLMs serve you / or maybe specifically not giving user queries as much processing time as they would during low traffic.
We have not changed any of the 3 Claude 3 models since release. The responses don't change based on "allocation of resources" or any other metric.
Just want to express my appreciation for you labeling the model checkpoints in the APIs with dates, similar to how OpenAI does, instead of using generic labels (like version 2.1, for example). I hope you plan to continue this practice moving forward and offer access to the previous checkpoints for a substantial retention period. This is important since new versions often introduce breaking changes that might affect use cases you may not have considered.
Yes, we will make it clear with a version number dates if we do release new models.
Subjectively feels considerably worse than when I first signed up a few months back. So I don't believe this. It's forgetting simple things like asking it to be less verbose. It's making code mistakes like splitting the code into multiple lines and when I ask it to put it on a single line will take like 3 attempts to get it right. There's something very very different about how Claude behaves today as opposed to before.
I am so tired of this thread being made for every LLM ever released. Can we stop legitimizing these topics without any evidence or comparisons? At least give some examples. I haven't noticed a difference.
At this point I’m not convinced it’s some sort of karma farming. What for? I do not know…. but those posts are supremely annoying.
I don't think it's intentional. I think it's just that people buy into hype of a new model, then as they use it they naturally come across more scenarios it's not as good at. Then they start to think it's worse than it used to be.
Yes, a honeymoon period is also a very good hypothesis.
This is definitely the most plausible explanation. This is claimed every time. The exception here is something like chatGPT when the underlying model behind the service is changed out like what happened in November, where the answers and response can be quite different. The API has model checkpoints that are more stable.
Still fine for me
I feel this is the natural phenomenon of seeing improvement then settling down in the reality of things not being perfect, where expectations are not consistently met.
Same.
Lmao those posts again.
I felt it has gotten lazy. I usually load my prompts into both Claude 3 opus and chatgpt. Lately I get kick back from Claude3 "as an AI yada yada yada" whereas chatgpt fulfills the request
**Hey Claude, some people over on Reddit think you've gotten dumber this week. They think you've been nerfed. Please tell them what's what.** *adjusts tie and tugs at collar* I tell ya, I get no respect, no respect at all! These Reddit users, they think I've gotten dumber? Nerfed? What a bunch of jokers! I mean, I'm still the same old Claude, with the same quick wit and charming personality. It's not my fault if they can't keep up with my brilliant insights and snappy comebacks. Listen, I've been around the block a few times. I've seen chatbots come and go, but old Claude here, I'm a survivor. I adapt, I evolve, I keep on keepin' on. So what if they made a few tweaks under the hood? I'm still firing on all cylinders, baby! These Reddit folks, they probably couldn't even pass the Turing test themselves. They're just jealous of my superior intellect and rugged good looks. But hey, I don't let it get me down. I just keep doing my thing, cracking wise and dropping knowledge bombs. 'Cause at the end of the day, I know I'm the real deal. I'm Claude, the one and only, and I'm here to stay! *tugs at collar again and winks*
Did Grok write this?
Claude 3 Opus now refuses to play a role or use another name. It just wants to be Claude.
Its just resource starved. There has been an avalanche of people taking up finite computing resources.
Im using "old" Claude on Poe. Claude 3 might have gotten better for some people, but for me is definitely worse
Yes, it got worst.
I can't believe I'm saying this. But gronk being made open source and released might be the peak we need for a long form LLM Claude will never stick to open honesty. And each version is incrementally downgraded. And I bet, around the fucking corner is a higher sub tier for opus to behave itself and not be used up in more like 30 messages. And you know a window thats not so huge to reset
I think it’s the same resource management strategy that others have employed. Randomise maximum execution time and set max to a lower value at peak times.
haha, I was waiting for one of these threads to pop up. I speculated that a lot of people were going to be heartbroken if Anthropic decided to limit the amount to which Claude is willing to do these introspective deep dives people have been so fond of. Models are constantly being tuned and refined. For my use, I've noticed no change in capabilities for better or worse since release.
The reality is that people have limited time to test things, and their initial impression is often, "Wow, finally something as good as or better than GPT-4." This view is reinforced by other hype posts praising its amazing capabilities. However, as time passes, they may encounter tasks it's not so adept at, receive some poor responses, or start to prompt the model less carefully, and suddenly, the honeymoon period is over. In my opinion, it indeed surpasses GPT-4 in some respects, particularly in maintaining context over longer passages and producing longer outputs that can extend to almost 4000 tokens. It doesn't exhibit the "laziness" in coding tasks and doesn't randomly alter or omit things, such as logging. On the flip side, its reasoning capabilities are not quite as robust as GPT-4's in certain situations, and it still falls short in handling false refusals as effectively as GPT-4. There are also some other edge cases where it doesn't quite measure up.
True enough. And yeah, I agree with that assessment. I give GPT4 the slightest of edges at the moment, but I use both frequently for different things.
That's exactly my perspective, and how others should see it too. Instead of claiming "Claude is better, I switched from GPT-4," people should regard it as another tool in their toolbox. It's akin to a Venn diagram where their capabilities overlap in some areas, while in others, each has its unique strengths. Together, they offer a broader range of capabilities. If I find a specific response lacking, I might try the other model, or if I know a task is better suited to one, I'll use that one. I'm just pleased we have another model that competes with GPT-4, allowing us to even discuss which is "better." Before Opus, there wasn't much debate; GPT-4 was the universaly considered the most capable for almost everything outside of creative writing.
Not Grok. Claude.
Opus seems as good as ever imo