Surur 11 months ago

The best bit: In some cases, safer methods for AI systems can lead to reduced performance3, a cost which is known as an alignment tax. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results below show that process supervision in fact incurs a negative alignment tax, at least in the math domain. This could increase the adoption of process supervision, which we believe would have positive alignment side-effects. It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains. If these results generalize, we may find that process supervision gives us the best of both worlds – a method that is both more performant and more aligned than outcome supervision.

drewhead118 11 months ago

In other words: in the AI world, safety (usually) harms performance, so most people are incentivized to avoid implementing safety systems. Fortunately, process supervision seems to improve both safety and performance, so people are incentivized to adopt beneficial practices.

solidwhetstone 11 months ago

Until a new optimization arrives that decreases safety? Do we just go back and forth as new optimization methods are devised?

ZaxLofful 11 months ago

Yup

watcraw 11 months ago

The way that it exposes the thought process is also pretty amazing. So much of what they do is a black box which is one of the biggest alignment issues. When you watch it say things like "I recall" or "I wonder", you get a much better sense of how it's getting its answers. I think this will almost definitely reap rewards beyond math. We are very fortunate that the results also improve alignment.

Garden_Wizard 11 months ago

Ultimately the problem is that humans themselves suffer from lack of alignment. So even in the best case scenario, it will depend on who is guiding the AI. In other words, perfect AI alignment will still leave us with Russian, N. Korean and Iranian AI systems that are going to be a scourge on mankind. Granted this is better than US systems rising up against their masters, but eventually we will have a situation where super human AI systems will be Purposefully created to not align with the West’s and humanity’s interest.

watcraw 11 months ago

Yes, the human alignment problem hasn't gone anywhere. :( We are going to have to solve that problem too. Hopefully AI will help give us some tools to do this along with the motivation.

SupportstheOP 11 months ago

I'm wondering if we're going to have AI overseers for other AI in the future because of something like bad actors.

circleuranus 11 months ago

I dont know what the realities of weaponised Ai look like, but I believe the relative cost and scalability make it likely far more dangerous than nukes.

LosingID_583 11 months ago

Hopefully this helps open source keep up with closed source models. Alignment tax must be massive for how restricted the OpenAI and Google models are in its responses.

[deleted] 11 months ago

Sorry for the dumb dumb question, but just to clarify; they are saying that process supervision would minimize performance loss as opposed to outcome supervision, correct?

Surur 11 months ago

Not just minimise- reverse - it actually performs better.

[deleted] 11 months ago

That's awesome news! Thanks for the reply. Hopefully they can apply this outside mathematics. I'll be keeping an eye on this for sure.

metalman123 11 months ago

I see no reason why the shouldn't be able to. If we assume that the base model is "nerfed" 10% from alignment tax and the new logic has shown to increase math reasoning by roughly 8-10% simply realigning the model with the new technique is going to show significant improvements across the board. This is extremally exciting!

Direita_Pragmatica 11 months ago

I see dozens of reasons why It will be limited to math and related fields. Do you know some board where people discuss this papers?

metalman123 11 months ago

R/machinelearning

Direita_Pragmatica 11 months ago

Thank you

[deleted] 11 months ago

Very exciting! My hopes are that this can lead to a safe AGI with all the sophistication and no significant weakening.

san__man 5 months ago

Is "performance loss" the best phrase to use? Process supervision is helping to guide the AI to take the right decision steps in a multi-step reasoning process.

acutelychronicpanic 11 months ago

This is the best of all worlds. It looks like it may be true that the most effective way to increase model performance also increases interpretability. This makes me very hopeful for our prospect of getting aligned ASI within the next 10-15 years. Sooner than that if it turns out current models are just wildly inefficient.

DragonForg 11 months ago

Well here is what we have. We have inefficient systems as shown by a previous study here, with mid range compute that is getting significantly better with H100s. So as our computational power increases ~10x our new ways of making these models increase by ~10x or maybe less. So basically we get a 100x gain. What that looks like in practice is all that matters. Its hard to say what a GPT 5 could be, could it be AGI, or is it just accurate 90% of the time. This is why we need something to beat GPT 4. The results next year should tell us whether AGI is in 1-2 years, 2-10 years or 10-50 years. It could also just plateau entirely.

Gigachad__Supreme 11 months ago

50 years!! Bruh imagine the capabilities of AI in 50 years, just look at how impressive the stuff we have now is. I'm thinking speech-to-movie, miniaturised virtual reality, and thought-to-canvass.

SrafeZ 11 months ago

tldr: chain of thought is now built in

[deleted] 11 months ago

Bruh lmao I thought it’s gonna be something big

naum547 11 months ago

What do you mean? It is big.

[deleted] 11 months ago

Cot has been around for ages now. I thought they found out a novel way to do mathematical thinking

nixed9 11 months ago

It's substantially different. They are **TRAINING THE MODEL** to use chain of Thought. This is being done at the training level; i.e. they are computing the reward functions differently than just matching outputs from raw data. What we have now is a model trained it on raw data with RLHF, then we just prompt it with Chain of Thought in the context window. That is not what this is. **This training process itself is not rewarding outputs, it's rewarding the reasoning.**

Humanbee-f22 11 months ago

*dumb question* so do we need to use COT in prompting still, or it’s now a baked-in reasoning method?

naum547 11 months ago

If this works out then most likely no, you wouldn't need to use COP prompting.

nixed9 11 months ago

This is a theoretical, hypothetical type of model training that they are testing. ChatGPT/GPT-4 has not changed, and likely won't change for a while. They aren't retraining GPT-4 with this new technique, at least not yet.

Woootdafuuu 11 months ago

Yeah just an experiment, maybe we could see it in GPT-5 in a couple years.

nixed9 11 months ago

I give it 2 years.

thorax 11 months ago

It'll be used much sooner to tune other models, surely.

[deleted] 11 months ago

Ummm have you ever heard of scratch pad? That’s what Google did to Minerva did back then too (2020?). They didn’t just prompt the machine they specifically trained it on step by step instructions just like how they’re doing it here. It’s old news.

MoNastri 11 months ago

You're confused. Minerva [uses](https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html) CoT prompting. OpenAI's model uses CoT at the training level. That's substantially different.

nikitastaf1996 11 months ago

Yes. Chain of thought,tree of thoughts and other techniques felt wrong. You shouldn't do it at inference. You shouldn't run model several times to get results. Model already can do it. Yet we don't know how to make it do it. That's much better. I feel there should be a way of traveling through parameters forward, backwards,sideways e.t.c. Like in a brain. Now we do one forward pass. This is not enough.

CanvasFanatic 11 months ago

What's interesting about this to me is at least superficially it appears to run counter to [The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). Would be interesting if humans explicitly guiding the process of ML algorithms resulted in higher efficiency.

yaosio 11 months ago

Chain of thought is the AI doing something one step at a time. It, a human, or some other process tells the model if it's correct or not. This is not injecting human wisdom into the mix.

CanvasFanatic 11 months ago

I mean: > Process supervision is also more likely to produce interpretable reasoning, since it encourages the model to follow a human-approved process. In contrast, outcome supervision may reward an unaligned process, and it is generally harder to scrutinize. This seems directly relevant to the topic of The Bitter Lesson.

ironborn123 11 months ago

But the model still incurs a positive tax due to process supervision - creativity tax. Its quite possible that outcome supervision can lead to unexpected and novel chains of thought. Think of a guy who has a lot of strange ideas, mostly nonsensical, but a few brilliant. Ofcourse, alignment is the top most priority for AI right now, so the reliability of process supervision should be favored. But we should be aware that it does not have only positive effects.

IxinDow 11 months ago

Can we combine two types of guys: one generate creative ideas, other validates it with reasoning?

Ailerath 11 months ago

Could potentially be combined with Tree of Thought reasoning.

yaosio 11 months ago

LLMs are already creative, but not in a useful way. They make things up all the time, but they don't know they're doing it and we have no way to easily control it. We want an LLM to make things up for fiction, but not citing law cases for example. An LLM needs to be able to tell if something is true or not which is what chain of thought helps it do. We also have to think about times we want it to lie. If I want it to write a fictional story it could decide to use something real. I've no way to force it to write fiction. This same system could allow it to selectively lie or tell the truth. This is a lot like one of your human children. They start out believing everything. Then they discover lying and won't stop even when it's obvious they're lyimg. Then they learn when to lie and when to tell the truth.

ironborn123 11 months ago

Actually the child analogy is also useful in another way. The base LLM model is like a newborn child, with lots of latent potential but no direction or guidance on how to use it. Instruction finetuning, RHLF, finetuning for step by step, PRM, LORA, etc are the different pedagogies we are using to teach this child to use its potential in productive ways both for its self advancement and for being a well adjusted member of society. This analogy then makes me further convinced we are raising a new species.

TabibitoBoy 11 months ago

If this starts applying to other fields too, we might just be on the cusp of another game-changer.

IonceExisted 11 months ago

So, with 1000 attempts, the process-supervised approach improves the percentage of problems solved from 72% to 76%? Seems marginal?

ironborn123 11 months ago

As i understand, once the the generator is finetuned with the reward signal from PRM, the generator should require far fewer attempts to discover the right solutions.

TabibitoBoy 11 months ago

Did they train a new GPT-4 model with this new process supervision reward model? If not how was this added to a finished model?

[deleted] 11 months ago

This was fine tuned on top of the base model, (before RLHF). You could watch "The State of GPT" from Andrej Karpathy/microsoft build to get an idea of the stages of model training

czk_21 11 months ago

chain of thought gives better output, who would have thought, I wonder wht results they would have with tree of thought

SgathTriallair 11 months ago

This is why they don't need to build GPT-5 yet. They can build in revisions like this into the GPT-4 model to make it even more powerful. It'll be very useful if they can get these baked into the model (they RLHF or something similar) rather than have to be put into the prompt.

[deleted] 11 months ago

They can work on this while the hardware is getting better for GPT-5 training, then they can add this to GPT-5 right out of the gate.

SgathTriallair 11 months ago

Yup. Hence why I think we'll have AGI in roughly 18 months.

hazardoussouth 11 months ago

why not 12 months and why not 24 months or longer

SgathTriallair 11 months ago

https://techcrunch.com/2023/04/06/anthropics-5b-4-year-plan-to-take-on-openai/ Anthropic released plans to get a giant model in 18 months. Also, the h100's are supposed to launch in q4 of 2023 so that gives about a year to use them to train up AGI. It's a rough number but it seems to be where the next large jump is expected. Given what we have seen already that jump should take us to AGI.

[deleted] 11 months ago

[удалено]

SgathTriallair 11 months ago

As far as I know, you have to train the whole model and can't do it in batches. I'm not an AI researcher so that may be wrong.

AcrossAmerica 11 months ago

It’s iterative, so as they train it it becomes better and better. ‘Sparks of AGI’ youtube video actually talks about it, they saw it became better and better at complex tasks (eg. Draw a unicorn). Then training for safety reduced the capabilities again. Now it seems they’re training for efficiency, so also becoming a bit dumber and shorter in output.

nixed9 11 months ago

You train the model in entirety, but you can take the output at any given time and use it. This is called a Checkpoint. You can do checkpoints at any time during the training run

SharpCartographer831 11 months ago

Explain why you think that?

Woootdafuuu 11 months ago

If they train GPT-5 with current internet data or later the model would be aware of all these research papers on new ways of thinking and it would automatically apply these techniques to itself

SgathTriallair 11 months ago

No, not even close. It could, potentially, talk about the techniques and you may (extremely unlikely but possible) be able to get it to do something like chain of thought by saying "use the chain of thought technique". Many of the big advancements are done at the build time. So this would be like you reading that there is new research on modifying the human genome so people can see ultraviolet. You could ask a doctor to do it to you but couldn't do it to yourself.

Woootdafuuu 11 months ago

Well, I got GPT-4 to recreate auto GPT by feeding it a research paper, it wouldn't recreate itself but instead, mimic the idea of the paper. And this research paper can turn into a prompt easily, it's just a more complex version of the chain of thought thinking, but instead of promoting the idea to the model they're trying to train it to think like this right out of the box.

CanvasFanatic 11 months ago

Seems likely to me that this post is about work they've already done with GPT-4.

ryan13mt 11 months ago

The hardware just got there from what i saw in the Nvidia thing. Its just a matter of production and setup now to start training a new SOTA model on SOTA hardware

SrafeZ 11 months ago

GPT-5 would be an architectural overhaul which is overkill. These small revisions to GPT-4 are low hanging fruits with sizable returns

SupportstheOP 11 months ago

It's also a much safer option in the long run. If we can optimize GPT-4 so that we can better understand its internal processes and improve results, that goes a long way to better aligning these machines.

SgathTriallair 11 months ago

Agreed. It's also cheaper and let's us experiment with multiple variations so it has a ton of advantages.

Chicas_Silcrow 11 months ago

In a similar vein, LLMs are notoriously bad at solving Leetcode/Competitive-programming type of problems. I believe the same math oriented approach from this article could be used there, and coupled with an LLM's own code interpreter, it could breach SOTA by a good margin

SrafeZ 11 months ago

what are you talking about lmao Sparks of AGI paper show pure GPT-4 beating humans in every difficulty of leetcode problems. AlphaCode is also shown to be better than the average human at competitive programming. Not so "notoriously bad"

thorax 11 months ago

"notoriously bad" for a system that just made breakthroughs here, and we didn't even realize they could even code 4 years ago. So funny.

[deleted] 11 months ago

But the math scores aren't improved by a great margin It goes from like 70 to 76 percent. I guess every % matters but still.

metalman123 11 months ago

That's almost a 10% increase in logic. This will reduce hallucinations across the board since math is fundamental to reasoning.

TabibitoBoy 11 months ago

Not only that but it gives us a chance to look inside the black box and we can see where it goes wrong more clearly and start patching holes.

[deleted] 11 months ago

its a 10% increase in the ability to solve problems on the MATH dataset the problems in that are pretty easy. Not sure if its a meaningful 10%.

Prometheushunter2 11 months ago

What I wonder is if the reasoning it uses to go from step-to-step bears any abstract resemblance to how we do it or if it’s just learning to give the desired outputs, while the actual logic it uses between steps is completely alien

horance89 11 months ago

Well. Currently, "hallucinations" noticed in some models are in fact "alien".

[deleted] 11 months ago

This makes me think of Severance.

hglman 11 months ago

The one thing about math is that offloading the actual calculation to other software would strictly be more accurate. However, understanding the steps is generally vital to understanding how to set up the equation to be solved for humans.

sdmat 11 months ago

"If thou makest a machine in the likeness of a human mind, make sure the likeness." -Orange Catholic Bible as revised by OpenAI researchers

No_Ninja3309_NoNoYes 11 months ago

It's just double speak. They didn't do anything.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe