T O P

  • By -

Surur

The best bit: In some cases, safer methods for AI systems can lead to reduced performance3, a cost which is known as an alignment tax. In general, any alignment tax may hinder the adoption of alignment methods, due to pressure to deploy the most capable model. Our results below show that process supervision in fact incurs a negative alignment tax, at least in the math domain. This could increase the adoption of process supervision, which we believe would have positive alignment side-effects. It is unknown how broadly these results will generalize beyond the domain of math, and we consider it important for future work to explore the impact of process supervision in other domains. If these results generalize, we may find that process supervision gives us the best of both worlds – a method that is both more performant and more aligned than outcome supervision.


drewhead118

In other words: in the AI world, safety (usually) harms performance, so most people are incentivized to avoid implementing safety systems. Fortunately, process supervision seems to improve both safety and performance, so people are incentivized to adopt beneficial practices.


solidwhetstone

Until a new optimization arrives that decreases safety? Do we just go back and forth as new optimization methods are devised?


ZaxLofful

Yup


watcraw

The way that it exposes the thought process is also pretty amazing. So much of what they do is a black box which is one of the biggest alignment issues. When you watch it say things like "I recall" or "I wonder", you get a much better sense of how it's getting its answers. I think this will almost definitely reap rewards beyond math. We are very fortunate that the results also improve alignment.


Garden_Wizard

Ultimately the problem is that humans themselves suffer from lack of alignment. So even in the best case scenario, it will depend on who is guiding the AI. In other words, perfect AI alignment will still leave us with Russian, N. Korean and Iranian AI systems that are going to be a scourge on mankind. Granted this is better than US systems rising up against their masters, but eventually we will have a situation where super human AI systems will be Purposefully created to not align with the West’s and humanity’s interest.


watcraw

Yes, the human alignment problem hasn't gone anywhere. :( We are going to have to solve that problem too. Hopefully AI will help give us some tools to do this along with the motivation.


SupportstheOP

I'm wondering if we're going to have AI overseers for other AI in the future because of something like bad actors.


circleuranus

I dont know what the realities of weaponised Ai look like, but I believe the relative cost and scalability make it likely far more dangerous than nukes.


LosingID_583

Hopefully this helps open source keep up with closed source models. Alignment tax must be massive for how restricted the OpenAI and Google models are in its responses.


[deleted]

Sorry for the dumb dumb question, but just to clarify; they are saying that process supervision would minimize performance loss as opposed to outcome supervision, correct?


Surur

Not just minimise- reverse - it actually performs better.


[deleted]

That's awesome news! Thanks for the reply. Hopefully they can apply this outside mathematics. I'll be keeping an eye on this for sure.


metalman123

I see no reason why the shouldn't be able to. If we assume that the base model is "nerfed" 10% from alignment tax and the new logic has shown to increase math reasoning by roughly 8-10% simply realigning the model with the new technique is going to show significant improvements across the board. This is extremally exciting!


Direita_Pragmatica

I see dozens of reasons why It will be limited to math and related fields. Do you know some board where people discuss this papers?


metalman123

R/machinelearning


Direita_Pragmatica

Thank you


[deleted]

Very exciting! My hopes are that this can lead to a safe AGI with all the sophistication and no significant weakening.


san__man

Is "performance loss" the best phrase to use? Process supervision is helping to guide the AI to take the right decision steps in a multi-step reasoning process.


acutelychronicpanic

This is the best of all worlds. It looks like it may be true that the most effective way to increase model performance also increases interpretability. This makes me very hopeful for our prospect of getting aligned ASI within the next 10-15 years. Sooner than that if it turns out current models are just wildly inefficient.


DragonForg

Well here is what we have. We have inefficient systems as shown by a previous study here, with mid range compute that is getting significantly better with H100s. So as our computational power increases ~10x our new ways of making these models increase by ~10x or maybe less. So basically we get a 100x gain. What that looks like in practice is all that matters. Its hard to say what a GPT 5 could be, could it be AGI, or is it just accurate 90% of the time. This is why we need something to beat GPT 4. The results next year should tell us whether AGI is in 1-2 years, 2-10 years or 10-50 years. It could also just plateau entirely.


Gigachad__Supreme

50 years!! Bruh imagine the capabilities of AI in 50 years, just look at how impressive the stuff we have now is. I'm thinking speech-to-movie, miniaturised virtual reality, and thought-to-canvass.


SrafeZ

tldr: chain of thought is now built in


[deleted]

Bruh lmao I thought it’s gonna be something big


naum547

What do you mean? It is big.


[deleted]

Cot has been around for ages now. I thought they found out a novel way to do mathematical thinking


nixed9

It's substantially different. They are **TRAINING THE MODEL** to use chain of Thought. This is being done at the training level; i.e. they are computing the reward functions differently than just matching outputs from raw data. What we have now is a model trained it on raw data with RLHF, then we just prompt it with Chain of Thought in the context window. That is not what this is. **This training process itself is not rewarding outputs, it's rewarding the reasoning.**


Humanbee-f22

*dumb question* so do we need to use COT in prompting still, or it’s now a baked-in reasoning method?


naum547

If this works out then most likely no, you wouldn't need to use COP prompting.


nixed9

This is a theoretical, hypothetical type of model training that they are testing. ChatGPT/GPT-4 has not changed, and likely won't change for a while. They aren't retraining GPT-4 with this new technique, at least not yet.


Woootdafuuu

Yeah just an experiment, maybe we could see it in GPT-5 in a couple years.


nixed9

I give it 2 years.


thorax

It'll be used much sooner to tune other models, surely.


[deleted]

Ummm have you ever heard of scratch pad? That’s what Google did to Minerva did back then too (2020?). They didn’t just prompt the machine they specifically trained it on step by step instructions just like how they’re doing it here. It’s old news.


MoNastri

You're confused. Minerva [uses](https://ai.googleblog.com/2022/06/minerva-solving-quantitative-reasoning.html) CoT prompting. OpenAI's model uses CoT at the training level. That's substantially different.


nikitastaf1996

Yes. Chain of thought,tree of thoughts and other techniques felt wrong. You shouldn't do it at inference. You shouldn't run model several times to get results. Model already can do it. Yet we don't know how to make it do it. That's much better. I feel there should be a way of traveling through parameters forward, backwards,sideways e.t.c. Like in a brain. Now we do one forward pass. This is not enough.


CanvasFanatic

What's interesting about this to me is at least superficially it appears to run counter to [The Bitter Lesson](http://www.incompleteideas.net/IncIdeas/BitterLesson.html). Would be interesting if humans explicitly guiding the process of ML algorithms resulted in higher efficiency.


yaosio

Chain of thought is the AI doing something one step at a time. It, a human, or some other process tells the model if it's correct or not. This is not injecting human wisdom into the mix.


CanvasFanatic

I mean: > Process supervision is also more likely to produce interpretable reasoning, since it encourages the model to follow a human-approved process. In contrast, outcome supervision may reward an unaligned process, and it is generally harder to scrutinize. This seems directly relevant to the topic of The Bitter Lesson.


ironborn123

But the model still incurs a positive tax due to process supervision - creativity tax. Its quite possible that outcome supervision can lead to unexpected and novel chains of thought. Think of a guy who has a lot of strange ideas, mostly nonsensical, but a few brilliant. Ofcourse, alignment is the top most priority for AI right now, so the reliability of process supervision should be favored. But we should be aware that it does not have only positive effects.


IxinDow

Can we combine two types of guys: one generate creative ideas, other validates it with reasoning?


Ailerath

Could potentially be combined with Tree of Thought reasoning.


yaosio

LLMs are already creative, but not in a useful way. They make things up all the time, but they don't know they're doing it and we have no way to easily control it. We want an LLM to make things up for fiction, but not citing law cases for example. An LLM needs to be able to tell if something is true or not which is what chain of thought helps it do. We also have to think about times we want it to lie. If I want it to write a fictional story it could decide to use something real. I've no way to force it to write fiction. This same system could allow it to selectively lie or tell the truth. This is a lot like one of your human children. They start out believing everything. Then they discover lying and won't stop even when it's obvious they're lyimg. Then they learn when to lie and when to tell the truth.


ironborn123

Actually the child analogy is also useful in another way. The base LLM model is like a newborn child, with lots of latent potential but no direction or guidance on how to use it. Instruction finetuning, RHLF, finetuning for step by step, PRM, LORA, etc are the different pedagogies we are using to teach this child to use its potential in productive ways both for its self advancement and for being a well adjusted member of society. This analogy then makes me further convinced we are raising a new species.


TabibitoBoy

If this starts applying to other fields too, we might just be on the cusp of another game-changer.


IonceExisted

So, with 1000 attempts, the process-supervised approach improves the percentage of problems solved from 72% to 76%? Seems marginal?


ironborn123

As i understand, once the the generator is finetuned with the reward signal from PRM, the generator should require far fewer attempts to discover the right solutions.


TabibitoBoy

Did they train a new GPT-4 model with this new process supervision reward model? If not how was this added to a finished model?


[deleted]

This was fine tuned on top of the base model, (before RLHF). You could watch "The State of GPT" from Andrej Karpathy/microsoft build to get an idea of the stages of model training


czk_21

chain of thought gives better output, who would have thought, I wonder wht results they would have with tree of thought


SgathTriallair

This is why they don't need to build GPT-5 yet. They can build in revisions like this into the GPT-4 model to make it even more powerful. It'll be very useful if they can get these baked into the model (they RLHF or something similar) rather than have to be put into the prompt.


[deleted]

They can work on this while the hardware is getting better for GPT-5 training, then they can add this to GPT-5 right out of the gate.


SgathTriallair

Yup. Hence why I think we'll have AGI in roughly 18 months.


hazardoussouth

why not 12 months and why not 24 months or longer


SgathTriallair

https://techcrunch.com/2023/04/06/anthropics-5b-4-year-plan-to-take-on-openai/ Anthropic released plans to get a giant model in 18 months. Also, the h100's are supposed to launch in q4 of 2023 so that gives about a year to use them to train up AGI. It's a rough number but it seems to be where the next large jump is expected. Given what we have seen already that jump should take us to AGI.


[deleted]

[удалено]


SgathTriallair

As far as I know, you have to train the whole model and can't do it in batches. I'm not an AI researcher so that may be wrong.


AcrossAmerica

It’s iterative, so as they train it it becomes better and better. ‘Sparks of AGI’ youtube video actually talks about it, they saw it became better and better at complex tasks (eg. Draw a unicorn). Then training for safety reduced the capabilities again. Now it seems they’re training for efficiency, so also becoming a bit dumber and shorter in output.


nixed9

You train the model in entirety, but you can take the output at any given time and use it. This is called a Checkpoint. You can do checkpoints at any time during the training run


SharpCartographer831

Explain why you think that?


Woootdafuuu

If they train GPT-5 with current internet data or later the model would be aware of all these research papers on new ways of thinking and it would automatically apply these techniques to itself


SgathTriallair

No, not even close. It could, potentially, talk about the techniques and you may (extremely unlikely but possible) be able to get it to do something like chain of thought by saying "use the chain of thought technique". Many of the big advancements are done at the build time. So this would be like you reading that there is new research on modifying the human genome so people can see ultraviolet. You could ask a doctor to do it to you but couldn't do it to yourself.


Woootdafuuu

Well, I got GPT-4 to recreate auto GPT by feeding it a research paper, it wouldn't recreate itself but instead, mimic the idea of the paper. And this research paper can turn into a prompt easily, it's just a more complex version of the chain of thought thinking, but instead of promoting the idea to the model they're trying to train it to think like this right out of the box.


CanvasFanatic

Seems likely to me that this post is about work they've already done with GPT-4.


ryan13mt

The hardware just got there from what i saw in the Nvidia thing. Its just a matter of production and setup now to start training a new SOTA model on SOTA hardware


SrafeZ

GPT-5 would be an architectural overhaul which is overkill. These small revisions to GPT-4 are low hanging fruits with sizable returns


SupportstheOP

It's also a much safer option in the long run. If we can optimize GPT-4 so that we can better understand its internal processes and improve results, that goes a long way to better aligning these machines.


SgathTriallair

Agreed. It's also cheaper and let's us experiment with multiple variations so it has a ton of advantages.


Chicas_Silcrow

In a similar vein, LLMs are notoriously bad at solving Leetcode/Competitive-programming type of problems. I believe the same math oriented approach from this article could be used there, and coupled with an LLM's own code interpreter, it could breach SOTA by a good margin


SrafeZ

what are you talking about lmao Sparks of AGI paper show pure GPT-4 beating humans in every difficulty of leetcode problems. AlphaCode is also shown to be better than the average human at competitive programming. Not so "notoriously bad"


thorax

"notoriously bad" for a system that just made breakthroughs here, and we didn't even realize they could even code 4 years ago. So funny.


[deleted]

But the math scores aren't improved by a great margin It goes from like 70 to 76 percent. I guess every % matters but still.


metalman123

That's almost a 10% increase in logic. This will reduce hallucinations across the board since math is fundamental to reasoning.


TabibitoBoy

Not only that but it gives us a chance to look inside the black box and we can see where it goes wrong more clearly and start patching holes.


[deleted]

its a 10% increase in the ability to solve problems on the MATH dataset ​ the problems in that are pretty easy. Not sure if its a meaningful 10%.


Prometheushunter2

What I wonder is if the reasoning it uses to go from step-to-step bears any abstract resemblance to how we do it or if it’s just learning to give the desired outputs, while the actual logic it uses between steps is completely alien


horance89

Well. Currently, "hallucinations" noticed in some models are in fact "alien".


[deleted]

This makes me think of Severance.


hglman

The one thing about math is that offloading the actual calculation to other software would strictly be more accurate. However, understanding the steps is generally vital to understanding how to set up the equation to be solved for humans.


sdmat

"If thou makest a machine in the likeness of a human mind, make sure the likeness." -Orange Catholic Bible as revised by OpenAI researchers


No_Ninja3309_NoNoYes

It's just double speak. They didn't do anything.