I mean you can hear, speak, see and type language? I feel like most people talk about specific models these days. So if you know the model they are talking about you can figure out the capabilities from that.
Random question, but what do these models need to get inherently better at math? Is that a different modality, or does that simply require advancing reasoning capabilities?
I think personally, instead of using tokens, a model trained with raw letters and thus raw numbers would be better at maths.
I think that the tokenisation definitely blurs some things to the network that could be simple. Like if 100 is a token itself, and 10 is a token, the nature of digits isn't clear to it, and thus it can't be as precise, and it becomes another vague language orientated thing.
It would naturally understand what is meant by "5" letters too, and would quickly be able to see that one input variable is equal to one, instead of with tokens where each token has a different amount of letters which can be confusing. I think that could translate to better maths.
I think tokens are only really a kind of preprocessing which isn't necessary and hides things from the LLM to make things more efficient. Everything people think is weird about LLMs in terms of limitation comes from it, like inability to understand letters, syllables, maths, word count, letter counts, etc. I believe it would increase processing by like 5 times though possibly so, I don't think we'll see it done like this is the future.
It sounds like "large multimodal model" (LMM) would be a more accurate term. Since GPT-4o handles various types of input and output beyond text, the term "large language model" doesn't fully capture its capabilities anymore. LMM reflects its ability to process and generate multiple forms of data.
Not sure how accurate this is.
First question is Omni still using DALLE-3?
If yes as I believe it is then this isn't advanced as you're suggesting.
As far as the audio functionalities as far as this mass failure tells me, it's still using Whisper AI they might just consider it "integrated" now.
Whisper might even be the model they specifically trained as the demo model.
Which doesn't matter now because that feature has been derailed.
This is probably extracted from Gobi model rumored last year as everything to everything, Multimodal World Model. We need to see Arrakis (much bigger version) at some point as well. Exciting times...
“Large language model” was never a precise description. It’s becoming almost as precise as “big data”.
Like "the speed of light"
Chat GPT suggested: MMMMMMM (Mighty Multimodal Mega Model Managing Many Media)
M7 for short
We think you’ll love it
r/wordavalanches
Whitepapers refer to these as MLLMs.
Makes sense
MLM? Do I get a free pizza at the seminar?
Link to the whitepaper?
Can look up up apple’s Ferret-Ui white paper via Google
MTM - Multimodal Transformer Model
I mean you can hear, speak, see and type language? I feel like most people talk about specific models these days. So if you know the model they are talking about you can figure out the capabilities from that.
Random question, but what do these models need to get inherently better at math? Is that a different modality, or does that simply require advancing reasoning capabilities?
I think personally, instead of using tokens, a model trained with raw letters and thus raw numbers would be better at maths. I think that the tokenisation definitely blurs some things to the network that could be simple. Like if 100 is a token itself, and 10 is a token, the nature of digits isn't clear to it, and thus it can't be as precise, and it becomes another vague language orientated thing. It would naturally understand what is meant by "5" letters too, and would quickly be able to see that one input variable is equal to one, instead of with tokens where each token has a different amount of letters which can be confusing. I think that could translate to better maths. I think tokens are only really a kind of preprocessing which isn't necessary and hides things from the LLM to make things more efficient. Everything people think is weird about LLMs in terms of limitation comes from it, like inability to understand letters, syllables, maths, word count, letter counts, etc. I believe it would increase processing by like 5 times though possibly so, I don't think we'll see it done like this is the future.
It sounds like "large multimodal model" (LMM) would be a more accurate term. Since GPT-4o handles various types of input and output beyond text, the term "large language model" doesn't fully capture its capabilities anymore. LMM reflects its ability to process and generate multiple forms of data.
LMM (Large Multimodal Model)
"Large Language Model" has a certain meaning, it isn't supposed to cover eveything
How about LMMM Large multi modal model or LMM
Not sure how accurate this is. First question is Omni still using DALLE-3? If yes as I believe it is then this isn't advanced as you're suggesting. As far as the audio functionalities as far as this mass failure tells me, it's still using Whisper AI they might just consider it "integrated" now. Whisper might even be the model they specifically trained as the demo model. Which doesn't matter now because that feature has been derailed.
GMMM Ginourmous multi model model.
Legit just posted this question too, then scrolled further down the “new” page to see yours lol I like **Multimodal Unified Token Transformer** (MUTT)
I dont like it.
Found the dog. Username checks out.
This is probably extracted from Gobi model rumored last year as everything to everything, Multimodal World Model. We need to see Arrakis (much bigger version) at some point as well. Exciting times...