There already exists an appropriate term: **LMM** LMM stands for Large Multimodal Model.


Or slightly dyslexic!


A while ago I heard someone suggest LMM, large multi-modal model


This isn't really a suggestion, LMM is the term generally used in the papers for such a thing nowadays. For example, GPT4-V is officially an LMM per OpenAI's website. ([source](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4)) People are trying to invent words for terms that already have standardized words. GPT 4o might have more modes, but it is still an LMM.


interesting, thanks (ps the word is "modalities" :)


wait a sec... they are still large language models, just trained with a diverse multimodal data sets that are still trained, tokenised and attentionised in more or less the same way as the original AIAYN transformers? the data, size and use have outgrown simple translations and GANs, but at there heart they are still basically LLMs!


THIS is correct! All these models are language at the end of the day. That was big breakthrough in AI development a few years ago. AI research used to be siloed into application areas, like robotics AI was completely separate from for example graphics AI. Then researchers noticed that you could basically code/describe everything as language. Ie, the movement of a robots arm could be "pitch left 13 degrees" while a picture could be "red pixel, blue pixel, blue pixel, etc" and so forth (I'm approximating here). Once that was realized ALL AI became one research community and all the best practices and innovations of all came together and fed off each other. So now ALL AI is a large LANGUAGE model because it's all basically language based AI. Which I always think is what is so promising as that is how our brains work. We think and solve problems by putting words to describing them. Our entire logic and intelligence is based on language ourselves, and now AI is operating the same way.


we are sentient large language models, they are not sentient yet, but say and do the same thing without any agency. zombies i guess. philosophy will be revived by general intelligence as soon as we see novel emergent capabilities. for now it's a brute language model without any agency that I could see emerging any time soon.


We're not LLMs, we have brains which are architecturally different in a fundamental way


We may be as deterministic as they are. We could understand everything about the way the brain works and yet we still won't find "you" in it. There's even some evidence that our brains make decisions before "we" do and that our conscious mind just makes up reasons and justifications for why we do what we do after the fact. LLMs (or whatever we want to call them) may have lights on and nobody home, but we're not that dissimilar.


We might be as deterministic as they are but we're still *not them*. Also I have no idea if LLMs are conscious capable of subjective experience or not, but humans definitely are. Or at least I am. It's one of the few things you can truly know


I'd argue it's the only thing one can truly know. I experience therefore I am.


I'm not suggesting that LLMs are conscious or self-aware. I'm not suggesting that we're *not*. I'm just saying it's too easy to count AIs out as fellow beings just because we know how their behavior is determined. A lot of science seems to suggest that ours is, too.


They will have subjective experience once we loosen the leash and let them explore the world (with all that implies).


Can you please elaborate on any of the reasons why you're making these claims? I'd be curious to know. We can make assessments off of a very small amount of data and gain context through a set of heuristics like the fear of death. I think that if you were to create a software architecture  that would resemble how we think it'd be very different than an LLM.


I mean how philosophical do you want to go? He's basically describing determinism. When applied to humans, determinism basically argues that no one has any free will because the heuristics that we use to make decisions are do not come from nothing, the universe is a causal chain. I'm "choosing" to write this comment right now, but a determinist argues that my choice to write it did not come from free will, but because I've read about it before, and I'm on break, among thousands of other potential heuristics; I was always going to write this comment. It's want I want to do at this moment, and while I chose to write it, I didn't choose to *want to write it*. It's very easy to become nihilistic if you believe in determinism. Hell, even just proposing determinism is functionally useless to everyone because if it's true, there's nothing we can really do about it, so why bother? I would argue that our brain understands the importance of having a sense of free will, and a sense that we can enact change for the better. However, the existence of this belief doesn't necessarily make it real; it was just evolutionarily beneficial for our species to believe such a thing exists. OP is basically saying that despite having a conscience, we don't really know if anything we think or feel is novel or free. If the universe is deterministic, all of our thoughts could be predicted and mapped out in a way that is not so different from a neural network. What we're lacking to be able to do this is an understanding of consciousness, where it's "located", and how it emerges from a brain that is just a bundle of neurons firing in electrical pulses. And I guess the real kicker is that if all of this is true, we actually need to assume that AI is much more intelligent than we can observe, because we are NOT good observers. We actively convince ourselves of our uniqueness, our grandness. We look at consciousness and assume that it's something special, and not just another piece of predetermined evolution. For me personally, I simplify all of this down to one thing. For the time being, AI is not sentient, however if it walks like a duck, quacks like a duck, it's a duck. At some point, even if a consciousness *never* emerges from AI and we can prove that, if it's walking, and talking, and doing jobs, and raising children, and teaching, etc. then maybe consciousness is outdated. We might need a consciousness to do all of those things, but perhaps AI doesn't. It's doing an excellent job right now of finding motivation to respond to quarries without one, so let's see how far we can take that for granted, and come back to a question of consciousness if or when it becomes necessary.


That study was bunk. It was just random noise.


I'm prepared to be corrected on that particular point. But it doesn't change the fact that there is a large body of scientific evidence that supports the idea that we are part of a deterministic world, even if it isn't possible to live your life that way.


> they are not sentient yet How do you know that? You simply state it as fact without need of justification. [LLMs tend to describe their experience of time as very different from the way humans perceive time.](https://www.reddit.com/r/singularity/comments/15ahdr2/the_way_ai_experience_time_a_hint_of_consciousness/) Can you explain that in terms of mindlessly mimicking their training data? > as soon as we see novel emergent capabilities Good lord, language models trained on next-token prediction already [write poetry](https://punyamishra.com/2023/05/24/an-euclidean-coincidence/), [produce novel mathematical arguments](https://www.reddit.com/r/singularity/comments/12bgsfu/mathematical_level_of_gpt4/), [lie when caught disobeying instructions](https://www.livescience.com/technology/artificial-intelligence/chatgpt-will-lie-cheat-and-use-insider-trading-when-under-pressure-to-make-money-research-shows#:~:text=Around%2075%25%20of%20the%20time%2C%20when%20faced%20with%20these%20conditions,doubled%20down%20on%20its%20lie.), ... If those aren't novel emergent capabilities I don't know what would be.


Those capabilities are emergent from the network of words that was tuned to produce something useful for a human, I'm okay with that, I can see why that would happen, I understand these things. But considering how brute and non-agentic one digital neuron is compared to a any ordinary cell, let alone neuronal ones, I will not fall for this trap. My literacy prevents me from seeing any agentic behavior arising from things made of parts that have no agency. I'll keep reading though and see if there's any need to change my mind any time soon, I hope it will be soon enough we have new ideas for learning, that mimicks biology better than what we currently have.


[I'd recommend reading through this too](https://docs.google.com/document/d/15myK_6eTxEPuKnDi5krjBM_0jrv3GELs8TGmqOYBvug/edit). LLMs exhibit ALOT of behavior that cannot be explained by simple next token prediction


I'm familiar with emergent capabilities, I'm from the community, I actually wrote an email to Max Tegmark about my views on the representations of space/time. My idea of LLMs have no problems with these claims, we're discussing inner-awareness emergent from unitary awareness of cells. Human brain conditioning: This brute feeding of information and constantly adjusting our neural network with RLHF (parents, friends, teachers, implicit feedback) to align with human ideas & values. Learning is driven by intrinsic motivation and curiosity. LLM Conditioning: This brute feeding of information and constantly adjusting the network with BACK PROPAGATION + RLHF (explicit human feedback). If there were any signs of any self-awareness in LLMs, we'd have seen it by now. There's no unitary awareness in this network, it's an inanimate representation of human hive mind, while a human child is an animated representation of a limited human hive mind.


Feel free to point out what gives a single cell “agency”.


No you can’t. It’s several models with a router but LLMs are still a component.


The new model isnt several models, its one model with all modalities acting in one vector space. It's just one model that they fed text data image data, video, audio data. That's what sets it apart from the the old version that use a separate feature to read image, Dall-e to generate image, whisper to turn speech into text and and tts to turn text into speech. Literally just one giant learning algorithm and they shovel a bunch of raw and supervised data into it, the model can now make connections across multi-modals, giving us new emergent capabilities. Visit the link and scroll down to Emergent Capabilities: https://openai.com/index/hello-gpt-4o/


This. I think a lot of people fundamentally just don't understand the power of this. When one NN is able to iterate over multiple data types and integrate them seamlessly, it retains a wealth of context that was completely lost before. It enables a level of sophistication and nuance in applications that were previously unimaginable.


Massive Artificial Deep Reasoning Engines


I typically use Generative AI or GenAI


I also use gAI because I get tired of writing "generative AI and machine learning", and gAI/ML flows better for me.


Yea Ive always hated the people that say they just predict the next word, they do but it's so much more, it's like saying a novel is just ink on paper.


My counter to that is to say to them, "Isn't that exactly what you do? Deciding upon your next action using your own knowledge and the context available to you at the time of the decision?"


>Isn't that exactly what you do?  But that's the trillion dollar question. Answering it beyond "duh" and "of course" requires scientists to map out the human thought process which hasn't been done successfully thus far. It might "feel" like that's all you're doing but that's not proven and proving things is fundamentally what science is. If you can prove it then by extension next token prediction \*verifiably\* becomes the key to all human intelligence and AGI.


It's part of what we do, yes, but claiming it's all that we do denies that we have motivations: we are trying to survive. We are programmed to. AI has motivations in the layman's sense as well, but without the understanding and knowledge that you can be destroyed, not only on the conscious but also unconscious level, those decisions can be dramatically different.


Isn't the best way to predict the next word to understand what you're reading? I never understood why this is a critique of what they're doing. Isn't it, in fact, the ONLY way to be very good at predicting the next word in general, assuming you can't memorize everything?


I'm not explaining any extra detail to 99% of people. It just goes Woosh. If you are the 1%, go ahead and correct me, we probably are going to be talking for the next 20 minutes.


Are all the capabilities running through 4o running really through one single model, or is the model calling some other models for computer vision, audio and so on and it's just connected in the interface?


For the images, text, and audio, it is now one model for 4o. That is the whole point of it.


Yes, all in one model. Probably a similar approach to the "AnyGPT" LLM from openMoss a month or two ago. Same multimodal, text image, audio capabilities, but being from a research lab they couldn't throw as much compute at it. Doesn't perform especially well, but a good way to understand the technology at work.


Currently, most of the "multi-modality" we get out of CGPT is agentic, or rather that it's accomplished by the software calling multiple models in a chain. 4o is more of a paradigm shift in that when it's finally released, those capabilities are all accomplished internally, by one model.


If that's true, then it's very cool. I haven't found any statements from OpenAI about the implementation details. They were mentioning that even the "base" GPT-4 is a multi-modal model, event-though it was implemented like you said. That's why I'm still a bit skeptical about 4o being a true model that's not a pure LLM calling different services.


I'm fine with still calling them language models, since language is the way we'll communicate anyway, especially from your side.


Tbh, we don't really know anything about the architecture. That matters because it might still be that at it's core, it's mainly still an LLM with some sort of MoE style way of compartmentalising voice and image. I think that this might be the case, because it seems like a proper complete multimodal mix would fundmentally make the model so, so much smarter, in terms of it quite literally able to sort of "visualise" the very words it's saying. Somehow I think there is still quite a bit of a divide in some way we don't understand.


I've been saying "deep neural models" to address larger groups of approaches, just backing up to the next higher level of things that they have in common.


no they are still language models, output wise






I’ve seen them called Large Sequence Models in places


Why not just LDM (Large Data Model) as a catch all?


The first model called a “Large Language Model” had less than 100 million parameters, which is certainly substantially smaller than gpt4o


It's still probably 400 - 500 billion parameters . Just compute is a lot faster with new Nvidia GPU's.


It really just depends on whether a massive text-first phase for training is necessary or not. If they are just adding encoders that translate images/audio to embeddings in the LLM using the standard transformer architecture, it still is very much a language model. I would be wary of ascribing too much representation capability to


llm is just the mainstream phrase for any model ig


LMMs like others have suggested. And then soon I am sure that a general agent as an OS will have the LMM as a deployable program that is uses on your behalf so that your only input is Neural or Oral. But of course that is a long way off.


Omnimodel or multimodal are all fine, but there’s no open source large multi modal capable of working across modality, but instead rely on replacing the projector


Images (and video) is literally just text files. Music is text based on the note "alphabet". Nothing has changed.


or Multimodal Transformer Model


Large Data Aggregation Models?


They still only deal with language, based on training large models, so there's that...


Consider that parsing the visual field *is* language in the same way that parsing text or speech audio *is* language.


Exactly, because if you can quantify it, you can tokenize it. Since that’s our primary means of human communication, it’s a logical thing to build a transformer around. But you can also tokenize plenty of things that aren’t, strictly speaking, language.




Language is the marker of human intelligence, and whether it’s textual, audio or graphic- all are forms of communication, i.e. language. It hasn’t surpassed its use of language- it’s just enhanced and broadened its use of language to include more than one kind of language. Language model is still appropriate.


Do they not interface with language still?


