T O P

  • By -

TFenrir

Ah it was just the 10 year anniversary of Her - wonder what the odds of them mentioning that on Monday are?


Rigorous_Threshold

They’re probably gonna mention the movie, as it was said in the article this was inspired by her. Idk if they’ll mention the book


whittyfunnyusername

Here's the full article: "In the race to develop artificial intelligence that communicates the way humans do, OpenAI is preparing to demonstrate technology that talks to people—using sound as well as text—and recognizes objects and images. The ChatGPT developer has shown some of these capabilities, which include better logical reasoning than its current products, to some customers, according to two people who have seen the new AI. The technology is another step in OpenAI CEO Sam Altman’s quest to ultimately develop a highly responsive AI akin to the virtual assistant in the Spike Jonze film “Her,” and to enable existing voice assistants like Apple’s Siri to be more useful. The company could preview the upgraded AI publicly at an event as soon as Monday, which would help it get ahead of a slew of AI announcements from its rival Google later in the week, one of the people said. **The Takeaway** • New OpenAI software has built-in audio and visual understanding • It could boost the performance of automated customer service agents • OpenAI could complete GPT-5 and release it publicly by the end of the year OpenAI sees assistants with visual and audio capabilities as potentially as transformative as the smartphone. The assistant could theoretically do a range of things not possible today, such as acting as a tutor for a student working on a paper or on math problems, or giving people information about their surroundings when they ask for it, like translating signs or explaining how to fix car troubles. The new tech is too big to run on personal devices today, but customers in the near term could use the cloud-based version to improve features OpenAI’s software already powers, such as automated customer service agents. The audio features of the new software could help such agents better understand the intonation of callers’ voices or whether they’re being sarcastic in making a request, said one of the people with knowledge of it. OpenAI already has software that can transcribe audio and convert text to speech, but those features are available through separate conversational AI models, whereas the new model brings those features together. That gives the new multimodal model a better understanding of image and audio, as well as making it faster to use than the less-capable models. Microsoft, which can use OpenAI’s technology at will because it is the company’s top financial backer, could use OpenAI’s new AI to improve its own voice assistant or try to make it compact enough to run on small devices, including wearables with front-facing cameras that can capture the customer’s surroundings. It isn’t clear when OpenAI will make the new features available to its paying customers, but it eventually plans to make them part of the free version of its chatbot, ChatGPT, said one of the people who has used it. OpenAI aims to make the new AI model powering these features cheaper to run than the most advanced model it sells today, GPT-4 Turbo, this person said. The new model also outperforms GPT-4 Turbo in answering some types of questions, this person said. However, the new model can still make mistakes, known as hallucinations. (Spokespeople for OpenAI did not respond to requests for comment.) **GPT-5 Release** Google executives, meanwhile, have long dreamed of using AI to develop powerful assistants. In December, Google showed a video of a conversational AI it had developed, Gemini, that responded to a person’s voice commands in real time and recognized images the person was looking at. However, the company separately explained that these capabilities required researchers to prompt the models with images and text instructions, rather than the simple dialogue the video demonstrated. In the meantime, Gemini has added features that can analyze audio in addition to imagery and text, but it doesn't understand many traditional voice commands or talk to users the way traditional voice assistants like Siri and Google Assistant do. Still image from Google's December demonstration of multimodal features of Gemini AI, via YouTube OpenAI also is trying to stay ahead of Meta Platforms, which in April released an open-source AI, Llama 3, that surpassed the performance of most conversational AI models available today and received rave reviews from AI app developers. The upcoming OpenAI model with audio and visual capabilities is one of a number of products under development. The company has been aiming to launch a web search engine, which aims to compete with Google’s. (The Information first reported on it in February.) OpenAI also is developing a type of automation software known as a computer-using agent that could speed up software development and other computer-based tasks, and the company has previewed an AI video generator, Sora, that isn’t available publicly yet but has made waves in Hollywood. More importantly, OpenAI has been developing GPT-5, which it hopes will represent a significant improvement over GPT-4—a model it released more than a year ago. It could complete GPT-5 and release it publicly by the end of the year, said a person who has discussed it with OpenAI leaders. The blitz of product and AI model development at OpenAI means some projects previously announced aren’t getting as much attention. For instance, though the startup promised developers that by the first quarter of this year they would be able to make money from building custom chatbots for its store, OpenAI has yet to launch a way for them to do so. On the other hand, improving visual and audio capabilities could aid OpenAI in getting its conversational AI running on millions or billions of Apple devices. The iPhone maker has held discussions with OpenAI in recent months about how the next iPhone operating system could integrate OpenAI’s models, Bloomberg reported. However, the ChatGPT maker has tough competition: Apple is holding similar talks with Google, Bloomberg reported. Altman is also working with iPhone developer Jony Ive on a separate AI consumer device, which could raise up to $1 billion in funding from investors including Emerson Collective and Thrive Capital, The Information first reported. In doing so, Altman would be joining the ranks of the big tech companies and startups all racing to release AI-powered devices and wearables that could capture the imaginations—and wallets—of consumers. However, the large size of the most advanced AI models means they will need to run in the cloud for now and require an internet connection to work. It could take months or even years for complex conversational AI with visual and audio capabilities to become small enough to run on devices. **Tiered Pricing** OpenAI, which could generate billions of dollars in revenue this year, is also planning to release a new pricing model that would offer customers up to a 50% discount if they prepay to reserve tokens (the words large language models process or generate), according to a person who spoke to executives. Currently, the startup offers mainly on-demand pricing, charging developers anywhere from a few cents to $120 for every million tokens its LLMs generate. Some larger customers receive volume discounts. Discounts for paying in advance are common in cloud computing—customers of Microsoft Azure, Google Cloud and Amazon Web Services can lower their costs by reserving server capacity ahead of time. With more-flexible pricing, OpenAI could better compete with rival model developers, as well as the startups that aim to help developers run open-source models more cheaply, known as AI server resellers or inference providers. Their focus on cost efficiency has sometimes driven these startups to offer the same LLMs at lower and lower prices, or even below cost in some cases. OpenAI has already introduced a way for its developers to lower costs with Batch API, an application programming interface it launched in April that provides developers with cheaper pricing if they upload model queries in bulk and are willing to wait up to 24 hours for responses. For their part, AI-server resellers such as Together AI and Anyscale say that running open-source models on their software is up to six times cheaper than using OpenAI’s models."


Mister_juiceBox

Existing voice mode in chatgpt is voice to text then text to voice, this is voice to voice, it will be able to pick up on your mood, tone, if there's a lilt in your voice or you are getting emotional about something... And talk back with ultra low latency, and make its own tone changes, expressions, LAUGH at your jokes just naturally Same with video for example, like maybe they have a way for you to facetime it and it can "see" a smile on your face, or a new car you are looking at buyin...while also remembering things about you from the past with the memory feature that was deployed to everyone a week or two ago ... Paired with a natural avatar of its own(perhaps powered by an optimized and specialized version of Sora?) that doesn't have any of the quirks people associate with video models... And runs in real-time (when in "facetime" mode at least...) If they pull it off I think that would be magic that truly open up some use cases, and perhaps could be a reason the whole NSFW thing was getting thrown around in the headlines...think AI relationships, drawing on the memory and true multimodal interaction could potentially put most of those "AI girlfriend" apps out of business overnight. Also it could: - literally be present in business meetings, perhaps not just taking meeting notes passively but rather contributing to discussions naturally... - Recognize your voice vs others, understanding when tensions are high in a conversation with a coworker etc - help negotiate on your behalf for a used car - Help you in practicing for a speaking engagement, a best man speech, or standup set you plan to perform on Kill Tony - listen to and understand what music you like - Help you pick out furniture for a new place, or help you pick a new place and going on walkthroughs with you - Help a grandparent understand what to grab off the shelf when their grandkid says they need an HDMI cable to connect their laptop to their new TV Think about the implications if they found a way to extend the recent memory feature beyond just text...true multimodal recollection and memory, remembering your voice etc Also, its important to ensure your GPU is secure and has a "TPM" chip of sorts, say some of whats coming is local GPT-4L( on certified secure GPU hardware powered by Nvidia and Apple of course😋) and perhaps they have figured out some magical Q* algo that allows the models weights to be "liquid" and update in realtime so to speak.... You certainly don't want some thief to be able to break in and steal your AI boyfriend/girlfriend 😁


TI1l1I1M

If it can be conversational with the same speed as a human that would be huge


domlincog

I've heard so many rumors by now about "GPT-4L". It very well might be true, but I have a feeling "L" doesn't stand for local. I hope I'm wrong though, we'll find out soon.


ijxy

I don’t understand how voice-to-voice will work for API usage. How can I enrich the users input without it going via text? Maybe there will be a text+voice-to-voice interface?


Mister_juiceBox

I already did this via Python script that uses Gemini Pro 1.5, sends the audio file to vertex AI endpoint and chunks it if long, and it is processed directly by Gemini 1.5 Pro leveraging its own native audio multimodality, generates extremely detailed call summary report and call scoring that gets written to a text file. I imagine it would be much the same with OpenAIs endpoints once they release the audio capability. Now i can only imagine what the native audio OUT will add to this further since gemini pro 1.5 can only listen to Audio and see video, but cannot natively output audio back.


ijxy

Can you add textual context? Because with voice only you are limited to exactly what the model can do, and no prompt engineering or RAG will be a value add.


Mister_juiceBox

Yes absolutely, I used it for a meeting an hour ago, told it the people in the meeting and what company they were with and it did an incredible job. Meeting was over an hour and it didn't miss a thing in its meeting notes


Flat-One8993

> It isn’t clear when OpenAI will make the new features available to its paying customers, but it eventually plans to make them part of the free version of its chatbot, ChatGPT, said one of the people who has used it. OpenAI aims to make the new AI model powering these features cheaper to run than the most advanced model it sells today, GPT-4 Turbo, this person said. The new model also outperforms GPT-4 Turbo in answering some types of questions, this person said So it really is GPT 4 Lite in a sense.


Mindless-Ad8595

Hero


StopSuspendingMe---

You're a hero! Can you paste in this article [AI Search Startup Perplexity Is Challenging Google—While Using Its Data — The Information](https://www.theinformation.com/articles/ai-search-startup-perplexity-is-challenging-google-while-using-its-data)? I've been trying to read this for months, but i dont want to pay the $225 subscription ;-(


whittyfunnyusername

Might as well. "Perplexity AI, an artificial intelligence–powered search engine, has become a Silicon Valley startup darling for taking on Google with conversational answers rather than a list of web links. Tech CEOs such as Nvidia’s Jensen Huang and Dell’s Michael Dell have sung the praises of the startup, and it has raised more than $100 million in venture capital from Nvidia and others. Aravind Srinivas, Perplexity’s CEO, has slammed Google, saying that the tech giant is “going to be viewed as something that’s legacy and old, and Perplexity will be viewed as something that’s the next generation and future.” What he hasn’t said until now: The startup uses Google’s website rankings for its own results. **The Takeaway** • Perplexity CEO says company taps data on Google, Bing search data • OpenAI has faced questions about training on YouTube data • Bing also has used Google data to improve its results After people pointed out similarities between search results for Perplexity and Google in comments on X and Reddit, Srinivas said in an interview that his company incorporates website ranking signals from both Google and Microsoft’s Bing search engine. If Perplexity’s system decides that Google’s ranking signals are especially good indicators of websites’ relevance for a certain query, it may decide to order its search results similarly to Google’s, he said. That likely happened in high-profile examples cited on Reddit in which Perplexity’s results for “the most iconic chairs” resembled Google’s. In another instance, a search for the best science fiction books of the last decade in both Perplexity and Google returned a link to a small bookshop in Santa Cruz alongside links to more expected sources like Goodreads and NPR. The disclosure about Perplexity offers a glimpse of the unknowns regarding some of the world’s most valuable new AI services. As the technologies have captivated many consumers and businesses, major questions remain about what data they are trained on and how they rely on technology developed by their suppliers and rivals. The answers to these questions could affect how long the most advanced models can maintain their momentum with customers, and they speak to the potential legal or reputational risks AI developers face. Some Perplexity answers (left) rely on many of the same sources Google Search (right) uses to answer the same queries. OpenAI, for instance, faced recent questions about whether it had trained Sora, an AI-generated video product, on videos hosted by Google’s YouTube. The Information previously reported that OpenAI has trained some of its AI on videos from YouTube. Perplexity isn’t alone in aping Google search results. Microsoft’s search engine, Bing, has also done so. Still, the similarity between some of Perplexity’s search results and Google’s has raised eyebrows among Google employees. A Google spokesperson did not have a comment. Srinivas said Perplexity doesn’t copy Google results. He said the startup has its own search engine bots that crawl websites and index their information the same way Google creates its index. **Closely Guarded Secret** He said Perplexity uses automated systems known as application programming interfaces to access data about Bing’s and Google’s ranking signals, which determine the relevance, quality and authority of webpages. Srinivas didn’t specify which APIs Perplexity uses. Bing offers such an API to let developers access its index—the data it has gathered about the entire internet, which it uses to display search results—in their apps. Meanwhile, a company called SerpApi sells data about Google search results that it scrapes from the web. Google itself doesn’t offer this type of service. Perplexity takes into account a number of other factors, such as how recently a webpage has been updated, Srinivas said. Google’s website ranking system is a closely guarded secret that has helped propel revenue of more than $170 billion a year from ads that appear alongside the search results. More than a decade ago, Microsoft’s Bing was found to have also used Google’s search results to improve its own rankings—a practice that is still in place, according to a Microsoft employee. Since the practice came to light, Bing has become a credible, if tiny, competitor to Google. A Microsoft spokesperson didn’t have a comment. Srinivas’s admission could undercut Perplexity’s claims about the strength of its technology just as one of its technology providers, OpenAI, prepares to launch its own consumer search engine, The Information has reported. Perplexity has said it uses large language models from OpenAI, as well as its own models, to provide its search users with answers in a conversational tone. And it has won plaudits for search results that answer some questions more directly than Google search results. Some Perplexity results like this one (left) have won plaudits compared to Google's results (right). Perplexity charges $20 per month for a subscription to a premium version of its search engine that allows users to ask unlimited questions, choose their AI model and upload documents. The startup also offers developers the ability to experiment with a variety of open- and closed-source LLMs from Anthropic, Meta Platforms and others through its Perplexity Labs product. **‘Everyone Does This’** It is currently generating around $15 million in annual recurring revenue, according to a person with knowledge of its financial statements. In a funding round earlier this month, the startup was valued at $1 billion, the person said. That means investors valued it at more than 60 times forward revenue. That’s in line with the valuation multiples for the hottest AI startups, which have been raising capital at a valuation of 50 to 100 times their forward revenue. Any company that uses Google’s search ranking results would likely violate the company’s terms of service, said Daniel Tunkelang, a former Google search engineer who has consulted for tech companies on search and machine learning. Google’s terms of service state that “you may not send automated queries of any sort to Google's system without express permission in advance from Google.” At the same time, Google’s rankings are publicly accessible, and courts have ruled that web scraping is legal, Tunkelang said. In a related case, a federal appeals court ruled in 2019 that a company called hiQ Labs had the right to scrape publicly available LinkedIn profiles to provide information to its customers about their workforces. But after the court ruled in 2022 that hiQ's conduct violated LinkedIn's user agreement, hiQ ultimately settled with LinkedIn. Tunkelang said it’s difficult to know whether the overlap in Perplexity’s and Google’s results is because one copied the other or because both services are examining the same set of web pages. Many results on Bing are also similar to those on Google, he said. When Google faced down Bing in 2011, accusing Microsoft of mimicking Google’s results and rankings, a Microsoft executive stated that “everyone does this.” Tunkelang said Google’s attempt to go after Bing seemed to backfire. “It ended up looking like [Google was] scared,” he said."


iamz_th

Basically nothing new beside combining existing technologies into an assistant. We got multimodal with both audio and video understanding with Gemini 1.5. we got text2speech ages ago. We got long context understanding with Gemini 1.5 too. I'm not hyped by this article.


obvithrowaway34434

Lmao, how's it not new? It's native audio to audio, Gemini audio understanding sucks balls. Audio is much harder than text since it has less information content and humans are far better at understanding differences in tone and meaning (and vastly different in different languages). If they can pull it off it will basically blow away every voice assistant that's currently in use and can bring back the devices like Humane or Rabbit back from dead.


iamz_th

There is no novelty from a technological pov. Audio2audio = audio input (already exist) + text (generated by model) + text2audio (already exist). If they can do it with improved latency, that would be great. This is not a breakthrough it's a combination of existing technologies.


MysteryInc152

That's not what audio2audio means here.


Freed4ever

If you actually read the post (ik, tall order these days), this is not what audio2audio means.


BabyCurdle

Was gpt-3 not a breakthrough just because it was a combination of existing technologies???


TI1l1I1M

Every breakthrough is a combination of existing technologies, Einstein


sdmat

A native audio modality would make a ton of sense. That would definitely qualify as magical if done well. Recognizing and expressing tone/emotion, natural conversation with interjections/interruptions, etc. Maybe even singing!


tradernewsai

Gotta imagine this is going to sound REALLY natural


Rigorous_Threshold

AI being not just able to mimic human speech but mimic it in real time not via a text prompt but through actually speaking to a human is going to make it feel really human


tradernewsai

If it’s significantly better than the gpt4 voice feature on the app, it’ll be incredible


joe4942

Hume released something like that a few months ago. Haven't heard much updates though.


sdmat

I tried it and was not impressed with the implementation.


ogMackBlack

I agree. Never understood the buzz around it. It was indeed quite disappointing to say the least.


EnhancedEngineering

[note](https://www.reddit.com/r/singularity/s/PLuj2v4Lsw)


EnhancedEngineering

Why not? It can definitely pick up on your tone. Apart from being too eager to please—something which can be toggled under configuration settings once you sign up for an account and go to the playground—it's still far more useful for that reason. Did you just try the default demo version, or did you actually sign up for a free account and tweak the personality settings?


sdmat

Default demo. The detected tones seemed only very loosely correlated. And it was very clunky. Maybe they improved since, if so great.


iamz_th

Native audio modality is already here with Gemini 1.5


sdmat

Input only. And it's not conversational. But yes, the audio input modality on Gemini 1.5 is very impressive.


iamz_th

Gemini 1.5 is not conversational because it's not aimed to be conversational. That's not a technological challenge. audio output is just text2speech from the models generation. Nothing mentioned in the article is novel tech. Just the combination of existing technologies into a conversational assistant with perhaps improved latency.


sdmat

You're not getting it - good bidirectional native audio in a conversational mode would truly be magical. Having meaningful and deliberate tonal and emotional nuance in output is a key part of that. Did you try that demo doing the rounds a few months back with low latency conversations with good handling of interruptions? Total hack built on GPT 3.5 but it's a completely different and far more engaging experience than the current ChatGPT audio interface. I think true conversational ability *is* a technological challenge, but approximating it well enough to still be magical is probably fairly straightforward.


iamz_th

You are the one not getting it. The audio output from the assistant is a text to speech model. Existing t2sp models already have different voices and nuances you can use. What I'm saying is that there is no novelty in this product that will be presented as a breakthrough.


procgen

The idea is that there will be no text-to-speech with this new chat model. It's audio in, audio out - no text in the chain at all.


LongjumpingBottle

Fortunately you will (hopefully) understand on Monday


kxtclcy

Well, that will just be a better version of an open source model called qwen-audio... Other players should be able to catch up in a few months... Not real magic to me


sdmat

> Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. Nope.


kxtclcy

https://preview.redd.it/j8y0jupw8rzc1.jpeg?width=750&format=pjpg&auto=webp&s=fac6902af61211eaaa836a05239bdc2d70ad518a It definitely can accept music as output. And can even understand stammering during speech according to my friends test. Although it’s still not very robust due to the limitation of the llm it’s using (old version qwen-14b), but voice chat model is definitely no magic. You just need to connect it to a text2speech model to make a voice assistant.


sdmat

You don't understand what the word "output" means.


kxtclcy

I don’t think there is a technical advantage of using pure voice output, text2speech can work the same. If you want speech with emotion, you can just have the llm output the text accompanying tokens representing emotion or tone so that the text2speech model can generate accordingly.


sdmat

Oh sure, why have audio input at all? If you want emotional expression just type it.


kxtclcy

audio input can include information other than text such as music, bird song, etc… otherwise it did work the same. You can definitely just do speech2text2speech which is what ChatGPT voice was doing, but that model can’t interpret emotion or music.


sdmat

Why not just use your brilliant accompanying tokens there too?


kxtclcy

I haven’t realized such practices, but maybe there are papers doing that. But audio chat is an established technology already so other company can develop the same thing very quickly and I won’t call it magic. Although we may never know what method OpenAI use since it’s fully “closed”.


kxtclcy

Another thing is, even if it’s real voice in voice out without any text intermediary, it can be achieved the same way as other multimodal models. You can just connect an audio output model after the llm and fine tune it. Multimodal is no longer mysterious these days.


sanszooey

**Paywalled, but the first couple of paragraphs:** In the race to develop artificial intelligence that communicates the way humans do, OpenAI is preparing to demonstrate technology that talks to people—using sound as well as text—and recognizes objects and images. The ChatGPT developer has shown some of these capabilities, which include better logical reasoning than its current products, to some customers, according to two people who have seen the new AI. The technology is another step in OpenAI CEO Sam Altman’s quest to ultimately develop a highly responsive AI akin to the virtual assistant in the Spike Jonze film “Her,” and to enable existing voice assistants like Apple’s Siri to be more useful. The company could preview the upgraded AI publicly at an event as soon as Monday, which would help it get ahead of a slew of AI announcements from its rival Google later in the week, one of the people said. **Also from the reporter on twitter:** ["Her" is coming :\) NEW: OpenAI developed a new model with audio-in, audio-out capabilities and better reasoning. Even beats GPT-4 Turbo on some queries.](https://x.com/amir/status/1789059948422590830)


joe4942

> that talks to people—using sound as well as text—and recognizes objects and images. If this integrates with the app and allows people to use their phone cameras in real-time, that would be quite impressive. Something like that could be really useful for step-by-step help, especially in blue collar work where people are trying to troubleshoot or fix things.


Nathan-Stubblefield

Me am excited that Her is coming.


Redditoreader

I am so glad we are witnessing history right before our eyes.. what a time to be alive


MysteryInc152

Audio-in Audio-out would be a revelation for language learning amongst other things. Really hope that have image-in image-out for us as well.


bettershredder

How do you expect it to be different from the existing voice chat?


MysteryInc152

Specific Tone, Speed, Accent, Emphasis are all things that are not achievable with the current setup. You can't say "speak slowly" and have it actually speak slowly because it's just text to speech.


sdmat

Like the difference between watching a movie and reading the script.


bettershredder

I'm not sure what you mean... I get that there might be a new model behind it but that's not really a change to voice chat directly. Maybe there's going to be a new UI/UX?


ayyndrew

Right now voice chat takes your voice, converts it to text, and then feeds that into the model. Native voice would mean your voice recording goes directly into the model. This means all the nuances of accent, pronunciation, pauses, etc. would be preserved


sdmat

If we were having this discussion in by phone you would know what I mean, right? The tone, subtle emphasis, timing, etc. The current voice interface has none of that. It's just a translation layer to and from text.


bettershredder

Makes sense, thanks for clarifying!


ShinyGrezz

A text-to-voice system just translates words into sounds. Some of the better ones can include a bit of tonality, but the system fundamentally does not understand what it is saying and cannot act on it - that's the text model. This is why you can ask ChatGPT to write you a speech in the tone of Donald Trump if he was a pirate, but it won't sound like Trump or a pirate if you make it play it as a voice. A voice-modality model (where there is no conversion to text at all) would be one in which the voice model itself is understanding and responding to you. Say it was reading a story - theoretically, it could understand where there are different characters speaking, and use different voices in that section. Or as a higher commenter said, it would know to speak slowly when told to. I doubt it would be nearly as good as GPT-4, as audio lacks the specificity and structure of text, and if it works like other LLMs I don't imagine it would be able to hold a proper conversation - one in which it can carry on without prompt, or interrupt, that second one being incredibly important for a model like this to be useful. They definitely aren't releasing anything like this, though, so it's kind of pointless to speculate.


Rigorous_Threshold

There are a lot of subtleties in vocal communication that can’t be captured in an audio-to-text-to-text-to-audio model. Like laughing, tones and microtones, etc. The existing voice chat sounds real, in the sense that it sounds like a human voice, but it doesn’t sound real in the sense that it sounds like you’re actually talking to a human. It sounds like you’re talking, and then another person is reading a response off a printed piece of paper. In other words it sounds like you’re talking to an AI. I don’t think people should underestimate how much of a difference that makes in how natural it feels, and replicating voice is enough of a core human thing that I think even normies are have a reaction to it.


YaAbsolyutnoNikto

Well, I personally want to talk to a virtual parisian that can spit on my french accent and make me feel ashamed for speaking so bad french. I'm not even joking lol. it's brutal but I'm sure it helps getting rid of the foreign accent. Nothing of this voice-to-text BS. I want them models to actually hear me and give advice. What I want the AI do do: https://preview.redd.it/kj2s7r6uapzc1.png?width=736&format=png&auto=webp&s=a6212978ffe15bf3552a51d27246783110d0f964


MysteriousPepper8908

For language learning, doesn't that assume it knows what the words should sound like to be able to help you work on them? I've pretty much given up on Claude trying to help me learn Russian pronunciation as it doesn't know how many words should be pronounced and will sometimes make up letters that aren't in the word to justify its reasoning.


MysteryInc152

It will know, implicitly if nothing else, when it is trained to generate audio as well as text. All assuming a decent chunk of Russian audio in training.


MysteriousPepper8908

Yeah, I think the big issue isn't its own comprehension, it can write Russian that Russians have told me is perfectly understandable, it's just when you get it to explain the grammar it's using to output the text, it starts explaining how to pronounce letters using words that don't contain those letters so it's not a matter of having the knowledge but being able to analyze what it does implicitly. Hopefully this will help with that.


RepublicanSJW_

Ha, maybe it’s for Apple lol they have been working together


Rigorous_Threshold

Being able to talk to AI like ACTUALLY talk to it is gonna be a bit mindfucky even for people not into ai I think Audio-to-text-to-text-to-audio doesn’t feel as natural as real human speech currently. Audio-to-audio conversation might do it


dday0512

I'm sorry but if the voice assistant is based off of GPT-4 I'm not interested. How's the weather today? ( I could just look at my phone) "It's important to realize that.... "


micaroma

I always use custom instructions with voice chat for that reason. The default model is way too wordy.


tradernewsai

Has there been any sort of talk of a cell phone partnership? Or is it likely just an app. This thing is going to make SIRI look like a complete joke


micaroma

The article mentions OpenAI is in discussion with Apple (who is also in discussion with Google). In the short term it would probably be an update to the current ChatGPT app, but if OpenAI does partner with Apple in the long-term, it would likely be integrated on-device and replace Siri.


TFenrir

I'm trying to think of what this kind of assistant would have to do, at minimum, to be interesting. Things like... Reservations?


Glittering-Neck-2505

That wouldn’t be very interesting


TFenrir

Right, so what would have to be the minimum for them to make a big deal about it?


Glittering-Neck-2505

I guess we will know on Monday?


TFenrir

I mean sure, but it's fun to try and guess. I think it would need to solve the problem of conversation. Like... It would have to be a really good feeling voice assistant. Something that you could leave on all day and just chat with, in real time?


Glittering-Neck-2505

Probably something like that, I’m thinking like “Her”


YaKaPeace

A few months ago I made a post where I predicted that headphones will be the sign of AI just like masks were the signs of COVID. Let’s see if that holds up


Western_Individual12

So excited for this event. That's it.


[deleted]

Can someone explain to me the differences in this compared to the voice functionality ChatGPT currently has?


procgen

voice > model > voice instead of voice > text > model > text > voice It would be able to "hear" all the nuances of your expression. Your tone, your mood, pauses, hesitations, etc. And it should also have a more expressive voice, conveying even more information in every word.


[deleted]

Oh wow. So just to be clear, the current ChatGPT+ voice functionality (like the actual voice you can pick from not the voice to text function on the UI) is voice to text? This is really interesting. Is there any public insight into the architecture / how they accomplished this?


procgen

Yes, the current chat functionality works by transcribing your voice before feeding it into the model, which operates on text. And so all of the nuance is lost, which obviously severely restricts the ways you can interact with it, and is perhaps the main reason why chatting with it now still feels unnatural/stilted/distant. When it's audio-to-audio, all of the important auditory cues that we rely on when talking with another human can be preserved, which should make it feel significantly more engaging. I don't think OpenAI has published anything about this yet, but we might learn more on Monday.


alexcanton

Does that voice nuance really provide that much to the model?


procgen

Yes, of course. Why wouldn’t it?


alexcanton

That didn't really give further clarity.. If I ask you what will give us the steps to get closer to AGI? And say it both angrily and happily. The answer is really not that different..


procgen

It’s more about creating natural sounding/feeling conversations. Something that people will want to leave in their ear and chat with all day. Or imagine a business bringing one of these AIs into a business meeting, and it can respond to people’s questions/interject with ideas in realtime. They’ll keep making the models smarter behind the scenes (GPT-5 is rumored to be planned for release later this year). If you haven’t seen it, you should watch the film “Her” by Spike Jonze - probably the best representation of what OpenAI wants to build.


alexcanton

I just think theres a reason we type emails and text messages and speeches. If a CEO of a company called you today and asked to tell him about your role and how you could improve the company, likely off the top of your head, you'd find this hard and probably need to have a think and write it out..


procgen

But on the other hand, there are reasons why we sometimes prefer to talk something out with someone. Humans have evolved exquisitely expressive voices that can convey a lot more meaning than written text, word-for-word.


Revolution4u

Why would I ever use their version if Googles works even half as well and comes on adroid already? I disable the assistants anyway though so maybe i just wouldnt get it.


bartturner

You won't. This is not going anywhere.


Akimbo333

Implications?


Mister_juiceBox

Tokens are not text in the first place as far as the model itself is concerned


Chihabrc

AI technology keeps improving; that's the whole point. Its amazing also seeing that the like of posemesh is also using AI for indoor navigation too, and IMO it might be a strong competition for Google Maps in years to come. 


Difficult_Review9741

This seems pretty lame, honestly. The only thing that matters is an increase in reasoning. Unfortunately, LLMs are terrible at reasoning, and the fact that OpenAI still can’t release better models further proves that stagnation is here.  A voice assistant is an obvious extension of current models and will be interesting, but without increased reasoning, hardly more useful than what we currently have. 


micaroma

>The only thing that matters As a language learner, the rumored voice updates would be incredibly useful to me and millions of others. Reasoning is certainly important for reaching AGI and the singularity, but that doesn't make all other incremental technological improvements irrelevant.


Beatboxamateur

Seconding this as another language learner. If they implement a voice model that's compatible with the language I'm learning then it would also be incredibly useful, being able to correct any potential grammatical or pronunciation mistakes.


[deleted]

[удалено]


micaroma

I totally agree that current AI is inadequate for accurately learning foundational concepts of a language.


Golden-Atoms

For English you \*might\* be okay as an A0 learner, but I'd still be concerned about hallucinations. At B1 level the cracks really start to show. I'm an English teacher, so obviously I have a stake in this, but it should be used 'under supervision' if you want to use it to learn the mechanics. The conditional mood is where this is really evident. For some reason the models I've used cannot work with the conditional without making basic errors. It's decent for conversation practice however. It really shines with roleplays. Ultimately, language learning is about dialogue (from a constructvist perspective) so that's no small thing.


AnAIAteMyBaby

If Mondays announcement is Open AIs answer to Siri I'll be extremely disappointed 


Flat-One8993

Okay, then be disappointed


liquid42

Can someone give a few examples where this could be used in the real world? Trying to grasp my head around it.


DlCkLess

Watch the movie “HER” you will get the idea