Apple researchers have published a paper about a new AI model. According to the company, ReALM is a language model that can understand and successfully handle contexts of different kinds. With that, users can ask about something on the screen or run in the background, and the language model can still understand the context and give the proper answer.
This is the third paper regarding AI that Apple has published in the past few months. These studies only tease the upcoming AI features of iOS 18, macOS 15, and Apple’s newest operating systems. In the paper, Apple researchers say, “Reference resolution is an important problem, one that is essential to understand and successfully handle context of different kinds. (…) This paper demonstrates
how LLMs can be used to create an extremely effective system to resolve references of various types by showing how reference resolution can be converted into a language modeling problem, despite involving forms of entities like those on screen that are not traditionally conducive to being reduced to a text-only modality.”
One example is a user asking for pharmacies near them. After a list is presented, something Siri could do, the user could ask, “Call the one on Rainbow Rd.,” “Call the bottom one,” or “Call this number (present onscreen).” Siri can’t perform this second part, but with ReALM, this language model could understand the context by analyzing on-device data and completing the query.
With that, Apple researchers want to use AI for the following tasks with ReALM:
- Onscreen Entities: These are entities that are currently displayed on a user’s screen
- Conversational Entities: These are entities relevant to the conversation. These entities might
come from a previous turn for the user (for example, when the user says “Call Mom”, the
contact for Mom would be the relevant entity in question) or from the virtual assistant (for
example, when the agent provides a user with a list of places or alarms to choose from). - Background Entities: These are relevant entities that come from background processes that might
not necessarily be a direct part of what the user sees on their screen or their interaction with
the virtual agent; for example, an alarm that starts ringing or music that is playing in the
background
That said, Apple thinks its latest AI model is better than ChatGPT’s GPT 4: “In the case of GPT-3.5, which only accepts text, our input consists of the prompt alone; however, in the case of GPT-4, which also can contextualize on images, we provide the system with a screenshot for the task of onscreen reference resolution, which we find helps substantially improve performance. Note that our ChatGPT prompt and prompt+image formulation are, to the best of our knowledge, in and of themselves novel. While we believe it might be possible to further improve results, for example, by sampling semantically similar utterances up until we hit the prompt length, this more complex approach deserves further, dedicated exploration, and we leave this to future work.”
You can find the full paper here.