Apple GPT might soon become a reality. During the past few months, we heard several reports about this learning language model is working on. For example, The Information posted that Apple is spending millions of dollars daily to train its LLM.
While the publication says most of this investment would focus on AppleCare customers, the Siri team plans to incorporate these language models to make complex shortcut integrations more accessible. In addition, Haitong International Securities analyst Jeff Pu has reported that Apple has been building a few hundred AI servers throughout 2023 and plans to add more in 2024.
He believes that Apple plans to combine cloud-based AI and on-device data processing to release its generative AI to iPhone and iPad users by late 2024, during the iOS 18 cycle. Since we’re all looking forward to this Apple GPT technology to land on our iPhones, one small detail would set this GPT apart from the others: on-device usage instead of cloud-based.
While Pu believes Apple will mix both, the company is a big advocate of privacy as a “fundamental human right,” so mainly relying on on-device processing would be a key differentiator from all the other companies. But since Large Language Models are… large, it means an iPhone technically wouldn’t be able to run this future Apple GPT locally because it would need a proper server to do that.
That said, some Apple researchers published a paper showing how they could efficiently use Large Language Models with limited memory, which is very exciting.
In this paper, first spotted by MacRumors, the researchers say that the “method involves constructing an inference cost model that harmonizes with the flash memory behavior, guiding us to optimize in two critical areas: reducing the volume of data transferred from flash and reading data in larger, more contiguous chunks.” By doing that, the company plans to use two new technologies:
- Windowing: It loads parameters for only the past few tokens, reusing activations from recently computed tokens. This sliding windows approach reduces the number of IO requests to load weights.
- Row-column bundling: It stores a concatenated row and column of the up-projection and down-projection layers to read bigger contiguous chunks from flash memory. This increases throughput by reading larger chunks.
The combination of methods could bring a 4-5 times increase in speed on CPUs and 20-25 times faster GPUS, which would allow AI models to run up to twice the size of the iPhone’s memory. At the end of the day, this technology could improve Siri’s capabilities, real-time translation, and other AI features for photos, videos, and understanding of how customers use their iPhones.