A group of researchers working with MIT have come up with a solution to a baffling problem with ChatGPT and other large language models. As these models talk to users, they gradually start to collapse, eventually leading the bot’s performance to drop rapidly. With this solution, though, that could be a thing of the past.
The issue, the researchers note, stems from the key-value cache, which is essentially the bot’s conversation memory. When this cache becomes full and needs to hold more, it often lets the first pieces of data get bumped out to make room.
This move can actually cause ChatGPT and other LLM’s performance to drop. As such, ensuring that the first few points of data remain in the memory is important to keeping the LLM moving forward without any issues, even if the conversation goes on for a long time.
The researchers call the new method StreamingLLM, which allows the AI to remain efficient even when a conversation extends to more than four million words. The researchers tested it against another method, which helps avoid crashing and performance issues by constantly recomputing part of the past conversations.
StreamingLLM actually performed more than 22 times faster, which would allow for performance in ChatGPT and other LLMs to remain consistent even during longer conversations, allowing you to get better results from ChatGPT and more. The authors of the study say that StreamingLLM would allow for the chatbot to have continual conversations throughout the entire day without requiring rebooting.
Understanding the relationship that the cache plays to how the chatbot responds to human inputs is important. It helped highlight the issue for which the researchers needed to provide a resolution. They’ve published their findings in a new paper that appears on the arXiv preprint server.
Currently, StreamingLLM has been incorporated into Nvidia’s TensorRT-LLM, but it could make appearances in other chatbots, like ChatGPT, Claude, and more, if those companies see the same value that Nvidia did.