Does GPT-4 Turbo live up to the hype?
This past Wednesday, Google and Google DeepMind announced their much-awaited AI model, Gemini. Google purports that Gemini Ultra, the powerhouse version geared toward enterprises, outperforms GPT-4 in all industry benchmarks, including reasoning, math, and coding. So far, The Verge notes that Gemini’s reported performance is only narrowly better than GPT-4’s. But we still don’t have much hands-on evaluation; we’ll hold up our poking at all the hype, and instead, turn our attention to its true enterprise rival –OpenAI’s GPT-4 Turbo – whose capabilities have been more thoroughly researched and continue to get a ton of publicity.
Upon its release by OpenAI, GPT-4 Turbo marked a significant milestone. With an unrivaled 128,000 token context window, the model promises to push the boundaries of language processing, and in turn, open up new possibilities for tasks like summarization and complex question answering. Compared to its predecessor, it has a significantly increased knowledge base, with a data cutoff as recent as April 2023, adding 19 more months of additional information compared to the previous models 2021 cutoff. Plus, its multimodal functions include image processing, visual content analysis, and claims of improved text recognition within images.
On Openai DevDay, Sam Altman boldly declared GPT-4 Turbo to be “the most capable model in the world.” AlpacaEval, an Automatic Evaluator for Instruction-following Language Models, echoed the sentiment, awarding GPT-4 Turbo the highest leaderboard score – an impressive 97.70% Win Rate. However, its gap with 2nd place is insignificant: XwinLM 70b V0.1 reached 95.57% Win Rate, outperforming GPT-4, Claude 2, and LLaMA 2 Chat (You can dive into the detailed evaluation here to explore the contrasts between these LLM).
But do these upgraded, flashy capabilities hold up when it comes to performance and capacity? What are some of the challenges that come with these impressive, rapid advancements? Below, we compile some of the research we’ve seen that forces us to pause amidst the hype, so that we can remain agile and informed as we adapt to the ever-evolving boundaries of what these models can do.
Increased context window may lead to decreased accuracy
Since precise information retrieval is crucial to tasks that are relevant to all industries, like data analysis, customer service, content creation (to name a few), understanding that GPT-4 Turbo requires strategic input and context management will allow users to leverage its strengths while mitigating its weaknesses.
Takeaway: Despite the model’s context window size updates, we still need accurate retrieval to get correct responses. When tasked with large-scale summarization, current long-context LLMs may still require prompt engineering to get things right. Meaning, to ensure accuracy and cost-effectiveness, RAG isn’t going anywhere (at this time).
Stability and Capacity Issues
It’s unclear exactly how many resources OpenAI is using to keep ChatGPT up and running. But after a quick glance at OpenAI’s incident history, one thing is clear: the current level of service availability is intermittent. Users will frequently find that OpenAI takes too long when generating a reply, and the program times out. For larger scale business uses, timeouts continue to pose a concern for businesses continuity and reliability. As we were writing this review, ChatGPT went down for reasons not specified by OpenAI.
Takeaway: Stability and Capacity is currently an issue
Usage Limitations and Scale
As we all eagerly embrace GPT-4 Turbo, let’s remember that our ability to work with the model is somewhat limited. Access via the OpenAI API is limited to a restricted number of requests per month. In its current preview stage, GPT-4 Turbo adheres to a rate useful for some testing by some organizations. For many enterprise level organizations, testing during the preview stage may be challenging.
While OpenAI has indicated that they won't be accommodating rate limit increases for this model at this time, its public release communicated that they will potentially relax some of these limitations or introduce limit increase options.
Takeaway: In its current preview stage, usage limitations make GPT-4 Turbo potentially unusable for business applications at this time.
Conclusion
In the ever-evolving landscape of OpenAI, we’re constantly inundated with claims that the latest model is the grandest, the best, the most capable. GPT-4 Turbo is surely impressive. But accurate retrieval, some stability issues, and current usage limitations imply that there’s some nuance to the hype. Studies continue to confirm our conclusion that RAG isn’t going anywhere for the time being, both when it comes to accuracy and cost-effectiveness.
As organizations like OpenAI and Google continue to release impressive models, we at Proxet will continue to review, learn, and educate our clients and partners on what’s real and what’s just hype. Learn more about us at www.proxet.com and see how we can help your engineering, technology, and product development strategies.