Let's Play Jeopardy! with LLMs

Update 2024-05-14: Hot off the presses, the benchmark now includes the recently released GPT-4o model! How good are LLMs at trivia? I used the Jeopardy! dataset from Kaggle to benchmark ChatGPT and the new Llama 3 models. Here are the results: There you go. You’ve already gotten 90% of what you’re going to get out of this article. Some guy on the internet ran a half-baked benchmark on a handful of LLM models, and the results were largely in line with popular benchmarks and received wisdom on fine-tuning and RAG.
Read more...

My Dinner with ChatGPT

It's hard to talk about ChatGPT without cherry-picking. It's too easy to try a dozen different prompts, refresh each a handful of times, and report the most interesting or impressive thing from those sixty trials. While this problem plagues a lot of the public discourse around generative models, cherry-picking is particularly problematic for ChatGPT because it's actively using the chat history as context. (It might be using a $\mathcal{O}(n \log{} n)$ attention model like reformer or it might just be brute forcing it, but either it has an impressively long memory; about 2048 "
Read more...