Benchmarks play an important role in the world of Large Language Models (LLMs). After all, they are an important marketing tool. No presentation of new models from OpenAI, Google, Anthropic and the like can do without reference to some best values. Whether in programming, math tasks or general deliberation skills: new records are set practically every week. At least that is the impression given by the companies themselves. An impression that a team of researchers at the Oxford Internet Institute is now fundamentally questioning.