Software utensils: Thinking model expense is hard to predict

When Opus 4.6 came out, my colleagues were very excited because prompts which had previously elicited incorrect or disorganized answers were instead yielding correct analysis. It was a significant advance, and the change in our environment really highlighted how central model quality was to our experience. Now the opinion was widely held that anthropic was at the top of the quality rankings, and this was consistent with what we saw comparing it with GPT, but what about the other many models available? I kept hearing about high-quality open source models that could rival the frontier models we were usually focused on, but up to that point had not bothered to actually test them. Since our access was coming via OpenRouter, it was relatively easy to run some queries across a wider variety of models without having to set up accounts, etc., and the results were intriguing. Opus did the best, and was as promised pretty expensive. To my surprise though, there were some models which were reputed to be cheap which were nearly as expensive (despite coming up with much worse answers). This despite the fact that the headline input and output token costs were much less for those other models. The key here was that these were all "thinking" models which also charge for intermediate reasoning tokens which apparently cannot be accurately predicted. This means that it is not practical to predict cost by any method besides actual tests; the headline price tags can be easily overwhelmed by the numbers of reasoning tokens expended, which in my experience was all over the map. This makes me think that we really need a tool which can easily look at a wide variety of models and evaluate their performance on our prompts, and also give us some grounding for our view of how to models stack up in terms of cost.

Software utensils

Wednesday, April 29, 2026

Thinking model expense is hard to predict

No comments:

Post a Comment