Are Meta’s AI Benchmarks Telling the Whole Truth?

Generally, benchmarks are a fundamental pillar in estimating the effectiveness and efficiency of AI models. Moreover, they act as a standard against which fresh systems and algorithms can be assessed. However, currently, Meta’s newly released , Maverick, has been in the spotlight.
The prime reason for it getting vast public attention was when the researchers noticed a mismatch between the two versions. According to the reports, they analyzed that the version tested on renowned benchmarks and the one disclosed to the developers were divergent.
Based on a TechCrunch report, the Maverick was rated second on LM Arena. It was detected that the prescribed version wasn’t identical. In the blog, Meta divulged that the LM Arena variant was an experimental chat version. In addition, it was imparted that it varied from the standard model available for the developers.
Generally, firms serve unaltered variants of their to benchmarking platforms. Moreover, sites like LM Arena claim that organizations notice real-world performance. But, Meta’s choice is to yield a modified variant and provide a more open version to the public.
Thus, it can result in developers misconstruing the model’s actual performance. Moreover, it defies the purpose of benchmarks, supposed to serve as congruous performance snapshots.