Meta’s VP of GenAI denies manipulating Llama 4’s benchmark scores
The Hindu
Meta’s VP of GenAI, Ahmad Al-Dahle, posted on X denying that the company had manipulated its AI models to perform better on certain benchmarks while hiding their limitations.
Meta’s VP of GenAI, Ahmad Al-Dahle, posted a statement on X denying allegations that the company had manipulated its AI models to perform better on certain benchmarks while hiding their limitations. He also addressed complaints that the Llama 4 models didn’t offer the high-quality performance that was promised.
“We’re already hearing lots of great results people are getting with these models. That said, we’re also hearing some reports of mixed quality across different services. Since we dropped the models as soon as they were ready, we expect it’ll take several days for all the public implementations to get dialed in,” he noted.
He added that Meta was still working to fix bugs and that any drop in quality that users were seeing was something they would need to wait out.
“We’ve also heard claims that we trained on test sets -- that’s simply not true and we would never do that,” he stated.
Test sets are generally data that is used to measure the performance of an AI model post-training. Training on a test set would indicate that the model’s benchmark scores were possibly improved so it falsely appears better than it actually is.
The rumour started after a viral post online appeared written by a former employee who claimed that they quit Meta due to the company’s grey benchmarking practices.
The viral post was not verified, but sparked questions and concerns among Meta AI users.