The AIFARMS team is excited to announce that two papers on complementary benchmark data sets – MIRAGE and AgMMU – have been accepted to the NeurIPS 2025 Datasets & Benchmarks Track. Both data sets have been curated based on thousands of real-world user-expert conversations over many years in the Extension Foundation’s Ask Extension service, which provides professional advice to the public on agriculture and gardening topics. AgMMU focuses on factual knowledge and visual understanding of images, using multiple-choice and short-form answers. MIRAGE focuses on long-form answers for multimodal, expert-level reasoning and decision-making in real-world consultative settings. The MIRAGE benchmark metrics pipeline using “judge LLMs” to evaluate long-form answers is being used for the upcoming AI AgriBench Benchmarking Consortium (including several leading AgTech companies), which will be launched soon. For more information and links, read on:
AgMMU Paper at NeurIPS Datasets & Benchmarks Track
The AgMMU benchmark is empowered by 110K real-world user-expert conversations from the Ask Extension service. AgMMU curates a high-quality multi-modal evaluation benchmark targeting the question: Do existing vision-language models have the precise factual knowledge and reliable visual understanding to answer agricultural questions?

Key Results: Even state-of-the-art vision-language models cannot reliably answer multimodal questions. However, such capabilities can be consistently improved via fine-tuning on our multimodal agricultural knowledge base, which calls for more attention to agricultural knowledge and data for developing foundation models.
- Project Web page: https://agmmu.github.io/
- Paper: https://arxiv.org/abs/2504.10568
- Dataset: https://huggingface.co/datasets/AgMMU/AgMMU_v1
MIRAGE Paper at NeurIPS Datasets & Benchmarks Track
The MIRAGE benchmark is built on over 35,000 real user–expert conversations from the Ask Extension service. MIRAGE addresses the critical question: Can today’s large vision–language models be trusted when questions are technically complex and often open-ended, images are messy, and decisions have real-world consequences? MIRAGE introduces two complementary challenges:
- Multimodal Single-Turn – measuring grounded reasoning and decision quality given an image, metadata, and a single query
- Multimodal Multi-Turn – testing conversational abilities, including clarification-seeking, context tracking, and iterative decision-making across turns

Key Results: Even the most advanced frontier models struggle to maintain accuracy and reliability. There is a need for models that can reason over multimodal inputs, track context over multiple turns, and provide trustworthy, actionable outputs.
- Project Web Page: https://mirage-benchmark.github.io/
- Paper: https://arxiv.org/abs/2506.20100
- Dataset:https://huggingface.co/MIRAGE-Benchmark