MCLEAN, Va.--(BUSINESS WIRE)--The Federal Aviation Administration (FAA) and MITRE are introducing a new benchmark to enable the evaluation and assessment of large language models (LLMs) for aerospace tasks. Given the safety-critical nature of aerospace, it is imperative that LLMs undergo thorough evaluation prior to their integration into systems.

The Aerospace Language Understanding Evaluation (ALUE) benchmark provides a crucial tool for guiding the assurance of LLMs tailored to the unique demands of the aerospace domain. It incorporates diverse datasets and tasks and introduces several metrics for evaluating the correctness of LLM-generated responses.

ALUE is designed to streamline and improve the evaluation and inference of LLMs using aerospace domain-specific information. The versatile benchmark supports custom datasets, open-source and domain-specific LLMs, user-defined prompts, and various quantitative performance metrics. Such evaluations are essential not only for assessing a model’s performance but also for understanding its inherent limitations and potential risks, including issues such as hallucinations, biases, and privacy concerns.

“MITRE has deep expertise in both aviation safety and AI adoption, and is aligned with the FAA’s mission to provide the safest and most efficient aerospace in the world,” said Kerry Buckley, Ph.D., MITRE vice president and director, Center for Advanced Aviation System Development (CAASD). “ALUE allows the FAA and the aerospace community to create a definitive library of diverse and specific aviation nomenclature and terms that will enable the agency to harness the power of AI for tools and tasks that will continuously improve safety and efficiency today and into the future.”

Ongoing work will continue to expand the benchmark’s complexity and scope to address more intricate real-world aerospace challenges. This includes developing tasks for extracting complex information from charts, such as airspace boundaries or navigational aids, which require sophisticated spatial and symbolic reasoning.

Future work will also incorporate tasks that require LLMs to consult external data sources, such as aircraft operational manuals, to determine precise parameters such as flap and thrust settings under specific conditions, moving beyond simple information extraction to knowledge application.

CAASD’s engineers, scientists, and analysts pair cross-disciplinary capabilities with deep mission-centric expertise to deliver impactful solutions to advance aviation and aerospace safety.

ALUE is available via GitHub to airlines, academia, and aerospace stakeholders who are using or considering using LLMs on aerospace data. Active community collaboration is important to enhancing the benchmark with additional curated datasets and tasks, and organizations can run the benchmark on their machines. ALUE is the starting point to ensure the assurance of sophisticated and reliable AI tools for the enhanced safety and efficiency of the National Airspace System.

Reference: Aerospace Language Understanding Evaluation (ALUE): Large Language Benchmark with Aerospace Datasets, AIAA

