-

MITRE and FAA Introduce Novel Aerospace Large Language Model Evaluation Benchmark

Aerospace Language Understanding Evaluation (ALUE) Benchmark Enables Thorough Evaluation of LLMs for Aerospace Tasks

MCLEAN, Va.--(BUSINESS WIRE)--The Federal Aviation Administration (FAA) and MITRE are introducing a new benchmark to enable the evaluation and assessment of large language models (LLMs) for aerospace tasks. Given the safety-critical nature of aerospace, it is imperative that LLMs undergo thorough evaluation prior to their integration into systems.

The Aerospace Language Understanding Evaluation (ALUE) benchmark provides a crucial tool for guiding the assurance of LLMs tailored to the unique demands of the aerospace domain. It incorporates diverse datasets and tasks and introduces several metrics for evaluating the correctness of LLM-generated responses.

ALUE is designed to streamline and improve the evaluation and inference of LLMs using aerospace domain-specific information. The versatile benchmark supports custom datasets, open-source and domain-specific LLMs, user-defined prompts, and various quantitative performance metrics. Such evaluations are essential not only for assessing a model’s performance but also for understanding its inherent limitations and potential risks, including issues such as hallucinations, biases, and privacy concerns.

“MITRE has deep expertise in both aviation safety and AI adoption, and is aligned with the FAA’s mission to provide the safest and most efficient aerospace in the world,” said Kerry Buckley, Ph.D., MITRE vice president and director, Center for Advanced Aviation System Development (CAASD). “ALUE allows the FAA and the aerospace community to create a definitive library of diverse and specific aviation nomenclature and terms that will enable the agency to harness the power of AI for tools and tasks that will continuously improve safety and efficiency today and into the future.”

Ongoing work will continue to expand the benchmark’s complexity and scope to address more intricate real-world aerospace challenges. This includes developing tasks for extracting complex information from charts, such as airspace boundaries or navigational aids, which require sophisticated spatial and symbolic reasoning.

Future work will also incorporate tasks that require LLMs to consult external data sources, such as aircraft operational manuals, to determine precise parameters such as flap and thrust settings under specific conditions, moving beyond simple information extraction to knowledge application.

CAASD’s engineers, scientists, and analysts pair cross-disciplinary capabilities with deep mission-centric expertise to deliver impactful solutions to advance aviation and aerospace safety.

ALUE is available via GitHub to airlines, academia, and aerospace stakeholders who are using or considering using LLMs on aerospace data. Active community collaboration is important to enhancing the benchmark with additional curated datasets and tasks, and organizations can run the benchmark on their machines. ALUE is the starting point to ensure the assurance of sophisticated and reliable AI tools for the enhanced safety and efficiency of the National Airspace System.

Reference: Aerospace Language Understanding Evaluation (ALUE): Large Language Benchmark with Aerospace Datasets, AIAA

About MITRE

MITRE’s mission-driven teams are dedicated to driving solutions to our nation’s most pressing challenges. As a not-for-profit research and development organization, MITRE’s staff leverage our unique multi-sponsor vantage point, systems expertise, and innovative solutions to ensure the health, prosperity, and security of our nation. www.mitre.org

Contacts

Media Contact: Jordan Graham at media@mitre.org

MITRE


Release Versions

Contacts

Media Contact: Jordan Graham at media@mitre.org

More News From MITRE

New Defense Acquisition Framework to Accelerate Technology Transition to Warfighters

MCLEAN, Va., & BEDFORD, Mass.--(BUSINESS WIRE)--The National Security Engineering Center (NSEC), a federally funded research and development center (FFRDC) operated by MITRE, unveiled the Transition Maturity Framework (TMaF) today. TMaF is a comprehensive defense acquisition framework developed to streamline the transition of innovative technologies from research labs to active deployment with U.S. warfighters. The framework addresses persistent challenges by providing a structured acquisition...

Lloyds Banking Group Becomes First U.K. Financial Services Benefactor of MITRE ATT&CK®

MCLEAN, Va. & LONDON--(BUSINESS WIRE)--Lloyds Banking Group has become the first U.K. financial services benefactor of MITRE ATT&CK® to help globally advance threat-informed defense. The MITRE ATT&CK open-source framework enables organizations to understand how adversaries operate so they can better manage cyber risks and strengthen defenses. MITRE ATT&CK is a cornerstone of Lloyds Banking Group’s cyber defense strategy, providing a unified language to describe and analyze adversary...

MITRE ATT&CK Community to Learn, Connect, and Strengthen Cyber Defense at ATT&CKcon 6.0

MCLEAN, Va. & BEDFORD, Mass.--(BUSINESS WIRE)--The MITRE ATT&CK® Community will convene and advance cyber defense at ATT&CKcon 6.0 at MITRE’s headquarters in McLean, Va., and online, October 14-15. This year’s event underscores MITRE’s commitment as a not-for-profit operator of federally funded research and development centers (FFRDCs), dedicated to advancing the public and national interest through open, community-driven innovation. ATT&CKcon brings together cyber leaders and pract...
Back to Newsroom