Yahoo Releases the Largest-ever Machine Learning Dataset for Researchers

13.5TB dataset to help advance innovation in computer science

SUNNYVALE, Calif.--()--Yahoo Inc. (NASDAQ: YHOO) today announced the public release of the largest-ever machine learning dataset to the academic research community. With this release, the company aims to advance the field of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research.

“Many academic researchers and data scientists don’t have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies,” said Suju Rajan, director of research, Yahoo Labs. “We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems.”

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B events (13.5TB uncompressed) of user-news item interaction data, collected by recording the user-item interactions of about 20M users from February 2015 to May 2015.

"Yahoo's release of the Yahoo News Feed dataset is a significant contribution to the research community. Academic researchers everywhere will finally have access to realistic scale data to study how to automatically discover which news articles are of interest to which users, and will be able to compare their methods using this as a shared test case,” said Tom Mitchell, machine learning department chair, Carnegie Mellon University. “Here at CMU we'll certainly be using it for our research."

The dataset provides categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, the title, summary and key-phrases of the news article in question are also included, and interaction data is timestamped with the user’s local time and also contains partial information of the device used to access the news feeds.

"Access to datasets of this size is essential to design and develop machine learning algorithms and technology that scales to truly ‘big’ data,” said Gert Lanckriet, professor, Department of Electrical and Computer Engineering, University of California, San Diego. “At the Jacobs School of Engineering at UC San Diego, it will directly and significantly benefit the wide variety of ongoing research in machine learning, artificial intelligence, information retrieval, and big data applications."

"At the UMass Amherst Center for Data Science we have broad interests in developing new methods for scalable analytics on a wide variety of big-data domains,"said Andrew McCallum, director of the Center and professor in the College of Information and Computer Sciences. "The release of this large Yahoo News Feed dataset will be a tremendous asset for the academic research community, and for us at UMass particularly, given our major research activities in natural language processing, information retrieval, databases and computational social science."

About the Webscope program:

The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprised of anonymized user data for non-commercial use. The dataset we are releasing today is governed by our commitment to safeguard our users’ privacy and follows our practice of protecting and anonymizing user data.

About Yahoo Labs:

Yahoo Labs is the scientific engine guiding Yahoo innovation while powering impactful products for Yahoo’s users, partners, and advertisers. Yahoo Labs serves as Yahoo’s research arm–its incubator for bold new ideas and laboratory for rigorous experimentation. Yahoo Labs applies its scientific findings in powering products for Yahoo’s users and enhancing value for its partners and advertisers. Yahoo Labs’ forward-looking innovation also helps position Yahoo as an industry and scientific thought leader. For more information, visit or Yahoo Labs’ blog (

About Yahoo:

Yahoo is a guide focused on informing, connecting, and entertaining our users. By creating highly personalized experiences for our users, we keep people connected to what matters most to them, across devices and around the world. In turn, we create value for advertisers by connecting them with the audiences that build their businesses. Yahoo is headquartered in Sunnyvale, California, and has offices located throughout the Americas, Asia Pacific (APAC) and the Europe, Middle East and Africa (EMEA) regions. For more information, visit the pressroom ( or the Company's blog (


Yahoo Inc.
Fred Han, 415-713-1562

Release Summary

Yahoo is announcing the public release of the largest-ever machine learning dataset to the research community.


Yahoo Inc.
Fred Han, 415-713-1562