"600% Momentum-Strategie": Aktien hebeln und Gewinne maximieren - sichern Sie sich jetzt das Kennenlernangebot für nur 39 Euro (statt 69 Euro) im Monat!

Yandex releases world's largest event dataset for advancing recommender systems

29.05.25 18:05 Uhr

Yandex introduces the world's largest currently available dataset for recommender systems, advancing research and development on a global scale.
The open dataset contains 4.79B anonymized user interactions (listens, likes, dislikes) from the Yandex music streaming service collected over 10 months.
The dataset includes anonymized audio embeddings, organic interaction flags, and precise timestamps for real-world behavioral analysis.
It introduces Global Temporal Split (GTS) evaluation to preserve event sequences, paired with baseline algorithms for reference points.
The dataset is available on Hugging Face in three sizes — 5B, 500M, and 50M events — to accommodate diverse research and development needs.

SINGAPORE, May 29, 2025 /PRNewswire/ -- Yandex has published Yambda (Yandex Music Billion-Interactions Dataset), the world's largest open dataset for recommender systems, containing nearly 5 billion anonymized user interactions with audio tracks from its music streaming platform, Yandex Music.

Yambda serves as a universal benchmark for testing new approaches and algorithms across all domains utilizing recommender systems — e-commerce, social networks, and short-form video platforms.

The dataset enables researchers to develop and test new recommender algorithms against its baseline models, accelerating innovation. Startups with limited data can leverage the dataset to build and test systems using Yambda before scaling. This accelerates the creation of advanced technologies tailored to business needs worldwide.

Bridging the research-industry gap

The quality and scale of training data are critical to delivering relevant recommendations on platforms like streaming services, social networks, short-form video apps, and e-commerce marketplaces. However, research in recommender systems has lagged behind rapidly advancing fields like large language models, largely due to limited access to large-scale datasets. Effective recommendation models require terabytes of behavioral data, which commercial platforms possess but rarely share publicly.

Researchers are often left with small, outdated datasets that fail to capture the complexity of modern usage:

Spotify's Million Playlists dataset is too small for commercial-scale recommender systems.
Netflix Prize dataset, with ~17,000 items and date-only timestamps, limits temporal modeling and large-scale research.
Criteo 1TB Click Logs dataset lacks proper documentation and identifiers, and focuses narrowly on ad clicks.

"Recommender systems are inherently tied to sensitive data. Companies can only publish recommender system datasets publicly after exhaustive anonymization, a resource-intensive process that's slowed open innovation," explains Nikolai Savushkin, Head of Recommender Systems at Yandex.

This data scarcity creates a gap: models that excel in academic settings often underperform in real-world applications. Efforts to integrate recommender systems with advanced architectures are also constrained by the lack of suitable training data.

About the Yambda dataset

Yambda addresses recommender system challenges by providing a massive, anonymized dataset from its music streaming service with ~28 million monthly users. This dataset provides insights into how users interact with the content offered by Yandex Music, which is known for its sophisticated recommendation system My Wave that tailors the listening experience to the tastes of each user. To protect privacy, all user and track data is anonymized, using numeric identifiers to meet privacy standards.

Key features of the dataset:

4.79 billion anonymized user interactions collected over 10 months.
Data from 1 million users and anonymized descriptors for 9.39 million tracks.
Includes two feedback types: implicit interactions (listens) and explicit interactions (likes, dislikes, and their removal).
Offers audio embeddings (vector representations generated via convolutional neural networks) and anonymized information about tracks.
Features an "is_organic" flag marking whether users discovered tracks independently or through recommendations, enabling deeper behavioral analysis.
All events are timestamped, which supports the analysis of user behavior over time and allows models to be evaluated under conditions that closely resemble real-world use.

The dataset is released in Apache Parquet format, compatible with distributed processing systems such as Spark or Hadoop and analytical libraries like Pandas and Polars.

"Yambda empowers researchers to test innovative hypotheses and businesses to build smarter recommender systems. Ultimately, users benefit — finding the perfect song, product, or service effortlessly," notes Nikolai Savushkin.

Dataset versions and evaluation

Available in three sizes — approximately 5 billion, 500 million, and 50 million events — the Yambda dataset accommodates researchers and developers with different needs and computational resource capacities.

The dataset uses Global Temporal Split (GTS) for evaluation, a method that splits data by timestamps to preserve event sequences. Unlike Leave-One-Out, which removes the last positive interaction from each user's history for testing, GTS avoids breaking temporal dependencies between training and test sets. This ensures a more realistic model testing — mimicking real-world conditions where future data is unavailable.

Baseline implementations include MostPop, DecayPop, ItemKNN, iALS, BPR, SANSA, and SASRec, providing benchmarks for comparing new recommender system approaches. These baselines are evaluated using standard metrics, including:

NDCG@k (ranking quality)
Recall@k (retrieval effectiveness)
Coverage@k (catalog diversity)

"When industry leaders share hard-won tools and data, a rising tide lifts all boats: researchers gain real-world benchmarks, startups access resources once reserved for tech giants, and users everywhere enjoy greater personalization," added Nikolay Savushkin.

Yambda, the world's largest open recommender system dataset, is now available on Hugging Face.

About Yandex

Yandex is a global technology company that builds intelligent products and services powered by machine learning. The company's goal is to help consumers and businesses better navigate the online and offline world. Since 1997, Yandex has been delivering world-class, locally relevant search and information services and has also developed market-leading on-demand transportation services, navigation products, and other mobile applications for millions of consumers across the globe.

About My Wave

My Wave, a personalized recommendation system integrated into the multi-million-user music streaming service, Yandex Music, employs deep neural models and AI algorithms to analyze over a thousand factors — including user interactions, customizable mood/language settings, and real-time music analysis of spectrograms, frequency ranges, rhythm, vocal tone, and genre. By processing listening history and track sequences, it dynamically adapts to user preferences, identifies audio similarities, and predicts musical tastes to deliver tailored suggestions.

View original content:https://www.prnewswire.com/news-releases/yandex-releases-worlds-largest-event-dataset-for-advancing-recommender-systems-302468616.html

SOURCE Yandex

	Der ING Morning Call vom 30. Mai mit Christian Zoller
	Marvell Technology – aktuelles Quartalszahlenwerk bewegt
	DAX – 24.000er-Marke unterboten
	DAX: weiter konsolidierend
	Der XTB Morgenticker (30.05.2025)
	S&P 500 - MACD: Mythos und Wahrheit
	Kostenlose E-Books für Trader: Wissen, Strategien & Tipps

	BIT Capital: BIT Capital legt ersten Multi-Asset-Fonds auf.
	Traden ohne Ordergebühr (zzgl. Spreads) - mit finanzen.net ZERO
	Alexander Mey: "Wo die Innovation gelebt, gefahren und umgesetzt wird? In Asien!"
	News: Bitcoin stabil, und M&A nimmt in der gesamten Kryptowelt zu.
	DDA Krypto ETPs - für jede Anlagestrategie die passende Lösung. Jetzt mehr erfahren!
	Gewinnsprung, Umbau, Erholung: Wohin steuert Continental?
	Dieses Geld-Geschenk bringt Ihnen bis zu 425.000 Euro

	UBS: Diese US-Aktien sind in Q1 2025 im Depot Einblick ins Depot Jetzt durchklicken Jetzt durchklicken
	1. Quartal 2025: Diese Aktien hat Warren Buffett im Depot Das Portfolio des Berkshire Hathaway-CEOs Jetzt durchklicken Jetzt durchklicken
	1. Quartal 2025: In diese Aktien investiert der Gates Foundation Trust So hat der Gates Foundation Trust im ersten Jahresviertel investiert Jetzt durchklicken Jetzt durchklicken
	Rohstoffpreise Entwicklung: Gewinner und Verlierer im April 2025 Welcher Rohstoff macht das Rennen? Jetzt durchklicken Jetzt durchklicken
	Die 10 größten Rüstungskonzerne der Welt Wer am meisten mit Waffen profitiert Jetzt durchklicken Jetzt durchklicken

Aktienkurse	Beliebteste Aktien
Realtimekurse	Alle Indizes
Top 50	Tops/Flops
Insiderdaten	Dividenden
Portfolio

	Jeremy Grantham Depot Q1 2025 Blick ins Portfolio Jetzt durchklicken Jetzt durchklicken
	So hat Michael Burry im ersten Quartal 2025 investiert Burrys Portfolio in Q1 2025 Jetzt durchklicken Jetzt durchklicken
	Nomad Passport Index: Die mächtigsten Pässe weltweit Gefragte Reisepässe Jetzt durchklicken Jetzt durchklicken