Representation learning for music classification and retrieval: bridging the gap between natural language and music semantics /

Won, Minz

Representation learning for music classification and retrieval: bridging the gap between natural language and music semantics / / Minz Won .-- [Barcelona]: : Universitat Pompeu Fabra. Departament de Tecnologies de la Informació i les Comunicacions, , 2021

1 recurso en línea (200 páginas)


Directores: Xavier Serra i Casals; Horacio Saggion, Dept. of Information and Communication Technologies.
Fecha de defensa: 01-07-2022


The explosion of digital music has dramatically changed our music consumption behavior. Massive digital music libraries are now available through streaming platforms. Since the amount of information available to an individual listener has increased greatly, it is nearly impossible for them to go through the entire catalog exhaustively. As a result, we need robust knowledge management systems more than ever. Recent advances in deep learning have enabled data-driven music representation learning for classification and retrieval. However, there is still a gap between machinelearned representations and the human understanding of music. This dissertation aims at reducing this semantic gap in order to assist listener's behavior around music information with advanced algorithmic support. To this end, we tackle three main challenges in representation learning: model architecture design, scalability, and multimodality. Firstly, we carefully review previous deep representation models and propose new architectures that improve the representation in qualitative and quantitative ways. The newly proposed models are more flexible, interpretable, and powerful than previous ones. Secondly, training schemes beyond supervised learning are explored as a way to achieve scalable research. Transfer learning, semi-supervised learning, and self-supervised learning approaches are addressed in detail; transfer learning and semi-supervised methods are applied to enhance music representation learning. Finally, metric learning is proposed as a way to bridge music audio representation and natural language semantics, forming a multi-modal embedding space. This facilitates music retrieval using arbitrary tags beyond a fixed vocabulary, and makes it possible to match music to text stories based on mood. Although our work focuses on bridging music and natural language semantics, we believe the proposed approaches generalize to other modalities. All implementation details of this thesis are available and open-source for reproducibility. The knowledge gained throughout this thesis has been put in practice and grounded in research internships and collaborations with multiple industries.


Modalidad
Música y tecnología


Tesis y escritos académicos


Saggion, Horacio
Serra, Xavier

Con tecnología Koha