Making AI Recommendations Smarter with Visual, Text, and Audio Data

DM Television

SEC dropping Ripple case is ‘final exclamation mark’ that XRP is not a security — John Deaton

April

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Making AI Recommendations Smarter with Visual, Text, and Audio Data

Tags: application audio google small

Author: DATE POSTED:February 16, 2025

Feed: Hacker Noon - Medium

View: Original article

:::info Authors:

(1) Daniele Malitesta, Politecnico di Bari, Italy and [email protected] with Corresponding authors: Daniele Malitesta ([email protected]) and Giuseppe Gassi ([email protected]);

(2) Giuseppe Gassi, Politecnico di Bari, Italy and [email protected] with Corresponding authors: Daniele Malitesta ([email protected]) and Giuseppe Gassi ([email protected]);

(3) Claudio Pomo, Politecnico di Bari, Italy and [email protected];

(4) Tommaso Di Noia, Politecnico di Bari, Italy and [email protected].

:::

Abstract and 1 Introduction and Motivation

2 Architecture and 2.1 Dataset

2.2 Extractor

2.3 Runner

3 Extraction Pipeline

4 Ducho as Docker Application

5 Demonstrations and 5.1 Demo 1: visual + textual items features

5.2 Demo 2: audio + textual items features

5.3 Demo 3: textual items/interactions features 6

Conclusion and Future Work, Acknowledgments and References

5 DEMONSTRATIONS

his section proposes three use cases (i.e., demos) which show some of the main functionalities in Ducho and how to exploit them within a complete multimodal extraction pipeline. The guidelines and codes are accessible at this link[4] to run the demos (i) on your local machine, (ii) in a Docker container, and (iii) on Google Colab. Note that we specifically selected these demos as to replicate some real recommendation tasks involving multimodal features.

5.1 Demo 1: visual + textual items features

Fashion recommendation is probably one of the most popular task involving multimodal features to describe items. Generally, fashion products come with images (i.e., visual) and descriptions (i.e., textual) which may captivate the attention of the customer.

\ Input data. We use a small fashion dataset where each item has its own image and other metadata such as gender, category, colour, season, and product title. As for the visual modality, we save a subsample of 100 random images from the dataset in jpeg format; as for the textual modality, we produce for each of these items a description obtained as the combination of all the metadata fields from above, and store it into a tsv file where the first and second columns map item ids and descriptions, respectively. Note that, if no item column name is provided, Ducho selects, by default, the last column as the one holding the items’ descriptions.

\ Extraction. In terms of extraction models, we adopt VGG19 and Xception for the product images, and Sentence-BERT pre-trained for semantic textual similarity for the descriptions. For each extraction model, we select the extraction layer, the pre-processing procedures, and the library where the deep network should be retrieved from.

\ Output. Through the configuration file, we set Ducho to save the visual and textual embeddings to custom folders, where each embedding is a numpy array whose filename corresponds to the item name from the original input data. Additionally, Ducho keeps track of the log file in a dedicated folder within the project.

5.2 Demo 2: audio + textual items features

When it comes to recommending songs to users, audio and textual features may enhance the representation of each song, where the former are structured as a waveform, the latter as sentences referring, for instance, to the music genre of the song.

\ Input data. We use a small music genres dataset where each item comes with the binary representation of its waveform (we save it as wav audio track) and its music genre (we interpret it as textual song description and save it into a tsv file similarly to the previous demo). Given the heavy computational costs deep learning-based audio extractors require, we decide to select a small subset of the input songs (i.e., 10) just for the purpose of this demo.

\ Extraction. For the extraction of audio features we exploit Hybrid Demucs pre-trained for the task of music source separation. As for the textual extraction, we re-use the same deep neural model from the previous demo, since we are not interested in extracting other specific high-level features from music genres.

\ Output. Once again, we use the configuration file to specify the output folders for both the audio and textual embeddings. Please note that the extraction of audio features might take some time depending on the machine you are running Ducho on, as the deep audio extractor might require high computational resources to run.

:::info This paper is available on arxiv under CC BY 4.0 DEED license.

:::

[4] https://github.com/sisinflab/Ducho/tree/main/demos.

Feed: Hacker Noon - Medium

View: Original article

Tags: application audio google small