Jinhua Liang

Rapid advancements in artificial intelligence have significantly enhanced generative tasks involving music and images, employing both unimodal and multimodal approaches. This research develops a model capable of generating music that resonates with the emotions depicted in visual arts, integrating emotion labeling, image captioning, and language models to transform visual inputs into musical compositions. Addressing the scarcity of aligned art and music data, we curated the Emotion Painting Music Dataset, pairing paintings with corresponding music for effective training and evaluation. Our dual-stage framework converts images to text descriptions of emotional content and then transforms these descriptions into music, facilitating efficient learning with minimal data. Performance is evaluated using metrics such as Fr\'echet Audio Distance (FAD), Total Harmonic Distortion (THD), Inception Score (IS), and KL divergence, with audio-emotion text similarity confirmed by the pre-trained CLAP model to demonstrate high alignment between generated music and text. This synthesis tool bridges visual art and music, enhancing accessibility for the visually impaired and opening avenues in educational and therapeutic applications by providing enriched multi-sensory experiences.

10.48550/arxiv.2409.07827

Cite

Citations (0)

FSD-FS

Zenodo (CERN European Organization for Nuclear Research) (2022)

Jinhua Liang Huy Phan Emmanouil Benetos

FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43030 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London. Citation If you use the FSD-FS dataset, please cite our paper and FSD50K.

@article{liang2022learning, title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition}, author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil}, journal={arXiv preprint arXiv:2212.08952}, year={2022} } @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={FSD50K: An Open Dataset of Human-Labeled Sound Events}, year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

About FSD-FS FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper). LICENSE FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link. FILES FSD-FS are organised in the structure:

root | └─── base | └─── val | └─── eval

REFERENCES AND LINKS [1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link] [2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]

10.5281/zenodo.7452708

Cite

Citations (0)

FSD-FS

Zenodo (CERN European Organization for Nuclear Research) (2022)

Jinhua Liang Huy Phan Emmanouil Benetos

FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London. Citation If you use the FSD-FS dataset, please cite our paper and FSD50K.

@article{liang2022learning, title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition}, author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil}, journal={arXiv preprint arXiv:2212.08952}, year={2022} } @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={FSD50K: An Open Dataset of Human-Labeled Sound Events}, year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

About FSD-FS FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper). LICENSE FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link. FILES FSD-FS are organised in the structure:

root | └─── dev_base | └─── dev_val | └─── eval

REFERENCES AND LINKS [1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link] [2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]

10.5281/zenodo.7557107

Cite

Citations (0)

DExter: Learning and Controlling Performance Expression with Diffusion Models

Applied Sciences (2024)

Huan Zhang Shreyan Chowdhury Carlos Cancino-Chacón Jinhua Liang Simon Dixon

In the pursuit of developing expressive music performance models using artificial intelligence, this paper introduces DExter, a new approach leveraging diffusion probabilistic models to render Western classical piano performances. The main challenge faced in performance rendering tasks is the continuous and sequential modeling of expressive timing and dynamics over time, which is critical for capturing the evolving nuances that characterize live musical performances. In this approach, performance parameters are represented in a continuous expression space, and a diffusion model is trained to predict these continuous parameters while being conditioned on a musical score. Furthermore, DExter also enables the generation of interpretations (expressive variations of a performance) guided by perceptually meaningful features by being jointly conditioned on score and perceptual-feature representations. Consequently, we find that our model is useful for learning expressive performance, generating perceptually steered performances, and transferring performance styles. We assess the model through quantitative and qualitative analyses, focusing on specific performance metrics regarding dimensions like asynchrony and articulation, as well as through listening tests that compare generated performances with different human interpretations. The results show that DExter is able to capture the time-varying correlation of the expressive parameters, and it compares well to existing rendering models in subjectively evaluated ratings. The perceptual-feature-conditioned generation and transferring capabilities of DExter are verified via a proxy model predicting perceptual characteristics of differently steered performances.

10.3390/app14156543

Cite

Citations (0)

FSD-FS

Zenodo (CERN European Organization for Nuclear Research) (2022)

Jinhua Liang Huy Phan Emmanouil Benetos

FSD-FS is a publicly-available database of human labelled sound events for few-shot learning. It spans across 143 classes obtained from the AudioSet Ontology and contains 43805 raw audio files collected from the FSD50K. FSD-FS is curated at the Centre for Digital Music, Queen Mary University of London. Citation If you use the FSD-FS dataset, please cite our paper and FSD50K.

@article{liang2022learning, title={Learning from Taxonomy: Multi-label Few-Shot Classification for Everyday Sound Recognition}, author={Liang, Jinhua and Phan, Huy and Benetos, Emmanouil}, journal={arXiv preprint arXiv:2212.08952}, year={2022} } @ARTICLE{9645159, author={Fonseca, Eduardo and Favory, Xavier and Pons, Jordi and Font, Frederic and Serra, Xavier}, journal={IEEE/ACM Transactions on Audio, Speech, and Language Processing}, title={FSD50K: An Open Dataset of Human-Labeled Sound Events}, year={2022}, volume={30}, number={}, pages={829-852}, doi={10.1109/TASLP.2021.3133208}}

About FSD-FS FSD-FS is an open database for multi-label few-shot audio classification containing 143 classes drawn from the FSD50K. It also inherits the AudioSet Ontology. FSD-FS follows the ratio 7:2:1 to split classes into base, validation, and evaluation sets, so there are 98 classes in the base set, 30 classes in the validation set, and 15 classes in the evaluation set (More details can be found in our paper). LICENSE FSD-FS are released in Creative Commons (CC) licenses. Same as FSD50K, each clip has its own license as defined by the clip uploader in Freesound, some of them requiring attribution to their original authors and some forbidding further commercial reuse. For more details, ones can refer to the link. FILES FSD-FS are organised in the structure:

root | └─── dev_base | └─── dev_val | └─── eval

REFERENCES AND LINKS [1] Gemmeke, Jort F., et al. "Audio set: An ontology and human-labeled dataset for audio events." 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017. [paper] [link] [2] Fonseca, Eduardo, et al. "Fsd50k: an open dataset of human-labeled sound events." IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021): 829-852. [paper] [code]

10.5281/zenodo.7452707

Cite

Citations (0)

Acoustic scene classification using deep CNN with fine-resolution feature

Expert Systems with Applications (2019)

Tao Zhang Jinhua Liang Biyun Ding

Feature (linguistics)

Spectrogram

Convolution (computer science)

Representation

10.1016/j.eswa.2019.113067

Cite