The Psychoacoustics and Synthesis of Singing Harmony

Page view(s)

Checked on Jul 27, 2025

Please use this identifier to cite or link to this item: https://oar.a-star.edu.sg/communities-collections/articles/16544

Title:

The Psychoacoustics and Synthesis of Singing Harmony

Journal Title:

Digital Repository of NTU

DOI:

10356/142516

Publication URL:

https://hdl.handle.net/10356/142516

Authors:

Paul Yaozhu Chan, Minghui Dong, Haizhou Li

Keywords:

Psychoacoustics

Publication Date:

07 May 2021

Citation:

Chan, Paul Yaozhu (2020). The psychoacoustics and synthesis of singing harmony.

Abstract:

The human singing voice is a remarkable instrument that compounds an immense amount of expressivity onto a single dimension. Apart from semantics and melody (pitch, duration and dynamics), accent, age, gender and emotion are all carried in the singing voice. While a single singing voice on its own is aesthetically pleasing to the ear, the addition of concurrent voices of different pitch is commonly known to be capable of producing a pleasing effect far greater than the sum of that produced by each contributing voice. This motivates the use of harmony in singing. Unfortunately, accompaniment voices are difficult to sing, even for professional singers. Thankfully singing synthesis has made it viable for this task to be undertaken by machines. The overall objective of this thesis is to advance today’s understanding of singing harmony and ultimately develop novel techniques for its synthetic reproduction. This is broken down into three parts. The first focuses on a psychophysical basis of harmony, the second focuses on the synthesis of the singing voice, while the third combines the first two to focus on the synthesis of harmonized singing. The first contribution is an attempt to find a psychoacoustic basis of harmony and presented in chapter 2. Apart from stationary harmony (chords, or sonorities: the aesthetics of a group of concurrent notes at one point of time), this also includes transitional harmony (chord progression, or resolution: the aesthetics of a similar group of notes progressing to another). In order to explain both stationary and transitional harmony, it introduces a theory of harmony based on the notions of interharmonic and subharmonic modulations. Acoustic measures of stationary and transitional harmony are proposed and the answers to five fundamental questions of psychoacoustic harmony are presented, both based on this theory. Correlations with existing music theory and perception statistics support this contribution with both stationary and transitional harmony. The second contribution is in the synthesis of the singing voice and presented in chapter 3. Modern singing synthesis methods are at best capable of word- level runtime synthesis, with only two known ones dedicated to realtime synthesis. This means that they are applicable only towards offline music production. A large part of the art of music and singing, however, is in realtime performance. With both of the existing realtime singing synthesis methods bounded by a phone- coverage to realtime-capability tradeoff, a need for one that overcomes it remains. A novel realtime singing synthesis system, SERAPHIM, is proposed as an answer to this. Apart from overcoming this phone-coverage to realtime-capability trade- off, subjective listening tests also showed that listeners preferred voices synthesized by SERAPHIM as opposed to other realtime systems. The third contribution is in the synthesis of singing harmony and presented in chapter 4. With this contribution, a novel method for singing harmony synthesis is proposed. Current implementations can be classified into pitch-inaccurate rule- based systems, timing-inaccurate inference-based systems, and hybrid systems that trade off between pitch inaccuracies and timing inaccuracies. This means that existing systems are vulnerable to either pitch errors, timing errors or both in different degrees of compromise. The challenge in the task was to overcome this compromise to develop a robust technique that is simultaneously resilient to both pitch and timing errors while producing harmonious accompaniment. Our strategy was to leverage on the pitch-accurate inference-based method while eliminating timing inaccuracies by use of machine-synchronization. Spectrograms revealed that harmonized voices produced by this method contain the least dissonances amongst existing methods. Subjective listening tests also showed that harmonized voices produced by this method are perceived to be the best sounding, both by vocal experts and by casual listeners. All in all, the work presented in this thesis contributes to the advancement of the psychoacoustic understanding and machine synthesis of singing harmony across one journal paper, three conference papers and three patents.

License type:

http://creativecommons.org/licenses/by-nc/4.0/

Funding Info:

No specific funding for research done.

Description:

A thesis report submitted to the School of Computer Science and Engineering, Nanyang Technological University, Singapore, in partial fulfilment of the requirements for the degree of Doctor of Philosophy.

URI:

https://oar.a-star.edu.sg/communities-collections/articles/16544

ISBN:

Collections:

Institute for Infocomm Research

Files uploaded:

Manuscripts in This Item:

File	Size	Format	Action
harmonythesis-print.pdf	17.92 MB	PDF	Open