University of Sheffield
COM3502-4502-6502
Speech Processing
Programming Assignment
Contents
1 Overview (please read carefully) 2
1.1 Logistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Provided Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Hand-In Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Part-I: Voice Manipulation 4
2.1 Continuous Speech Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Filtering in the Time-Domain . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Filtering in the Frequency-Domain . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Low Frequency Oscillator . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Amplitude Modulation: Tremolo . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Ring Modulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Frequency Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8 Frequency Modulation: Vibrato . . . . . . . . . . . . . . . . . . . . . . . . 12
2.9 Time Delay Effects: Echo, Comb Filtering and Flanger . . . . . . . . . . . 12
2.10 Feedback: Reverberation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Part-II: Real-Time Voice Changer 15
3.1 General Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 COM4502-6502 Only . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 Final Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1
1 Overview (please read carefully)
Before proceeding with this assignment you should have completed the ‘Pure Data’ (Pd)
Introductory Programming Exercise, and completed the quiz on Blackboard. If not, then
you are strongly advised to do so before moving on to the programming assignment described
below. In any event, it may be advisable to review the introductory exercise,
especially Sections 7 and 8 which cover working with real-time audio and speech.
Note: This programming assignment is worth 55% of the overall course mark.
This is an individual assignment.
You are permitted to re-use any of the Pd examples provided during the course.
1.1 Logistics
You are free to complete this assignment in your own time. However, feedback, advice
and guidance is available from the post-graduate teaching assistants via the appropriate
discussion board on Blackboard. Note that they will try to help you as much as possible,
but it is not their job to debug your code or provide solutions to the assignment itself.
Note: It will take some time to complete this assignment, so plan your work accordingly
over the coming weeks. Read these instructions carefully, and pay particular attention to
the marks associated with each component - especially the last.
Note: Please be aware that students registered on COM4502 and COM6502 have additional
tasks to perform. These are marked ‘COM4502-6502 Only’.
1.2 Provided Materials
In addition to these instructions, you have been provided with a .zip file containing a
number of items1
that you will need for this assignment. In particular, you have been
supplied with a LATEXtemplate - YourName.tex - which will you use to compile your
response sheet.
LATEX (pronounced [lA:tek] or [leItek]) is a free document preparation system for highquality
typesetting2
, and having a working knowledge of LATEX is a key skill for any
computer scientist3
. LATEX is very easy to learn4
, and there are many resources available on
the web, e.g. https://www.sharelatex.com/learn/Learn_LaTeX_in_30_minutes.
LATEX distributions are available for Linux, Mac OSX and Windows (see: https://www.
latex-project.org/get/). However, for this assignment, you might like to use an on-line
collaborative environment such as Overleaf (https://www.overleaf.com).
1The provided screenshot.jpg is simply a placeholder which you will replace with your own images.
2LATEX includes features designed for the production of technical and scientific documentation, so it
is the de-facto standard for the publication of scientific/technical papers and reports. Unlike WYSIWYG
(“What You See Is What You Get”) word processors (such as Microsoft Word or OpenOffice), LATEX
is based on plain-text source files (containing html-style markup) which are compiled into a typeset
document. This separation of ‘content’ from ‘style’ facilitates the production of very professional scientific
documents in terms of their consistency, readability and design. You can find out more about LATEX here:
https://www.latex-project.org.
3The Department strongly recommends that you consider using LATEX for your Dissertation, and Prof.
Moore has provided a template here: http://staffwww.dcs.shef.ac.uk/people/R.K.Moore/campus_
only/USFD_Academic-_Report_LaTeX-Template.zip.
4Students who took COM2009-3009 Robotics should be familiar with LATEX already.
2
⋆ Step 0: Before you start working with LATEX, edit the filename of the provided .tex
file by replacing YourName with your name.
Your edited .tex file should compile to produce a .pdf document with your name on the
front cover, followed by your responses to a number of questions. You will be submitting
the .pdf file when you have completed the assignment, not the .tex file.
1.3 Hand-In Procedure
Once you have completed the assignment, you should submit a .zip file (via Blackboard)
containing your response sheet (in .pdf format) and the requested Pd source
files. Do not submit your LATEX source files. The .zip filename should be of the form
YourName.zip.
Please make sure that your name is shown correctly on the front page of your response
sheet (as explained in the previous section).
Standard departmental penalties apply for late hand-in5 and plagiarism6
.
Feedback (including provisional marks) will be provided via Blackboard within three working
weeks of the hand-in date.
The deadline for handing-in this assignment (via MOLE) is . . .
COM3502-4502: 17:00 Friday 18th December 2020
COM6502: 17:00 Friday 22nd January 2021
5
https://sites.google.com/sheffield.ac.uk/comughandbook/general-information/
assessment/late-submission
6
https://sites.google.com/sheffield.ac.uk/comughandbook/general-information/
assessment/unfair-means
3
2 Part-I: Voice Manipulation
As you saw in Lecture 1, speech processing has many applications across a wide variety
of market sectors. One area of particular interest is the creation of appropriate voices for
fictional characters in television, cinema, games and related areas of edutainment7
. Such
robot, alien or cartoon voice effects are often achieved by manipulating the speech of a
voice actor (either in real-time or in post-production8
) - a process that is referred to as
‘Voice FX’ or ‘Vocal FX’.
The aim of this assignment is to implement a range of speech processing algorithms for
real-time voice manipulation. The final outcome will be a ‘Voice Changer’ that can be
configured to perform a variety of different manipulations on your own voice. Part-I
involves creating the core components in Pd, and Part-II brings them together into a
single [VoiceChanger] application. Please work through the following sections in the
order given, providing responses the relevant questions as you come to them (by editing
your .tex file).
Note: In Part-II you will reuse many of the components you create in Part-I. Hence, you
are strongly advised to adopt professional working practices, e.g. saving versioned copies
of each piece of software that you develop.
Note: Some of the steps below require you to listen critically to sounds generated by
your software. It is thus advisable for you to use headphones and/or work in a quiet
environment. Take care not to annoy other people with your sound output.
2.1 Continuous Speech Input
Your first task is to program a Pd patch that can provide a continuous stream of real-time
speech that can be fed into the various manipulation techniques which follow.
⋆ Step 1: Create the following patch . . .
Note: Recall from the introductory programming exercise that the audio output control
will appear when you create [output∼] object.
7Wilson, S., & Moore, R. K. (2017). Robot, alien and cartoon voices: implications for speechenabled
systems. In 1st Int. Workshop on Vocal Interactivity in-and-between Humans, Animals and Robots
(VIHAR-2017). Skovde, Sweden.
8Rose, J. (2012). Audio Postproduction for Film and Video. Taylor & Francis.
4
⋆ Step 2: Click on the bang at the top of the patch, increase the volume slider on
[output∼] and verify that you can hear the speech9 on a continuous loop.
⋆ Step 3: Now click on [wsprobe∼]. This should open a separate patch that displays
the waveform and spectrum in real-time (as you have seen several times in the lectures).
Experiment with the various options provided in the [wsprobe∼] GUI, and take particular
notice of the differences between voiced and voiceless sounds.
QUESTION 1 (worth up to 5 marks)
Provide a screenshot of [wsprobe∼] for a typical voiced sound, and explain the features
in the waveform and spectrum that distinguish it from an unvoiced sound. Hint: use the
‘snapshot’ feature in [wsprobe∼] to obtain a static display.
2.2 Filtering in the Time-Domain
One of the most basic manipulations in Voice FX is to ‘filter’ a signal in order to increase or
decrease energy at particular frequencies. The simplest types of filter are ‘low-pass’, ‘highpass’
and ‘band-pass’, and these may be implemented in Pd using the [lop∼], [hip∼]
and [bp∼] objects respectively.
⋆ Step 4: Modify your patch to include a low-pass filter with a ‘cut-off frequency’ controlled
by a horizontal slider covering the range 0 to 10,000 Hz as shown here10
. . .
⋆ Step 5: Experiment with different values for the cut-off frequency, and observe the
consequences for the output sound as well as the waveform and spectrum displays.
9The provided speech.wav file is the same as the one used in the Lectures (and as provided in
com3502-4502-6502_pd-examples.zip available on MOLE).
10Recall that the slider’s range can be set by right-clicking and adjusting its ‘properties’.
5
QUESTION 2 (worth up to 5 marks)
Which sounds are most affected when the low-pass cut-off frequency is set to around 500
Hz - vowels or consonants - and why?
⋆ Step 6: Edit the [lop∼] object to become a [hip∼] object, and again experiment
with different values for the cut-off frequency.
QUESTION 3 (worth up to 5 marks)
How is it that the speech is still quite intelligible when the high-pass cut-off frequency is
set to 10 kHz?
As you have seen, these simple low-pass and high-pass filters have only one control parameter
- their cut-off frequency. A simple band-pass filter has two parameters - its ‘centre
frequency’ and its ‘Q factor’. The Q of a band-pass filter is defined as Q = 1/B, where
B is the bandwidth (in Hz). Hence a high-Q filter has a narrow bandwidth and vice
versa.
⋆ Step 7: Edit the [hip∼] object to become a [bp∼] object, and add another horizontal
slider covering the range 0 to 100 as shown here . . .
⋆ Step 8: Experiment with different values for the centre frequency and Q.
2.3 Filtering in the Frequency-Domain
The [lop∼], [hip∼] and [bp∼] objects operate directly on the speech waveform. Hence,
they are examples of ‘time-domain’ signal processing. An alternative to time-domain
processing is to process signals in the ‘frequency-domain’, i.e. by modifying the spectrum
rather than the waveform.
6
A classic example of filtering in the frequency-domain is a ‘graphic equaliser’. This is a
common component of a high-fidelity (HiFi) audio system, and it allows the spectrum to
be shaped by setting gains across a range of frequencies. You have been provided with a
Pd ‘abstraction’11 [GraphicEqualiser∼] that can perform this function.
⋆ Step 9: Remove your [bp∼] object and its slider controls, and insert the provided
[GraphicEqualiser∼]. Connect it between [readsf∼] and [output∼], and experiment
with different filter profiles by clicking on the graph (in ‘run mode’) and dragging the curve
into different shapes. Note how you can make low-pass, high-pass and band-pass filters
(and many more), simply by drawing the appropriate shapes, as shown here . . .
⋆ Step 10: Open the [GraphicEqualiser∼] abstraction to see how it works (by rightclicking
on the object, and selecting open). Take particular note of how the GUI objects
(the graph and reset button) are made to appear in the parent patch using the ‘graphon-parent’
option in the sub-patch properties12
.
QUESTION 4 (worth up to 5 marks)
COM3502-4502-6502: The [GraphicEqualiser∼] object uses an FFT internally; what
does FFT stand for and what does an FFT do?
COM4502-6502 ONLY: What is a DFT and how is it different from an FFT?
⋆ Step 11: Use [GraphicEqualiser∼] to simulate the effect of a land-line telephone by
eliminating all energy below 300 Hz and above 3,400 Hz.
2.4 Low Frequency Oscillator
Many ‘Voice FX’ are achieved by modifying some characteristic of the speech using a low
frequency oscillator or ‘LFO’. LFOs typically have two controls: speed (which is specified
by the frequency in Hertz) and depth (which specifies the magnitude of the effect). Your
[VoiceChanger] will require several LFOs, so it makes sense to implement one as a Pd
‘abstraction’.
11Refer back to the introductory programming exercise if you have forgotten what an abstraction is.
12Note that you can also look inside [output∼] or [wsprobe∼] to see the same ‘graph-on-parent’
functionality.
7
⋆ Step 12: Using the skills you acquired in the introductory programming exercise,
create a Pd abstraction [LFO∼] that is capable of generating an audio sine/square wave
between 0 and 50 Hertz (e.g. using the [osc∼] object). The speed and depth of the LFO
should be controllable by suitable sliders, and they should appear in the object using the
‘graph-on-parent’ approach.
⋆ Step 13: Add an option to select either ‘sine’ or ‘square’ wave output using a Vradio
button. Hint: to obtain a square wave, multiply the output of the [osc∼] object by a
very large number and pass the result through [clip∼ -1 1] before being multiplied by
depth.
Additional features that will be useful later in the assignment are:
• [number] GUI objects for speed and depth
• a reset function
• external inputs for speed, depth and reset
• external outputs for speed and depth
• an ability to handle creation arguments for speed and depth (also linked to reset)
• an ability to reset the phase of the oscillator when speed = 0 in order to ensure
consistent output when initialising/resetting
Your resulting [LFO∼] abstraction should look something like this (note the use of speed
= 5 and depth = 0.5 as creation arguments) . . .
Although your [LFO∼] object outputs audio, you are unlikely to be able to hear it as the
frequency is so low. However, you can check that it is functioning correctly by connecting
it to [wsprobe∼] and/or by using the appropriate number GUI object to wind the speed
up to an audible frequency.
⋆ Step 14: Test your [LFO∼] object by creating an [LFO∼-help] object.
QUESTION 5 (worth up to 10 marks)
With speed = 50 and depth = 0.5, what are the minimum and maximum amplitudes of
your LFO output, and how do they vary with changes in these two settings? Also, please
provide two screenshots: (a) your [LFO∼-help] object and (b) the internal structure of
your [LFO∼] object.
8
2.5 Amplitude Modulation: Tremolo
‘Tremolo’ is one of the most basic voice manipulations that makes use of an LFO. In
this effect, the amplitude of a speech signal is modulated, i.e. the speech waveform is
multiplied by a variable gain that ranges between 0 and 1. Your LFO outputs an audio
signal between -depth and +depth. So, in order to modulate the amplitude of the speech
correctly, the output of the LFO has to be scaled appropriately - in this case, by adding
depth (e.g. using the audio math object [+∼] connected to the depth output of your
LFO) and dividing the result by 2 (e.g. using the audio math object [*∼ 0.5]).
⋆ Step 15: Implement ‘tremolo’ using your [LFO∼] object using the principles given
above. The resulting program should look something like this (note the use of a [sig∼]
object to convert numerical data to audio data) . . .
⋆ Step 16: Experiment with different settings for speed and depth. In particular, note
that a square wave with speed between 3 and 4 Hz (and depth = 1) has a very destructive
effect on the intelligibility of the output. This is because 3-4 Hz corresponds to the typical
syllabic rate of speech.
2.6 Ring Modulation
Another basic effect is to multiply the speech signal by the output of an LFO. This is
known as ‘ring modulation’.
⋆ Step 17: Modify your implementation of ‘tremolo’ by removing the two scaling objects
([+∼] and [*∼ 0.5]) and connecting the audio output of your [LFO∼] directly to the
[*∼] object.
9
⋆ Step 18: Experiment with different settings for speed and depth, and note how the
timbre of the resulting sound is subtly different from ‘tremolo’.
Note: In the BBC TV series Dr. Who, the voices of the alien Daleks13 are generated by
a ring modulator with an LFO set to around 30 Hz. The voice actors also spoke using a
stilted monotonic intonation in order to enhance the effect. You can try this yourself by
adding a ‘live speech input’ option to your code (i.e. using [adc∼] and some means to
toggle between the prerecorded and live input)14
.
QUESTION 6 (worth up to 5 marks)
In your own words15
, why is this effect known as ‘ring modulation’?
2.7 Frequency Shifting
Many Vocal FX are the result of altering the frequencies present, e.g. changing the pitch
of a voice. There are many algorithms for frequency shifting, but you have already implemented
an approximate solution with your ring modulator.
⋆ Step 19: Use the number object in your [LFO∼] GUI to wind up the speed of the
LFO in your ring modulator to around 300 Hz. If all is working correctly, as you raise the
modulation frequency, you will hear that the frequencies in the speech seem to be both
increasing and decreasing. Indeed, if you listen carefully at a fixed speed of around 300
Hz, you may be able to detect two voices speaking simultaneously at different frequencies.
The reason for this result is that multiplying one signal by another - known as ‘heterodyning’
- produces an output which consists of the sums and differences of the frequencies
of the two input signals. This means that the frequencies in the original speech are both
raised and lowered by an amount corresponding to the speed of the LFO - hence the
strange mixture of higher pitched and lower pitched voices in the output.
The sum and difference components are known as the upper and lower ‘sidebands’. So, to
avoid the distortion caused by having both sidebands present in the output, we need to
remove one of them, i.e. we only want one sideband. This can be achieved by creating two
versions of the original signal that differ in phase by 90◦
, then modulating the two versions
with two heterodynes that also differ in phase by 90◦
. These phase differences are such
that, when the two out-of-phase heterodyned signals are subtracted, the lower sideband
is cancelled out. This process is known as ‘single-sideband modulation’ or ‘SSB’16
.
QUESTION 7 (worth up to 5 marks)
Why is SSB commonly used in long-distance radio voice communications?
Splitting a signal into two with a 90◦ phase difference between them can be achieved using
the ‘Hilbert Transform’. Hence the following patch shows the Pd object [hilbert∼] being
used to create a frequency shifter based on SSB modulation . . .
13https://en.wikipedia.org/wiki/Dalek
14You will need this feature later, in any case.
15I.e. do not plagiarise from Wikipedia.
16SSB is commonly used in long-distance radio voice communications, especially by amateur radio
enthusiasts.
10
⋆ Step 20: Implement the above patch, and experiment with different values for the
frequency shift. Note that the frequency-shifted output now sounds much cleaner than
the output of the ring modulator (i.e. only one voice). Also note that the pitch of the
voice can be shifted down as well as up.
QUESTION 8 (worth up to 5 marks)
COM3502-4502-6502: Why can the voice be shifted up in frequency much further than
it can be shifted down in frequency before it becomes severely distorted? Hint: look at
[wsprobe∼].
COM4502-6502 ONLY: Your frequency shifter changes all the frequencies present in an
input signal. How might it be possible to change the pitch of a voice without altering the
formant frequencies?
A classic ‘robotic’ voice can be achieved by simply adding frequency-shifted speech back
to the unprocessed original. This effect is known as ‘harmony’. However, rather than
simply adding the signals in equal amounts, your final [VoiceChanger] will benefit from
a more general purpose approach.
⋆ Step 21: Implement a ‘mixer’ that adds the original speech with the manipulated
speech in different proportions. Use a slider that has 100% original at one end, 100%
manipulated at the other end and 50-50 in the middle.
⋆ Step 22: With your mixer at the 50-50 setting, experiment with different frequency
shifts in order to produce the best robotic sounding output.
11
2.8 Frequency Modulation: Vibrato
Now that you have the ability to shift the frequencies in a speech signal, it is very easy to
implement another common voice manipulation technique - ‘vibrato’. All that is required
is for the frequency shifter to be controlled by the output of an LFO.
⋆ Step 23: Implement ‘vibrato’ by connecting an LFO to your frequency shifter, and
experiment with different values for speed and depth. Note that the LFO output will
need to be scaled to provide an appropriate frequency shift range and then added to the
output of the frequency shift slider.
Note: With speech.wav as input, it is possible to simulate the voice of Gollum from the
Lord of the Rings by setting LFO = sine wave, speed = 50 Hz and depth = 350.
2.9 Time Delay Effects: Echo, Comb Filtering and Flanger
Many interesting voice FX can be achieved by delaying the signal and recombining it
with itself. [Pd] provides the possibility of writing to and reading from audio ‘delay
lines’ using the objects [delwrite∼] and [delread∼]. However, for full flexibility in
your [VoiceChanger] application, you need to be able to vary the time at which audio
data is read out of the delay line. So, rather than using [delread∼] it is better to use
[vd∼].
⋆ Step 24: Implement the following patch17 with the ‘delay’ slider’s properties set to
operate on a log scale from 1 to 1000 msecs . . .
17Of course, you may choose to replace the [+∼] with the mixer you implemented earlier.
12
⋆ Step 25: Experiment with various values for the delay, and note the different effects
you can achieve with delays (a) below 20 msecs, (b) between 20 and 100 msecs, and (c)
above 100 msecs.
You should observe that, with delays below 20 msec, the signals combine to create a subtle
‘phasing’ effect. This is known as ‘comb filtering’ as the signal is effectively interfering
with itself, and frequency components corresponding to multiples of the delay time are
enhanced or cancelled out (due to ‘superposition’). Delays between 20 and 100 msecs give
the effect of the voice being in a reverberant room. Delays above 100 msecs sound like
distant echoes.
Finally, in this section, it is possible to use an LFO to vary the delay. The resulting effect
is known as a ‘flanger’.
⋆ Step 26: Add an LFO to your ‘delay’ patch to create a ‘flanger’, and experiment with
different settings. Note that you will need to scale the output of the LFO, and you will
get different effects depending on whether the delayed signal is mixed with the original or
not.
Note: Using ‘flanger’ it is possible to achieve an underwater effect by setting LFO = sine
wave, speed = 10 Hz, depth = 0.4. delay = 5 msecs and mix = 100% processed.
2.10 Feedback: Reverberation
The previous section noted that an echo occurring between 20 and 100 msecs after the
original sounds like reverberation. However, in a real environment there are many echoes
caused by multiple reflections from the different surfaces present, and a signal might
bounce around a very reverberant environment many times. A simple method for achieving
such an effect is to feed a proportion of the output signal back to the input. This is known
as ‘feedback’.
⋆ Step 27: Modify your delay patch as follows, and experiment with different values for
delay and feedback . . .
13
QUESTION 9 (worth up to 5 marks)
In a practical system, why is it important to keep the feedback gain less than 1?
Note: A short delay with a lot of feedback can give a metallic sound to a voice, e.g. C3PO
from Star Wars. Try delay = 10 msecs, feedback = 0.9 and mix = 90% processed.
Clearly feedback could be applied to any/all of the effects that you have implemented so
far, thereby increasing the range of possible voice effects considerably - as you will discover
in Part-II.
14
3 Part-II: Real-Time Voice Changer
3.1 General Requirements
The final part of this assignment is to compile all of the different components you developed
in Part-I into a single [VoiceChanger] application. The application should allow
‘live’ speech input in addition to the ability to select a particular prerecorded file (i.e. no
longer hard-wired for speech.wav). The GUI should be well thought out, easy to use and
attractive, and it should not only allow the different effects to be controlled, but also allow
them to be connected to each other in a logical sequence. You should also provide innovations,
for example preset effects (using Vradio buttons to select particular combinations
of settings).
3.2 COM4502-6502 Only
It is possible to convert a single voice into multiple voices by combining the outputs
of parallel manipulations, especially different frequency shifts. This effect is known as
‘chorus’. Your objective is to add ‘chorus’ to your [VoiceChanger] application by careful
use of abstractions and a method for selecting one or more output voices.
3.3 Final Steps
⋆ Step 28: Implement [VoiceChanger] according to the requirements stated above.
QUESTION 10 (worth up to 50 marks18)
Please provide a short19 description of the operation of your [VoiceChanger] application,
together with a screenshot of your final GUI.
Congratulations, you have almost completed this programming assignment. All that is
left now is to package up your code and submit it for marking.
⋆ Step 29: Create a .zip file of the form ‘YourName.zip’ containing . . .
• your response sheet in .pdf format with your name on the front cover and containing
your responses to all of the questions; and
• your final [VoiceChanger] code, together with any necessary abstractions (such as
[LFO∼], [LFO∼-help] etc.).
Important: For marking, we expect your code to work ‘out of the box’.
Note: Do not include your .tex file (and associated image files) or your work-in-progress
code examples.
⋆ Step 30: Hand-in your .zip file in accordance with the instructions at the beginning
of this document.
That’s it - you’re done!
1825 for functionality, 15 for design/layout, 5 for Pd features, 5 for innovations
19no more than 500 words
15
版权所有:编程辅导网 2021 All Rights Reserved 联系方式:QQ:99515681 微信:codinghelp 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。