Received: by ATHENA-PO-2.MIT.EDU (5.45/4.7) id AA15412; Wed, 10 May 89 01:18:06 EDT
From: <smmalina@ATHENA.MIT.EDU>
Received: by ATHENA.MIT.EDU (5.45/4.7) id AA03632; Wed, 10 May 89 01:16:49 EDT
Received: by M37-312-6.MIT.EDU (5.45/4.7) id AA14453; Wed, 10 May 89 01:15:49 EDT
Message-Id: <8905100515.AA14453@M37-312-6.MIT.EDU>
To: mlk@ATHENA.MIT.EDU
Cc: smm%nc.mit.edu@mc.lcs.mit.edu, iagarcia@ATHENA.MIT.EDU
Subject: Revenge of the Backups
Date: Wed, 10 May 89 01:15:46 EDT


thesis.tex:

\documentstyle[11pt,mitthesis]{report}

\addtolength{\topmargin}{.125in}
\addtolength{\textheight}{-.25in}

\addtolength{\oddsidemargin}{.125in}
\addtolength{\evensidemargin}{.125in}
\addtolength{\textwidth}{-.25in}

\begin{document}


\title{Teaching a Computer to Recognize Musical Instruments}

\author{Stephen M. Malinak}
\department{Department of Electrical Engineering and Computer Science}

\degree{Bachelor of Science}

\degreemonth{May}
\degreeyear{1989}
\thesisdate{May 10, 1989}

\supervisor{Campbell L. Searle}{Professor of Electrical Engineering}

\chairman{Leonard A Gould}
  {Chairman, Department Committee on Undergraduate Theses}

\begin{titlepage}

\begin{Large}

{\bf Teaching a Computer to Recognize Musical Instruments}

by

Stephen M. Malinak

\end{Large}

\vspace{1em}

Submitted to the Department of Electrical Engineering and Computer Science

In Partial Fulfillment of the Requirements for the Degree of

Bachelor of Science in Electrical Science and Engineering

at the Massachusetts Institute of Technology

May 1989

Copyright Stephen M. Malinak 1989


The author hereby grants to M.I.T. permission to reproduce \\
and to distribute copies of this thesis document in whole or in part.


\begin{flushright}
Author \hrulefill\\
Department of Electrical and Computer Science\\
May 10, 1989

Certified by \hrulefill\\
Prof. Campbell L. Searle\\
Thesis Supervisor

Accepted by \hrulefill\\
Leonard A. Gould\\
Chairman, Department Committee on Undergraduate Theses
\end{flushright}
\end{titlepage}



\begin{abstractpage}
Previous studies of musical timbre recognition have assumed that the
ear separates the input sound signal into narrow, overlapping
frequency bands.  Current research suggests that the auditory system
may be better modeled at normal sound levels as a collection of broad
bandwidth, overlapping filters.  In this thesis, a simple twelve
channel broad bandwidth filter bank model of the auditory system is
used to analyze a collection of notes from two woodwinds instruments
(clarinet, flute) and three brass instruments (French horn, trombone,
trumpet).  Certain transient features of the attack and spectral
features of the steady-state provide the tools for construction and
testing of a musical instrument recognition algorithm.  Other
instrumental characteristics suggest more reliable alternative
algorithms which would use more criteria gained from the study of many
more notes.
\end{abstractpage}


\section*{Acknowledgments}

I would especially like to thank Ken Malsky for providing the
digitized notes and Richard Kim for providing the filter bank from
which I constucted the analyses in this thesis.  Hugh Secker-Walker
and Richard have written numerous programs for filtering, statistical
analysis, and graphical display of signals.  Hugh and Richard provided
greatly appreciated assistance with the digital signal processing
software library upon which this thesis was built.  Professor Searle,
my supervisor, was nearly always available (!) to assist me in
directing my research.

I would like to thank Mom, Dad, and the financial aid office for
helping to pay for MIT.  I also want to thank my classmates and
friends who have helped me to survive this place and still manage to
get an excellent education.

\tableofcontents
\listoffigures
\listoftables



\chapter{Background and Motivation}


\section{Modeling the peripheral auditory system}

Decades of research have shown that the signal processing done by the
peripheral auditory system includes a complicated filtering function.
The bandwidth of the filtering appears to increase with stimulus
intensity.  To model this phenomenon, Richard Kim created a
``semi-linear'' filter bank which consists of twenty sets of linear
fifty-channel filter banks.  Different stimulus levels trigger
different sets of filters: narrow bandwidth at the threshold of
hearing and wide bandwidth at normal ``conversational'' levels
\cite{Kim}.  This study of timbre recognition is based upon a subset
of Kim's model with the filters set at their widest bandwidths.


\section{Previous studies of musical timbre}

Studying musical timbre produces information useful for a variety of
research areas.  Musical notes provide simpler data than speech
waveforms for evaluating models of the auditory system such as Richard
Kim's filter bank.  Moreover, timbre research leads to better
synthesis of musical sounds.  In fact, John Grey states that timbre
studies will not be complete until it is possible to re-synthesize an
analyzed tone which is {\em indistinguishable} from the original tone
\cite[page 16]{Grey}.  Furthermore, many timbre researchers hope to
extend pitch detection and instrument recognition to computer
transcription of orchestral scores.

The human auditory system gathers three kinds of information from
sounds.  A listener detects loudness from the signal's intensity and
pitch from the signal's periodicity.  The auditory system system then
determines the timbre of the signal \cite[page 138]{Roe}.  According to the
American National Standards Institute (1960), ``{\em Timbre} is that
attribute of auditory sensation in terms of which a listener can judge
two sounds similarly presented and having the same loudness and pitch
as dissimilar \cite[page 113]{Ross}.''  Because of its multi-faceted
nature, timbre perception depends on a tremendous variety of qualities
from the signal.  Pitch and loudness can be graded on several
one-dimensional scales, but timbre requires several ``dimensions''.
Grey studied sixteen instruments and created a sketchy
three-dimensional model for ``timbral space'':

\begin{quote}
One dimension related to the {\em spectral energy distribution}, while
the other two related to the temporal pattern of the attack and decay
of the tones, namely, the presence of {\em low-amplitude,
high-frequency energy}\/ in the initial attack segment and the
presence of {\em synchronicity}\/ in the attacks and decays of the
higher harmonics.  This interpretation is qualified with the following
remarks: 1) it may be important that the energy referred to in the
former dimension was {\em inharmonic}\/; 2) it is also possible that
musical instrument {\em family}\/ relationships were the basis of the
latter dimension \cite[page 67]{Grey}.
\end{quote}



	\subsection{Importance of the attack}

Most timbre researchers have studied individual musical notes played
on one instrument at a time.  Musical notes consist of three temporal
sections: the attack, the steady-state, and the decay.  Generally, the
amplitude of the signal's envelope increases during the attack,
remains relatively constant during the steady-state, and decreases
during the decay.  Figure 1-1 shows a musical note.

\begin{figure}[htbp]
\vspace*{3in}
\caption{A musical note}  This is ``Middle C'' (262 Hz) played on a
clarinet.  The attack is from x msec to y msec, steady state from
y msec to z msec, and decay from z msec to q msec.
\end{figure}

Early timbre studies focused on the steady-state portion of musical
notes.  Hermann Von Helmholtz determined in 1885 that the steady-state
spectrum of a note strongly correlates with its timbre \cite[page
52]{Dodge}.  The pitch of the tone appears at the {\em fundamental}\/
frequency in the spectrum, although the amplitude of this fundamental
is not necessarily the maximum amplitude of the spectrum.  Energy is
also usually present at several {\em harmonics}\/ --- frequencies
which are integral multiples of the fundamental.  Spectra often
contain some inharmonic energy caused by excitation actions such as
bowing or blowing.  Like the human voice, many instruments have {\em
formants}.  Formants are spectral maxima which tend to remain in a
fixed frequency range regardless of the pitch of the note.  All of
these features give an instrumental signature to the steady-state
spectrum of individual notes.

Recent studies, however, have determined that the attack may be at
least as important as the steady-state for recognition of instruments.
For instance, in 1967 Strong and Clark interchanged the spectra of
pairs of wind instrument notes while retaining the same time envelopes
and presented the hybrid waveforms to listeners.  They found the
spectrum to be more important for identification of certain
instruments, the time envelope for others, and both of equal
importance for the rest \cite[page 119]{Ross}.  Grey's study, quoted
above, found that the attack plays a major role in two dimensions of
timbral space.  John Bourne found risetimes and temporal onset
patterns of harmonics to be useful for a recognition algorithm
\cite[page 74]{Bourne}.



	\subsection{Characteristics of five instruments}

In 1978, J\"{u}rgen Meyer conducted a study of numerous orchestral
instruments, and the following subsections summarize his findings for
the five instruments analyzed in the next chapter.  Several of these
characteristics demonstrate how the timbre of a type of instrument
depends on several variables such as pitch and loudness of the note
played and actions of the human performer.


		\subsubsection{Clarinet}

\begin{itemize}
\item in the low register odd harmonics are much stronger than the
even harmonics; in the middle register the second harmonic is still
weaker than the first and third, although higher harmonics are more
nearly equal \cite[page 54]{umlatt}.
\item risetime varies from 15 ms for a ``hard'' attack to more than 50
ms for a ``soft'' attack \cite[page 55]{umlatt}.
\item entrance of higher harmonics is delayed in a soft attack
\cite[page 56]{umlatt}.
\end{itemize}


		\subsubsection{Flute}

\begin{itemize}
\item has a longer attack than any other orchestral instrument
\cite[page 50]{umlatt}.
\item harmonic content varies tremendously with dynamics: {\em pp}
notes sound like pure sinusoids while {\em ff} notes contain numerous
strong overtones \cite[page 50]{umlatt}.
\item notes start with ``preliminary tones'' --- high frequency resonances
caused by the start of blowing \cite[page 50]{umlatt}.
\end{itemize}

		\subsubsection{French horn}

\begin{itemize}
\item 10 -- 20 ms preliminary impulse in harmonics below 1000 Hz \cite[page
39]{umlatt}.
\item poor attacks can be characterized by repeated blips which sound like a
rolled ``r'' \cite[page 39]{umlatt}
\item main formant occurs around 340 Hz \cite[page 37]{umlatt}.
\end{itemize}

		\subsubsection{Trombone}

\begin{itemize}
\item fast risetime: final amplitude reached in about 20 ms for notes
in the high register \cite[page 47]{umlatt}.
\item biggest formant at 480 Hz in low register, 600 Hz in high
register \cite[page 46]{umlatt}.
\item preliminary impulse which sounds harsh in strong attacks and
becomes barely detectable in smooth attacks \cite[page 47]{umlatt}.
\end{itemize}

		\subsubsection{Trumpet}

\begin{itemize}
\item richer in overtones than any other orchestral instrument
\cite[page 42]{umlatt}.
\item main formant at about 1200 Hz \cite[page 42]{umlatt}.
\item ``incisive'' attack lasts 20 -- 25 ms and is marked by high
frequencies and a 5 ms preliminary impulse \cite[page 44]{umlatt}.
\end{itemize}



\section{Timbre recognition: a ``conditioned response''}

Recognition of timbral sources is a complicated and learned process.
A person unfamiliar with a particular type of instrument can determine
pitch, loudness, and a few vague timbral characteristics (``bright'',
``nasal'', ``muddy''), but he cannot identify the source of the sound.
Juan Roederer describes timbre perception as a conditioned response
consisting of ``(1) {\em storage}\/ in the memory with an adequate
label of identification, and (2) {\em comparison}\/ with previously
stored and identified information \cite[page 138]{Roe}.''  Hence, a
reasonable approach to teaching a machine to recognize musical
instruments includes: (1) studying and labeling information derived
from various types of musical signals, and (2) constructing an
algorithm to recall this stored information and identify the tone
source.




\input{two}




\chapter{An Algorithm for Recognition of Musical Instruments}


\section{Creation of algorithm}

The data analysis explained in the previous chapter reveals several
characteristics of musical notes.  These characteristics provide
enough useful information for the creation of a detection algorithm
which can identify an instrument from the first 320 milliseconds of a
single note.  The simple algorithm presented here uses only three
characteristics: high frequency inharmonic bursts, presence of a
second harmonic, and spectral distribution of energy.  Each test
distributes a certain number of points to each possible instrument,
and the algorithm chooses the instrument with the most points at the
end of these tests.

The decision program includes several of the subroutines used to
analyze the original seventeen notes.  This program first determines
the fundamental pitch of the raw note.  Next, it finds the maximum
value and ``onset point'' of each channel.  Finally the program detects
the inharmonic bursts found at the start of woodwind notes and examines the
harmonics of the three bands above the fundamental band.  After
collection of this data, the decision algorithm awards points to the
five possible instruments according to the values in Table~4.1.


\begin{table}[tbp]

	\caption{Points for decision criteria.}  The values of channel
maxima for the energy criteria were picked as an approximate best
boundary for the data from the seventeen notes.

\centering

\vspace{1em}

\begin{tabular}{|l||l|l|l|l|l|}
\multicolumn{6}{c}{POINTS FOR CHARCTERISTICS}\\
\hline
instrument & 	clarinet & 	flute & 	french horn & 	trombone & 
	trumpet\\
\hline \hline
Inharmonic bursts & 15 &	15 &		0 &		0 &
	0\\
\hline
No inharmonic bursts &	0 &	0 &		10 &		10 &
	10\\
\hline \hline
No 2nd harmonic & 20 &		0 &		0&		0 &
	0\\
\hline
2nd harmonic & 	0 &		5 &		5 &		5 &
	5\\
\hline \hline
Channel 10 max $<$ .04 & 0 &	20 &		0 &		0 &
	0\\
\hline
Channel 10 max $>=$ .04 & 5 &	0 &		5 &		5 &
	5\\
\hline \hline
Channel 12 max $>$ .15 & 0 &	0 &		0 &		0 &
	20\\
\hline
Channel 12 max $<=$ .15 & 5 &	5 &		5 &		5 &
	0\\
\hline
\end{tabular}

\end{table}

Although it is somewhat arbitrary, this method of awarding points
provides a reasonably efficient way to utilize the small amount of
data gathered from the seventeen notes.  If only one instrument can
pass a test, twenty points are awarded.  If the note passes this type
of criterion, the appropriate instrument gets all twenty points.
Otherwise, the points are evenly distributed among the other four
instruments.  Since bursts can occur in two instruments, the burst
criterion distributes 30 points.  This simple algorithm works very
well for the seventeen notes studied.

To speed and automate the processing of new edited notes, a single
shell script controls all of the filtering, data collection, and
decision making.  Hence, once a note has been edited from its
recording, all further processing is done without any human input.
This shell script can run in the ``background'' of a UNIX environment.
Processing a single notes takes approximately five (** check this)
minutes on VAX 11/750 if no one else is using the computer at the same
time.

\section{Testing}

The algorithm was first applied to the seventeen notes that had been
carefully analyzed.  Since the decision tests were derived from
characteristics measured for these notes, the algorithm matched all
seventeen notes to the appropriate instruments.  Tables 2.3 and 2.4
summarize the scoring for these notes.

Creating a more challenging test required editing more notes from the
recordings.  These new notes had the same fundamental pitches as the
first notes, but this time the notes were extracted from the downward
runs of the arpeggios rather than the upward runs.  These notes had
not been examined before the creation of the decision criteria.  The
algorithm matched fifteen of these seventeen new notes to the
appropriate instrument.  Results of this testing appears in Tables 3.2
and 3.3.

The algorithm's major flaw is that it sometimes fails miserably if it
does not correctly identify the instrument.  One flute note did not
have the inharmonic attack transients, so the algorithm got confused
and called it a trombone note.  The next chapter suggests a more
intelligent algorithm which would fail more gracefully if it did not
correctly identify an instrument.

Two other problems, however, were handled more successfully.  One
clarinet note had a detectable second harmonic, so the algorithm
called it a flute note.  The flute falls into the same orchestral
category as the clarinet.  Finally, one French horn note's preliminary
impulses circumvented the temporal criteria designed to catch them and
were considered woodwinds bursts.  Nevertheless, the other tests
enabled the algorithm to correctly identify this note.




\begin{table}[hbp]
	\caption{Testing of algorithm on new woodwind notes}
\begin{large}
\centering
\vspace{1em}
 \begin{tabular}{|l||l|l|l|l||l|l|l|}
\multicolumn{8}{c}{CHARCTERISTICS}\\
\hline
instrument &	   \multicolumn{4}{c||}{Clarinet} &
		   \multicolumn{3}{c|}{Flute}\\
\hline
raw note &         C 262 & 	E 330 &		G 392 &		C 523 &
				E 330 & 	G 392 & 	C 523\\
\hline \hline
fundamental &	   262.3 &	326.5 &		395.1 &		524.6 &
				326.5 &		395.1 &		524.6\\
\hline
bursts? &	   yes &	yes &		yes &		yes &
				no &		yes &		yes\\
\hline
harmonic[1] &	   1 &		1 &		1 &		1 &
				2 &		2 &		2\\
\hline
harmonic[2] &	   3 &		1 &		1 &		3 &
				2 &		2 &		3\\
\hline
harmonic[3] &	   3 &		3 &		2 &		3 &
				2 &		3 &		3\\
\hline
channel 10 max &   0.1199 &	0.1284 &	0.0501 &	0.0720 &
				0.0778 &	0.1246 &	0.1121\\
\hline
channel 12 max &   0.0413 &	0.0323 &	0.0090 &	0.0147 &
				0.0317 &	0.0310 &	0.0340\\
\hline
\multicolumn{8}{c}{SCORES}\\
\hline
clarinet &	   45 ** &	45 ** &		25 &		45 ** &
				10 &		25 &		25\\
\hline
flute & 	   25 &		25 &		30 ** &		25 &
				15 &		30 ** &		30 **\\
\hline
french horn &	   5 &		5 &		10 &		5 &
				20 &		10 &		10\\
\hline
trombone &	   10 &		10 &		15 &		10 &
				25 ** &		15 &		15\\
\hline
trumpet &	   5 &		5 &		10 &		5 &
				20 &		10 &		10\\
\hline
\end{tabular}
\end{large}
\end{table}

\begin{table}[p]
	\caption{Testing of algorithm on new brass notes}
\centering
\vspace{1em}
\begin{tabular}{|l||l|l|l||l|l|l|}
\hline
\multicolumn{7}{|c|}{CHARCTERISTICS}\\
\hline
instrument &	   \multicolumn{3}{c||}{French Horn} &
		   \multicolumn{3}{c|}{Trombone}\\
\hline
raw note &      	C 262 &		E 330 &		G 392 &
	      		C 262 & 	E 330 & 	G 392\\
\hline \hline
fundamental &	   	262.3 &		333.3 &		395.1 &
			262.3 &		329.9 &		400.0\\
\hline
bursts? &	   	no &		yes &		no &
			no &		no &		no\\
\hline
harmonic[1] &	   	2 &		1 &		1 &
			2 &		2 &		2\\
\hline
harmonic[2] &		2 &		2 &		2 &
			3 &		2 &		2\\
\hline
harmonic[3] &	   	2 &		3 &		3 &
			3 &		3 &		2\\
\hline
channel 10 max &   	0.0214 &	0.0228 &	0.0205 &
			0.1141 &	0.1732 &	0.1160\\
\hline
channel 12 max &  	0.0050 &	0.0039 &	0.0043 &
			0.0188 &	0.0263 &	0.0201\\
\hline \hline
\multicolumn{7}{|c|}{SCORES}\\
\hline
clarinet &		5 &		20 &		5 &
			10 &		10 &		10\\
\hline
flute & 		10 &		25 &		10 &
			15 &		15 &		15\\
\hline
french horn &		40 ** &		30 ** &		40 ** &
			20 &		20 &		20\\
\hline
trombone &		20 &		10 &		20 &
			25 ** &		25 ** &		25 **\\
\hline
trumpet &		15 &		5 &		15 &
			20 &		20 &		20\\
\hline
\end{tabular}

\vspace{3em}

\begin{tabular}{|l||l|l|l|l|}
\hline
\multicolumn{5}{|c|}{CHARCTERISTICS}\\
\hline
instrument &	   \multicolumn{4}{c|}{Trumpet}\\
\hline
raw note &         C 262 & 	E 330 & 	G 392 & 	C 523\\
\hline \hline
fundamental &	   260.2 &	329.9 &		390.2 &		516.1\\
\hline
bursts? &	   no & 	no &		no &		no\\
\hline
harmonic[1] &	   2 &		2 &		2 &		2\\
\hline
harmonic[2] &	   2 &		3 &		3 &		2\\
\hline
harmonic[3] &	   4 &		3 &		4 &		3\\
\hline
channel 10 max &   0.9019 &	0.9268 &	0.9455 &	0.8304\\
\hline
channel 12 max &   0.3013 &	0.3603 &	0.2943 &	0.3196\\
\hline \hline
\multicolumn{5}{|c|}{SCORES}\\
\hline
clarinet &	   5 &  	5 &		5 &		5\\
\hline
flute & 	   10 &		10 &		10 &		10\\
\hline
french horn &	   15 &		15 &		15 &		15\\
\hline
trombone &	   20 &		20 &		20 &		20\\
\hline
trumpet &	   40 ** &	40 ** &		40 ** &		40 **\\
\hline
\end{tabular}
\end{table}






\chapter{Conclusions and Suggestions for Further Work}


\section{Conclusions}

This project was motivated by the desire to see if a wide-band
bandpass filter bank simulation of the auditory system can help a
computer to recognize musical instruments.  The filtered notes show
numerous characteristics of their instruments.  Some of these
characteristics are so strong that a simple algorithm can usually
identify an instrument from the first 320 milliseconds of a single
note.  The twelve channel filter bank could also be very useful for a
more rigorous statistical study of many notes to support a complex
decision algorithm.

The project was mostly successful, although the usefulness of the
simple recognition algorithm is rather limited.  Given the restricted
choice of five instruments played by a single performer at one
loudness over a single octave, the algorithm correctly identified the
instruments from which 32 of 34 notes came.  Hence, a wide band filter
bank can assist in a practical recognition task.



\section{Suggestions for further work}

	\subsection{Possible extensions for algorithm}

Studying more notes --- hundreds or thousands of them --- could lead
to a more reliable and ``intelligent'' algorithm.  Analyzing more than
one performer's notes would result in a more robust decision process.
Numerous simple extensions could expand this instrument recognition
system without changing its basic approach.  Notes from the same
instruments but from different octaves could be analyzed.  The present
algorithm could easily be extended to new instruments after a study of
several of their notes.  The addition of new instruments would lead to
the introduction of new criteria for decisions.

	\subsection{Statistical studies}

The methods of data collection used in this project could also be used
to construct a much more rigorous and ``intelligent'' decision-making
system.  A better algorithm would assign points differently.  Instead
of awarding points on an all-or-nothing basis for each test, different
measurement levels could result in different awarding of points.
Probabilistic decision analysis models are available for this kind of
algorithm.  These models would make decisions on the basis of assumed
probabilistic distributions derived from extensive study of several
thousand notes.  Characteristics such as harmonics, spectral
distribution of energy, attack transients, risetimes, onset patterns,
and pitch range could prove quite helpful for this kind
of study.  




\begin{thebibliography}{99}

\bibitem{Bourne} Bourne, John B.
{\em Musical Timbre Recognition Based on a Model of the Auditory
System.} Master's Thesis, Department of Electrical Engineering,
Massachusetts Institute of Technology, June 1972.

\bibitem{Dodge} Dodge, Charles and Thomas A. Jerse.
{\em Computer Music: Synthesis, Composition, and Performance.}  New
York: Schirmer Books, 1985.

\bibitem{Grey} Grey, John.
{\em An Exploration of Musical Timbre.}  PhD Thesis, Department of
Music, Stanford University, February 1975.

\bibitem{Kim} Kim, Richard.
{\em A Semi-Linear Filter Bank Model of the Peripheral Auditory
System.}  Master's Thesis, Department of Electrical Engineering and
Computer Science, Massachusetts Institute of Technology, May 1988.

\bibitem{umlatt} Meyer, J\"{u}rgen.
{\em Acoustics and the Performance of Music.}  Frankfurt: Verlag Das
Musikinstrument, 1978.

\bibitem{Roe} Roederer, Juan G.
{\em Introduction to the Physics and Psychophysics of Music,} Second
Edition.  New York: Springer Verlag, 1975.

\bibitem{Ross} Rossing, Thomas D.
 {\em The Science of Sound.}  Reading, Massachusetts: Addison-Wesley
Publishing Company, 1982.


\end{thebibliography}


\end{document}


two.tex:

\chapter{Data Analysis}

\section{The raw notes}

A study of musical timbre based on a digital bandpass filter system
requires a set of digitized musical recordings.  The ideal raw data
would be a collection of notes played on only one instrument at a time
and separated by rests.  Furthermore, availability of the same notes
on several instruments allows direct comparisions of color for the
same fundamental tone.

	\subsection{Recordings}

Fortunately, Ken Malsky needed precisely the same kinds of notes when
he conducted a psychoacoustics experiment in 1987.  He could not find
an appropriate collection of notes, so he and Peter Andrews recorded
and digitized their own set.  These recordings contain arpeggios of
ten orchestral instruments: clarinet, flute, French horn, trombone,
trumpet, alto sax, tenor sax, violin, viola, and cello.  The arpeggios
are in the key of C and cover two octaves of each instrument's range.
The recordings have a sampling rate of 32 kHz, so they contain energy
up to about 16 kHz.  These recordings contain the musical notes used
in this study of timbre.

Selecting instruments with completely different ranges or timbres
would make the identification problem too simple.  Consequently, this
study focuses on five instruments with similar timbres: two woodwinds
(clarinet, flute) and three brasses (French horn, trombone, trumpet).
All four pitches studied --- middle C (262 Hz), E (330 Hz), G (392
Hz), and high C (523 Hz) --- come from a single octave.  Because of
differing instrumental ranges, some of the pitches are missing from
certain instruments.  There is no middle C for the flute, and there is
no high C for the French horn and trombone.  Thus, a total of
seventeen notes from these five instruments provide raw data for
timbre analysis.


	\subsection{Edited notes}

Previous studies of musical timbre found that most of the information
about the identity of an instrument comes from the attack and
steady-state sections of notes.  Consequently, the raw notes are
edited down to the first 320 milliseconds.  Although risetimes vary
significantly, this duration generally includes the entire attack and
a section of the steady-state.  This study completely ignores the
decay.  The original digitized recordings contain all of the notes
from one instrument in a single file.  Visual examination determined
the precise location of each attack.  Each edited note contains
approximately 500 samples (16 ms) of background noise before the
attack begins.  A total of 20,000 samples (640 ms) are stored on disk
for each note, although nearly all of the analytical programs use only
the first 10,000 samples.  Each edited note is individually normalized
to make its peak absolute value equal to one.  Figures 2-3 through 2-7
at the end of this chapter show a time plot of a representative note
from each of the five instruments.


\section{Twelve channel wide bandwidth filter bank}

The edited notes are passed through a twelve channel filter bank
which uses the widest bandwidth setting of Richard Kim's semi-linear
filter bank.  Set at such wide bandwidths, adjacent channels contain
mostly the same information for a given musical note.  Visual analysis
of all fifty channels' outputs for the input of a 262 Hz clarinet note
showed that an evenly-spaced selection of twelve channels contains
almost as much information as all fifty.  Figure 2-1 shows the
frequency response of the twelve channel filter bank used in this
experiment.  Kim based the shape of the wide bandwidth filters on data
from a study of cats' auditory systems: \begin{quote} The high
frequency skirt slope is very high, up to several hundred dB per
octave.  The lower frequency skirt is much less steep. It is as low as
10 dB per octave \cite[page 18]{Kim}. \end{quote} Because of the
filters' asymmetry, the peaks of the filters are closer to the high
ends of the pass bands.  Signals at frequencies below a filter's
``center frequency'' are attenuated less than those at higher
frequencies.  Table 2.1 presents simplified pass bands whose endpoints
are at these center frequencies.  The analysis programs use this table
to determine which channel contains the fundamental frequency of a
note.


\begin{figure}[p]
\vspace{4.5in}
\caption{Twelve channel filter bank}
\end{figure}

\begin{table}[p]
\caption{Boundaries of the twelve channels}
\centering
\vspace{1em}
\begin{tabular}{|c|c|c||c|c|c||c|c|c|}
\hline
band & low & high & band & low & high & band & low & high\\
\hline \hline
1 &	0&    126&	5&	372&	522&	9&	1369&	1867\\
\hline
2 & 	127&  181&	6&	523&	725&	10&	1868&	2533\\
\hline
3 &	182&  261&	7&	726&	998&	11&	2534&	3431\\
\hline
4 &	262&  371&	8&	999&	1368&	12&	3432&	4596\\
\hline
\end{tabular}
\end{table}


After filtering, the notes no longer contain any energy above 8 kHz.
Consequently, the filtered outputs are downsampled to 16 kHz (by
removing every other sample) to save disk space and allow faster
analysis.  Figures 2-3 through 2-7 show the outputs of the twelve
channel filter bank for a representative note from each instrument.
Table 2.2 shows how these notes and their harmonics fall into the
twelve channels of the bandpass filter model.

\begin{table}[p]
 \caption{Notes, harmonics, and channels}
\centering
\vspace{1em}
\begin{tabular}{|l||r|l|r|l|r|l|r|l|r|l|}
\hline
Note & \multicolumn{2}{c|}{Fund.} & \multicolumn{2}{c|}{2nd} & 
\multicolumn{2}{c|}{3rd} & \multicolumn{2}{c|}{4th} & 
\multicolumn{2}{c|}{5th}\\
\hline \hline
	C &	261.6 &4 &  523.3 &6 &  784.9 &7 &  1046.5 &8 &  1308.1 &8\\
\hline
	E &	329.6 &4 &  659.3 &6 &  988.9 &7 &  1318.5 &8 &  1648.1 &9\\
\hline
	G &	392.0 &5 &  784.0 &7 & 1176.0 &8 &  1568.0 &9 &  1960.0 &10\\
\hline
	C &	523.3 &6 & 1046.5 &8 & 1569.8 &9 &  2093.0 &10 & 2616.3 &11\\
\hline
\end{tabular}
\end{table}


\section{Methods for analyzing filtered notes}

	\subsection{Visual analysis}

The filtered notes provide much valuable information for
identification of the five instruments.  After studying filter bank
outputs such as those in Figures 2-3 through 2-7, a person can usually
determine which notes belong to which instruments.  Visual analysis of
these filter bank outputs reveals several timbral qualities found in
previous studies.  These characteristics include risetimes, onset
patterns, presence of harmonics, distribution of energy, and
inharmonic attack transients.


	\subsection{Computational analysis}

Getting useful data for computational analysis requires repeatedly
reworking measurement programs.  The values of various parameters
strongly depend on how these parameters are measured.  Consquently,
the measurement techniques described below resulted from several
generations of trial programs.  These techniques provide the most
distinct and reliable contrasts between instruments while retaining
the most consistency among the notes of a single instrument.

One of the most important characteristics of a note is its pitch.  The
subroutine {\tt findfund} looks at the raw note's 20,000 samples to
determine its fundamental frequency.  This subroutine counts the
number of samples between consecutive positive peaks and examines the
resulting array of interpeak distances.  The fundamental period is the
interpeak distance that occurs most often (the mode).  The subroutine
attempts to ignore harmonics and noise by requiring each new peak to
be greater than a threshold of 90\% of the previous peak.  The
following formula determines the fundamental frequency:
\[f (Hz) = \frac{32,000\:samples/sec}{period\:(samples)}\]
Another subroutine, {\tt pickband}, determines which of the twelve
bands in Table 2.1 contains this fundamental frequency.

Finding the harmonics present in a signal requires more complicated
analysis.  A modified version of the {\tt findfund} routine with a
different threshold failed because of the phase of the harmonics; for
instance, the second harmonic often peaks somewhere other than the
middle of the fundamental period.  The final version of the harmonic
detector first tracks the fundamental: it finds a peak, looks for
another peak within the fundamental period $\pm$ two samples, then
looks for the next peak, etc.  If the chain of peaks breaks before the
end of the signal, the wrong starting peak has been selected ({\em
i.e.,} it is not part of the fundamental chain), so the algorithm
starts with the peak immediately following its original starting point
and tries to track again.  This tracking method worked on all of the
bands in which the fundamental period dominates --- the three bands
above the fundamental band.  The {\tt findharm} subroutine tallies the
number of maxima between each pair of fundamental peaks.  The harmonic
of the signal is defined to be the most often repeated number of
maxima per period.

Risetimes of the envelopes of the filter bank ouputs present a much
more difficult problem than harmonic detection.  These envelopes are
generally not simply rising exponentials.  Some these envelopes
steadily increase during the entire period under study.  Other
envelopes undulate throughout the ``steady-state''.  For instance, the
``risetime'' of the trumpet note Figure 2-2 is quite debatable.  The
filter bank outputs for this 262 Hz trumpet note appear in Figure 2-7.
Note that the fundamental 262 Hz channel finishes its rise at x**
ms, but the amplitudes of the higher channels grow tremendously
around y** ms.  Envelope modulations such as these tend to confuse
computer programs.  Ideally, a decision tree would allow different
criteria to be applied to different types of rises.  In this study,
however, the risetime of a signal was defined to be the time between
the first occurence of 20\% of the peak value and the first occurence
of 80\% of the peak value.  These thresholds are less susceptible
than traditional 10\% -- 90\% endpoints to noise and signal
variability.

\begin{figure}[htb]
\vspace*{4in}
\caption{What is the risetime of this note?}
\end{figure}

Good risetime measurements should make harmonic onset patterns trivial
to find.  The onset point of a filter bank output channel is merely
the number of the sample at which the signal's rise starts.  In this
analysis the ``onset point'' was the first occurence of 20\% of the
signal's maximum.

Like the risetime, the energy of a signal can be measured several
ways.  Two alternatives were studied: (1) finding the short-term
root-mean-squared value of the signal, or (2) finding the signal's
maximum.  An {\sc rms} measurement of a steady-state waveform would
present the ideal way to measure energy, but the several of the notes
studied have fluctuating ``steady-states''.  The amplitude
fluctuations lead to significantly different short-term {\sc rms}
values in different sections of the steady-state.  The short-term {\sc
rms} could be used if it were measured in several sections of the
signal and analyzed in some way.  The peak value of a signal, however,
gives a simpler indication of the relative energies of the channels.
In the seventeen notes analyzed (and the seventeen new notes tested
later), measuring the peak values of the twelve channels correctly
determines whether a note comes from a trumpet, a French horn, or
something else.

Finally, high frequency inharmonic attack transients are detected on a
present/not present basis.  A subroutine checks each channel above the
fundamental for bursts of samples exceeding 10\% of the signal's
maximum value.  Early attempts at burst detection had found the
inharmonic transients but also found the fundamental blips present
in the rise of some French horn notes.  To enable detection of
inharmonic woodwind transients and suppression of brass blips, the
inharmonic burst criterion only includes bursts which start more than
550 and finish more than 250 samples before the fundamental's onset
point.




\section{Conclusions of analyses}

Some characteristics prove to be immediately useful for timbre
recognition based on a single note.  The other characteristics apply
better to a statistical study of several notes from the same instrument.
A ``useful'' characteristic is defined here to be a quality of a note
which places it into one of two mutually exclusive groups; for
instance, only clarinets have no second harmonic.  Characteristics
good for statistical analysis do not individually lead to definite
conclusions; instead, they increase or decrease the probability of an
instrument. Tables 2.3 and 2.4 at the end of this chapter summarize
measurements of characteristics from the seventeen notes.  The next
chapter explains the scores in these tables.

	\subsection{Immediately useful characteristics}

		\subsubsection{Inharmonic attack transients}

Only woodwind notes contain high frequency inharmonic sputterings
before the start of the tone.  These bursts look like ``noise'' at the
start of the raw notes in Figures 2-2 and 2-3.  In the filtered
waveforms, the bursts appear in the high frequency channels (above 523
Hz), and they occur before or during the rise of the signal in the
fundamental channel.  The flute bursts tend to be longer and more
energetic than the clarinet bursts.  This burst characteristic
provides an excellent way to separate the instruments into two
categories: woodwinds (clarinet, flute) and brass (french horn,
trombone, trumpet).

		\subsubsection{Presence of harmonics}

As expected, all of the clarinet notes lack a second harmonic.  All of
the notes from the other instruments have a second harmonic in at least
one of the three channels above the fundamental channel.

		\subsubsection{Distribution of energy}

The energy characteristics allow easy differentiation of the brasses.
French horns have very little energy in the high frequency channels.
Channel 10 (2534 Hz center frequency) supplies the french horn
criterion because the french horn's energy differs more from other
instruments' energies in this particular frequency range than in any
other.  The trumpet, on the other hand, has most of its energy at high
frequencies.  For notes in the octave studied, the trumpet has far
more energy in channel 12 (4597 Hz center frequency) than any other
instrument.


	\subsection{Characteristics suitable for statistical analysis}

		\subsubsection{Risetimes}

The risetimes of instruments differ widely, but risetimes of notes
from the same instrument also vary significantly.  Furthermore, the
risetime is not well defined for several of the notes studied.
Risetimes of a trumpet and a clarinet differ greatly, but
unsophisticated risetime measurements could place a wavering French
horn into either category.  Risetimes have provided an important
criterion in other studies of musical timbre, and they could certainly
assist classification of a group of notes.  Studying many more notes
to determine probabilistic distributions would make the risetime
measurement a more reliable way to identify which instrument produced
a single note.


		\subsubsection{Onset patterns}

The clarinet shows delayed entry of harmonics, while the trumpet gets
energy at all frequencies instantaneously.  Unfortunately, the wide
bandwidth filters allow the fundamental to obscure the rise of some of
the harmonics.  The filter bank outputs for the clarinet note
in Figure 2-3 demonstrate the problem of obscured onset patterns.
Teaching a computer to see one waveform rising inside another presents
more difficulties than merely determining the outer risetime.

		\subsubsection{Preliminary impulses}

Most of the brass notes show preliminary impulses in the filter bank
outputs.  These impulses generate discontinuities in the slope of a
signal's envelope.  The strength and shape of the impulses differ from
one note to the next.


		\subsubsection{Ranges}

Finding the pitch range of a collection of notes from a single
instrument would assist classification of instruments with different
ranges.  In this single octave study, however, ranges of all
instruments are the same except for a few missing high and low notes.



\begin{figure}[p]

\vspace*{7.5in}

	\caption{Clarinet note: C 262 Hz}

\end{figure}

\begin{figure}[p]

\vspace*{7.5in}

	\caption{Flute note: G 392 Hz}

\end{figure}

\begin{figure}[p]

\vspace*{7.5in}

	\caption{French horn note: C 262 Hz}

\end{figure}

\begin{figure}[p]

\vspace*{7.5in}

	\caption{Trombone note: C 262 Hz}

\end{figure}

\begin{figure}[p]

\vspace*{7.5in}

	\caption{Trumpet note: C 262 Hz}

\end{figure}



\clearpage

\begin{table}[p]

	\caption{Woodwind data}

\begin{large}

\centering

\vspace{1em}

 \begin{tabular}{|l||l|l|l|l||l|l|l|}
\hline
\multicolumn{8}{|c|}{CHARCTERISTICS}\\
\hline
instrument &	   \multicolumn{4}{c||}{Clarinet} &
		   \multicolumn{3}{c|}{Flute}\\
\hline
raw note &         C 262 & 	E 330 &		G 392 &		C 523 &
				E 330 & 	G 392 & 	C 523\\
\hline \hline
fundamental &	   262.3 &	326.5 &		395.1 &		524.6 &
				329.9 &		395.1 &		524.6\\
\hline
bursts? &	   yes &	yes &		yes &		yes &
				yes &		yes &		yes\\
\hline
harmonic[1] &	   3 &		1 &		1 &		1 &
				2 &		2 &		2\\
\hline
harmonic[2] &	   3 &		1 &		3 &		3 &
				2 &		2 &		2\\
\hline
harmonic[3] &	   3 &		3 &		4 &		3 &
				2 &		3 &		3\\
\hline
channel 10 max &   0.1250 &	0.1378 &	0.1071 &	0.0925 &
				0.1382 &	0.1619 &	0.727\\
\hline
channel 12 max &   0.0286 &	0.0253 &	0.0220 &	0.0152 &
				0.0352 &	0.0807 &	0.0214\\
\hline \hline
\multicolumn{8}{|c|}{SCORES}\\
\hline
clarinet &	   45 ** &	45 ** &		45 ** &		45 ** &
				25 &		25 &		25\\
\hline
flute & 	   25 &		25 &		25 &		25 &
				30 ** &		30 ** &		30 **\\
\hline
french horn &	   5 &		5 &		5 &		5 &
				10 &		10 &		10\\
\hline
trombone &	   10 &		10 &		10 &		10 &
				15 &		15 &		15\\
\hline
trumpet &	   5 &		5 &		5 &		5 &
				10 &		10 &		10\\
\hline
\end{tabular}

\end{large}

\end{table}




\begin{table}[p]

	\caption{Brass data}

\centering

\vspace{1em}

\begin{tabular}{|l||l|l|l||l|l|l|}
\hline
\multicolumn{7}{|c|}{CHARCTERISTICS}\\
\hline
instrument &	   \multicolumn{3}{c||}{French Horn} &
		   \multicolumn{3}{c|}{Trombone}\\
\hline
raw note &      	C 262 &		E 330 &		G 392 &
	      		C 262 & 	E 330 & 	G 392\\
\hline \hline
fundamental &	   	260.2 &		333.3 &		395.1 &
			262.3 &		329.9 &		400.0\\
\hline
bursts? &	   	no &		no &		no &
			no &		no &		no\\
\hline
harmonic[1] &	   	2 &		1 &		1 &
			2 &		2 &		2\\
\hline
harmonic[2] &		2 &		2 &		2 &
			3 &		2 &		2\\
\hline
harmonic[3] &	   	2 &		3 &		3 &
			3 &		3 &		2\\
\hline
channel 10 max &	0.0194 &	0.0234 &	0.0200 &
			0.1232 &	0.1150 &	0.1155\\
\hline
channel 12 max &	0.0044 &	0.0036 &	0.0052 &
			0.0214 &	0.0204 &	0.0198\\
\hline \hline
\multicolumn{7}{|c|}{SCORES}\\
\hline
clarinet &		5 &		5 &		5 &
			10 &		10 &		10\\
\hline
flute & 		10 &		10 &		10 &
			15 &		15 &		15\\
\hline
french horn &		40 ** &		40 ** &		40 ** &
			20 &		20 &		20\\
\hline
trombone &		20 &		20 &		20 &
			25 ** &		25 ** &		25 **\\
\hline
trumpet &		15 &		15 &		15 &
			20 &		20 &		20\\
\hline
\end{tabular}

\vspace{3em}

\begin{tabular}{|l||l|l|l|l|}
\hline
\multicolumn{5}{|c|}{CHARCTERISTICS}\\
\hline
instrument &	   \multicolumn{4}{c|}{Trumpet}\\
\hline
raw note &         C 262 & 	E 330 & 	G 392 & 	C 523\\
\hline \hline
fundamental &	   260.2 &	329.9 &		390.2 &		524.6\\
\hline
bursts? &	   no & 	no &		no &		no\\
\hline
harmonic[1] &	   2 &		2 &		2 &		2\\
\hline
harmonic[2] &	   2 &		2 &		3 &		3\\
\hline
harmonic[3] &	   3 &		3 &		4 &		4\\
\hline
channel 10 max &   0.6825 &	0.9500 &	0.9612 &	0.7664\\
\hline
channel 12 max &   0.3457 &	0.3864 &	0.3404 &	0.2512\\
\hline \hline
\multicolumn{5}{|c|}{SCORES}\\
\hline
clarinet &	   5 &  	5 &		5 &		5\\
\hline
flute & 	   10 &		10 &		10 &		10\\
\hline
french horn &	   15 &		15 &		15 &		15\\
\hline
trombone &	   20 &		20 &		20 &		20\\
\hline
trumpet &	   40 ** &	40 ** &		40 ** &		40 **\\
\hline
\end{tabular}

\end{table}



