Spectrogram Analysis of Speech Data

Spectrogram Analysis Of Speech Using MATLAB

Abstract-Speech signal of a word is a combination of frequency which can generate specific transition frequency shapes. Spectrogram is the visual representation of frequencies spectrum of signals that vary with time. If applied to an audio signal, they are referred to as the voiceprints, monographs or the voicegrams. However, when the is represented in 3D plots, they are referred to as waterfalls. This work presents the modeling of the spectrogram analysis of the speech data and simulation through MATLAB.
Index Terms-Augmentation, Eula, Spectrogram,
Speech is used is used for relay information from one person to other listeners. Speech recognition involves the conversion of the speech data for processing by a computer program.
Spectrograms are majorly applied in the fields of music, linguistic, sonar, radar, speech processing and seismology [1,2,3].
The common format is a graph with two geometric dimensions, ie, one axis represents time while the other represents frequency [4]. The third dimension represents the amplitude of a given frequency at a given time by the intensity. This is shown in the figure below.

Fig. 1. Variable spectrogram
Spectrograms are created from a time-domain signal in one of two ways, approximated by a filterbank that results from a series of band-pass filters.
The spectrogram analysis of signals is usually depicted as heat map with varying colors or brightness [5].
The technique used for computing the spectrogram of speech as input signals is the Fast Fourier Transform (FTT). Compared with the Discrete Fourier Transform (DFT), the FFT normally require operation to calculate the discrete Fourier representation of a digital signal [6].
The linear prediction – is techniques employed in the linear prediction analysis where large data is involved for processing [7].
The linear predictive technique presumes that each sample of a signal can be approximated as a linear combination of its preceding samples (s(n-k) , k=1,2,3……….K). The weighing factor are considered as the LP coefficients’
The linear prediction having coefficients ak can be defined as a system with the output [7]:
s = (n) ∑_(k-1)^p▒〖aks(n-k)〗 (1)
The concept of speech modulation was consistent with the Dudley’s view of carrier nature of frequency. The short term spectrum in spectrogram S(ω,t) at the frequencyωois described by a one-dimensional time series S(ωo,t).
The Discrete Fourier Transform (DFT) of the time series logarithm within a time ΔT, a time tois expressed as follows [8].
The spectral analysis involves the decomposition of a signal into a sinusoidal signal such that [7].
x(t) = Ao+ ∑_(k=1)^N▒〖(A_k cos⁡[2πf_k t+ϕ_k])〗
Aoa constant
Ak signal amplitude
fkfrequency of the signal
ɸk signal phase
x(t) = Acos(2πf_k t)
Applying the Eula equation, equation (3) can be rewritten as
cos⁡(2πf_k t)=(e^i(2πf_k )t+e^-(2πf_k)t)/2
The signal then turns into a complexexponential signal which decomposes as

Fig.2. Fourier transform of cosine function.
The best approximation of the signal is obtained by the summation of the sinusoids.

x(t)=∑_(k=-N)^N▒(a_ke ⅈ^(12πf_k t) )
. (6)
akis complex number representing the phase and amplitude of the signal.
fkphase of the signal.
Assuming a square signal, it will be decomposed using sinusoids. Its location, frequency and akwill be defined. This process gives the exact form of the signal as the original signal [9].

Fig. 3. Sinusoidal and square periodic signals
Through the addition of more sinusoids, the approximation becomes close to the square periodic wave. As show in figure 3 and 4 respectively [10].

Fig. 4. Different harmonic sinusoidal signals

Fig.5. Square periodic wave sinusoidal approximation
The Matlab codes below were developed for spectrogram speech analysis
y=chirp(t,0,1,150); % user-defined speech data
title(‘Spectrogram of speech’);

t = 0:0.001:5;
y = chirp(t,100,1,200,’q’);
spectrogram(y,128,120,128,1E3); % user-defined speech data
title(‘Spectrogram of speech’);

Fig. 6.MATLAB simulation results of spectrogram of speech, Time=3 sec

Fig.7.MATLAB simulation results of spectrogram of speech, Time=5 sec

The multi-layer neural networks phases consists of first modifying the last layers of the original Convolution Neural Network (CNN), based on the state techniques that involve resetting and adapting them. The CNN is trained to using the information in the spectrogram to recognize speeches. The trained and fine-tuned CNN is a perfect tool for speech processing.

Fig. 8. Machine learning model of Spectrogram of speech

Other areas of application in Machine Learning in relation to spectrogram analysis:
Digitization of sound through deep learning to help solve daily lives [11] Generation of audio data using Mel spectrograms in processing of the audio data in Python [12] Hyper-parameter tuning and data augmentation [13] End-to-end architecture of classification of ordinary sounds.
Speech-to-text algorithm and architecture, using CTC Loss and decoding for aligning sequences [14-15].
Classification of colorectal cancer tumour tissue in whole-slide images.


Spectral analysis of a periodic speech signal is of a square wave is executed as a result of the same frequency. When the frequency of the signal changes, dimensional spectral analysis of the speech signal will differ as a result of the change in time, from 3 sec to 5 sec.This therefore calls for the need to have “time” involved in spectrogram speech analysis for each point in time, ie Time frequency spectral analysis (TFSA).

Spectrogram analysis of speech data can be achieved through the application of the Neural Networks (NN) in coordination with the current state-of the art technologies. The multi-layer aspect of the NN allows us to synthesize speech data with accuracy.


R. Steinberg and D. O’Shaughnessy, “segmentation of speech spectrogram using Mathematical Morghology,” ICASSP, pp. 4, 2008.
Pieplow, N. (2009). “A Brief History of Spectrograms.” Retrieved 10/12/2012, 2012, from http://earbirding.com/blog/archives/1229
V.W. Zue and L.F. Lamel, “An expert spectrogram reader: a knowledge-based approach to speech recognition,” ICASSP vol. 11, pp. 3, 1986.

O.S. Douglas, Speech Communication
Human and Machine: Addison -Wesley
Publishing, 1990.
B. Pinkowski, “Multiscale Fourier descriptors for classifying semivowels in spectrograms,” Pattern Recognition vol. 30, (no. 26), pp. 9, 1993.
R. Steinberg and D. O’Shaughnessy, segmentation of speech spectrogram using Mathematical Morghology.pdf” in Proc. Acoustics’ speech and signal SignalProcessing , ICASSP, IEEE, 2008, pp. Pages.
R.C. Gonzalez and R.E. Woods, “Digital Image processing ” in Digital Image processing Addison – Wesley Publishing Company, INC, 1992, pp. 413-481.
A.K. Jain, Fundamentals of DigitaI Image Processing: Prentice-Hall, 1989.
W.K. Pratt, Digital image processing: john wiley and sons ltd, 1991.
W. Niblack, An Introduction to Digital Image Processing: Prentice Hall 1986.
G.S. Cox, “Template Matching and Measures of Match in Image processing,” 1995.
A. Mallawaarachchi, S.H. Ong, M. Chitre, and E. Taylor, “Spectrogram denoising and automated extraction of the fundamental frequency variation of dolphin whistles,” J Acoust Soc Am, vol. 124, (no. 2), pp. 1159-70, Aug 2008.
A.V. Oppenheim, R.W. Schafer, and J.A. Buck, Discrete Time Signal Processing: Upper Saddle River, N.J., Prentice Hall, 1999.
W.B. Hussein, “Spectrogram Enhancement By Edge Detection Approach Applied To Bioacoustics Calls Classification,” Signal & Image Processing : An International Journal, vol. 3, (no. 2), pp. 1-20, 2012.

B. Owsinski, The Recording Engineers Handbook Thomson Course Technology PTR, 1989.