Acoustic Scene Classification using Multi Feature Extraction and Convolutional Neural Networks with Multi Task Learning Ali Haider Muhammad Saqib17-MS-CSc-41 17-MS-CSc-35 University of Engineering and Technology

Acoustic Scene Classification using Multi Feature Extraction and Convolutional Neural Networks with Multi Task Learning
Ali Haider Muhammad Saqib17-MS-CSc-41 17-MS-CSc-35

University of Engineering and Technology, Taxila.
Abstract:
Acoustic scenes (AC) are collection of a variation of sound events obtained from different sources. The content of AC reveals wide variations in both time domain and frequency. Convolutional neural networks (CNNs) offer nominal technique to excerpt spatial material of multidimensional data such as audio, video and image. CNNs give facility to acquire hierarchical representation from time domain and frequency features of audio signals. Previously, the work is through convolutional neural network and multi-scale multi-feature extraction approaches for acoustic scene classification. We proposed Deep Neural Network (DNN) with Multi Task Learning (MTL) framework and multi feature extraction technique. DNN with MTL approaches which show signi?cant performance than CNN on the basis of number of detection, classi?cation and recognition. We conduct experiments on the TUT Acoustic Scenes 2016 dataset. Experimental results show that the use of DNN with MTL and multi-feature extraction methods increases the performance of the system significantly. Our proposed method gives high accuracy of 85.9%.
Keyword: DNN, MTL and CNN.

Introduction:
Acoustic scene classi?cation (ASC) which classify the sounds produced from different environment that happen for a time period.
Classification of acoustic scene will be valuable in many fields such as security surveillance and context aware services, if a machine can automatically classify the category of its environment from audio sounds recording by acoustic events ADDIN EN.CITE <EndNote><Cite><Author>Hertel</Author><Year>2016</Year><RecNum>25</RecNum><DisplayText>1</DisplayText><record><rec-number>25</rec-number><foreign-keys><key app=”EN” db-id=”s50dwrt05xssr6ef92npp95n29rr0dwrsaef”>25</key></foreign-keys><ref-type name=”Journal Article”>17</ref-type><contributors><authors><author>Hertel, Lars</author><author>Phan, Huy</author><author>Mertins, Alfred</author></authors></contributors><titles><title>Classifying variable-length audio files with all-convolutional networks and masked global pooling</title><secondary-title>arXiv preprint arXiv:1607.02857</secondary-title></titles><periodical><full-title>arXiv preprint arXiv:1607.02857</full-title></periodical><dates><year>2016</year></dates><urls></urls></record></Cite></EndNote>1. {Hertel, 2016 #25}
ASC is a challenging problem that has been considered for many years. Because of heterogeneity in audio scene nature and mutual acoustic properties, detection of features which is related to scene classi?cation is a challenging task between different types of scene. For example, sounds such as bird singing and rustling leaves are common to both scene forms in scene classes between forest path and park 2.
Several studies and many approaches have been proposed in the field of acoustic scene classi?cation (ASC) in the past, include feature extraction algorithms and machine learning algorithms for modeling acoustic scenes. Some discovered features are spectrum image features (SIF) 3, log-frequency ?lter banks, Mel frequency cepstral coef?cients (MFCC), time dependent temporal features, frequency dependent spectral features and combined Time-Frequency features. Several studies used Gaussian Mixture Model (GMM) and Support Vector Machine (SVM) for modeling environmental sounds which are machine learning algorithms. Recently, deep learning approaches have grown devotion and now researchers apply deep learning methods few are DNN and CNN for classi?cation of acoustic scene task. Convolution Neural Network (CNN) is one of the most effective modeling methods.
Recently, Deep Neural Network (DNN) with Multi-Task Learning (MTL) which does classi?cation, detection, prediction and recognition of acoustic scene. Multiple tasks which are associated to each other are learned by MTL simultaneously. To achieve better system performance, MTL method offers regularization and optimization outcome. There are various applications that bene?t from MTL approach. The multilingual character recognition, semantic classi?cation and several others application are benefit from MTL approach. In this paper, we suggest that a multi-task representation learning framework is based on CNN for acoustic scene classi?cation (ASC). We proposed Multi Feature Extraction and Convolutional Neural Network (CNN) with Multi Task Learning (MTL) framework
Literature review:
In this paper 4, a novel framework for acoustic scene classi?cation is presented through characterizing both sound textures and events, emulates psychoacoustic auditory scene cognition process. For labeling sound textures and events e?ective acoustic features are hired. In this paper, with respect to their importance for scene classi?cation integrate the two-channel information. The framework accomplished higher results in real data estimation.

In this paper 5, method is presented which recognize the audio context among every day environments. In this paper, each audio context is represented by a histogram of audio events which are identified by supervised classi?er. Each context is molded with a histogram which estimates from annotated training data in the training stage. In the unidentified recording, individual sound events are identified and a histogram of the sound event existences is constructed in the testing stage. By calculating the cosine distance among this histogram and event histograms of all context from the training database, context recognition is performed. For governing the rank of different events in the histogram distance intention, term frequency–inverse document frequency weighting is calculated.

In this paper 6, proposed system is that using deep learning automatically performs acoustic scene classi?cation (ASC). The scheme is designed based on the architecture that is related to the computer vision ?eld. In this paper, proposed method is an ASC scheme based on convolutional neural network (CNN) architecture. The suggested method is called self-determination CNN (SD-CNN). In this system, perform training process for deep neural network which improves the data separability. Additionally, to build a SD-NIN-CNN architecture the suggested SD approach is smeared to the architecture of NIN. The results show that the suggested SDNIN-CNN system succeeds the best performance over pervious works.

In this paper 7, proposes a novel neural network framework which is applied to commercial smart devices with microphones to identify acoustic contextual info. Our approach states that an acoustic signal takes extra local connectivity on the time axis as compared to the frequency axis. Experimental results show that the proposed method outperforms there are two conventional approaches which are proposed are Gaussian Mixture Models (GMMs) and Multi-Layer Perceptron (MLP) show experimental results 8.6% and 7.8% respectively in general accuracy.
In this paper 8, propose TDCNN architecture. In this paper, reserve full connectivity on frequency-related nodes, whereas time-related nodes are limited to be local. For the resolution of evaluation, DCASE 2017 development dataset was used. The proposed method provides more performance results than two conventional methods, GMMs and MLP. The result is that our TDCNN system is mostly well appropriate for the task of ASC since the local connectivity on the time axis is sufficiently engaged into consideration.

In this paper 9, proposed a system is for ASC problem. The suggested structure contains of two key ideas which a simple CNN model was using two convolutional layers and two max pooling layers, and the routine of arrangement of MFCC and log MEL filter bank features through several window size. General, our suggested model beats the baseline technique was provided by the organizer of DCASE 2016 challenge. Obtained results shows that an average accuracy is 85.9% compared to the baseline of 77.2%.

In this paper 10, Convolutional Neural Network with Multi-Task Learning architecture is presented is used to execute classi?cation for 15 acoustic scenes. The acoustic scenes are collected based on kinds of environment to express the scene classi?cation problem into multi-task learning framework. The proposed method is bene?t for regularization outcome after the cross-task knowledge sharing layers and system performance optimization conclusion from within-task knowledge sharing layers. The results show that CNN with MTL framework out achieves baseline CNN and advantage is additional signi?cant when scene classes are gathered based on environment kinds.

In this paper 11, report is presented the state of the art which classifying audio scenes, detecting and also classifying audio events automatically. This paper surveys the previous work and also the state of the art described through the proposals towards the challenge from various research collections. In this paper, present the detail on the organization of the challenge, which helps in experience as challenging hosts, helpful for those forming challenges have alike domains. New audio datasets and baseline systems for the challenge is created, are publicly available below open licenses, which helpful for further research.

Proposed Methodology:

CONVOLUTIONAL NEURAL NETWORKS (CNN)
Convolutional Neural Network (CNN) has exposed gripping and ef?ciency for various classi?cation fields, are image classi?cation, multivariate time series classi?cation, acoustic scene classi?cation and many others. In fact, CNN is feature extractor in which softmax layer gives final result. The convolutional layer is main block which uses ReLu or Rectilinear unit as activation function are shown in below equation.y = W ?x + b
h = ReLU(y) (1)
Convolution operator is ?. In this paper, we form the standard CNN with MTL framework. Conventional CNN is formed by loading three convolution blocks through the ?lter sizes respectively in each block. Convolution layer does multi feature extraction. Every convolution layer is tailed through a maxpooling layer. In conclusion, softmax layer produced the output scene classification.
CNN with Multi-Task Learning
Multiple classi?cation/prediction tasks which are related to one another are learned by Multi-task learning simultaneously and also info is shared to the tasks. MTL framework has two parts. Across many tasks, knowledge or information is shared or learned jointly in the first part. For the speci?c task, information or knowledge is learned individually in the second part, which mean mutual knowledge is learned in the ?rst part and in the second part this knowledge is adjusted to a speci?c task. As common knowledge is learned together across different tasks. MTL framework shows better generalization effect than freely learning separable tasks. Modeling process show improved optimization effect when generalized knowledge is more adjusted to the speci?c task.
Experiments Results:
Experiments are showed to calculate the efficiency of the expressing scene classi?cation in MTL framework based on CNN. There are two experiments one is multi feature extraction technique through CNN and scene classification through CNN with Multi Task Learning. In CNN with MTL, two groups are there for experiment, one is environment based scene and other is randomly selected scene.
We used TUT Acoustic Scenes 2016 dataset which consists of 15 different acoustic scenes: lakeside path, grocery store, beach, bus, cafe/restaurant, home, library, metro station, of?ce, urban park, residential area, train car, city center, forest and tram. The data set having length of 13 hours of audio. Database has two set, training and test sets. We extract features using multi learning CNN and in classification audio using CNN multi task learning. For training set, every scene class has total duration of 39 minutes and for test set, is 13 minutes. Each audio segment has length of 10 seconds. We judge that acoustic scenes are associated to some environments types. For example, the scenes which are library, of?ce, café and restaurant home are indoor acoustic scenes and the scenes which are car, metro station, bus and road are outdoor acoustic scenes related to transport.

The confusion matrix show the result which are obtained from proposed method by evolution of dataset.

Conclusion:
We proposed multi feature extraction technique using CNN for extraction of features and classification of acoustic scene using CNN. Extraction and also classification are done in the system. Experimental results show that the use of DNN with MTL and multi-feature extraction methods increases the performance of the system significantly. Our proposed method gives high accuracy of 85.9%.
References:
ADDIN EN.REFLIST 1.Hertel, L., H. Phan, and A. Mertins, Classifying variable-length audio files with all-convolutional networks and masked global pooling. arXiv preprint arXiv:1607.02857, 2016.2. M. Mulimani and S. G. Koolagudi, Acoustic scene classi?cation using MFCC and MP features in Detection and Classi?cation of Acoustic Scenes and Events 2016, Budapest, Hungary,2016.

3. H. Zhang, I. McLoughlin, and Y. Song, Robust sound event recognition using convolutional neural networks, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 559563.

4. Ye, Jiaxing, Takumi Kobayashi, Masahiro Murakawa, and Tetsuya Higuchi. “Acoustic scene classification based on sound textures and events.” In Proceedings of the 23rd ACM international conference on Multimedia, pp. 1291-1294. ACM, 2015.5. Heittola, T., Mesaros, A., Eronen, A., & Virtanen, T. (2010, August). Audio context recognition using audio event histograms. In Signal Processing Conference, 2010 18th European (pp. 1272-1276). IEEE.6 Wang, Chien-Yao, Andri Santoso, and Jia-Ching Wang. “Acoustic scene classification using self-determination convolutional neural network.” Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2017. IEEE, 2017.7. Lee, Younglo, Sangwook Park, and Hanseok Ko. “A time delay convolutional neural network for acoustic scene classification.” Consumer Electronics (ICCE), 2018 IEEE International Conference on. IEEE, 2018.8 Lee, Younglo, Sangwook Park, and Hanseok Ko. “A time delay convolutional neural network for acoustic scene classification.” Consumer Electronics (ICCE), 2018 IEEE International Conference on. IEEE, 2018.9. Dang, An, Toan H. Vu, and Jia-Ching Wang. “Acoustic scene classification using convolutional neural networks and multi-scale multi-feature extraction.” Consumer Electronics (ICCE), 2018 IEEE International Conference on. IEEE, 2018.10 Nwe, Tin Lay, Tran Huy Dat, and Bin Ma. “Convolutional neural network with multi-task learning scheme for acoustic scene classification.” (2017).

11. Giannoulis, Dimitrios, et al. “Detection and classification of acoustic scenes and events: An IEEE AASP challenge.” Applications of Signal Processing to Audio and Acoustics (WASPAA), 2013 IEEE Workshop on. IEEE, 2013.