DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments using Deep Learning

Please download to get full document.

View again

of 9
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report

Business & Finance


Views: 0 | Pages: 9

Extension: PDF | Download: 0

Related documents
DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments using Deep Learning Nicholas D. Lane, Petko Georgiev, Lorena Qendro Bell Labs, University of Cambridge, University of Bologna
DeepEar: Robust Smartphone Audio Sensing in Unconstrained Acoustic Environments using Deep Learning Nicholas D. Lane, Petko Georgiev, Lorena Qendro Bell Labs, University of Cambridge, University of Bologna ABSTRACT Microphones are remarkably powerful sensors of human behavior and context. However, audio sensing is highly susceptible to wild fluctuations in accuracy when used in diverse acoustic environments (such as, bedrooms, vehicles, or cafes), that users encounter on a daily basis. Towards addressing this challenge, we turn to the field of deep learning; an area of machine learning that has radically changed related audio modeling domains like speech recognition. In this paper, we present DeepEar the first mobile audio sensing framework built from coupled Deep Neural Networks (DNNs) that simultaneously perform common audio sensing tasks. We train DeepEar with a large-scale dataset including unlabeled data from 168 place visits. The resulting learned model, involving 2.3M parameters, enables DeepEar to significantly increase inference robustness to background noise beyond conventional approaches present in mobile devices. Finally, we show DeepEar is feasible for smartphones by building a cloud-free DSP-based prototype that runs continuously, using only 6% of the smartphone s battery daily. Author Keywords Mobile Sensing, Deep Learning, Audio Sensing ACM Classification Keywords H.5.2 User/Machine Systems: I.5 Pattern Recognition INTRODUCTION Advances in audio-based computational models of behavior and context continue to broaden the range of inferences available to mobile users [26]. Through the microphone it is possible to infer, for example: daily activities (e.g., eating [9], coughing [49], driving [58]), internal user states (e.g., stress [55], emotion [64]) and ambient conditions (e.g., number of nearby people [78]). Audio sensing has evolved into a key building block for various novel mobile applications that enable users to monitor and improve their health and wellbeing [63], productivity [73, 50] and environment [61, 19]. However, despite its progress, audio sensing is plagued by the challenge of diverse acoustic environments. Mobile applications deployed in the real world must make accurate inferences regardless of where they are used. But this is problem- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from UbiComp 15, September 07 11, 15, Osaka, Japan. Copyright c 15 ACM /15/09...$ atic because each environment (such as, the gym, office or a train station) contains its own mixture of background noises that often confuse the audio sensing classifiers in use today. Places are filled with noises that can overlap the sound targeted for classification, or may contain confounding noises that sound similar but are not tied to the same target event or activity of interest. Locations can even alter the acoustic characteristics of sounds (e.g., a user s voice) due to the materials used for furniture or decoration. For such reasons, the accuracy of audio sensing often falls, and otherwise is unpredictable, when performed in a range of different places. In recent years, a new direction in the modeling of data has emerged known as deep learning [25]. Through a series of new learning architectures and algorithms, domains such as object recognition [46] and machine translation [, 11] have been transformed; deep learning methods are now the stateof-the-art in many of these areas. In particular, deep learning has been the driving force behind large leaps in accuracy and model robustness in audio related domains like speech recognition []. One of the key ideas behind this progress is representational learning through the use of large-scale datasets. This allows models to stop relying on hand-crafted (often sensor specific) or generic features; and instead, robust representations of targeted inference categories are automatically learned from both labeled and unlabeled data. These representations are captured in a dense interconnected network of units, in which each unit contributes with a relatively simple function parameterized by the data. Deep learning has the potential to broadly impact the fields of activity recognition and the modeling of user behavior and context; early explorations of this potential are already underway [47, 35]. In this work, we examine a specific aspect of this larger puzzle, namely: Can deep learning assist audio sensing in coping with unconstrained acoustic environments? The outcome of our investigation is DeepEar an audio sensing framework for mobile devices that is designed using deep learning principals and algorithms. The heart of DeepEar are four coupled 5-layer 1024-unit Deep Neural Networks (DNNs), each responsible for separate types of audio inferences (viz. ambient audio scene analysis, speaker identification, emotion recognition, stress detection). Every DNN is parameterized using a large-scale audio dataset (12 hours) composed of both conventional labeled data along with unlabeled data gathered from 168 place visits. By applying a mixed condition approach to data pre-processing, we further synthesize additional labeled data by combining examples of audio categories with background noise of varying intensity. To utilize this dataset, we adopt state-of-the-art deep learning algorithms during pre-training and fine-tuning phases. As a result, unlike most existing mobile audio sensing frameworks DeepEar is able to exploit even unlabeled audio segments. Collectively, the stages of this framework represents a rethinking of how mobile audio sensing is performed. To achieve its primary goal of robustness to different acoustic environments it rejects manually-selected features designed for specific audio inferences. Only simple frequency domain information is presented to DeepEar at training time. Instead, representations of the audio data for each inference category are learned within the 3,300 units used in the framework. We experimentally validate the design of DeepEar in two ways. First, we compare the model accuracy and robustness of DeepEar, under unconstrained environments, against existing audio sensing systems designed for mobile devices [54, 64, 55, 72]; each system is selected as it is purpose-designed to provide one or more of the inferences supported by our framework. Second, we measure the energy and latency of a prototype DeepEar system designed for modern phone hardware (i.e., a programmable DSP 1 ). Our findings show Deep- Ear can cope with significant acoustic diversity while also is feasible for use on standard phones. In summary, this paper makes the following contributions: Deep Learning for Audio-based Sensing of Behavior and Context. Our design of DeepEar represents the first time computational models with a deep architecture have been developed to infer a broad set of human behavior and context from audio streams. By integrating techniques including unsupervised pre-training, our model is able to utilize the large amounts of unlabeled data that is readily collected by mobile systems. DeepEar is an important step towards understanding how deep approaches to modeling sensor data can benefit activity and context recognition. Large-scale Study of Acoustic Environment Diversity. We quantify the challenge of performing audio sensing in diverse environments using real-world audio datasets spanning four common audio sensing tasks, along with audio captured from 168 place visits. There are two key findings. First, conventional modeling approaches for mobile devices suffer dramatic fluctuations in accuracy when used in a set of common everyday places. Second, Deep- Ear when compared to these state-of-the-art mobile audio sensing techniques not only offers higher average accuracy across all four tested tasks but also has a much tighter range of accuracy as the acoustic environment changes. Low-energy Smartphone Prototype. To show DeepEar is suitable for mobile devices, we implement the framework using off-the-shelf smartphone hardware. By directly utilizing the DSP, present in most smartphone sold today, we demonstrate DeepEar can perform continuous sensing with acceptable levels of energy and latency. For example, with modest reductions in model complexity, DeepEar uses only 6% of the battery of the phone per day of continuous use, this comes at the expense of only a 3% drop in accuracy (on average) when ignoring mobile resources concerns. 1 Digital Signal Processor Audio Stream Hand Engineered Features GMM Inference Figure 1: Current Practice in Mobile Audio Sensing HMM Smoothing STATE-OF-THE-ART IN MOBILE AUDIO SENSING Mobile audio sensing has been an intensely active area of interest, with many techniques and end-to-end systems developed [49, 9, 78, 50, 61, 31, 56, 65]. Although all would acknowledge the difficulty of coping with diverse acoustic conditions, most study other challenges for example, developing methods to recognize new inference categories. As a result, they often do not explicitly measure the variability of accuracy across many background conditions. We assume, however, they will struggle due to the lack of compensating techniques coupled with the fundamental nature of the problem evidenced, for example, by decades of speech recognition research towards noise resistance [51, 22, 41, 68, 59]. Coping with Diverse Acoustic Environments. Due to the severity of the problem, a number of approaches for coping with unconstrained acoustic environments have been developed specifically within the mobile sensing community. One popular technique is to adapt the underlying model to accommodate the changes in distributions and characteristics of audio categories under new surroundings [57, 55]. For example, [55] proposes a method using Maximum a Posteriori (MAP) to adjust model parameters to new conditions. [57] in contrast, uses a semi-supervised procedure to recruit new labeled data compatible with the new environment, enabling model retraining. In a related approach, [12, 62, 74] all propose mechanisms to crowdsource labels, a side-effect of which is the ability to model a variety of environments. However, these methods are general and do not specifically consider the nuances of modeling audio. Similarly, semi-supervised learning and automated model adaptation are difficult to control as the model can degenerate when exposed to uncontrolled data, and there can be few opportunities to tune performance. In the broader speech and signal processing community, a much wider diversity of techniques exist, for example: a variety of model adaptation techniques (e.g., [29, 22, 59]); approaches built upon projections into low-dimensional subspaces robust to noise [68, 41]; and even some based on source separation [32]. Because of the maturity of this community (in comparison to mobile sensing) and the fact they are less bound by mobile resource limitations, these techniques for handling background noise are typically even more robust than those used within mobile systems. In fact, most of these approaches are yet to appear in mobile prototypes. But as we discuss in the next section, for many audio tasks the start-of-the-art is migrating towards the use of deep learning. Consequently, in this work we do not compare deep learning techniques to the latest offline server-side shallow learning audio algorithms (i.e., those outside of deep learning). Instead, we focus on the core question of if deep approaches to audio modeling are beneficial and feasible to mobile systems. Greedy Pre-Training Fine-tuning (backpropagation) Classification (forward propagation) Figure 2: Deep Neural Network Training and Inference Stages Current Practice in Mobile Audio Sensing. Largely due to limitations in current mobile solutions to the problem of unconstrained acoustic environments, they are rarely used in practice. Figure 1 sketches the de facto standard stages of audio processing found in the majority of mobile sensing applications today. As highlighted in the next section, current practice in audio sensing is radically different to the design of DeepEar. In the figure the process begins with the segmentation of raw audio in preparation for feature extraction. Representation of the data, even for very different types of audio inferences is often surprisingly similar. Generally banks of either Perceptual Linear Prediction (PLP) [38] or Mel Frequency Cepstral Coefficient (MFCC) [28] features are used as they are fairly effective across many audio modeling scenarios. Features are tuned (e.g., number of co-efficients, length of the audio frames) depending on the type of sensing task. Additional task specific features can also be added; for example, [55] incorporates features like TEO-CB-AutoEnv 2. The modeling of these features is sometimes performed by Decision Trees (especially in early systems) or Support Vector Machines (SVMs) (both detailed in [14]) but by far the most popular technique is a Gaussian Mixture Model (GMM) [30]. This is inline with their decades of dominance in the speech recognition domain, only until recently were they replaced by deep learning methods. Similar to features, model tuning also takes place, for instance, the number of mixture components is selected. Finally, a secondary model (e.g., a Hidden Markov Model [14]) is sometimes used, though largely to smooth the transitions of classes towards more likely sequences of real-life events rather than truly modeling the data. AUDIO APPLICATIONS OF DEEP LEARNING As already discussed it is in the domain of audio, and more specifically speech recognition, that deep learning has had some of its largest impact. For example, in 12 by adopting deep learning methods Google decreased error speech recognition error in Android devices by 30% [24]. Such success has spawned a rich set of audio-focused deep learning techniques and algorithms [23, 33, 10]. However, many are not directly applicable to mobile audio sensing due to their focus on speech tasks; thus they leverage speech-specific elements, for instance words and phonemes, which do not cleanly translate into the inference categories targeted by audio sensing. Deep Neural Networks for Audio Modeling. Figure 2, shows the core phases of audio modeling under a Deep Neural Network (see [25] for more). In describing these phases, 2 Teager Energy Operator Critical Band Autocorrelation Envelope, which has been shown to be discriminator of stress in voicing frames we intend to contrast their fundamental differences with conventional audio sensing methods (see Figure 1), as well as to provide a brief primer to core concepts. While there are a variety of deep learning algorithms for audio, DNNs and the related techniques we now describe (and are also adopted in DeepEar) are some of the most widely used. The architecture of a DNN is comprised by a series of fullyconnected layers, each layer in turn contains a number of units that assume a scalar state based primarily on the state of all units in the immediately prior layer. The state of the first layer (the input layer) is initialized by raw data (e.g., audio frames). The last layer (the output layer) contains units that correspond to inference classes; for example, a category of sound like music. All layers in between these two are hidden layers; these play the critical role of collectively transforming the state of the input layer (raw data) into an inference. Inference is performed with a DNN using a feed-forward algorithm that operates on each audio frame separately. Initially, the state of each input layer unit is set by a representation (e.g., frequency banks or even raw values) of the audio samples in the frame. Next, the algorithm updates the state of all subsequent layers on a unit-by-unit basis. Each unit has an activation function and additional parameters that specify how it s state is calculated based on the units in the prior layer (see next section for more). This process terminates once all units in the output layer are updated. The inferred class corresponds to the output layer unit with the largest state. To train a DNN (i.e., tune the activation function and parameters of each unit) two techniques are applied. The first of these stages is unsupervised with the aim of enabling the network to produce synthetic output with the same characteristics and distributions of real input data (i.e., generative). This process, referred to as pre-training, allows unlabeled data to be leveraged during model training. The next stage of training is called fine-tuning and is based on backpropagation algorithms that adjust the activation functions initialized by pre-training. This supervised process optimizes parameters globally throughout the network by minimizing a loss function defined by the disagreement of ground-truth labels (that set the output layer) and the network inferences assuming the existing activation functions. In most cases large-scale unlabeled data is crucial because of the difficulty in globally optimizing all network parameters using backpropagation alone. For decades building deep networks with many units and hidden layers was impractical, it was not until the discovery that greedily layer-by-layer pre-training vastly simplifies backpropagation did deep learning become possible [42]. Deep versus Shallow Learning. Understanding why deep learning models are able to outperform alternatives has been an area of considerable study [13, 27, 45, ]. Within such analysis, shallow learning is defined to include SVMs, GMMs, Decision Trees, single-layer Neural Networks and other commonly used models; the essential characteristic being they incorporate nonlinear feature transformations of one or at most two layers in contrast to models like a DNN. Considerable emphasis is assigned to the benefits of representational learning as well as the way deep learning moves Unlabeled Audio from Multiple Environments Unsupervised Pre-Training Initialized DNN Parameters Trained DNN Mixed Condition Synthesis Multi-Environment Labeled Data Supervised Fine-Tuning Labeled Audio Data Figure 3: DeepEar Model Training Process towards the closer integration of features and classifiers [45, 27]. It is the ability to learn features from the data and a rejection of manual feature engineering that is seen as a foundation to the advantage of deep learning. Through studies of deep learning applied to images [25] (e.g., face recognition) this behavior can be observed directly through individual layers. Findings show deep models learn hierarchies of increasingly more complex concepts (e.g., detectors of eyes or ears), beginning first with simple low-level features (e.g., recognizers of round head-like shapes). Studies also find highly preprocessed raw data (through manual feature engineering) actually leads to large amounts of information being discarded, that instead under deep learning is utilized []. More theoretical work [13] has found shallow architectures like GMMs are inefficient in representing important aspects of audio, and this causes them to require much more training data than DNNs before these aspects are captured. Other results [] similarly identify manifolds on which key audio discriminators exist but that are resistant to being learned within GMMs, a problem that is not present in deep models. Emerging Low Complexity and Hybrid DNNs. Extending from the success in speech, work is underway on broader set of audio tasks that utilize deep methods. [36], for ex
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!