Assessing positive matrix factorization model fit: a new method to estimate uncertainty and bias in factor contributions at the measurement time scale

Please download to get full document.

View again

of 15
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Books - Non-fiction

Published:

Views: 0 | Pages: 15

Extension: PDF | Download: 0

Share
Related documents
Description
Atmos. Chem. Phys., 9, , 2009 Author(s) This work is distributed under the Creative Commons Attribution 3.0 License. Atmospheric Chemistry and Physics Assessing positive matrix factorization
Transcript
Atmos. Chem. Phys., 9, , 2009 Author(s) This work is distributed under the Creative Commons Attribution 3.0 License. Atmospheric Chemistry and Physics Assessing positive matrix factorization model fit: a new method to estimate uncertainty and bias in factor contributions at the measurement time scale J. G. Hemann 1, G. L. Brinkman 2, S. J. Dutton 2, M. P. Hannigan 2, J. B. Milford 2, and S. L. Miller 2 1 Department of Applied Mathematics, University of Colorado, Boulder, USA 2 Department of Mechanical Engineering, University of Colorado, Boulder, USA Received: 3 January 2008 Published in Atmos. Chem. Phys. Discuss.: 14 February 2008 Revised: 4 December 2008 Accepted: 5 December 2008 Published: 22 January 2009 Abstract. A Positive Matrix Factorization receptor model for aerosol pollution source apportionment was fit to a synthetic dataset simulating one year of daily measurements of ambient PM 2.5 concentrations, comprised of 39 chemical species from nine pollutant sources. A novel method was developed to estimate model fit uncertainty and bias at the daily time scale, as related to factor contributions. A circular block bootstrap is used to create replicate datasets, with the same receptor model then fit to the data. Neural networks are trained to classify factors based upon chemical profiles, as opposed to correlating contribution time series, and this classification is used to align factor orderings across the model results associated with the replicate datasets. Factor contribution uncertainty is assessed from the distribution of results associated with each factor. Comparing modeled factors with input factors used to create the synthetic data assesses bias. The results indicate that variability in factor contribution estimates does not necessarily encompass model error: contribution estimates can have small associated variability across results yet also be very biased. These findings are likely dependent on characteristics of the data. 1 Introduction Air pollution comprised of particulate matter smaller than 2.5µm in aerodynamic diameter (PM 2.5 ) has been associated with a significant increased risk of morbidity and mortality (Dockery et al., 1993; Pope et al., 2002; Peel et al., 2005). Existing regulations have focused on average and peak PM 2.5 concentrations (µg m 3 ). To help policy makers design more Correspondence to: J. G. Hemann targeted and cost-effective approaches to protecting public health and welfare, an understanding of the association between PM 2.5 sources and morbidity and/or mortality needs to be developed. The Denver Aerosol Sources & Health study (DASH) has been undertaken to understand the sources of PM 2.5 that are detrimental to human health. PM 2.5 filter samples are collected daily from a centrally located site in Denver, CO. Speciated PM 2.5 is quantified including sulfate, nitrate, bulk elemental and organic carbon, trace metals, and trace organic compounds. These speciated PM 2.5 data are used as input to a receptor model, Positive Matrix Factorization (PMF), for pollution source apportionment. The PMF model fit yields characterizations of pollution sources, known as factors, with respect to their contributions to total measured PM 2.5, as well as their chemical profiles. Ultimately, an association will be explored between the individual factor contributions and short-term, adverse health effects, including daily mortality, daily hospitalizations for cardiovascular and respiratory conditions, and measures of poor asthma. For example, historical records of daily hospitalizations due to respiratory problems might be regressed against the daily concentrations of PM 2.5 pollution from diesel fuel combustion (as estimated by PMF) over the same time span. Having measures of uncertainty associated with the contribution of diesel fuel combustion to PM 2.5, at the daily time scale, may lead to more reliable characterization of the role diesel fuel combustion has in daily health effects data. PMF is a factor analytic method developed by Paatero and Tapper in 1994 (Paatero and Tapper, 1994) that has been widely used for pollution source apportionment modeling (Anderson et al., 2001; Kim and Hopke, 2007; Larsen and Baker, 2003; Lee et al., 1999; Polissar et al., 1998; Ramadan et al., 2000). The objective of this paper is to present a novel Published by Copernicus Publications on behalf of the European Geosciences Union. 498 J. G. Hemann et al.: A new method to estimate PMF model uncertainty method that has been developed to quantify uncertainty and bias in a PMF source apportionment model as it is applied to speciated PM 2.5 data. Uncertainty in a PMF solution exists at a number of levels and is important to quantify, especially if the solutions will inform environmental and health policy decisions. Uncertainty can stem from the data and from the PMF model itself. With respect to the data, uncertainty in the solution is imparted through measurement error as well as random sampling error. For the PMF model, there is generally rotational ambiguity in the solutions (i.e. solutions are not unique); further, solutions based upon the same data can vary depending upon how the model parameters are set. Past studies have considered these aspects, primarily by using the statistical method of the bootstrap to analyze model fit results. For example, Heidam (1987) considered the uncertainty in factor profiles due to receptor model uncertainty by varying the model parameters in models fit to bootstrapped datasets. The Environmental Protection Agency s Office of Research and Development distributes two software products, EPA PMF 1.1 (Eberly, 2005) and EPA Unmix 6.0 (Norris et al., 2007), which incorporate the bootstrap to analyze receptor model fit results. The software can be used to assess uncertainty in factor profile estimates and has been used by studies such as Chen et al. (2007) and Olson et al. (2007) to characterize sources of PM 2.5. Few studies, however, have addressed uncertainty in factor contribution estimates. Two examples are Nitta et al. (1994) and Lewis et al. (2003), though the estimates come from different source apportionment models and pertain to average contribution variability. The method presented in this paper estimates, at the measurement time scale, bias and variability due to random sampling error in factor contribution estimates. Replicate datasets are created using a circular block bootstrap, and the subsequent application of two novel techniques make such estimation possible. First, neural networks are used for matching factors across PMF results on that data. Second the measurements resampled across the replicate datasets are tracked within the PMF solutions. This discussion describes the method in the context of application to a synthetic PM 2.5 dataset, which was designed to simulate DASH data, fit by the PMF model. Using synthetic data allows assessment of model fit as well as a way to validate the method itself. 2 Methodology Presented here is a method of assessing uncertainty in source apportionment model results using two different measures: bias and variability due to random sampling error. The method goes beyond computing these measures in terms of average values and gives estimates at the measurement time scale. A synthetic time series of daily PM 2.5 measurements is used in which the concentrations of chemical species are derived from published source profiles and source contributions consistent with the Denver area. The solution from applying PMF can be compared with known profiles and contributions, allowing estimates of bias to be computed. A circular block bootstrap generates additional data by resampling, with replacement, from the original synthetic measurement series. Each new dataset, or replicate, is again fit by the PMF model to apportion the PM 2.5 mass to factors. The first novel aspect pertains to how factors are sorted between solutions. For each solution the factors should correspond to the same real-world pollution sources. The factors need to be aligned such that factor k in each solution always refers to the same factor. To accomplish this factor alignment, or matching, the standard approach has been to use scalar metrics like linear correlation to match a factor from one solution to the closest factor in another solution. This is the approach taken by the EPA PMF 1.1 software, where it is specifically the time series of factor contributions that are matched between solutions. In contrast, the present work takes the novel approach of using Multilayer Feed Forward Neural Networks (NN), trained to perform pattern recognition, to align factors between PMF solutions. Further, using the intuitive notion that pollution sources are characterized best by the chemical species they emit, the matching is based on factors profiles rather than their contribution time series. The NN approach is a robust factor matching technique: it avoids the sensitivity to outliers that is problematic when using measures such as linear correlation and replaces it with a method that is capable of capturing linear as well as non-linear relationships. The second novel aspect in the method presented here is the tracking of the measurement days resampled in each bootstrapped dataset. Through this bookkeeping it is possible to arrive at a collection of PMF results for each factor s contribution on each day. Accordingly, descriptive statistics can be computed for each factor contribution on each day. 2.1 Positive Matrix Factorization PM 2.5 pollution is typically comprised of dozens of chemical specie emitted from multiple sources. The concentration of each species may be treated as a random variable observed over time. The statistical technique of factor analysis can be used to explain the variability in these observations as linear combinations of some unknown subset of the sources, called factors. In traditional factor analysis approaches, including Principal Components Analysis, the variance-covariance matrix of the observations is used in an eigen-analysis to find the factors that explain most of the variability observed. The uncertainty in the observations, for all variables, is assumed to be independent and normally distributed. These assumptions are often not valid in the context of air pollution measurement data. In contrast, PMF a receptor-based source apportionment model offers an alternative technique that is based upon a least squares method, and measurement uncertainties Atmos. Chem. Phys., 9, , 2009 J. G. Hemann et al.: A new method to estimate PMF model uncertainty 499 can be specific to each observation, correlated, and nonnormal in distribution. Further, the factors resultant from PMF need not be orthogonal, which is an important quality when trying to associate modeled factors to real-world pollution sources that can be highly temporally correlated but are nonetheless important to characterize separately (e.g. diesel versus gasoline fuel combustion). Given a matrix of observed PM 2.5 concentrations, X, PMF attempts to solve X = GF + E (1) by finding the matrices G and F that recover X most closely, with all elements of G and F strictly non-negative. G is the matrix of factor contributions (or scores in traditional factor analysis terminology), where G ik is the concentration factor k contributed to the total PM 2.5 observed in sample i. F is the matrix of factor profiles (or loadings ), where F kj is the fraction at which species j makes up factor k. Finally, E is the matrix of residuals defined by E ij = X ij p G ik F kj (2) k=1 G and F are found through an alternating least squares algorithm that minimizes the sum of the normalized, squared residuals, Q Q = n m i=1 j=1 ( )2 Eij S ij where E ij is weighted by S ij, the uncertainty associated with the measurement of the jth pollutant species in the i sample. The ability to weight specific observations with specific uncertainties allows PMF to handle data that include heterogenous measurement uncertainty, outliers, values below measurement detection limits, and missing values. As such, PMF can often yield better results than traditional factor analysis methods (Huang et al., 1999). An algorithm for implementing PMF is available as a commercial software library, PMF2 (Paatero, 1997). The work presented here uses PMF2 version 4.2, and specifically, the pmf2wopt executable file (Paatero, 2007). PMF2 has numerous optimization parameters that can be set by the user, and methods of choosing these values have been published elsewhere (Paatero, 2000; Paatero et al., 2002, 2005). Since the focus of this paper is on a method of assessing uncertainty and bias in PMF solutions, the discussion of fine-tuning the numerous algorithm parameters is kept to a minimum. Two PMF2 parameters are especially important to the PMF model fit and deserve mention. First, the number of factors in the model, p, must be set by the user. In the present work, eight and nine factor solutions are considered, with the primary focus on the results for the nine factor solutions. The other important parameter is FPEAK, which controls the rotational freedom of the possible solutions. It is advised that FPEAK (3) values range between 1 and 1, with positive values causing extremes in the F matrix (values near 0 or 1) and negative values causing extremes in the G matrix. In the present work, FPEAK is zero for all PMF2 solutions, which corresponds to the default setting. 2.2 Synthetic data Given that the results of pollution source apportionment models may ultimately be used as critical components of environmental policy and regulatory decisions, it is especially important to assess their quality. One approach for evaluating receptor models is the use of synthetic data, which is defined as simulated PM 2.5 measurements rather than actual observations (Willis, 2000). Predefined sources are used, along with their respective contributions and profiles, to create the G and F matrices in Eq. (2). With G and F defined X can be calculated directly and given as input (along with uncertainty estimates) to the PMF2 software, where the resultant G and F matrices can then be compared with the actual values to assess model fit. The method of creating synthetic datasets followed in this paper is described in detail in Brinkman et al. (2006) and Vedal et al. (2007). Briefly, nine pollutant sources were used (Table 1), which contributed concentrations of 39 chemical species (Table 3), over 365 synthetic sampling days. The synthetic measurements were assumed to come from a single receptor site. Table 1 also lists the references used to generate the annual contributions, chemical profile, temporal patterns and variability for each source. Table 2 shows the lag zero cross correlations between the source contributions. With respect to PMF modeling, the relatively high cross-correlations between some of the input source contribution time series has the implication that some of these sources may be harder to cleanly separate from others. Distinct time series for the contributions from each source were generated by starting with average contribution estimates from preliminary DASH studies and the Northern Front Range Air Quality Study (Watson et al., 1998), then adding day-to-day variations reflecting both random variability and hypothesized weekly or seasonal patterns, as appropriate. Daily totals for the nine source contributions were normalized to match actual daily PM 2.5 levels observed in Denver in It should be noted that the presence of additional sources, such as secondary organic aerosols, could complicate application of PMF to observed data. The matrix of data uncertainties, S from Eq. (3), is computed as follows. Measurement detection limits, detection limit uncertainty, and measurement uncertainty associated with typical analytical techniques used to speciate PM 2.5 filter samples (Ion Chromatography, Thermal Optical Transmission, and Gas Chromatography/Mass Spectrometry), were incorporated into the PMF input via S ij = (αj X ij ) 2 + ( βj D j ) 2 (4) Atmos. Chem. Phys., 9, , 2009 500 J. G. Hemann et al.: A new method to estimate PMF model uncertainty Table 1. Synthetic PM 2.5 sources. Source References Secondary Ammonium Sulfate Lough (2004) Secondary Ammonium Nitrate Lough (2004) Gasoline Vehicles Watson et al. (1998); Chinkin et al. (2003); Cadle et al. (1999); Hildeman et al. (1991); Rogge et al. (1993a) Diesel Vehicles Watson et al. (1998); Chinkin et al. (2003); Hildeman et al. (1991); Rogge et al. (1993a); Schauer (1998) Paved Road Dust Watson et al. (1998); Chinkin et al. (2003); Hildeman et al. (1991); Rogge et al. (1993b) Wood Combustion Watson et al. (1998); Fine et al. (2004) Meat Cooking Watson et al. (1998); Schauer et al. (1999) Natural Gas Combustion Hildeman et al. (1991); Hannigan (1997); Rogge et al. (1993d) Vegetative Detritus Hildeman et al. (1991); Hannigan (1997); Rogge et al. (1993c) Table 2. Source contribution cross-correlations, Lag=0. Amm Sulfate Amm Nitrate Gasoline Diesel Road Dust Wood Meat Natural Gas Veg Ammonium Sulfate Ammonium Nitrate Gasoline Vehicles Diesel Vehicles Paved Road Dust Wood Combustion Meat Cooking Natural Gas Vegetative Detritus where for species j, α j is the measurement uncertainty, β j is the detection limit uncertainty, and D j is the detection limit. Table 3 contains the α, β and D associated with each species. The S ij uncertainties were incorporated into the final data matrix X with the following formula X ij = X ij + S ij Z ij (5) where Z ij is a random number drawn from a standard normal distribution. If X ij was less than the detection limit associated with measuring species j, then a value of one-half the detection limit was substituted in the final data matrix. 2.3 The bootstrap The bootstrap is a computationally intensive method for estimating the distribution of a statistic, the statistic itself being an estimator of some parameter of interest (Efron, 1979). The essence of the method is to create replicate data by resampling, with replacement, from the original observations of a random variable. For each replicate dataset the statistic of interest is computed, and the distribution of these values serves as an estimate for the random sampling distribution of the statistic. The properties of this distribution are then used to make inferences about the parameter of interest. In the present context, each pollutant species time series represents realizations of a random variable. The F and G matrices resulting from PMF s fitting of these data are functions of these random variables, thus, each element of those matrices may be considered a statistic. Previous studies using PMF have focused on analyzing the F matrix, the matrix of factor profiles. This discussion takes a different tack, with the statistic of interest being each element of the G matrix, the matrix of factor contributions over time Dependent data considerations Much of bootstrap theory is based upon the assumption that the data are comprised of observations of independent and identically distributed (iid) random variables. Time series data, however, are typically serially correlated. Singh (1981) showed that the bootstrap can be inconsistent in estimating the distribution of statistics based upon dependent data. Since then, numerous modifications of the original iid bootstrap have been formulated to better handle dependent data Atmos. Chem. Phys., 9, , 2009 J. G. Hemann et al.: A new method to estimate PMF model uncertainty 501 Table 3. Synthetic PM 2.5 species, measurement detection limits (D), measurement errors (α), and detection limit uncertainties (β). Species # Species Name D α β (ng/m 3 ) (%) (%) 1 Elemental Carbon Organic carbon Nitrate Sulfate Ammonium n-tricosane n-tetracosane n-pentacosane n-hexacosane n-heptacosane n-octacosane n-nonacosane n-triacontane n-hentriacontane n-dotriacontane n-tritriacontane n-tetratriacontane Oleic acid n-
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks
SAVE OUR EARTH

We need your sign to support Project to invent "SMART AND CONTROLLABLE REFLECTIVE BALLOONS" to cover the Sun and Save Our Earth.

More details...

Sign Now!

We are very appreciated for your Prompt Action!

x