DOI QR코드

DOI QR Code

Outlier Detection and Treatment for the Conversion of Chemical Oxygen Demand to Total Organic Carbon

화학적산소요구량의 총유기탄소 변환을 위한 이상자료의 탐지와 처리

  • Cho, Beom Jun (Marine Environments & Conservation Research Division, Korea Institute of Ocean Science & Technology) ;
  • Cho, Hong Yeon (Marine Environments & Conservation Research Division, Korea Institute of Ocean Science & Technology) ;
  • Kim, Sung (Marine Ecosystem Research Division, Korea Institute of Ocean Science & Technology)
  • 조범준 (한국해양과학기술원 해양환경.보전연구부) ;
  • 조홍연 (한국해양과학기술원 해양환경.보전연구부) ;
  • 김성 (한국해양과학기술원 해양생태계연구부)
  • Received : 2014.06.26
  • Accepted : 2014.08.22
  • Published : 2014.08.30

Abstract

Total organic carbon (TOC) is an important indicator used as an direct biological index in the research field of the marine carbon cycle. It is possible to produce the sufficient TOC estimation data by using the Chemical Oxygen Demand(COD) data because the available TOC data is relatively poor than the COD data. The outlier detection and treatment (removal) should be carried out reasonably and objectively because the equation for a COD-TOC conversion is directly affected the TOC estimation. In this study, it aims to suggest the optimal regression model using the available salinity, COD, and TOC data observed in the Korean coastal zone. The optimal regression model is selected by the comparison and analysis on the changes of data numbers before and after removal, variation coefficients and root mean square (RMS) error of the diverse detection methods of the outlier and influential observations. According to research result, it is shown that a diagnostic case combining SIQR (Semi - Inter-Quartile Range) boxplot and Cook's distance method is most suitable for the outlier detection. The optimal regression function is estimated as the TOC(mg/L) = $0.44{\cdot}COD(mg/L)+1.53$, then determination coefficient is showed a value of 0.47 and RMS error is 0.85 mg/L. The RMS error and the variation coefficients of the leverage values are greatly reduced to the 31% and 80% of the value before the outlier removal condition. The method suggested in this study can provide more appropriate regression curve because the excessive impacts of the outlier frequently included in the COD and TOC monitoring data is removed.

총유기탄소(TOC)는 해양의 탄소순환 연구분야에서 직접적인 생물학적 지표로 이용되는 중요한 인자다. 가용한 TOC 자료가 상대적으로 화학적산소요구량(COD) 자료 보다 부족하기 때문에 COD 자료를 활용하여 TOC 자료를 추정할 수 있다. COD를 TOC 로의 변환 시 TOC 추정에 직접적으로 영향을 미치는 COD 관측자료에 포함된 이상자료의 탐지와 적절한 처리는 합리적이고 객관적으로 수행되어야 한다. 본 연구에서는 국내 연안해역에서 관측된 염분, COD 및 TOC 자료에 대한 최적회귀모형을 제시하였다. 최적회귀모형은 이상자료와 영향자료를 여러 가지 탐색방법으로 진단하여 제거 전 후의 자료 개수 변화, 변동계수 및 RMS 오차를 비교 및 분석하여 선택하였다. 연구수행 결과, Cook의 진단방법과 SIQR의 boxplot 방법을 조합한 방법이 가장 적절한 것으로 파악되었다. 최적 회귀 함수는 TOC(mg/L) = $0.44{\cdot}COD(mg/L)+1.53$ 이고, 결정계수는 0.47 정도로 나타났으며, RMS 오차는 0.85 mg/L이다. RMS 오차와 지레계수(leverage values)의 변동계수는 이상자료 제거 전에 비하여 각각 31%, 80%로 크게 감소되었다. 본 연구에서 제시된 방법을 통해 COD와 TOC 관측자료에 포함된 이상자료와 영향자료의 과도한 영향을 진단 및 제거하였기 때문에 보다 적절한 회귀곡선식을 제시할 수 있었다.

Keywords

References

  1. Aucremanne, L., Brys, G., Hubert, M., Rousseeuw, PJ. and Struyf, A. (2004). A study of belgian inflation, relative prices and nominal rigidities using new robust measures of skewness and tail weight. In: Hubert, M, Pison, G, Struyf, A, Van Aelst, S. (Eds.), Theory and Applications of Recent Robust Methods, Series: Statistics for Industry and Technology. Birkhauser, Basel, pp. 13-25.
  2. Barnett, V. and Lewis, T. (1994). Outliers in Statistical Data, John Wiley & Sons, pp. 320-328.
  3. Chatterjee, S. and Hadi, A.S. (1986). Influential observations, high leverage points, and outliers in linear regression, Statistical Science, Vol. 1, No. 3, pp. 379-416. https://doi.org/10.1214/ss/1177013622
  4. Chen, R.F. and Bada, J.F. (1992). The fluorescence of dissolved organic matter in seawater, Marine Chemistry, Vol. 37, pp. 191-221. https://doi.org/10.1016/0304-4203(92)90078-O
  5. Cho, H.Y. and Oh, J.H., (2012). Outlier Detection of the Coastal Water Temperature Monitoring Data Using the Approximate and Detail Components, Journal of the Korean Society for Marine Environmental Engineering, Vol. 15, No. 2, pp. 156-162. https://doi.org/10.7846/JKOSMEE.2012.15.2.156
  6. Cook, R.D. (1977). Detection of Influential Observations in Linear Regression, Technometrics, 19, pp. 15-18. https://doi.org/10.2307/1268249
  7. Doval. M.D. and Hansell, D.A. (2000). Organic carbon and apparent oxygen utilization in the western south and the central Indian Ocean, Marine Chemistry, Vol. 68, pp. 249-264. https://doi.org/10.1016/S0304-4203(99)00081-X
  8. Hair, J.F., Black, W.C., Babin, B.J. and Anderson, R.E. (2010). Multivariate Data Analysis. Seventh Edition. Chapter 2. pp. 64-70.
  9. Hedger, J.I. (2002). Why dissolved organic matter, In : Biogeochemistry of marine dissolved organic matter, edited by Hansell, D.A. and C.A. Carlson, Academic Press, Amsterdam, pp. 1-33.
  10. Hoaglin, D.C. and Welsch, R.E. (1978). The Hat Matrix in Regression and ANOVA, The American Statistician, Vol. 32, pp. 17-22.
  11. Hubert, M. and Vandervieren, E. (2008). An adjusted boxplot for skewed distributions, Computational Statistics and Data Analysis, Vol. 52, pp. 5186-5201. https://doi.org/10.1016/j.csda.2007.11.008
  12. Kim, C. and Storer, B.E. (1996). Reference Values for Cook's Distance, Communications in Statistics Simulations and Computations, Vol. 25, pp. 691-708. https://doi.org/10.1080/03610919608813337
  13. Kim, K.H., Son, S.K., Son, J.W. and Ju, S.J. (2006). Methodological comparison of the quantification of total carbon ad organic carbon in marine sediment, Journal of the Korean Society for Marine Environmental Engineering, Vol. 9, pp. 235-242.
  14. Kimber, A.C. (1990). Exploratory data analysis for possibly censored data from skewed distributions, Applied Statistics, Vol. 39, pp. 21-30. https://doi.org/10.2307/2347808
  15. Koenker, R. and Bassett, J.G. (1978). Regression quantile, Econometrica : Journal of the Econometric Society, Vol. 46, No. 1, pp. 33-50. https://doi.org/10.2307/1913643
  16. Koenker, R. and Hallock, K.F. (2001). Quantile regression. Journal of Economic Perspectives, Vol. 15, No. 4, pp. 143-156. https://doi.org/10.1257/jep.15.4.143
  17. Korea Ocean Research & Development Institute. (2008). Development of management and restoration technologies for estuaries with focus on Han River estuary region, BSPE98101-2028-7, pp. 349-371 (in Korean).
  18. Kottegoda, N.T. and Renzo, R. (1997). Statistics, Probability, and Reliability for Civil and Environmental Engineers, pp. 375-380.
  19. Lee, J.S., Kim, S.Y., Lee, Y.K., Shin, D.W., Kim, H.J. and Jou, H.T. (2001). A Study on Outlier Adjustment for Multibeam Echosounder Data, The SeaJournal of the Korean Society for Marine Environmental Engineering, Vol. 6, No. 1, pp. 35-39.
  20. Lyman, O.R. and Longnecker, M. (2001). An Introduction to Statistical Methods and Data Analysis, pp. 96-101.
  21. Ministry of Land, Transport and Maritime Affairs, Korea Institute of Marine Science & Technology. (2011). Saemangeum coastal system research for marine environmental conservation, Korea Ocean Research & Development Institute, BSPM55630-2269-2, pp. 206-213 (in Korean).
  22. Ministry of Maritime Affairs and Fisheries. (2006). Research on Marine Environmental Improvement of Shihwa Lake, Korea Ocean Research & Development Institute, BSPM38800-1825-4, pp. 158-162 (in Korean).
  23. Ministry of Maritime Affairs and Fisheries. (2013a). Marine Environment Process Test Standard, Notification No. 2013-230 of the Ministry of Maritime Affairs and Fisheries (in Korean).
  24. Ministry of Maritime Affairs and Fisheries. (2013b). Marine Environment Management Act Enforcement Regulations, Act No. 63 of the Ministry of Maritime Affairs and Fisheries (in Korean).
  25. So, B.J., Kwon, H.H. and An, J.H. (2012). Trend Analysis of Extreme Precipitation Using Quantile Regression, Journal of Korea Water Resources Association, Vol. 45, No. 8, pp. 815-826. https://doi.org/10.3741/JKWRA.2012.45.8.815
  26. Son, J.W., Park, Y.C. and Lee, H.J. (2003). Characteristics of Total Organic Carbon and Chemical Oxygen Demand in the Coastal Waters of Korea. The SeaJournal of the Korean Society of Oceanography, Vol. 8, No. 3, pp. 317-326.
  27. Tchobanoglous, G. and Schroeder, E.D. (1985). Water Quality, pp. 101-104.

Cited by

  1. Classification and Performance Evaluation Methods of an Algal Bloom Model vol.26, pp.6, 2014, https://doi.org/10.9765/KSCOE.2014.26.6.405