DOI QR코드

DOI QR Code

Comparison of machine learning algorithms for Chl-a prediction in the middle of Nakdong River (focusing on water quality and quantity factors)

머신러닝 기법을 활용한 낙동강 중류 지역의 Chl-a 예측 알고리즘 비교 연구(수질인자 및 수량 중심으로)

  • Lee, Sang-Min (Department of Environmental Engineering, Pukyong National University) ;
  • Park, Kyeong-Deok (Department of Marine Design Convergence engineering, Pukyong National University) ;
  • Kim, Il-Kyu (Department of Environmental Engineering, Pukyong National University)
  • 이상민 (부경대학교 환경공학과) ;
  • 박경덕 (부경대학교 마린융합디자인공학과) ;
  • 김일규 (부경대학교 환경공학과)
  • Received : 2020.07.08
  • Accepted : 2020.08.14
  • Published : 2020.08.15

Abstract

In this study, we performed algorithms to predict algae of Chlorophyll-a (Chl-a). Water quality and quantity data of the middle Nakdong River area were used. At first, the correlation analysis between Chl-a and water quality and quantity data was studied. We extracted ten factors of high importance for water quality and quantity data about the two weirs. Algorithms predicted how ten factors affected Chl-a occurrence. We performed algorithms about decision tree, random forest, elastic net, gradient boosting with Python. The root mean square error (RMSE) value was used to evaluate excellent algorithms. The gradient boosting showed 10.55 of RMSE value for the Gangjeonggoryeong (GG) site and 11.43 of RMSE value for the Dalsung (DS) site. The gradient boosting algorithm showed excellent results for GG and DS sites. Prediction value for the four algorithms was also evaluated through the Receiver operating characteristic (ROC) curve and Area under curve (AUC). As a result of the evaluation, the AUC value was 0.877 at GG site and the AUC value was 0.951 at DS site. So the algorithm's ability to interpret seemed to be excellent.

Keywords

References

  1. Breiman. L., Friedman, J.H., Olshen, R.A., and Stone, C.J. (1984). Classification and regression trees, Wadsworth Statistics/Probability Series, Wadsworth Advanced Books and Software.
  2. Caissie, D., Satish, M.G., and El-Jabi, N. (2007). Predicting water temperatures using a deterministic model: Application on Miramichi River catchment(New Brunswick, Canada), J. Hydrol., 336, 303-315. https://doi.org/10.1016/j.jhydrol.2007.01.008
  3. Chun, D.J. and Eun, J. (2017). Application method of remote sensing method for monitoring the water quality of big River, KEI Environmental Forum, 214, 21.
  4. Cho, J. Y. (2019). Odor compounds forecasting in Daecheong water intake station using machine learning models, Doctor's Thesis, Chungnam National University, Daejeon, Korea.
  5. Clercq, D.D., Wen, Z., and Fei, F. (2019). Determinants of efficiency in anaerobic bio-waste co-digestion facilities: A data envelopment analysis and gradient boosting approach, Appl. Energy, 253, 113570. https://doi.org/10.1016/j.apenergy.2019.113570
  6. Dhaliwal, S.S., Nahid, A.A., and Abbas, R. (2018). Effective intrusion detection system using XGboost, Information, 9(7), 149. https://doi.org/10.3390/info9070149
  7. Do, D.T. and Le, N.Q.K. (2020). Using extreme gradient boosting to identify origin of replication in Saccharomyces cerevisiae via hybrid features, Genomics. 112(3), 2445-2451. https://doi.org/10.1016/j.ygeno.2020.01.017
  8. Falconer, I.R. and Humpage, A.R. (2005). Health risk assessment of cyanobacterial (blue-green algal) toxins in drinking water, Int. J. Environ. Res. Public Health, 2(1), 43-50. https://doi.org/10.3390/ijerph2005010043
  9. Fan, J., Ma, X., Wu, L., Zang, F., Yu, X., and Zeng, W. (2019). Light gradient boosting machine: An efficient soft computing model for estimating daily reference evapotranspiration with local and external meteorological date, Agric. Water Manag., 225, 105758. https://doi.org/10.1016/j.agwat.2019.105758
  10. Friedman, J.H. (2002). Stochastic gradient boosting, Comput. Stat. Data Anal., 38(4), 367-378. https://doi.org/10.1016/S0167-9473(01)00065-2
  11. Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: date mining, inference and prediction, Springer Series in Statistics, New York, 745.
  12. Heo, J.S., Kwon, D,h., Kim, J,B., Han, Y.H., and An, C.H. (2018). Prediction of cryptocurrency price trend using gradient boosting, KIPS Trans, Softw. Data Eng., 7(10), 387-396. https://doi.org/10.3745/KTSDE.2018.7.10.387
  13. Hoerl, A.E. and Kennard, R.W. (1970). Ridge regression: Biased estimation for nonorthogonal problems, Technometrics, 12(1), 55-67. https://doi.org/10.1080/00401706.1970.10488634
  14. Hwang, S.J. (2012). Forecasting system for water quality using artificial neural Networks: The Kangjung-Koryung weir on the Nakdong River, Doctor's Thesis, Keimyung University.
  15. Hyndman, R.J. and Koehler, A.B. (2006). Another look at measure of forecast accuracy, Int. J. Forecast., 22(4), 679-688. https://doi.org/10.1016/j.ijforecast.2006.03.001
  16. Johnson, N.E., Bonczak, B., and Kontokosta, C.E. (2018). Using a gradient boosting model to improve the performance of low-cost aerosol monitors in a dense, heterogeneous urban environment, Atmos. Environ., 184, 9-16. https://doi.org/10.1016/j.atmosenv.2018.04.019
  17. Johnson, N.E., Ianiuk, O., Cazap, D., Liu, L., Starobin, D., Dobler, G., and Ghandehari, M. (2017). Patterns of waste generation: A gradient boosting model for short-term waste prediction in New York City, J. Waste Manag., 62, 3-11. https://doi.org/10.1016/j.wasman.2017.01.037
  18. Jung, S.Y. and Kim, I.G. (2017). Analysis of water quality factor and correlation between water quality and Chl-a in middle and downstream weir section of Nakdong River, J. Korean Soc. Environ. Eng., 39(2), 89-96. https://doi.org/10.4491/KSEE.2017.39.2.89
  19. Jung, W.S., Kim, B,G., Kim, Y.D., and Kim, S.E. (2019). A study on the characteristics of cyanobacteria in the mainstream of Nakdong river using decision trees, J. Wetl. Res., 21(4), 312-320. https://doi.org/10.17663/JWR.2019.21.4.312
  20. Kim, C.W. and Seo, Y.G. (2020). Design and performance prediction of ultra-low flow hydrocyclone using the random forest method, J. Korean Soc. Manuf. Technol. Eng., 29(2), 83-88.
  21. Kim, D.H. and Yom, J.H. (2018). Machine Learning Based Estimation of Chlorophyll-a Concentrations in the Nakdong River Using Satellite Imagery, J. Korean Soc, Geom. atics., 4, 231-236.
  22. Kim, G.H., Jung, K.Y., Yoon, J.S., and Cheon, S.U. (2013). Temporal and spatial analysis of water quality data observed in lower watershed of Nam River Dam, J. Korean Soc. Hazard Mitig., 13(6), 429-437. https://doi.org/10.9798/KOSHAM.2013.13.6.429
  23. Kim, H.G. (2017). Prediction of chlorophyll-a in the middle reach of the Nakdong River at Maegok using artificial neural networks, Department of Integrated Biological Science, Master's Thesis, The Graduate School Busan National University, Busan, Korea.
  24. Krishna, T.H., Rajabhushanam, C., Michael, G., and Kavitha, R. (2019). Liver disorderprognosis with Apache spark random forest and gradient booster Algorithms, IJITEE, 8, 2278-3075.
  25. Landry, M., Erlinger, T.P., Patschke, D., and Varrichio, O. (2016). Probabilistic gradient boosting machines for Gefcom 2014 wind forecasting, Int. J. Forecast, 32(3), 1061-1066. https://doi.org/10.1016/j.ijforecast.2016.02.002
  26. Lawrence, R., Bunn, A., Powell, S., and Zambon, M. (2004). Classification of remotely sensed imagery using stochastic gradient boosting as a refinement of classification tree analysis, Remote Sens. Environ., 90(3), 331-336. https://doi.org/10.1016/j.rse.2004.01.007
  27. Lee, H.W. (2013). A study on nutrient mass balance of the weir sections in the middle of Nakdong River basin, Master's Thesis, Department of Environment Engineering Graduate School Yeungnam University, Gyeongsan, Gyeongbuk, Korea.
  28. Lee, J.A. and Yoo, J.E. (2019). Exploration of predictors to teacher efficacy via elastic net, Asian J. Education, 20(1), 149-172. https://doi.org/10.15753/aje.2019.03.20.1.149
  29. Lee, S.H., Kim, B.R., and Lee, H.W. (2014). A study on water quality after construction of the weirs in the middle area in Nakdong River, J. Korean Soc. Environ. Eng., 36(4), 258-264. https://doi.org/10.4491/KSEE.2014.36.4.258
  30. Lim, J.S., Kim, Y.W., Lee, J.H., Park, T.J., and Byun, I.G. (2015). Evaluation of Correlation between Chlorophyll-a and Multiple Parameters by Multiple Linear Regression Analysis, J. Korean Soc. Environ. Eng., 37(5), 253-261. https://doi.org/10.4491/KSEE.2015.37.5.253
  31. McLaughlin, D.B. (2012). Assessing the predictive performance of risk-based water quality criteria using decision error estimate from receiver operating characteristics(ROC) analysis, Integr. Environ. Asses., 8(4), 674-684. https://doi.org/10.1002/ieam.1301
  32. Metz, C.E. (1978). Basic principles of ROC analysis, Seminars in the Nuclear Medicine, 8(4), 283-298. https://doi.org/10.1016/S0001-2998(78)80014-2
  33. Morrison, A.M., Coughlin, K., Shin, J.P., Coull, B.A., and Rex, A.C. (2003). Receiver operating characteristic curve analysis of beach water quality indicator variables, Appl. Environ. Microb., 69(11), 6405-6411. https://doi.org/10.1128/AEM.69.11.6405-6411.2003
  34. Nieto, P.J.G., Gonzalo, E.G., Lasheras, F.S., Fernandez, J.J.R., Muniz, C.D., and Cos Jues, F.J. (2018). Cyanotoxin level prediction in a resevoir using gradient boosted regression trees: A case study, Environ. Sci. Pollut. R., 25, 22658-22671. https://doi.org/10.1007/s11356-018-2219-4
  35. Muller, A.C., and Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists, O'Reilly Media, Inc.
  36. Park, B.G. (2015). A study for estimation of chlorophyll-a in a mid-lower reach of the Nakdong River using a neural network, Master's Thesis, Department of Civil Engineering, The Graduate School Pukyong Natioal University, Busan, Korea.
  37. Park, K.Y., and Ko. J.W. (2019). A short guide to machine learning for economists, Korean J. Econ., 26(2), 367-408. https://doi.org/10.46228/kje.26.2.9
  38. Persson, C., Bacher, P., Shiga, T., and Madsen, H. (2017). Multi-site solar power forecasting using gradient boosted regression trees, J. Sol. Energy, 150, 423-436. https://doi.org/10.1016/j.solener.2017.04.066
  39. Rokach, L., and Maimon, O. (2005). Decision Trees In Data Mining and Knowledge Discovery Handbook, Springer, Boston, MA.
  40. Song, S.S., Park, J.J., Kang, T.T., Kim, Y.S., Kim, J.Y., and Kang, T.K. (2017). Accuracy evaluation and alert level setting for real-time cyanobacteria measurement using receiver operating characteristic curve analysis, J. Korean Soc. Water Environ., 33(2), 130-139. https://doi.org/10.15681/KSWE.2017.33.2.130
  41. Tibshirani, R. (1996). Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society, Series B (Methodological), 58(1), 267-288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
  42. Twisti, H., Edeards. A.C., and Codd, G.A. (1988). Algae growth respones to waters of contrasting tributaries of the river Dee, North-East Scotland, Water Res., 32(8), 2471-2479. https://doi.org/10.1016/S0043-1354(97)00450-8
  43. Vapnik, V. (1998). Statistical learning theory, Wiley-Interscience, New York.
  44. Wei, L., Huang, C., Wang, Z., Wang, Z., Zhou, X., and Cao, L. (2019). Monitoring of urban black-odor water based on Nemerow index and gradient boosting decision tree regression using UAV-borne hyperspectral imagery, Remote Sens., 11(20), 2402. https://doi.org/10.3390/rs11202402
  45. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net, Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301-320. https://doi.org/10.1111/j.1467-9868.2005.00503.x

Cited by

  1. 지표 유출 특성을 고려한 홍수취약지역 지형학적 인자의 ROC 분석 vol.7, pp.4, 2020, https://doi.org/10.17820/eri.2020.7.4.327
  2. 딥러닝을 이용한 정삼투 막모듈의 플럭스 예측 vol.35, pp.1, 2020, https://doi.org/10.11001/jksww.2021.35.1.093
  3. Detecting Areas Vulnerable to Flooding Using Hydrological-Topographic Factors and Logistic Regression vol.11, pp.12, 2021, https://doi.org/10.3390/app11125652