DOI QR코드

DOI QR Code

Statistical data preparation: management of missing values and outliers

  • Kwak, Sang Kyu (Department of Medical Statistics, School of Medicine, Catholic University of Daegu) ;
  • Kim, Jong Hae (Department of Anesthesiology and Pain Medicine, School of Medicine, Catholic University of Daegu)
  • Received : 2017.05.11
  • Accepted : 2017.06.20
  • Published : 2017.08.01

Abstract

Missing values and outliers are frequently encountered while collecting data. The presence of missing values reduces the data available to be analyzed, compromising the statistical power of the study, and eventually the reliability of its results. In addition, it causes a significant bias in the results and degrades the efficiency of the data. Outliers significantly affect the process of estimating statistics (e.g., the average and standard deviation of a sample), resulting in overestimated or underestimated values. Therefore, the results of data analysis are considerably dependent on the ways in which the missing values and outliers are processed. In this regard, this review discusses the types of missing values, ways of identifying outliers, and dealing with the two.

Keywords

References

  1. Rubin DB. Inference and missing data. Biometrika 1976; 63: 581-92. https://doi.org/10.1093/biomet/63.3.581
  2. Rubin DB. Multiple imputation after 18+ years. J Am Stat Assoc 1996; 91: 473-89. https://doi.org/10.1080/01621459.1996.10476908
  3. Schafer JL. Multiple imputation: a primer. Stat Methods Med Res 1999; 8: 3-15. https://doi.org/10.1177/096228029900800102
  4. Gentleman J, Wilk M. Detecting outliers II: supplementing the direct analysis of residuals. Biometrics 1975; 31: 387-410. https://doi.org/10.2307/2529428
  5. Seo HS, Yoon M. Outlier detection using support vector machines. Commun Stat Appl Methods 2011; 18: 171-7.
  6. Burke S. Missing values, outliers, robust statistics & non-parametric methods. LC-GC Eur Online Suppl Stat Data Anal 2001; 2: 19-24.

Cited by

  1. Correlation Coefficients : Appropriate Use and Interpretation vol.126, pp.5, 2017, https://doi.org/10.1213/ane.0000000000002864
  2. Blood Levels of Glutamate and Glutamine in Recent Onset and Chronic Schizophrenia vol.9, pp.None, 2017, https://doi.org/10.3389/fpsyt.2018.00713
  3. Improving the ACGIH threshold limit value (TLV) process vol.2, pp.None, 2017, https://doi.org/10.1177/2397847318801758
  4. Transfer of Deoxynivalenol (DON) through Placenta, Colostrum and Milk from Sows to Their Offspring during Late Gestation and Lactation vol.10, pp.12, 2017, https://doi.org/10.3390/toxins10120517
  5. Valores anômalos e dados faltantes em estudos clínicos e experimentais vol.18, pp.None, 2017, https://doi.org/10.1590/1677-5449.190004
  6. Estimating the cost of regulating genome edited crops: expert judgment and overconfidence vol.10, pp.1, 2017, https://doi.org/10.1080/21645698.2019.1612689
  7. The Efficiency of Briefly Presenting Word Forms in a Computerized Repeated Spelling Training vol.35, pp.3, 2017, https://doi.org/10.1080/10573569.2018.1526725
  8. The association between perseverative cognition and resting heart rate variability: A focus on state ruminative thoughts vol.145, pp.None, 2019, https://doi.org/10.1016/j.biopsycho.2019.04.004
  9. Rule based classification of neurodegenerative diseases using data driven gait features vol.9, pp.4, 2017, https://doi.org/10.1007/s12553-018-0274-y
  10. Ancestry-specific polygenic scores and SNP heritability of 25(OH)D in African- and European-ancestry populations vol.138, pp.10, 2017, https://doi.org/10.1007/s00439-019-02049-x
  11. An exploration of autism‐specific and non‐autism‐specific measures of anxiety symptomatology in school‐aged autistic children vol.23, pp.3, 2017, https://doi.org/10.1111/cp.12174
  12. Effect of operative time on the outcome of patients undergoing off‐pump coronary artery bypass surgery vol.34, pp.11, 2019, https://doi.org/10.1111/jocs.14231
  13. Associations between the Home Physical Environment and Children’s Home-Based Physical Activity and Sitting vol.16, pp.21, 2017, https://doi.org/10.3390/ijerph16214178
  14. Soya, maize and sorghum ready-to-use therapeutic foods are more effective in correcting anaemia and iron deficiency than the standard ready-to-use therapeutic food: randomized controlled trial vol.19, pp.1, 2017, https://doi.org/10.1186/s12889-019-7170-x
  15. MBCAST: A Forecast Model for Marssonina Blotch of Apple in Korea vol.35, pp.6, 2017, https://doi.org/10.5423/ppj.oa.09.2019.0236
  16. Predicting Microbial Species in a River Based on Physicochemical Properties by Bio-Inspired Metaheuristic Optimized Machine Learning vol.11, pp.24, 2019, https://doi.org/10.3390/su11246889
  17. Salivary and plasmatic oxytocin are not reliable trait markers of the physiology of the oxytocin system in humans vol.9, pp.None, 2020, https://doi.org/10.7554/elife.62456
  18. Beneficial Effect of Systemic Allogeneic Adipose Derived Mesenchymal Cells on the Clinical, Inflammatory and Immunologic Status of a Patient With Recessive Dystrophic Epidermolysis Bullosa: A Case Rep vol.7, pp.None, 2017, https://doi.org/10.3389/fmed.2020.576558
  19. Multi‐ethnic analysis shows genetic risk and environmental predictors interact to influence 25(OH)D concentration and optimal vitamin D intake vol.44, pp.2, 2017, https://doi.org/10.1002/gepi.22272
  20. Metal levels in two fish species from a waterbody impacted by metallurgic industries and acid mine drainage from coal mining in South Africa vol.55, pp.4, 2017, https://doi.org/10.1080/10934529.2019.1704604
  21. Determinants of R&D on European high technology industry: panel data evidence vol.18, pp.3, 2020, https://doi.org/10.1108/mrjiam-11-2019-0969
  22. Quantifying the feminine self(ie): Gender display and social media feedback in young women’s Instagram selfies vol.22, pp.5, 2017, https://doi.org/10.1177/1461444819871669
  23. Improving the Accuracy of Convolutional Neural Networks by Identifying and Removing Outlier Images in Datasets Using t-SNE vol.8, pp.5, 2020, https://doi.org/10.3390/math8050662
  24. Effect of Viewing Video Representation of the Urban Environment and Forest Environment on Mood and Level of Procrastination vol.17, pp.14, 2017, https://doi.org/10.3390/ijerph17145109
  25. Scared of compassion: Fear of compassion in anxiety, mood, and non‐clinical groups vol.59, pp.3, 2017, https://doi.org/10.1111/bjc.12250
  26. Artificial intelligence and machine learning for protein toxicity prediction using proteomics data vol.96, pp.3, 2020, https://doi.org/10.1111/cbdd.13701
  27. Active video games for knee osteoarthritis improve mobility but not WOMAC score: A randomized controlled trial vol.63, pp.6, 2017, https://doi.org/10.1016/j.rehab.2019.11.008
  28. The Effect of Transcranial Pulsed Current Stimulation at 4 and 75 Hz on Electroencephalography Theta and High Gamma Band Power: A Pilot Study vol.10, pp.9, 2017, https://doi.org/10.1089/brain.2020.0756
  29. Standardized regression-based clinical change score cutoffs for normal pressure hydrocephalus vol.20, pp.1, 2017, https://doi.org/10.1186/s12883-020-01719-y
  30. Association of Neighborhood Deprivation Index With Success in Cancer Care Crowdfunding vol.3, pp.12, 2017, https://doi.org/10.1001/jamanetworkopen.2020.26946
  31. Early Detection of Ganoderma boninense in Oil Palm Seedlings Using Support Vector Machines vol.12, pp.23, 2020, https://doi.org/10.3390/rs12233920
  32. The effects of high altitude ascent on splenic contraction and the diving response during voluntary apnoea vol.106, pp.1, 2017, https://doi.org/10.1113/ep088571
  33. Physiological Data Models to Understand the Effectiveness of Drone Operation Training in Immersive Virtual Reality vol.35, pp.1, 2017, https://doi.org/10.1061/(asce)cp.1943-5487.0000941
  34. Quantitative Analysis of a Weak Correlation between Complicated Data on the Basis of Principal Component Analysis vol.2021, pp.None, 2017, https://doi.org/10.1155/2021/8874827
  35. Reliability of hip muscle strength measured in principal and intermediate planes of movement vol.9, pp.None, 2021, https://doi.org/10.7717/peerj.11521
  36. Effects of mobile health interventions on improving glycemic stability and quality of life in patients with type 1 diabetes: A meta‐analysis vol.44, pp.1, 2017, https://doi.org/10.1002/nur.22094
  37. Alternative ways to handle missing values problem: A case study in earthquake dataset vol.1796, pp.1, 2017, https://doi.org/10.1088/1742-6596/1796/1/012123
  38. A Clinical Decision Support System for Predicting Invasive Breast Cancer Recurrence: Preliminary Results vol.11, pp.None, 2021, https://doi.org/10.3389/fonc.2021.576007
  39. Enduring value of perfectionism and maturity fears for predicting eating disorder maintenance over 10‐, 20‐, and 30‐year follow‐up vol.54, pp.3, 2017, https://doi.org/10.1002/eat.23412
  40. Impact of intrapersonal and interpersonal emotional intelligence and self-directed learning on academic performance among pre-university science students vol.7, pp.3, 2017, https://doi.org/10.1016/j.heliyon.2021.e06611
  41. Robust Normality Test in the Presence of Outliers vol.1863, pp.1, 2017, https://doi.org/10.1088/1742-6596/1863/1/012009
  42. Risk Factors Associated with Vitamin D Status among Older Puerto Rican Adults vol.151, pp.4, 2017, https://doi.org/10.1093/jn/nxaa426
  43. Complementary and alternative therapies (CAT) in academic programs and nursing practice: Is more education is needed? vol.43, pp.None, 2017, https://doi.org/10.1016/j.ctcp.2021.101327
  44. A randomized controlled trial of a smartphone-based application for the treatment of anxiety vol.31, pp.4, 2017, https://doi.org/10.1080/10503307.2020.1790688
  45. Interlaboratory evaluation of plasma N-glycan antennary fucosylation as a clinical biomarker for HNF1A-MODY using liquid chromatography methods vol.38, pp.3, 2017, https://doi.org/10.1007/s10719-021-09992-w
  46. Patterns of love and sexting in teen dating relationships: The moderating role of conflicts vol.2021, pp.178, 2017, https://doi.org/10.1002/cad.20427
  47. Diagnostics of Large Non-Conductive Anti-Corrosion Coatings on Steel Structures by Means of Electrochemical Impedance Spectroscopy vol.14, pp.14, 2017, https://doi.org/10.3390/ma14143959
  48. Two forms of short-interval intracortical inhibition in human motor cortex vol.14, pp.5, 2017, https://doi.org/10.1016/j.brs.2021.08.022
  49. Are individual and social factors specific to the home associated with children's behaviour and physical environment at home vol.39, pp.19, 2017, https://doi.org/10.1080/02640414.2021.1928409
  50. Emerging coherence and relations to communication among executive function tasks in toddlers: Evidence from a Latin American sample vol.26, pp.6, 2017, https://doi.org/10.1111/infa.12421
  51. Identifying a Clinical Risk Triage Score for Adult Emergency Department vol.30, pp.8, 2017, https://doi.org/10.1177/10547738211003273
  52. Motor Preparation and Execution for Performance Difficulty: Centroparietal Beta Activation during the Effort Expenditure for Rewards Task as a Function of Motivation vol.11, pp.11, 2017, https://doi.org/10.3390/brainsci11111442
  53. Factors associated with seasonal influenza and HPV vaccination uptake among different ethnic groups in Arab and Jewish society in Israel vol.20, pp.1, 2017, https://doi.org/10.1186/s12939-021-01523-1
  54. Identification of a Suitable Machine Learning Model for Detection of Asymptomatic Ganoderma boninense Infection in Oil Palm Seedlings Using Hyperspectral Data vol.11, pp.24, 2021, https://doi.org/10.3390/app112411798
  55. Air Quality Estimation in Ukraine Using SDG 11.6.2 Indicator Assessment vol.13, pp.23, 2017, https://doi.org/10.3390/rs13234769
  56. Blueberry bud freeze damage detection using optical sensors: Identification of spectral features through hyperspectral imagery vol.11, pp.4, 2017, https://doi.org/10.3233/jbr-211506
  57. The effect of priming on food choice: A field and laboratory study vol.168, pp.None, 2017, https://doi.org/10.1016/j.appet.2021.105749
  58. Persons with Parkinson's disease show impaired interlimb coordination during backward walking vol.94, pp.None, 2022, https://doi.org/10.1016/j.parkreldis.2021.11.029