Enhancing Dependability of Systems by Exploiting Storage Class Memory

스토리지 클래스 메모리를 활용한 시스템의 신뢰성 향상

  • 김효진 (홍익대학교 컴퓨터공학과) ;
  • 노삼혁 (홍익대학교 정보컴퓨터공학과)
  • Published : 2010.02.15

Abstract

In this paper, we adopt Storage Class Memory, which is next-generation non-volatile RAM technology, as part of main memory parallel to DRAM, and exploit the SCM+DRAM main memory system from the dependability perspective. Our system provides instant system on/off without bootstrapping, dynamic selection of process persistence or non-persistence, and fast recovery from power and/or software failure. The advantages of our system are that it does not cause the problems of checkpointing, i.e., heavy overhead and recovery delay. Furthermore, as the system enables full application transparency, our system is easily applicable to real-world environments. As proof of the concept, we implemented a system based on a commodity Linux kernel 2.6.21 operating system. We verify that the persistence enabled processes continue to execute instantly at system off-on without any state and/or data loss. Therefore, we conclude that our system can improve availability and reliability.

본 논문에서는 차세대 비휘발성램 기술인 스토리지 클래스 메모리(SCM)와 DRAM을 병렬적으로 메인 메모리로서 도입하고, SCM+DRAM 메인 메모리 시스템을 시스템 신뢰성 측면에서 활용한다. 본 시스템에서는 부팅 없는 즉각적인 시스템 온/오프, 프로세스의 동적인 영속성 또는 비영속성의 선택, 그리고 이를 통하여 전원과 소프트웨어 장애로부터의 빠른 복구를 제공한다. 본 논문에서 제안하는 시스템의 장점은 체크포인팅에서의 문제들, 즉 심각한 오버헤드와 복구 지연을 야기하지 않으며, 특히 응용 프로그램에 대한 완전한 투명성을 제공하기 때문에 보편적인 응용 프로그램에 영속성을 제공할 수 있어 실제 환경에 적용되기가 쉽다. 우리는 이를 검증하기 위해 상용 운영체제인 리눅스 커널 2.6.21을 기반으로 시스템을 구현하였고, 실험을 통해 영속성이 지정된 프로세스가 시스템의 오프-온 후 데이터 손실 없이 즉각적으로 실행을 지속하는 것을 알 수 있었으며, 이를 통하여 우리는 본 시스템에서 가용성과 신뢰성이 향상될 수 있음을 확인하였다.

Keywords

References

  1. G. W. Burr, B. N. Kurdi, J. C. Scott, C. H. Lam, K. Gopalakrishnan, and R. S. Shenoy, "Overview of Candidate Device Technologies for Storage- Class Memory," IBM Journal of Research and Development, vol.52, no.4, pp.449-464, 2008.
  2. R. F. Freitas and W. W. Wilcke, "Storage-Class Memory: the Next Storage System Technology," IBM Journal of Research and Development, vol. 52, no.4, pp.439–447, 2008.
  3. R. F. Freitas, W. W. Wilcke, B. Kurdi, and G. Burr, "Storage Class Memory, Technology and Uses," Tutorial In USENIX FAST, 2009.
  4. B. Lee, E. Ipek, O. Mutlu, and D. Burger, "Architecting Phase Change Memory as a Scalable DRAM Alternative," In Proceedings of the ACM ISCA, pp.2-13, 2009.
  5. P. Zhou, B. Zhao, J. Yang, and Y. Zhang, "A Durable and Energy Efficient Main Memory Using Phase Change Memory Technology," In Proceedings of the ACM ISCA, pp.14-23, 2009.
  6. M. K. Qureshi, V. Srinivasan, and J. A. Rivers, "Scalable High Performance Main Memory System Using Phase-Change Memory Technology," In Proceedings of the ACM ISCA, pp.24-33, 2009.
  7. J. C. Mogul, E. Argollo, M. Shah, and P. Faraboschi, "Operating System Support for NVM+ DRAM Hybrid Main Memory," In Proceedings of the USENIX Workshop on Hot Topics in Operating Systems, 2009.
  8. Y. M. Wang, Y. Huang, K. P. Vo, P. Y. Chung, and R. Kintala, "Checkpointing and Its Applications," In Proceedings of the IEEE Fault Tolerant Computing Symposium, pp.22-31, 1995.
  9. J. S. Shapiro and N. Hardy, "EROS: A Principle- Driven Operating System from the Ground Up," IEEE Software, vol.19, no.1, pp.26-33, 2002. https://doi.org/10.1109/52.976938
  10. E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel, "The Performance of Consistent Checkpointing," In Proceedings of the Symposium on Reliable Distributed Systems, pp.39-47, 1992.
  11. K. Li, J. F. Naughton, and J. S. Plank, "Low- Latency, Concurrent Checkpointing for Parallel Programs," IEEE Transactions on Parallel and Distributed Systems, vol.5, no.8, pp.874-879, 1994. https://doi.org/10.1109/71.298215
  12. D. E. Lowell and P. M. Chen, "Discount Checking: Transparent, Low-Overhead Recovery for General Applications," Technical Report CSE-TR-410-99, University of Michigan, December 1998.
  13. G. Bronevetsky, D. Marques, K. Pingali, P. Szwed, and M. Schulz, "Application-level Checkpointing for Shared Memory Programs," In Proceedings of the ACM ASPLOS, pp.235-247, 2004.
  14. O. Laadan and J. Nieh, "Transparent Checkpoint- Restart of Multiple Processes on Commodity Operating Systems," In Proceedings of the USENIX Annual Technical Conference, pp.323-336, 2007.
  15. M. Baker and M. Sullivan, "The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment," In Proceedings of the USENIX Summer Conference, pp.31-43, 1992.
  16. G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox, "Microreboot - A Technique for Cheap Recovery," In Proceedings of the USENIX OSDI, pp.31-44, 2004.
  17. Y. J. Moon, I. H. Doh, J. Park, and S. H. Noh "Development of an Instant On System Using Storage Class Memory," In Proceedings of the KIISE Korea Computer Congress, vol.36, no.1(A), pp.336-337, 2009 (in Korean).
  18. H. Kim, E. Kim J. Choi, D. Lee, and S. H. Noh, "Design and Implementation of Selective Process Persistence by Exploiting Storage Class Memory," In Proceedings of the KIISE Korea Computer Congress 2009, vol.36, no.1(A), pp.338-343, 2009 (in Korean).