DOI QR코드

DOI QR Code

Study of Scheduling Optimization through the Batch Job Logs Analysis

배치 작업 로그 분석을 통한 스케줄링 최적화 연구

  • Yoon, JunWeon (Department of Supercomputing Center, KISTI) ;
  • Song, Ui-Sung (Department of Computer Education, Busan National University of Education)
  • 윤준원 (한국과학기술정보연구원 슈퍼컴퓨팅본부) ;
  • 송의성 (부산교육대학교 컴퓨터교육과)
  • Received : 2017.10.18
  • Accepted : 2017.11.25
  • Published : 2017.11.30

Abstract

The batch job scheduler recognizes the computational resources configured in the cluster environment and plays a role of efficiently arranging the jobs in order. In order to efficiently use the limited available resources in the cluster, it is important to analyze and characterize the characteristics of user tasks. To do this, it is important to identify various scheduling algorithms and apply them to the system environment. Most scheduler software reflects the user's work environment, from job submission to termination, as well as the state of the inventory and system status of the entire managed object. It also stores various information related to task execution, such as job scripts, environment variables, libraries, wait for tasks, start and end times. In this paper, we analyze the execution log of the scheduler such as user 's success rate, execution time, and resource size through information related to job execution through batch scheduler. Based on this, it can be used as a basis to optimize the system by increasing the utilization rate of resources.

배치 작업 스케줄러는 클러스터 환경에서 구성된 계산 자원을 인지하고 순서에 맞게 효율적으로 작업을 배치하는 역할을 수행한다. 클러스터내의 한정된 가용자원을 효율적으로 사용하기 위해서는 사용자 작업의 특성을 분석하여 반영하여야 하는데 이를 위해서는 다양한 스케줄링 알고리즘을 파악하고 해당 시스템 환경에 맞게 적용하는 것이 중요하다. 대부분의 스케줄러 소프트웨어는 전체 관리 대상의 자원 명세와 시스템의 상태뿐만 아니라 작업 제출부터 종료까지 다양한 사용자의 작업 수행 환경을 반영하게 된다. 또한 작업 수행과 관련한 다양한 정보 가령, 작업 스크립트, 환경변수, 라이브러리, 작업의 대기, 시작, 종료 시간 등을 저장하게 된다. 본 연구에서는 배치 스케줄러를 통한 작업 수행과 관련된 정보를 통해 사용자의 작업 성공률, 수행시간, 자원 규모 등의 스케줄러의 수행 로그를 분석하여 문제점을 파악하였다. 향후 이 연구를 바탕으로 자원의 활용률을 높임으로써 시스템을 최적화할 수 있다.

Keywords

References

  1. He, Libo, et al., "A Review of Resource Scheduling in Large-Scale Server Cluster", International Conference on Knowledge Management in Organizations. Springer, Cham, pp. 494-505, 2017.
  2. J.H. Abawajy, "An efficient adaptive scheduling policy for high-performance computing", Original Research Article Future Generation Computer Systems, Vol 25, Issue 3, pp.364-370, Mar 2009. https://doi.org/10.1016/j.future.2006.04.007
  3. National Institute of Supercomputing and Networking, KISTI, Available: http://www.nisn.re.kr/.
  4. Reuther, Albert, et al. "Scalable system scheduling for HPC and big data", Journal of Parallel and Distributed Computing 111, pp.76-92, 2017.
  5. Templeton, D., "A Beginner's Guide to Sun Grid Engine 6.2", Whitepaper of Sun Microsystems, July 2009.
  6. C. Chaubal, "Scheduler Policies for Job Prioritization in the Sun N1 Grid Engine 6 System", Technical report, Sun BluePrints Online, Sun Microsystems, Inc., Santa Clara, CA, USA.
  7. Zhou, Xiaobing, et al., "Exploring distributed resource allocation techniques in the slurm job management system", Illinois Institute of Technology, Department of Computer Science, Technical Report, 2013.
  8. KLUSACEK, Dalibor; CHLUMSKY, Vaclav; RUDOVA, Hana, "Planning and optimization in TORQUE resource manager", In: Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing. ACM, pp. 203-206, 2015.
  9. Quintero, Dino, et al., "IBM Platform Computing Solutions Reference Architectures and Best Practices", IBM Redbooks, 2014.
  10. Yuan, Yulai, et al., "Guarantee strict fairness and utilize prediction better in parallel job scheduling", IEEE Transactions on Parallel and Distributed Systems Vol. 25, No. 4, pp. 971-981, 2014. https://doi.org/10.1109/TPDS.2013.88
  11. Feitelson, D. G., & Weil, A. M. A. (1998, April). Utilization and predictability in scheduling the IBM SP2 with backfilling. In Parallel Processing Symposium, 1998. IPPS/SPDP 1998. Proceedings of the First Merged International and Symposium on Parallel and Distributed Processing , IEEE. pp.542-546, 1998.
  12. J. W. Yoon, T. Y. Hong, C. Y. Park, H.C. Yu, "Analysis of Batch Job log to improve the success rate in HPC Environment", International Conference on Convergence Technology, vol.2 No.1, pp.209-210, July,2013.
  13. El-Sayed, N., & Schroeder, B., "Reading between the lines of failure logs: Understanding how HPC systems fail". In: Dependable Systems and Networks (DSN), 2013 43rd Annual IEEE/IFIP International Conference on, pp.1-12, June, 2013.