DOI QR코드

DOI QR Code

Correct Implementation of Sub-warp Parallel Prefix Operations based on GPU Hardware Architecture

GPU 하드웨어 아키텍처 기반 sub-warp 단위 병렬 프리픽스(prefix) 연산의 정확한 구현

  • Park, Taejung (Department of Digital Media, Duksung Women's University)
  • 박태정 (덕성여자대학교 디지털미디어학과)
  • Received : 2017.06.09
  • Accepted : 2017.06.25
  • Published : 2017.06.30

Abstract

This paper presents a CUDA (Compute Unified Device Architecture) code to achieve correct GPU parallel segmented prefix operation results with less than 32 segment length for large data arrays. Mark Harris and Michael Garland had published CUDA code to address the tasks. This paper shows that their code does not generate correct results when the local segment length is less than 32, discusses the cause of the problem, and presents a CUDA code that generates correct results. The segmented parallel prefix operation presented in this paper can be applied as a building block to various large parallel processing algorithms including the k-nearest neighbor search problems.

본 논문에서는 대규모 데이터를 길이가 32 미만인 로컬 세그먼트 단위로 구분하고 이 로컬 세그먼트 내에서 정확한 GPU 병렬 프리픽스(prefix) 연산 결과를 출력하는 CUDA (Compute Unified Device Architecture) 코드를 제시한다. 이미 Mark Harris와 Michael Garland가 이러한 목적을 수행하기 위한 CUDA 코드를 이미 발표한 바 있으나 본 논문에서는 로컬 세그먼트의 길이가 32 미만일 때 기존 코드의 결과가 정확하지 않다는 사실을 살펴 보고 그 원인을 논의한 후, 정확한 결과를 출력하는 코드를 제안한다. 본 논문에서 다루는 로컬 세그먼트 단위의 병렬 프리픽스 연산은 최인접 요소 탐색(k-nearest neighbor search) 등은 물론 다양한 대규모 병렬 처리 알고리즘을 구성하는 기본 연산으로 활용 가능하다.

Keywords

References

  1. wikipedia. Available: https://en.wikipedia.org/wiki/Prefix_sum
  2. Fermi architecture white paper. Available: http://www.nvidia.com/content/pdf/fermi_white_papers/nvidia_fermi_compute_architecture_whitepaper.pdf
  3. M. Harris, M. Garland, and W. Hwu (editor-in-chiefs), GPU Computing Gems Jade Edition, 1st ed. Morgan Kaufmann Pub., ch. 3, pp. 29-38, 2011.
  4. Parallel Prefix Sum on the GPU (Scan). Available: http://www.umiacs.umd.edu/-ramani/cmsc828e_gpusci/ScanTalk.pdf
  5. CUDA C Programming guide. Available: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#axzz4jURrxaId
  6. T. Park, "Analysis of Morton Code Conversion for 32 Bit IEEE 754 Floating Point Variables", The Journal of Digital Contents Society, Vol. 17, No. 3, pp. 165-172, June 2016. https://doi.org/10.9728/dcs.2016.17.3.165
  7. S. Li, L. Simons, J. B. Pakaravoor, F. Abbasinejad, J. D. Owens, and N. Amenta, "kANN on the GPU with shifted sorting,". In Proceedings of the Fourth ACM SIGGRAPH / Eurographics conference on High-Performance Graphics (EGGH-HPG'12), Switzerland, pp. 39-47, 2012.
  8. J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1st ed. Wrox, pp. 90-93, 2014.
  9. Mark Harris, GPU Gems 3, ch. 39. "Parallel Prefix Sum (Scan) with CUDA". Available: https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html
  10. CUDA Toolkit documentation. Available: http://docs.nvidia.com/cuda/
  11. J. Cheng, M. Grossman, and T. McKercher, Professional CUDA C Programming, 1st ed. Wrox, pp. 84-87, 2014.