DOI QR코드

DOI QR Code

Implementation of SIMD-based Many-Core Processor for Efficient Image Data Processing

효율적인 영상데이터 처리를 위한 SIMD기반 매니코어 프로세서 구현

  • 최병국 (울산대학교 전기공학부) ;
  • 김철홍 (전남대학교 전자컴퓨터공학과) ;
  • 김종면 (울산대학교 전기공학부)
  • Received : 2010.08.24
  • Accepted : 2010.10.29
  • Published : 2011.01.31

Abstract

Recently, as mobile multimedia devices are used more and more, the needs for high-performance and low-energy multimedia processors are increasing. Application-specific integrated circuits (ASIC) can meet the needed high performance for mobile multimedia, but they provide limited, if any, generality needed for various application requirements. DSP based systems can used for various types of applications due to their generality, but they require higher cost and energy consumption as well as less performance than ASICs. To solve this problem, this paper proposes a single instruction multiple data (SIMD) based many-core processor which supports high-performance and low-power image data processing while keeping generality. The proposed SIMD based many-core processor composed of 16 processing elements (PEs) exploits large data parallelism inherent in image data processing. Experimental results indicate that the proposed SIMD-based many-core processor higher performance (22 times better), energy efficiency (7 times better), and area efficiency (3 times better) than conversional commercial high-performance processors.

최근 모바일 멀티미디어 기기들의 사용이 증가하면서 고성능, 저전력 멀티미디어 프로세서에 대한 필요성이 높아지고 있는 추세이다. 주문형반도체 (ASIC)는 모바일 멀티미디어에서 요구되는 고성능을 만족시키지만 다양한 형태의 멀티미디어 애플리케이션에서 요구되는 범용성을 만족시키지 못한다. 반면 DSP기반의 시스템은 범용성에 기인하여 다양한 형태의 애플리케이션에서 사용될 수 있으나, 주문형반도체 보다 높은 가격, 전력소모 및 낮은 성능을 가진다. 이러한 문제점을 해결하기 위해 본 논문에서는 범용성을 유지하면서 고성능, 저전력으로 영상데이터 처리가 가능한 단일 명령어 다중 데이터(Single Instruction Multiple Data, SIMD)처리 방식의 매니코어 프로세서를 제안한다. 제안한 SIMD기반 매니코어 프로세서는 16개의 프로세싱 엘리먼트(processing element, PE)로 구성되어 영상데이터 처리에 내재한 무수한 데이터 레벨 병렬성을 높인다. 모의 실험한 결과, 제안한 SIMD기반 매니코어 프로세서는 현재 상용 고성능 프로세서보다 평균 22배의 성능, 7배의 에너지 효율 및 3배의 시스템 면적 효율을 보였다.

Keywords

References

  1. S.-H. Kim, S.-Y. Nam, and H.-J. Lim, "An improved area edge detection for real-time image processing," Journal of the Korea Society of Computer and Information, vol. 14, no. 1, pp. 99-106, Jan. 2009.
  2. X.-G. Jiang, J.-Y. Zhou, J.-H. Shi, H.-H. Chen "FPGA Implementation of Image Rotation Using Modified Compensated CORDIC," in Proc. of 6th Intl. Conf. on ASIC, vol. 2, pp. 752-756, 2005.
  3. E. B. Bourennane, S. Bouchoux, J. Miteran, M. Paindavoine, S. Bouillant, "Cost comparison of image rotation implementations on static and dynamic reconfigurable FPGAs," in Proc. of IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing (ICASSP '02), vol. 3, pp. III-3176-3179, 2002.
  4. S.-H. Lee, "The design and implementation of prallel processing system using the Nios(R) II embedded processor," Journal of the Korea Society of Computer and Information, vol. 14, no. 11, pp. 97-103, Nov. 2009.
  5. A. D. Blas et. al, "The UCSC Kestrel Parallel Processor," IEEE Trans. on Parallel and Distributed Systems, vol. 16, no. 1, pp. 80-92, Jan. 2005. https://doi.org/10.1109/TPDS.2005.12
  6. A. Gentile and D. S. Wills, "Portable Video Supercomputing," IEEE Trans. on Computers, vol. 53, no. 8, pp. 960-973, Aug. 2004. https://doi.org/10.1109/TC.2004.48
  7. L. V. Huynh, C.-H. Kim, and J.-M. Kim, "A massively parallel algorithm for fuzzy vector quantization," The KIPS Transactions: PartA, vol. 16-A, no. 6, pp. 411-418, Dec. 2009. https://doi.org/10.3745/KIPSTA.2009.16A.6.411
  8. TMS320C64x families, http://www.bdti.com/procsum/tic64xx.htm.
  9. P. Ranganathan, S. Adve, and N. P. Jouppi, "Performance of image and video processing with general-purpose processors and media ISA extensions," in Proc. of the 26th Intl. Sym. on Computer Architecture, pp. 124-135, May. 1999.
  10. R. Bhargava, L. John, B. Evans, and R. Radhakrishnan, "Evaluating MMX technology using DSP and multimedia applications," in Proc. of IEEE/ACM Sym. on Microarchitecture, pp. 37-46, 1998.
  11. N. Slingerland and A. J. Smith, "Measuring the performance of multimedia instruction sets," IEEE Trans. on Computers, vol. 51, no. 11, pp. 1317-1332, Nov. 2002. https://doi.org/10.1109/TC.2002.1047756
  12. A. Krikelis, I. P. Jalowiecki, D. Bean, R. Bishop, M. Facey, D. Boughton, S. Murphy, and M. Whitaker, "A programmable processor with 4096 processing units for media applications," in Proc. of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, vol. 2, pp. 937-940, May. 2001.
  13. L. W. Tucker and G. G. Robertson, "Architecture and applications of the connection machine," IEEE Computer, vol. 21, no. 8, pp. 26-38, 1988.
  14. "Connection machine model CM-2 technical summary," Thinking Machines Corp., version 51, May 1989.
  15. MarPar (MP-2) System Data Sheet. MarPar Corporation, 1993.
  16. M. J. Irwin, R. M. Owens, "A Two-Dimensional, Distributed Logic Processor," IEEE Trans. on Computers, vol. 40, no. 10, pp. 1094-1101, 1991. https://doi.org/10.1109/12.93742
  17. M. Bolotski, R. Armithrajah, W. Chen, "ABACUS: A High Performance Architecture for Vision," in Proceedings of the International Conference on Pattern Recognition, 1994.
  18. S. M. Chai, T. Taha, D. S. Wills, J. D. Meindl, "Heterogeneous Architecture Models for Interconnect- Motivated System Design," IEEE Trans. on VLSI Systems, vol. 8, no. 6, pp. 660-670, 2000. https://doi.org/10.1109/92.902260
  19. V. Tiwari, S. Malik, and A. Wolfe, "Compilation techniques for Low Energy: An Overview," in Proc. IEEE Intl. Symp. on Low Power Electrin., pp. 38-39, 1994.
  20. V. Tiwari, S. Malik,and A. Wolfe, "Compilation Techniques for Low Energy: An Overview," in Proc. of the IEEE Intl. Symp. on Low Power Electron., pp. 38-39, Oct. 1994.
  21. ARM 926EJ-S data sheet, http://www.arm.com/products/processors/classic/arm9/arm926.php.
  22. ARM 1020E data sheet, http://www.hotchips.org/archives/hc13/2_Mon/02arm. pdf
  23. Xilinx Vertex-4 FPGA XC4VLX60 data sheet, http://www.alldatasheet.net/ datasheet-pdf/pdf /152986/XILINX/XC4VLX60.html

Cited by

  1. 다양한 해상도의 초음파 영상처리를 위한 매니코어 프로세서의 디자인 공간 탐색 vol.a19, pp.3, 2012, https://doi.org/10.3745/kipsta.2012.19a.3.121
  2. 매니코어 프로세서를 이용한 SIFT 알고리즘 병렬구현 및 성능분석 vol.18, pp.9, 2011, https://doi.org/10.9708/jksci.2013.18.9.001
  3. 래스터화 알고리즘을 위한 최적의 매니코어 프로세서 구조 탐색 vol.9, pp.1, 2011, https://doi.org/10.14372/iemek.2014.9.1.17