Quantcast
Viewing all articles
Browse latest Browse all 536

Use VTune amplifier system 2016 for HelloOpenCL GPU application analysis

Prerequisite:

You are recommended to learn how to use VTune to perform a profiling first before reading this article. If you don’t know how to do it, you may refer the tutorial documents first to understand basics in VTune.

Introduction

VTune Amplifier 2016 for system can be also used to analyze OpenCL™ programs. This article is to show you how to use this function and also to create a simple OpenCL program HelloOpenCL via Microsoft Visual Studio & Intel OpenCL codebuilder.

OpenCL is an open standard designed to do portable parallel programming on heterogeneous systems, for example, the system having CPUs, GPUs, DSPs, FPGA and other hardware devices. An OpenCL application contains mainly two implementations. 1) host APIs codes and 2) devices program codes, what we called devices kernel code. Host APIs contains two kind of APIs. The platform APIs are to check available devices regarding platform capability, to select and initialize OpenCL devices. And runtime APIs are used to set and execute kernels on selected devices. To develop devices kernel codes executed on the OpenCL runtime, you can use Intel OpenCL codebuilder IDE. Different hardware vendors have their own OpenCL runtime implementation. Therefore, you have to make sure the required runtime implementation was installed.

Image may be NSFW.
Clik here to view.

VTune openCL analysis support can help identify which hotspot kernels spending most time and how often the kernels are invoked. Furthermore, copying data between different HW devices also takes time because it results in HW contexts switch. In VTune, certain OpenCL memory read/write bandwidth metrics can also help investigate possible stalls caused by memory accessing. In the following sections, we will show you how to create a simple HelloOpenCL program and also how to use the VTune OpenCL analysis with the new  architecture diagram feature..

Start the first OpenCL GPU program – HelloOpenCL.

Before starting to develop the HelloOpenCL program, you have to download a few things. To build a kernel code and check platform capability, you can download OpenCL code builder which is contained in INDE package. Secondly, the OpenCL runtime implementation is required to be installed on target device. Intel OpenCL implementation is included in Intel graphics package. You can download the driver from here. Visit this to get more downloading options and instructions.

After installed OpenCL codebuilder, you can check what kind of OpenCL devices it will support. This test target machine is based on 4th Generation Intel® Core™ Processors for client systems and the codename is Haswell.

Image may be NSFW.
Clik here to view.

After confirmed your environment had OpenCL devices support like the above figure shows, you can use Microsoft Visual Studio Professional 2013 to create your first OpenCL program by using installed template, HelloOpenCL or directly use this sample code we included in this KB. This sample codes request GPU device to perform math addition for two 2-dimension buffers and produce a 2-dimension output buffers. This can be applied for the typical image filter application. Here is the HelloOpenCL sample codes.Image may be NSFW.
Clik here to view.
Download

Profiling HelloOpenCL with VTune Amplifier system 2016

With the HelloOpenCL program being built up successfully, VTune can be directly lunched to perform the application profiling in Visual Studio IDE. Check the detailed setup steps in the following figure to setup OpenCL GPU profiling in VTune.

  1. Lunch the VTune in VS 2013 IDE
  2. Choose the advanced hotspots analysis type
  3. Choose graphics hardware events of memory accesses
  4. Check the OpenCL programs option.

Image may be NSFW.
Clik here to view.

After successfully collected VTune logs, you should be able to see the VTune’s analysis timeline view below by switching to the Graphics tab. Check the following indexes for functions brief.

Image may be NSFW.
Clik here to view.

  1. VTune contains several grouping views of functions calls list. For openCL GPU program, the specific grouping views “Computing Task Purpose/*” is provided in order to better explain the OpenCL APIs efficiency with OpenCL-aware metrics.
  2. These annotations are used to describe OpenCL’s host API codes running on CPU side. They also can present how long CPU running time one task function occupies. For details, clBuildProgram is to interpret kernel codes into the program which can be executed on OpenCL runtime implementation. clCreateKernel is to choose one kernel function in previous built OpenCL program which can contain multiple kernel functions. clEnqueueNDRange is put certain kernel function into an OpenCL command queue which will be picked up and executed by GPU.
  3. This “Intel(R) HD Graphics 4…” timeline shows that "Add” is the kernel function scheduled on Intel GPU runtime implementation.
  4. It highlighted when the real GPU activity of “Add” occurs at on GPU HW. From the timing kernel function is scheduled to the timing the kernel function is really executed, there is a time delay caused by certain preparation and context switching.
  5. This is the new feature provided in the latest VTune Amplifier 2016. Like what the following figure shows, it illustrates data transfer efficiency with statistic data form and presents bandwidth data in general GPU architecture diagram. Untyped memory read bandwidth is twice write bandwidth and that matches HelloOpenCL application’s behavior.

Image may be NSFW.
Clik here to view.

From this architecture diagram, you can also observe the buffers used in HelloOpenCL application are allocated at L3 cache. There are a lot of utilization room for GPU since GPU stays in stalled and idle states in most times. In other words, Intel OpenCL device can take more complex tasks.

See also

https://software.intel.com/articles/getting-started-with-opencl-code-builder

https://software.intel.com/en-us/articles/opencl-drivers


Viewing all articles
Browse latest Browse all 536

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>