CSE Parallel and High Performance Computing Lecture 7 - Introduction to OpenCL - PDF

Please download to get full document.

View again

of 49
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Information Report
Category:

Abstract

Published:

Views: 3 | Pages: 49

Extension: PDF | Download: 0

Share
Related documents
Description
CSE Parallel and High Performance Computing Lecture 7 - Introduction to OpenCL, HPC Software Analyst SHARCNET, University of Waterloo GPU computing
Transcript
CSE Parallel and High Performance Computing Lecture 7 - Introduction to OpenCL, HPC Software Analyst SHARCNET, University of Waterloo GPU computing timeline before Calculations on GPU, using graphics API Steady increase in CPU clock speed comes to a halt, switch to multicore chips to compensate. At the same time, computational power of GPUs increases November, CUDA released by NVIDIA November, CTM (Close to Metal) from ATI December Succeeded by AMD Stream SDK December, Technical specification for OpenCL1.0 released April, First OpenCL 1.0 GPU drivers released by NVIDIA August, Mac OS X 10.6 Snow Leopard released, with OpenCL 1.0 included September Public release of OpenCL by NVIDIA December AMD release of ATI Stream SDK 2.0 with OpenCL support March Cuda 3.0 released, incorporating OpenCL June OpenCL 1.1 specified August ATI switched to OpenCL November OpenCL 1.2 specified November OpenCL 2.0 specified OpenCL (Open Compute Language) is an open standard for parallel programming of heterogenous systems, managed by Khronos Group. Aim is for it to form the foundation layer of a parallel computing ecosystem of platformindependent tools OpenCL includes a language for writing kernels, plus APIs used to define and control platforms Targeted at GPUs, but also multicore CPUs and other hardware (Cell etc.), though to run efficiently each family of devices needs different code OpenCL is largely derived from CUDA (no need to reinvent the wheel) Same basic functioning : kernel is sent to the accelerator compute device composed of compute units of which processing elements work on work items. Some names changed between CUDA and OpenCL Thread - Work-item Block - Work-group Both CUDA and OpenCL allow for detailed low level optimization of the GPU OpenCL device independent - really? Actually, there is no way to write code that will run equally efficiently on both on GPU and multi-cpu architecture However, OpenCL can detect the features of the architecture and run code appropriate for each OpenCL code should run efficiently on many GPUs with a bit of fine tuning On the other hand, an OpenCL code highly optimized for NVIDIA hardware will not run that efficiently on AMD hardware As High Performance Computing code needs to be highly optimized, so OpenCL may not offer practical ability to be device independent Advantages of running on multiple devices The more machines can run code, the better Some devices are much better than others for certain problems NVIDIA and ATI GPUs can have widely different performance for certain types of problems Eg. recent ATI GPUs had a major advantage in speed of certain types of integer operations, which made them much better for some digital coin mining computations More fundamental differences in architecture Two similar generation GPU cards from ATI and NVIDIA NVIDIA GTX MHz, 512 execution units (CUDA cores), in 16 multiprocessors, each with 32 cores Radeon HD MHz, 1600 execution units (shaders), in 20 multiprocessors, each with 16 thread processors, each core with 5 execution units in VLIW (Very Large Instruction Word) architecture GTX 580 runs 512 threads at 1544 MHz, each thread one instruction per cycle, same instruction for each 32 threads in a multiprocessor HD 5870 runs 320 threads at 850 Mhz, each thread five instructions per cycle, same instructions for each of 16 threads in a multiprocessor HD 5870 will show higher performance if problem fits well with the VLIW architecture, that is the problem must break down into groups of 5 completely independent instructions that can be done on a VLIW at the same time (from CUDA vs. OpenCL NVIDIA is fully supporting OpenCL even though it does not run exclusively on their hardware The strategy is to increase the number of GPU users by making software more portable and accessible, which OpenCL is meant to do. If there are more users, NVIDIA will sell more hardware. As CUDA is fully controlled by NVIDIA, at any given time it contains more bleeding edge features than OpenCL, which is overseen by a consortium hence slower to incorporate new features If you want your GPU code to run on both NVIDIA and AMD/ATI devices (still two main players at present), OpenCL is the only way to accomplish that Available OpenCL environments NVIDIA OpenCL - distributed with CUDA since version no CPU support - for NVIDIA GPU cards only Apple OpenCL - included as standard feature since Mac OS X 10.6 Snow Leopard (requires XCode) - supports both graphics cards and CPUs (Apple hardware only) AMD (ATI) OpenCL - supports AMD GPU cards and CPUs Intel OpenCL - supports Intel CPUs and Integrated Graphics Processors (IGP) OpenCL References The OpenCL Programming Book - Khronos consortium: CUDA OpenCL: https://developer.nvidia.com/opencl NVIDIA GPU Computing SDK code samples On monk : /opt/sharcnet/cuda/4.1/sdk/opencl Apple OpenCL : AMD (ATI) OpenCL : OpenCL on SHARCNET Part of CUDA, so available on same systems that CUDA is installed. OpenCL SDK discontinued as part of CUDA SDK but it s still available for download. 4 viz stations have ATI cards: viz7-uwo, viz9-uwo, viz10-uwo, viz11-uwo Intel Phi machine should have OpenCL Know your hardware OpenCL programs should aim to be hardware agnostic, since one of the goals of OpenCL is to have programs run on multiple devices The program should find the relevant system information at runtime OpenCL provides methods for this At minimum programmer must determine what OpenCL devices are available and choose which are to be used by the program Sample program to do this follows Source code on monk: /home/ppomorsk/cse746_lec7/get_info/get_opencl_information.c Program to get information - What is the platform? #include stdio.h #include CL/cl.h int main(int argc, char** argv) { char dname[500]; cl_device_id devices[10]; cl_uint num_devices,entries; cl_ulong long_entries; int d; cl_int err; cl_platform_id platform_id = NULL; size_t p_size; /* obtain list of platforms available */ err = clgetplatformids(1, &platform_id,null); if (err!= CL_SUCCESS) { printf( error: Failure in clgetplatformids,error code=%d \n ,err); return 0; } /* obtain information about platform */ clgetplatforminfo(platform_id,cl_platform_name,500,dname,null); printf( cl_platform_name = %s\n , dname); clgetplatforminfo(platform_id,cl_platform_version,500,dname,null); printf( cl_platform_version = %s\n , dname); Program to get information cont. - What are the devices? /* obtain list of devices available on platform */ clgetdeviceids(platform_id, CL_DEVICE_TYPE_ALL, 10, devices, &num_devices); printf( %d devices found\n , num_devices); /* query devices for information */ for (d = 0; d num_devices; ++d) { clgetdeviceinfo(devices[d], CL_DEVICE_NAME, 500, dname,null); printf( device #%d name = %s\n , d, dname); clgetdeviceinfo(devices[d],cl_driver_version, 500, dname,null); printf( \tdriver version = %s\n , dname); clgetdeviceinfo(devices[d],cl_device_global_mem_size,sizeof(cl_ulong),&long_entries,null); printf( \tglobal Memory (MB):\t%llu\n ,long_entries/1024/1024); clgetdeviceinfo(devices[d],cl_device_global_mem_cache_size,sizeof(cl_ulong),&long_entries,null); printf( \tglobal Memory Cache (MB):\t%llu\n ,long_entries/1024/1024); clgetdeviceinfo(devices[d],cl_device_local_mem_size,sizeof(cl_ulong),&long_entries,null); printf( \tlocal Memory (KB):\t%llu\n ,long_entries/1024); clgetdeviceinfo(devices[d],cl_device_max_clock_frequency,sizeof(cl_ulong),&long_entries,null); printf( \tmax clock (MHz) :\t%llu\n ,long_entries); clgetdeviceinfo(devices[d],cl_device_max_work_group_size,sizeof(size_t),&p_size,null); printf( \tmax Work Group Size:\t%d\n ,p_size); clgetdeviceinfo(devices[d],cl_device_max_compute_units,sizeof(cl_uint),&entries,null); printf( \tnumber of parallel compute cores:\t%d\n ,entries); } return 0; } Output on Apple MacBook laptop - both GPU and CPU seen ppomorsk-mac:tryopencl pawelpomorski$ gcc -o test.x get_opencl_information.c -w -m32 -lm -lstdc++ - framework OpenCL ppomorsk-mac:tryopencl pawelpomorski$./test.x CL_PLATFORM_NAME = Apple CL_PLATFORM_VERSION = OpenCL 1.0 (Feb :46:58) 2 devices found Device #0 name = GeForce 9400M Driver version = CLH 1.0 Global Memory (MB): 256 Global Memory Cache (MB): 0 Local Memory (KB): 16 Max clock (MHz) : 1100 Max Work Group Size: 512 Number of parallel compute cores: 2 Device #1 name = Intel(R) Core(TM)2 Duo CPU 2.00GHz Driver version = 1.0 Global Memory (MB): 1536 Global Memory Cache (MB): 3 Local Memory (KB): 16 Max clock (MHz) : 2000 Max Work Group Size: 1 Number of parallel compute cores: 2 ppomorsk-mac:tryopencl pawelpomorski$ Getting a code to run on a GPU Take existing serial program, separate out the parts that will continue to run on host, and the parts which will be sent to the GPU GPU parts need to be rewritten in the form of kernel functions Add code to host that manages GPU overhead: creates kernels, moves data between host and GPU etc. OpenCL Hello, World! example Not practical to do proper Hello, World! as OpenCL devices cannot access standard output directly Our example program will pass an array of numbers from host to the GPU, square each number in the array on the GPU, then return modified array to host This program is for learning purposes only, and will not run efficiently on a GPU The program will demonstrate the basic structure which is common to all OpenCL programs Will not show error checks on the slides for clarity, but having them is essential when writing OpenCL code Source code On monk: /home/ppomorsk/cse746_lec7/hello/no_errorchecks_hello.c /home/ppomorsk/cse746_lec7/hello/hello.c Program flow - OpenCL function calls clgetplatformids clgetdeviceids clcreatecontext clcreatecommandqueue clcreateprogramwithsource clbuildprogram clcreatekernel clcreatebuffer clenqueuewritebuffer clsetkernelarg clgetkernelworkgroupinfo clenqueuendrangekernel clfinish clenqueuereadbuffer Organize resources, create command queue Compile kernel Transfer data from host to GPU memory Lauch threads running kernels on GPU, perform main computation Transfer data from GPU to host memory clrelease... Free all allocated memory Kernel code - string with OpenCL C code // Simple compute kernel which computes the square of an input array // const char *KernelSource = \n \ kernel void square( \n \ global float* input, \n \ global float* output, \n \ const unsigned int count) \n \ { \n \ int i = get_global_id(0); \n \ if(i count) \n \ output[i] = input[i] * input[i]; \n \ } \n \ \n ; Define variables, set input data int main(int argc, char** argv) { int err; float data[data_size]; float results[data_size]; unsigned int correct; size_t global; size_t local; cl_device_id device_id; cl_context context; cl_command_queue commands; cl_program program; cl_kernel kernel; cl_mem input; cl_mem output; // error code returned from api calls // original data set given to device // results returned from device // number of correct results returned // global domain size for our calculation // local domain size for our calculation // compute device id // compute context // compute command queue // compute program // compute kernel // device memory used for the input array // device memory used for the output array cl_platform_id platform_id = NULL; // Fill our data set with random float values // int i = 0; unsigned int count = DATA_SIZE; for(i = 0; i count; i++) data[i] = rand() / (float)rand_max; Organise resources, create command queue // determine OpenCL platform err = clgetplatformids(1, &platform_id,null); // Connect to a compute device // int gpu = 1; err = clgetdeviceids(platform_id, gpu? CL_DEVICE_TYPE_GPU : CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); // Create a compute context // context = clcreatecontext(0, 1, &device_id, NULL, NULL, &err); // Create a command commands // commands = clcreatecommandqueue(context, device_id, 0, &err); Compile kernel // Create the compute program from the source buffer // program = clcreateprogramwithsource(context, 1, (const char **) & KernelSource, NULL, &err); // Build the program executable // err = clbuildprogram(program, 0, NULL, NULL, NULL, NULL); // Create the compute kernel in the program we wish to run // kernel = clcreatekernel(program, square , &err); Transfer data from host to GPU memory // Create the input and output arrays in device memory for our calculation // input = clcreatebuffer(context, CL_MEM_READ_ONLY, sizeof(float) * count, NULL, NULL); output = clcreatebuffer(context, CL_MEM_WRITE_ONLY, sizeof(float) * count, NULL, NULL); // Write our data set into the input array in device memory // err = clenqueuewritebuffer(commands, input, CL_TRUE, 0, sizeof(float) * count, data, 0, NULL, NULL); Launch threads running kernels on GPU, perform main computation // Set the arguments to our compute kernel // err = 0; err = clsetkernelarg(kernel, 0, sizeof(cl_mem), &input); err = clsetkernelarg(kernel, 1, sizeof(cl_mem), &output); err = clsetkernelarg(kernel, 2, sizeof(unsigned int), &count); // Execute the kernel over the entire range of our 1d input data set // using one work item per work group (allows for arbitrary length of data array) // global = count; local = 1; err = clenqueuendrangekernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL); // Wait for the command commands to get serviced before reading back results // clfinish(commands); Transfer data from GPU to host memory // Read back the results from the device to verify the output // err = clenqueuereadbuffer( commands, output, CL_TRUE, 0, sizeof(float) * count, results, 0, NULL, NULL ); // Validate our results // correct = 0; for(i = 0; i count; i++) { if(results[i] == data[i] * data[i]) correct++; } // Print a brief summary detailing the results // printf( computed '%d/%d' correct values!\n , correct, count); Free all allocated memory // Shutdown and cleanup // clreleasememobject(input); clreleasememobject(output); clreleaseprogram(program); clreleasekernel(kernel); clreleasecommandqueue(commands); clreleasecontext(context); } return 0; If you are trying to use an OpenCL profiler and it crashes at the end of the run, not freeing all memory could be the cause Compiling kernel - Online approach In our example we have compiled code at runtime (online compile) Kernel code can be specified as a string in the main program file Kernel code can also be stored in a file (.cl) and loaded at runtime const char filename[] = ./kernel.cl ; /* Load kernel source code */ fp = fopen(filename, r ); if (!fp) { fprintf(stderr, Failed to load kernel.\n ); exit(1); } source_str = (char *)malloc(max_source_size); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp ); fclose( fp ); //... /* Create Kernel program from the read in source */ program = clcreateprogramwithsource(context, 1, (const char **)&source_str, (const size_t *)&source_size, &ret); Compiling kernel - Online approach Code compiled at runtime may help with portability, but it makes debugging somewhat harder. If the compiler encounters an error, the build fails but error is not shown by default. Even correct code may fail if some environment variables are not set. On AMD devices, kernels using double precision variables require setting CL_KHR_FP64=1 if (clbuildprogram(program, 0, NULL, NULL, NULL, NULL)!= CL_SUCCESS) { printf( error building program \n ); printf( buildlog output \n ); size_t len; char buffer[2048]; clgetprogrambuildinfo(program, devices[device_used], CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer,&len); /* print error log */ printf( %s\n , buffer); } return 1; Compiling kernel - Offline approach It is possible to compile OpenCL source code (kernels) offline This is the approach used in CUDA Upside: need not spend time for compiling during runtime Downside: executable not portable, need to compile separate binary for each device the code is running on There is no freestanding compiler for OpenCL (like nvcc for CUDA) Kernels have to be compiled at runtime and the resulting binary saved. That binary can then be loaded in the future to avoid compiling step Task queue Queue is used to launch kernels, in precisely controlled order if required. Event objects contain information about whether various operations have finished. For example, clenqueuetask launches a single kernel, after checking the list of provided event object whether they have completed, and returns its own event object. //create quque enabled for out of order (parallel) execution commands = clcreatecommandqueue(context, device_id, OUT_OF_ORDER_EXEC_MODE_ENABLE, &err);... // no synchronization clenqueuetask(command_queue, kernel_e, 0, NULL,NULL); // synchronize so that kernel E starts only after kernels A,B,C,D finish cl_event events[4]; // define event object array clenqueuetask(commands, kernel_a, 0, NULL, &events[0]); clenqueuetask(commands, kernel_b, 0, NULL, &events[1]); clenqueuetask(commands, kernel_c, 0, NULL, &events[2]); clenqueuetask(commands, kernel_d, 0, NULL, &events[3]); clenqueuetask(commands, kernel_e, 4, events, NULL); Queueing Data Parallel tasks clenqueuendrangekernel - used to launch data parallel tasks, i.e. copies of kernel which are identical except for operating on different data For example, to lauch #(global) work-items (copies of kernel, i.e. threads) grouped into workgroups of size #(local), one would use: err = clenqueuendrangekernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, NULL); #(global) must be divisible by #(local), maximum size of #(local) dependent on device Each work item can retrieve: get_global_id(0) - its index among all threads get_local_id(0) - its index among threads in its work-group get_group_id(0) - index of its work group This information can be used to determine which data to operate on Organizing work-items in more than 1D Convenient and efficient for many tasks, 2D and 3D possible //2D example // launch 256 work-items, organised in 16x16 grid // grouped in groups of 16, organised as 4x4 global[0]=16; global[1]=16; local[0]=4; local(1)=4; err = clenqueuendrangekernel(commands, kernel, 2, NULL, &global, &local, 0, NULL, NULL); Maximum possible values in global and local arrays are device dependent For 2D, can retrieve global index coordinates (x,y) = (get_global_id(0),get_global_id(1)) For 3D, can similarly retrieve global (x,y,z) = (get_global_id(0),get_global_id(1),get_global_id(2)) Measuring execution time OpenCL is an abstraction layer that allows the same code to be executed on different platforms The code is guaranteed to run but its speed of execution will be dependent on the device and type of parallelism used In order to get maximum performance, device and parallelism dependent tuning must be performed. To do this, execution time must be measured. OpenCL provides convenient ways to do this Event objects can also contain information about how long it took for the task associated with even to execute clgeteventprofilinginfo can obtain start and end time of event Taking the difference gives event duration in nanoseconds commands = clcreatecommandqueue(context, device_id, CL_QUEUE_PROFILING_ENABLE, &err); cl_event event; err = clenqueuendrangekernel(commands, kernel, 1, NULL, &global, &local, 0, NULL, &event); clwaitforevents(1, &event); clgeteventprofilinginfo(event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL); clgeteventprofilinginfo(event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL); printf( time for event (ms): %10.5f \n , (end-start)/ ); //... ClReleaseEvent(event); //important to free memory Must minimize memory transfers between host and GPU as these are quite slow. If you have a kernel which does not show any performance improvement when run on GPU, run it on GPU if doing so eliminates need for device to host memory transfer Intermediate data structures should be created in GPU memory, operated on by GPU, and destroyed without ever being mapped or copied to host memory Because of overhead, batching small transfers into one works better than making each transfer separately. High
Recommended
View more...
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks