Intro OpenCL Tutorial

Benedict R. Gaster, AMD Architect, OpenCL™

OpenCL™ is a young technology, and, while a specification has been published (www.khronos.org/registry/cl/), there are currently few documents that provide a basic introduction with examples. This article helps make OpenCL™ easier to understand and implement.

Note that:

  • I work at AMD, and, as such, I will test all example code on our implementation for both Windows® and Linux®; however, my intention is to illustrate the use of OpenCL™ regardless of platform. All examples are written in pure OpenCL™ and should run equally well on any implementation.
  • I have done my best to provide examples that work out-of-the-box on non-AMD implementations of OpenCL™, but I will not be testing them on non-AMD implementations; therefore, it is possible that an example might not work as expected on such systems. If this is the case, please let me know via our OpenCL™ forum, and I will do my best to rectify the code and publish an update.

The following “Hello World” tutorial provides a simple introduction to OpenCL™. I hope to follow up this first tutorial with additional ones covering topics such as:

  • Using platform and device layers to build robust OpenCL™
  • Program compilation and kernel objects
  • Managing buffers
  • Kernel execution
  • Kernel programming – basics
  • Kernel programming – synchronization
  • Matrix multiply – a case study
  • Kernel programming – built-ins

The “Hello World” program in OpenCL™

Here are some notes of caution on how the OpenCL™ samples are written:

  • OpenCL™ specifies a host API that is defined to be compatible with C89 and does not make any mention of C++ or other programming language bindings. Currently, there are several efforts to develop bindings for other languages (see the links at the end of this article), and, specifically, there has been a strong push to develop C++ bindings. In this and subsequent tutorials, I use the C++ bindings exclusively and describe OpenCL™ in these terms. See the OpenCL™ 1.0 specification for the corresponding C API.Alternatively, you can view the source for the C++ bindings to see what underlying OpenCL™ function is used, and with what arguments by the particular C++ binding.
  • OpenCL™ defines a C-like language for programming compute device programs. These programs are passed to the OpenCL™ runtime via API calls expecting values of type char *. Often, it is convenient to keep these programs in separate source files. For this and subsequent tutorials, I assume the device programs are stored in files with names of the form name_kernels.cl, where name varies, depending on the context, but the suffix _kernels.cl does not. The corresponding device programs are loaded at runtime and passed to the OpenCL™ API. There are many alternative approaches to this; this one is chosen for readability.

For this first OpenCL™ program, we start with the source for the host application.

Header files

Just like any other external API used in C++, you must include a header file when using the OpenCL™ API. Usually, this is in the directory CL within the primary include directory. For the C++ bindings we have (replace the straight C API with cl.h):

  1. #include <utility>
  2. #define __NO_STD_VECTOR // Use cl::vector instead of STL version
  3. #include <CL/cl.hpp>

For our program, we use a small number of additional C++ headers, which are agnostic to OpenCL™.

  1. #include <cstdio>
  2. #include <cstdlib>
  3. #include <fstream>
  4. #include <iostream>
  5. #include <string>
  6. #include <iterator>

As we will dynamically request an OpenCL™ device to return the “Hello World\n” string, we define it as a constant to use in calculations.

  1. const std::string hw("Hello World\n");

Errors

A common property of most OpenCL™ API calls is that they either return an error code (type cl_int) as the result of the function itself, or they store the error code at a location passed by the user as a parameter to the call. As with any API call that can fail, it is important, for the application to check its behavior correctly in the case of error. For the most part we will not concern ourselves with recovering from an error; for simplicity, we define a function, checkErr, to see that a certain call has completed successfully. OpenCL™ returns the value CL_SUCCESS in this case. If it is not, it outputs a user message and exits; otherwise, it simply returns.

  1. inline void
  2. checkErr(cl_int err, const char * name)
  3. {
  4. if (err != CL_SUCCESS) {
  5. std::cerr << "ERROR: " << name
  6. << " (" << err << ")" << std::endl;
  7. exit(EXIT_FAILURE);
  8. }
  9. }

A common paradigm for error handling in C++ is through the use of exceptions, and the OpenCL™ C++ bindings provide just such an interface. A later tutorial will cover the use of exceptions and other optional features provided by the C++ bindings. For now, let’s look at the one remaining function, “main,” necessary for our first OpenCL™ application.

OpenCL™ Contexts

The first step to initializing and using OpenCL™ is to create a context. The rest of the OpenCL™ work (creating devices and memory, compiling and running programs) is performed within this context. A context can have a number of associated devices (for example, CPU or GPU devices), and, within a context, OpenCL™ guarantees a relaxed memory consistency between devices. We will look at this in detail in a later tutorial; for now, we use a single device, CL_DEVICE_TYPE_CPU, for the CPU device. We could have used CL_DEVICE_TYPE_GPU or some other support device type, assuming that the OpenCL™ implementation supports that device. But before we can create a context we must first queue the OpenCL runtime to determine which platforms, i.e. different vendor’s OpenCL implementations, are present. The classcl::Platform provides the static method cl::Platform::get for this and returns a list of platforms. For now we select the first platform and use this to create a context. The constructor cl::Context should be successful and, in this case, the value of err is CL_SUCCESS.

  1. int
  2. main(void)
  3. {
  4. cl_int err;
  5. cl::vector< cl::Platform > platformList;
  6. cl::Platform::get(&platformList);
  7. checkErr(platformList.size()!=0 ? CL_SUCCESS : -1, "cl::Platform::get");
  8. std::cerr << "Platform number is: " << platformList.size() << std::endl;std::string platformVendor;
  9. platformList[0].getInfo((cl_platform_info)CL_PLATFORM_VENDOR, &platformVendor);
  10. std::cerr << "Platform is by: " << platformVendor << "\n";
  11. cl_context_properties cprops[3] =
  12. {CL_CONTEXT_PLATFORM, (cl_context_properties)(platformList[0])(), 0};cl::Context context(
  13. CL_DEVICE_TYPE_CPU,
  14. cprops,
  15. NULL,
  16. NULL,
  17. &err);
  18. checkErr(err, "Conext::Context()");

Before delving into compute devices, where the ‘real’ work happens, we first allocate an OpenCL™ buffer to hold the result of the kernel that will be run on the device, i.e. the string “Hello World\n.” For now we simply allocate some memory on the host and request that OpenCL™ use this memory directly, passing the flagCL_MEM_USE_HOST_PTR, when creating the buffer.

  1. char * outH = new char[hw.length()+1];
  2. cl::Buffer outCL(
  3. context,
  4. CL_MEM_WRITE_ONLY | CL_MEM_USE_HOST_PTR,
  5. hw.length()+1,
  6. outH,
  7. &err);
  8. checkErr(err, "Buffer::Buffer()");

 

OpenCL™ Devices

In OpenCL™ many operations are performed with respect to a given context. For example, buffers (1D regions of memory) and images (2D and 3D regions of memory) allocation are all context operations. But there are also device specific operations. For example, program compilation and kernel execution are on a per device basis, and for these a specific device handle is required. So how do we obtain a handle for a device? We simply query a context for it. OpenCL™ provides the ability to queue information about particular objects, and using the C++ API it comes in the form of object.getInfo<CL_OBJECT_QUERY>(). In the specific case of getting the device from a context:

  1. cl::vector<cl::Device> devices;
  2. devices = context.getInfo<CL_CONTEXT_DEVICES>();
  3. checkErr(
  4. devices.size() > 0 ? CL_SUCCESS : -1, "devices.size() > 0");

Now that we have the list of associated devices for a context, in this case a single CPU device, we need to load and build the compute program (the program we intend to run on the device, or more generally: devices). The first few lines of the following code simply load the OpenCL™ device program from disk, convert it to a string, and create a cl::Program::Sources object using the helper constructor. Given an object of type cl::Program::Sources a cl::Program, an object is created and associated with a context, then built for a particular set of devices.

  1. std::ifstream file("lesson1_kernels.cl");
  2. checkErr(file.is_open() ? CL_SUCCESS:-1, "lesson1_kernel.cl");std::string prog(
  3. std::istreambuf_iterator<char>(file),
  4. (std::istreambuf_iterator<char>()));cl::Program::Sources source(1,
  5. std::make_pair(prog.c_str(), prog.length()+1));cl::Program program(context, source);
  6. err = program.build(devices,"");
  7. checkErr(err, "Program::build()");

A given program can have many entry points, called kernels, and to call one we must build a kernel object. There is assumed to exist a straightforward mapping from kernel names, represented as strings, to a function defined with the __kernel attribute in the compute program. In this case we can build a cl::kernel object, kernel. Kernel arguments are set using the C++ API with kernel.setArg(), which takes the index and value for the particular argument.

  1. cl::Kernel kernel(program, "hello", &err);
  2. checkErr(err, "Kernel::Kernel()");err = kernel.setArg(0, outCL);
  3. checkErr(err, "Kernel::setArg()");

Now that the boiler plate code is done, it is time to compute the result (the output buffer with the string “Hello World\n”). All device computations are done using a command queue, which is a virtual interface for the device in question. Each command queue has a one-to-one mapping with a given device; it is created with the associated context using a call to the constructor for the class cl::CommandQueue. Given a cl::CommandQueue queue,kernels can be queued usingqueue.enqueuNDRangeKernel. This queues a kernel for execution on the associated device. The kernel can be executed on a 1D, 2D, or 3D domain of indexes that execute in parallel, given enough resources. The total number of elements (indexes) in the launch domain is called the global work size; individual elements are known as work-itemsWork-items can be grouped into work-groups when communication between work-items is required. Work-groups are defined with a sub-index function (called the local work size), describing the size in each dimension corresponding to the dimensions specified for the global launch domain. There is a lot to consider with respect to kernel launches, and we will cover this in more detail in future tutorials. For now, it is enough to note that for Hello World, each work-item computes a letter in the resulting string; and it is enough to launch hw.length()+1, where hw is the const std::string we defined at the beginning of the program. We need the extra work-item to account for the NULL terminator.

  1. cl::CommandQueue queue(context, devices[0], 0, &err);
  2. checkErr(err, "CommandQueue::CommandQueue()");cl::Event event;
  3. err = queue.enqueueNDRangeKernel(
  4. kernel,
  5. cl::NullRange,
  6. cl::NDRange(hw.length()+1),
  7. cl::NDRange(1, 1),
  8. NULL,
  9. &event);
  10. checkErr(err, "ComamndQueue::enqueueNDRangeKernel()");

The final argument to the enqueueNDRangeKernel call above was a cl::Event object, which can be used to query the status of the command with which it is associated, (for example, it has completed). It supports the method wait() that blocks until the command has completed. This is required to ensure the kernel has finished execution before reading the result back into host memory with queue.enqueueReadBuffer(). With the compute result back in host memory, it is simply a matter of outputting the result to stdout and exiting the program.

  1. event.wait();
  2. err = queue.enqueueReadBuffer(
  3. outCL,
  4. CL_TRUE,
  5. 0,
  6. hw.length()+1,
  7. outH);
  8. checkErr(err, "ComamndQueue::enqueueReadBuffer()");
  9. std::cout << outH;
  10. return EXIT_SUCCESS;
  11. }

Finally, to make the program complete an implementation for the device program (lesson1_kernels.cl), requires defining the external entry point, hello. The kernel implementation is straightforward: it calculates a unique index as a function of the launch domain using get_global_id(), it uses it as an index into the string, hw, then writes its value to the output array, out.

  1. #pragma OPENCL EXTENSION cl_khr_byte_addressable_store : enable
  2. __constant char hw[] = "Hello World\n";
  3. __kernel void hello(__global char * out)
  4. {
  5. size_t tid = get_global_id(0);
  6. out[tid] = hw[tid];
  7. }

For robustness, it would make sense to check that the thread id (tid) is not out of range of the hw; for now, we assume that the corresponding call toqueue.enqueueNDRangeKernel() is correct.

Building and running

On Linux, it should be enough to use a single command to build the OpenCL™ program; for example:
gcc –o hello_world –Ipath-OpenCL-include –Lpath-OpenCL-libdir lesson1.cpp –lOpenCL

To run:
LD_LIBRARY_PATH=path-OpenCL-libdir ./hello_world

On Windows, with a Visual Studio command window, an example is:
cl /Fehello_world.exe /Ipath-OpenCL-include lesson.cpp path-OpenCL-libdir/OpenCL.lib

Let’s assume that OpenCL.dll is on the path, then, running
.\hello_world

outputs the following string pm stdout:
Hello World

This completes our introductory tutorial to OpenCL™. Your feedback, comments, and questions are requested. Please visit our  OpenCL™ forum.

Useful Links

The following list provides links to some specific programming bindings, other than C, for OpenCL™. I have not tested these and cannot vouch for their correctness, but hope they will be useful:

  • OpenCL™ specification and headers:
    http://www.khronos.org/registry/cl/
  • OpenCL™ technical forum:
    http://www.khronos.org/message_boards/viewforum.php?f=28
  • The C++ bindings used in this tutorial can be found on the OpenCL™ web page at Khronos, along with complete documentation:
    http://www.khronos.org/registry/cl/
  • Python bindings can be found here:
    http://wiki.tiker.net/PyOpenCL
  • C# bindings can be found here:
    http://www.khronos.org/message_boards/viewtopic.php?f=28&t=1932

 

OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/448870.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

雷林鹏分享:codeigniter框架文件上传处理

CodeIgniter 框架input表单的重新填充&#xff0c;主要是针对text、radio、checkbox、select等input表单&#xff0c;那么对于文件上传表单file该如何处理呢? 自己的处理方式&#xff1a; //设置文件上传属性 $webroot $_SERVER[DOCUMENT_ROOT]; $time time(); $year date(…

jQuery基本使用

一.what 1&#xff09;.一个优秀的JS函数库&#xff1b; 2&#xff09;.中大型WEB项目开发的首选&#xff1b; 3&#xff09;.使用了jQuery的网站超过90%&#xff1b; 4&#xff09;.http://jquery.com/; 二.why(即jq的好处) html元素选取&#xff08;选择器&#xff09;&#…

解决:-bash: telnet: command not found

前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到教程。 报错如题 -bash: telnet: command not found只是因为没有安装这个命令&#xff0c;不识别。 安装命令&#xff1a; yum install telne…

钱荒下银行理财收益率角逐:邮储银行垫底

21世纪资管研究员松壑 由于银行理财的收益定价机制为设定预期收益率的“先行定价”&#xff0c;而银行对产品本金收益又保有或明或暗的兑付要求&#xff0c;其业绩往往在理财产品发行前就已决定。 因此&#xff0c;本次榜单根据已披露最高预期收益率&#xff08;下称收益率&a…

数据结构7.3_图的遍历

我们希望从图中某一顶点出发访遍图中其余顶点&#xff0c;且使每一个顶点仅被访问一次。 这一过程就叫做图的遍历。 图的遍历算法是求解图的连通性问题、拓扑排序和求关键路径等算法的基础。 然而&#xff0c;图的遍历要比树的遍历复杂得多。 因为图的任一顶点都可能和其余的顶…

CentOS7 使用 firewalld 打开关闭防火墙与端口

前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到教程。 1、firewalld的基本使用 启动&#xff1a; systemctl start firewalld关闭&#xff1a; systemctl stop firewalld查看状态&#xff1a…

HCL实验四

PC端配置&#xff1a;配置ip地址 配置网关 交换机配置&#xff1a;①创建VLAN system-view vlan 10 vlan 20 ②配置PC端接口 interface vlan-interface 10 ip add 192.168.10.254 24 interface vlan-interface 20 ip add 192.168.20.254 24 转载于:https://www.cnblogs.com/zy5…

程序员/设计师能用上的 75 份速查表

本文由 伯乐在线 - 黄利民 翻译自 designzum。欢迎加入 技术翻译小组。转载请参见文章末尾处的要求。75 份速查表&#xff0c;由 vikas 收集整理&#xff0c;包括&#xff1a;jQuery、HTML、HTML5、CSS、CSS3、JavaScript、Photoshop 、git、Linux、Java、Perl、PHP、Python、…

移动端真机测试怎么做

准备工作&#xff1a; 1、必须安装了node 环境和npm&#xff1b; 2、手机和电脑在同一个热点或者wifi下&#xff1b; 3、知道你的IP地址&#xff1b; 步骤一、 启动cmd&#xff0c;进入项目根目录&#xff0c;使用指令&#xff1a;npm i -g live-server 进行全局安装 步骤二、 …

Linux 下清空或删除大文件内容的 5 种方法

前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到教程。 下面的这些方法都是从命令行中达到清空文件的目的。 使用名为 access.log 的文件作为示例样本。 1. 通过重定向到 Null 来清空文件内容…

管理飞扬跋扈的技术部

摘要&#xff1a;有的管理人员认为最头疼的就是技术部的管理。因为技术工作看起来棘手&#xff0c;管理人员不能轻易了解技术工作的内涵&#xff0c;技术人员也觉得很难和管理人员沟通。要管理好技术人员&#xff0c;就一定要懂技术&#xff0c;这是其他管理方法都无法替代的。…

rocketmq 解决:There is insufficient memory for the Java Runtime Environment to continue

前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到教程。 1.场景描述 linux 安装 rocketmq 启动 mqnameserver、mqbroker 以及运行测试类生产者时报错。 运行命令为&#xff1a; nohup sh bin…

GWAS: 网页版的基因型填充(genotype imputation)

在全基因组关联分析中&#xff0c;处理芯片数据时&#xff0c;必须走的一个流程就是基因型数据填充&#xff08;imputation&#xff09;。 当然&#xff0c;如果你拿到的是全测序的数据&#xff0c;请忽略这一步。 下面直奔主题&#xff0c;怎么在网页版进行基因型填充。 1 进入…

【案例】图片无缝轮播效果

知识点&#xff1a; 1、scrollLeft属性 2、克隆节点 3、定时器 4、鼠标移入移除事件 <!DOCTYPE html> <html lang"en"> <head> <meta charset"UTF-8"> <title>无缝轮播</title> <style> *{ margin: 0; padding:…

腾讯CKV海量分布式存储系统

摘要&#xff1a;腾讯CKV&#xff0c;是腾讯自主研发的高性能、低延时、持久化、分布式KV存储服务。在腾讯的微信平台、开放平台、腾讯云、腾讯游戏和电商平台广泛使用&#xff0c;日访问量超过万亿次。本文将全面剖析CKV的实现原理和技术挑战。 与Memcached和Redis等开源NoSQ…

Apache RocketMQ 安装、测试、报错解决

1. 准备 前些天发现了一个巨牛的人工智能学习网站&#xff0c;通俗易懂&#xff0c;风趣幽默&#xff0c;忍不住分享一下给大家。点击跳转到教程。 64bit OS, Linux/Unix/Mac 64bit JDK 1.8; Maven 3.2.x 2.下载和构建 下载 4.2.0 源代码版本地址&#xff1a;http://mirro…

编程之法:面试和算法心得

《编程之法&#xff1a;面试和算法心得》高清中文版PDF 含书目录 下载地址&#xff1a; 链接&#xff1a;https://pan.baidu.com/s/1Kcd2bRsIfhagKZR6NaOgXg 提取码&#xff1a;054s 《编程之法&#xff1a;面试和算法心得》高清中文版PDF高清中文版PDF 含书目录&#xff0c;36…

localStorage存、取数组

localStorage存储数组时需要先使用JSON.stringify()转成字符串&#xff0c;取的时候再字符串转数组JSON.parse()。 var arr[1,2,3,4];localStorage.setItem(key,arr);console.log(localStorage(key); //打印出字符串&#xff1a;1,2,3,4 正常存储&#xff1a;localStorage.setI…

10岁起编程,并不认为自己是“黑客”

摘要&#xff1a;一直以来&#xff0c;女性在“黑客”群体中缺乏代表性&#xff0c;但这不是因为她们缺乏兴趣。麻省理工学院的Liz Denys从十岁开始接触编程&#xff0c;但由于被忽视以及性别歧视问题&#xff0c;她和许多女性一样&#xff0c;游走在“黑客”圈子之外。 我10岁…

Redis原理及拓展

Redis是单线程程序。单线程的Redis为何还能这么快&#xff1f; 1、所有的数据都在内存中&#xff0c;所有的运算都是内存级别的运算&#xff08;因此时间复杂度为O(n)的指令要谨慎使用&#xff09; 2、单线程操作&#xff0c;避免了频繁的上下文切换 3、多路复用&#xff08;非…