c# - Opencl kernel buffers leaking after 12k float elements -
ive written dotproduct kernel opencl in c++ , working vector length 4096(also tried 12k elements , working flawlessly) when increase vector length 16k elements, result becomes infinity while should not go beyond small float number. there leak or similar works ok n<16k elements. 16k elements , 4 byte each makes 64kb, 3 buffers sum 192kb , not 1/1000th of memory of gpu. compared result same reduction algorithm host-code(c#) , host result small expected. no precision errors build infinity also(it may capped @ finite value).
here kernel(ln= local work size, n= global work size) c# passed c++ through dll-call:
"__kernel void skalarcarpim(__global float * v1, __global float * v2, __global float * v3)" +             "{" +             "    int = get_global_id(0);" +             "    int j = get_local_id(0);" +             "    __local float biriktirici [" + ln.tostring() + "];" +             "    barrier(clk_local_mem_fence);" +             "    biriktirici[j]=v1[i]*v2[i];" +             "    barrier(clk_local_mem_fence);" +             "    barrier(clk_global_mem_fence);" +             "    float toplam=0.0f;" +             "    if(j==0)" +             "    {" +             "        for(int k=0;k<"+ln.tostring()+";k++)"+ // reduction             "        {"+             "             toplam+=biriktirici[k];"+             "        }"+             "    }" +             "    barrier(clk_global_mem_fence);" +             "    v3[i]=toplam;" +             "    barrier(clk_global_mem_fence);" +             "    toplam=0.0f;" +             "    for(int k=0;k<"+(n/ln).tostring()+";k++)" +              "    {" +             "         toplam+=v3[k*"+ln.tostring()+"];       " + // sum of temporary sums             "    }" +             "    v3[i]=toplam;"+             "}"; here c++ opencl buffers:
buf1=cl::buffer(altyapi,cl_mem_read_write,sizeof(cl_float) * n); buf2=cl::buffer(altyapi,cl_mem_read_write,sizeof(cl_float) * n); buf3=cl::buffer(altyapi,cl_mem_read_write,sizeof(cl_float) * n); //cl_mem_read_only makes same error, tried other too, no solution :( here how buffers sent:
komutsirasi.enqueuewritebuffer(buf1,cl_true,0,sizeof(cl_float)*n,v1); komutsirasi.enqueuewritebuffer(buf2,cl_true,0,sizeof(cl_float)*n,v2); //cl_true makes blocking action waits until finished execution:
 komutsirasi.enqueuendrangekernel(kernel,0,global,local);  //i got example , dont know if blocking or not. here how result buffer taken(all elements result, know unfinished):
komutsirasi.enqueuereadbuffer(buf3,cl_true,0,sizeof(cl_float) * n,v3); //cl_true makes blocking action waits until finished question: there cofiguration must before diving c++ opencl? not issue in java/aparapi/jocl.
using opencl 1.2 headers khronos' site , amd opencl.lib + opencl.dll if helps(target device hd7870).
your second reduction, sum of v3[k*n], assumes values in v3 have been computed. require synchronization between different workgroups, not possible in general case. may accidentally happen when there 1 single workgroup.
after first reduction, should store toplam in v3[get_group_id(0)], , run second kernel second reduction.
Comments
Post a Comment