c# - Opencl kernel buffers leaking after 12k float elements -
ive written dotproduct kernel opencl in c++ , working vector length 4096(also tried 12k elements , working flawlessly) when increase vector length 16k elements, result becomes infinity while should not go beyond small float number. there leak or similar works ok n<16k elements. 16k elements , 4 byte each makes 64kb, 3 buffers sum 192kb , not 1/1000th of memory of gpu. compared result same reduction algorithm host-code(c#) , host result small expected. no precision errors build infinity also(it may capped @ finite value).
here kernel(ln= local work size, n= global work size) c# passed c++ through dll-call:
"__kernel void skalarcarpim(__global float * v1, __global float * v2, __global float * v3)" + "{" + " int = get_global_id(0);" + " int j = get_local_id(0);" + " __local float biriktirici [" + ln.tostring() + "];" + " barrier(clk_local_mem_fence);" + " biriktirici[j]=v1[i]*v2[i];" + " barrier(clk_local_mem_fence);" + " barrier(clk_global_mem_fence);" + " float toplam=0.0f;" + " if(j==0)" + " {" + " for(int k=0;k<"+ln.tostring()+";k++)"+ // reduction " {"+ " toplam+=biriktirici[k];"+ " }"+ " }" + " barrier(clk_global_mem_fence);" + " v3[i]=toplam;" + " barrier(clk_global_mem_fence);" + " toplam=0.0f;" + " for(int k=0;k<"+(n/ln).tostring()+";k++)" + " {" + " toplam+=v3[k*"+ln.tostring()+"]; " + // sum of temporary sums " }" + " v3[i]=toplam;"+ "}";
here c++ opencl buffers:
buf1=cl::buffer(altyapi,cl_mem_read_write,sizeof(cl_float) * n); buf2=cl::buffer(altyapi,cl_mem_read_write,sizeof(cl_float) * n); buf3=cl::buffer(altyapi,cl_mem_read_write,sizeof(cl_float) * n); //cl_mem_read_only makes same error, tried other too, no solution :(
here how buffers sent:
komutsirasi.enqueuewritebuffer(buf1,cl_true,0,sizeof(cl_float)*n,v1); komutsirasi.enqueuewritebuffer(buf2,cl_true,0,sizeof(cl_float)*n,v2); //cl_true makes blocking action waits until finished
execution:
komutsirasi.enqueuendrangekernel(kernel,0,global,local); //i got example , dont know if blocking or not.
here how result buffer taken(all elements result, know unfinished):
komutsirasi.enqueuereadbuffer(buf3,cl_true,0,sizeof(cl_float) * n,v3); //cl_true makes blocking action waits until finished
question: there cofiguration must before diving c++ opencl? not issue in java/aparapi/jocl.
using opencl 1.2 headers khronos' site , amd opencl.lib + opencl.dll if helps(target device hd7870).
your second reduction, sum of v3[k*n], assumes values in v3 have been computed. require synchronization between different workgroups, not possible in general case. may accidentally happen when there 1 single workgroup.
after first reduction, should store toplam in v3[get_group_id(0)], , run second kernel second reduction.
Comments
Post a Comment