Compute unified device architecture
API lets you use most C++ features in CUDA where OpenCL (which is supposed to be cross-platform?) is more restrictive.
CUDA includes both Task Parallelism and Data Parallelism. Data parallelism here is a central feature where we want to evaluate a function (or a kernel) over a set of points.
A work item is a fundamental unit of work in CUDA. These live on an N-dimensional grid. where N is up to 3. You get your choice of block size but we usually let the system decide (this is how many work items get grouped into a block which may share memory although they operate independently). However, you want to make the best use of the hardware, and choose for your numbers a multiple of the size of the warp (unit of execution).