GPU programming is usually a 3-step process:
- Copy (transfer data from host to device)
- Compute parallel on GPU device
- Copy (transfer data back to host)
Total runtime is the sum of the three steps one-by-one.
- (CUDA streams) overlay memory transfer and compute
Total runtime will be no longer than non-overlap