* Refactored extract_image_patches op helpers.
* Eliminated compliler errors with helper implementation.
* Finished implementation for extract_image_patches both cpu and cuda helpers.
* Improved cpu implementation.
* Improved cuda implementation for extract_image_patches helper.
* Added omp to ClipByGlobalNorm helpers implementation.
* Added implementation for thresholedrelu_bp op.
* Fixed cuda kernel with F order.
* Fixed tests for subarray.
* Refactored tests for Gaussian_3 and Truncated_22.
* Added tests for GaussianDistribution with native ops.
* Modified tests for Gaussian distribution.
* Fixed random tests.
* Fixed atomicMin/atomicMax for 64bit cases.
* Fixed tests for execReduce3TAD tests.
* Eliminated waste comments.