raver119 3c4e959e21 [WIP] More of CUDA (#95)
* initial commit

Signed-off-by: raver119 <raver119@gmail.com>

* Implementation of hashcode cuda helper. Working edition.

* Fixed parallel test input arangements.

* Fixed tests for hashcode op.

* Fixed shape calculation for image:crop_and_resize op and test.

* NativeOps tests. Initial test suite.

* Added tests for indexReduce methods.

* Added test on execBroadcast with NDArray as dimensions.

* Added test on execBroadcastBool with NDArray as dimensions.

* Added tests on execPairwiseTransform and execPairwiseTransofrmBool.

* Added tests for execReduce with scalar results.

* Added reduce tests for non-empty dims array.

* Added tests for reduce3.

* Added tests for execScalar.

* Added tests for execSummaryStats.

* - provide cpu/cuda code for batch_to_space
- testing it

Signed-off-by: Yurii <yurii@skymind.io>

* - remove old test for batch_to_space (had wrong format and numbers were not checked)

Signed-off-by: Yurii <yurii@skymind.io>

* Fixed complilation errors with test.

* Added test for execTransformFloat.

* Added test for execTransformSame.

* Added test for execTransformBool.

* Added test for execTransformStrict.

* Added tests for execScalar/execScalarBool with TADs.

* Added test for flatten.

* - provide cpu/cuda code for space_to_Batch operaion

Signed-off-by: Yurii <yurii@skymind.io>

* Added test for concat.

* comment unnecessary stuff in s_t_b

Signed-off-by: Yurii <yurii@skymind.io>

* Added test for specialConcat.

* Added tests for memcpy/set routines.

* Fixed pullRow cuda test.

* Added pullRow test.

* Added average test.

* - correct typo in NDArray::applyPairwiseTransform(nd4j::pairwise::BoolOps op...)

Signed-off-by: Yurii <yurii@skymind.io>

* - debugging and fixing cuda tests in JavaInteropTests file

Signed-off-by: Yurii <yurii@skymind.io>

* - correct some tests

Signed-off-by: Yurii <yurii@skymind.io>

* Added test for shuffle.

* Fixed ops declarations.

* Restored omp and added shuffle test.

* Added convertTypes test.

* Added tests for execRandom. Eliminated usage of RandomBuffer with NativeOps.

* Added sort tests.

* Added tests for execCustomOp.

* - further debuging and fixing tests terminated with crash

Signed-off-by: Yurii <yurii@skymind.io>

* Added tests for calculateOutputShapes.

* Addded Benchmarks test.

* Commented benchmark tests.

* change assertion

Signed-off-by: raver119 <raver119@gmail.com>

* Added tests for apply_sgd op. Added cpu helper for that op.

* Implement cuda helper for aplly_sgd op. Fixed tests for NativeOps.

* Added test for assign broadcastable.

* Added tests for assign_bp op.

* Added tests for axpy op.

* - assign/execScalar/execTransformAny signature change
- minor test fix

Signed-off-by: raver119 <raver119@gmail.com>

* Fixed axpy op.

* meh

Signed-off-by: raver119 <raver119@gmail.com>

* - fix tests for nativeOps::concat

Signed-off-by: Yurii <yurii@skymind.io>

* sequential transform/scalar

Signed-off-by: raver119 <raver119@gmail.com>

* allow nested parallelism

Signed-off-by: raver119 <raver119@gmail.com>

* assign_bp leak fix

Signed-off-by: raver119 <raver119@gmail.com>

* block setRNG fix

Signed-off-by: raver119 <raver119@gmail.com>

* enable parallelism by default

Signed-off-by: raver119 <raver119@gmail.com>

* enable nested parallelism by default

Signed-off-by: raver119 <raver119@gmail.com>

* Added cuda implementation for row_count helper.

* Added implementation for tnse gains op helper.

* - take into account possible situations when input arrays are empty in reduce_ cuda stuff

Signed-off-by: Yurii <yurii@skymind.io>

* Implemented tsne/edge_forces op cuda-based helper. Parallelized cpu-based helper for edge_forces.

* Added kernel for tsne/symmetrized op heleper.

* Implementation of tsne/symmetrized op cuda helper. Working edition.

* Eliminated waste printfs.

* Added test for broadcastgradientargs op.

* host-only fallback for empty reduce float

Signed-off-by: raver119 <raver119@gmail.com>

* - some tests fixes

Signed-off-by: Yurii <yurii@skymind.io>

* - correct the rest of reduce_ stuff

Signed-off-by: Yurii <yurii@skymind.io>

* - further correction of reduce_ stuff

Signed-off-by: Yurii <yurii@skymind.io>

* Added test for Cbow op. Also added cuda implementation for cbow helpers.

* - improve code of stack operation for scalar case

Signed-off-by: Yurii <yurii@skymind.io>

* - provide cuda kernel for gatherND operation

Signed-off-by: Yurii <yurii@skymind.io>

* Implementation of cbow helpers with cuda kernels.

* minor tests tweaks

Signed-off-by: raver119 <raver119@gmail.com>

* minor tests tweaks

Signed-off-by: raver119 <raver119@gmail.com>

* - further correction of cuda stuff

Signed-off-by: Yurii <yurii@skymind.io>

* Implementatation of cbow op helper with cuda kernels. Working edition.

* Skip random testing for cudablas case.

* lstmBlockCell context fix

Signed-off-by: raver119 <raver119@gmail.com>

* Added tests for ELU and ELU_BP ops.

* Added tests for eq_scalar, gt_scalar, gte_scalar and lte_scalar ops.

* Added tests for neq_scalar.

* Added test for noop.

* - further work on clipbynorm_bp

Signed-off-by: Yurii <yurii@skymind.io>

* - get rid of concat op call, use instead direct concat helper call

Signed-off-by: Yurii <yurii@skymind.io>

* lstmBlockCell context fix

Signed-off-by: raver119 <raver119@gmail.com>

* Added tests for lrelu and lrelu_bp.

* Added tests for selu and selu_bp.

* Fixed lrelu derivative helpers.

* - some corrections in lstm

Signed-off-by: Yurii <yurii@skymind.io>

* operator * result shape fix

Signed-off-by: raver119 <raver119@gmail.com>

* - correct typo in lstmCell

Signed-off-by: Yurii <yurii@skymind.io>

* few tests fixed

Signed-off-by: raver119 <raver119@gmail.com>

* CUDA inverse broadcast bool fix

Signed-off-by: raver119 <raver119@gmail.com>

* disable MMAP test for CUDA

Signed-off-by: raver119 <raver119@gmail.com>

* BooleanOp syncToDevice

Signed-off-by: raver119 <raver119@gmail.com>

* meh

Signed-off-by: raver119 <raver119@gmail.com>

* additional data types for im2col/col2im

Signed-off-by: raver119 <raver119@gmail.com>

* Added test for firas_sparse op.

* one more RandomBuffer test excluded

Signed-off-by: raver119 <raver119@gmail.com>

* Added tests for flatten op.

* Added test for Floor op.

* bunch of tests fixed

Signed-off-by: raver119 <raver119@gmail.com>

* mmulDot tests fixed

Signed-off-by: raver119 <raver119@gmail.com>

* more tests fixed

Signed-off-by: raver119 <raver119@gmail.com>

* Implemented floordiv_bp op and tests.

* Fixed scalar case with cuda implementation for bds.

* - work on cuda kernel for clip_by_norm backprop op is completed

Signed-off-by: Yurii <yurii@skymind.io>

* Eliminate cbow crach.

* more tests fixed

Signed-off-by: raver119 <raver119@gmail.com>

* more tests fixed

Signed-off-by: raver119 <raver119@gmail.com>

* Eliminated abortion with batched nlp test.

* more tests fixed

Signed-off-by: raver119 <raver119@gmail.com>

* Fixed shared flag initializing.

* disabled bunch of cpu workspaces tests

Signed-off-by: raver119 <raver119@gmail.com>

* scalar operators fix: missing registerSpecialUse call

Signed-off-by: raver119 <raver119@gmail.com>

* Fixed logdet for cuda and tests.

* - correct clipBynorm_bp

Signed-off-by: Yurii <yurii@skymind.io>

* Fixed crop_and_resize shape datatype.

* - correct some mmul tests

Signed-off-by: Yurii <yurii@skymind.io>
2019-08-05 11:27:05 +10:00

522 lines
27 KiB
Plaintext

/*******************************************************************************
* Copyright (c) 2015-2018 Skymind, Inc.
*
* This program and the accompanying materials are made available under the
* terms of the Apache License, Version 2.0 which is available at
* https://www.apache.org/licenses/LICENSE-2.0.
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*
* SPDX-License-Identifier: Apache-2.0
******************************************************************************/
//
// @author raver119@gmail.com
//
#include <ops/declarable/helpers/sg_cb.h>
#include <cuda_exception.h>
#include <NDArrayFactory.h>
#define HS_MAX_EXP 6.0f
namespace nd4j {
namespace ops {
namespace helpers {
template <typename T>
__global__ void hSoftmaxKernel(void *vsyn0, void *vsyn1, void *vexpTable, void *vneu1e, double alpha, int vectorLength, int code, int expLength, bool isInference) {
auto syn0 = reinterpret_cast<T*>(vsyn0);
auto syn1 = reinterpret_cast<T*>(vsyn1);
auto expTable = reinterpret_cast<T*>(vexpTable);
auto neu1e = reinterpret_cast<T*>(vneu1e);
T dot(0.0f);
T g(0.0f);
T f(0.0f);
// dot
for (int e = 0; e < vectorLength; e++) {
dot += syn0[e] * syn1[e];
}
// gradient
if (dot < (T) - HS_MAX_EXP || dot >= (T) HS_MAX_EXP)
return;
int idx = static_cast<int>((dot + HS_MAX_EXP) * ((float) expLength / HS_MAX_EXP / 2.0f));
if (idx >= expLength || idx < 0)
return;
f = expTable[idx];
g = (static_cast<T>(1.0f) - static_cast<T>(code) - f) * (T) alpha;
// axpy1
for (int e = 0; e < vectorLength; e++) {
neu1e[e] = g * syn1[e] + neu1e[e];
}
// axpy2
if (!isInference) {
for (int e = 0; e < vectorLength; e++) {
syn1[e] = g * syn0[e] + syn1[e];
}
}
}
template <typename T>
void hSoftmax_(void *vsyn0, void *vsyn1, void *vexpTable, void *vneu1e, double alpha, int vectorLength, int code, int expLength, bool isInference, cudaStream_t* stream) {
hSoftmaxKernel<T><<<1,1,128, *stream>>>(vsyn0, vsyn1, vexpTable, vneu1e, alpha, vectorLength, code, expLength, isInference);
}
template <typename T>
__global__ void nSamplingKernel(void *vsyn0, void *vsyn1Neg, void *vexpTable, void *vneu1e, double alpha, int vectorLength, int code, int expLength, bool isInference) {
auto syn0 = reinterpret_cast<T*>(vsyn0);
auto syn1Neg = reinterpret_cast<T*>(vsyn1Neg);
auto expTable = reinterpret_cast<T*>(vexpTable);
auto neu1e = reinterpret_cast<T*>(vneu1e);
T dot = (T) 0.0f;
T g = (T) 0.0f;
for (int e = 0; e < vectorLength; e++) {
dot += syn0[e] * syn1Neg[e];
}
if (dot > HS_MAX_EXP)
g = (code - 1) * alpha;
else if (dot < (T) - HS_MAX_EXP)
g = (code - 0) * alpha;
else {
int idx = (int) ((dot + (T) HS_MAX_EXP) * ((T) expLength / HS_MAX_EXP / 2.0));
if (idx >= expLength)
return;
if (idx < 0)
return;
g = ((T) code - expTable[idx]) * alpha;
}
// axpy1
for (int e = 0; e < vectorLength; e++) {
neu1e[e] = g * syn1Neg[e] + neu1e[e];
}
// axpy2
if (!isInference) {
for (int e = 0; e < vectorLength; e++) {
syn1Neg[e] = g * syn0[e] + syn1Neg[e];
}
}
}
template <typename T>
void nSampling_(void *vsyn0, void *vsyn1Neg, void *vexpTable, void *vneu1e, double alpha, int vectorLength, int code, int expLength, bool isInference, cudaStream_t* stream) {
nSamplingKernel<T><<<1,1,128, *stream>>>(vsyn0, vsyn1Neg, vexpTable, vneu1e, alpha, vectorLength, code, expLength, isInference);
}
int binarySearch(const int *haystack, const int needle, const int totalElements) {
return 0;
}
void skipgram(NDArray &syn0, NDArray &syn1, NDArray &syn1Neg, NDArray &expTable, NDArray &negTable, NDArray &target, NDArray &ngStarter, int nsRounds, NDArray &indices, NDArray &codes, NDArray &alpha, NDArray &randomValue, NDArray &inferenceVector, const bool preciseMode, const int numWorkers) {
auto xType = syn0.dataType();
}
template <typename T>
static __global__ void checkContextKernel(int* context, T* syn0, T* neu1, int contextWidth, int vectorLength, int vocabSize) {
__shared__ bool hasError;
if (0 == threadIdx.x) {
hasError = false;
}
auto start = blockIdx.x * blockDim.x + threadIdx.x;
auto step = blockDim.x * gridDim.x;
for (int c = start; c < contextWidth; c += step) {
if (context[c] >= vocabSize)
hasError = true; //throw std::runtime_error("Bad context 4");
if (!hasError) {
T *syn0word = syn0 + (context[c] * vectorLength);
for (int i = 0; i < vectorLength; i++) {
neu1[i] += syn0word[i];
}
}
}
if (threadIdx.x == 0) {
if (hasError)
neu1[0] = DataTypeUtils::infOrMax<T>();
}
}
template <typename T>
__global__ void addInfVectorKernel(T* neu1, T* infVector, int vectorLength) {
auto start = blockIdx.x * blockDim.x + threadIdx.x;
auto step = blockDim.x * gridDim.x;
for (auto i = start; i < vectorLength; i += step) {
neu1[i] += infVector[i];
}
}
template <typename T>
__global__ void shiftKernel(T* neu1, T* infVector, int contextWidth, int vectorLength) {
auto start = blockIdx.x * blockDim.x + threadIdx.x;
auto step = blockDim.x * gridDim.x;
for (int i = start; i < vectorLength; i += step) {
neu1[i] /= contextWidth + int(infVector != nullptr); // ? 1 : 0);
}
}
template <typename T>
__global__ void fillUpSynonymsKernel(int starter, int contextWidth, int vectorLength, int* lockedWords, int* context, T* neu1e, T* syn0) {
auto start = threadIdx.x + blockIdx.x * blockDim.x;
auto step = blockDim.x * gridDim.x;
for (int c = starter + start; c < contextWidth; c += step) {
if (lockedWords[c] == 1)
continue;
T *syn0word = syn0 + (context[c] * vectorLength);
for (int i = 0; i < vectorLength; i++) {
syn0word[i] += neu1e[i];
}
}
}
template <typename T>
void cbow_(LaunchContext* lc, void *vsyn0, void *vsyn1, void *vsyn1Neg, void *vexpTable, void *vnegTable, void *vinfVector, int target, int ngStarter, int *context, int *lockedWords, int *indices, int8_t *codes, double alpha, Nd4jLong randomValue, const int contextWidth, const int hsRounds, const int nsRounds, const int vocabSize, const int vectorLength, const int expLength, const int negLength, const int numLabels, const bool trainWords) {
auto syn0 = reinterpret_cast<T *>(vsyn0);
auto syn1 = reinterpret_cast<T *>(vsyn1);
auto syn1Neg = reinterpret_cast<T *>(vsyn1Neg);
auto expTable = reinterpret_cast<T *>(vexpTable);
auto negTable = reinterpret_cast<T *>(vnegTable);
auto infVector = reinterpret_cast<T *>(vinfVector);
auto stream = lc->getCudaStream();
T* neu1; // = new T[vectorLength];
T* neu1e; // = new T[vectorLength];
size_t buffSize = sizeof(T) * vectorLength;
auto err = cudaMalloc(&neu1, buffSize);
err = cudaMalloc(&neu1e, buffSize);
err = cudaMemset(neu1, 0, buffSize);
err = cudaMemset(neu1e, 0, buffSize);
// building neu1 for current window
checkContextKernel<T><<<1,1,128,*stream>>>(context, syn0, neu1, contextWidth, vectorLength, vocabSize);
T checkVal;
err = cudaMemcpy(&checkVal, neu1, sizeof(T), cudaMemcpyDeviceToHost);
if (DataTypeUtils::infOrMax<T>() == checkVal)
throw std::runtime_error("Bad context 4");
// for inference we add additional inference vector
if (infVector != nullptr) {
addInfVectorKernel<T><<<128, 256, 128, *stream>>>(neu1, infVector, vectorLength);
}
// average neu1
if (contextWidth > 0) {
shiftKernel<T><<<128, 256, 128, *stream>>>(neu1, infVector, contextWidth, vectorLength);
}
// softmax round
if (hsRounds > 0) {
for (int i = 0; i < hsRounds; i++) {
if (indices[i] < 0 || indices[i] >= vocabSize)
throw std::runtime_error("Bad context 5");
T* syn1Shifted = syn1 + (indices[i] * vectorLength);
hSoftmax_<T>(neu1, syn1Shifted, expTable, neu1e, alpha, vectorLength, codes[i], expLength, infVector != nullptr, stream);
}
}
auto nsStarter = ngStarter;
auto irow = nsStarter;
if (nsRounds > 0) {
for (int r = 0; r < nsRounds + 1; r++) {
if (r == 0) {
// target is known in advance
} else {
randomValue = randomValue * (unsigned long long) 25214903917 + 11;
auto idx = nd4j::math::nd4j_abs<Nd4jLong >((randomValue >> 16) % negLength);
irow = idx >= negLength ? -1 : static_cast<int>(negTable[idx]);
if (irow < 0 || irow >= vocabSize) irow = randomValue % (vocabSize - 1) + 1;
if (irow == nsStarter)
continue;
}
nSampling_<T>(neu1, syn1Neg + (irow * vectorLength), expTable, neu1e, alpha, vectorLength, r == 0 ? 1 : 0, expLength, infVector != nullptr, stream);
}
}
// if we don't train words - we skip start of idxSyn0
int starter = trainWords == 1 ? 0 : contextWidth - numLabels;
// propagate neu1e -> syn0
if (infVector == nullptr) {
fillUpSynonymsKernel<T><<<1,1,128, *stream>>>(starter, contextWidth, vectorLength, lockedWords, context, neu1e, syn0);
} else {
for (int i = 0; i < vectorLength; i++) {
infVector[i] += neu1e[i];
}
}
err = cudaFree(neu1);
err = cudaFree(neu1e);
}
BUILD_SINGLE_TEMPLATE(template void cbow_, (LaunchContext* lc, void *syn0, void *syn1, void *syn1Neg, void *expTable, void *vnegTable, void *vinfVector, int target, int ngStarter, int *context, int *lockedWords, int *indices, int8_t *codes, double alpha, Nd4jLong randomValue, const int contextWidth, const int hsRounds, const int nsRounds, const int vocabSize, const int vectorLength, const int expLength, const int negLength, const int numLabels, const bool trainWords), FLOAT_TYPES);
template <typename T>
static __global__ void buildCurrentWindowKernel(int vocabSize, int contextWidth, int vectorLength, int* bContext, T* syn0, T* neu1, int* actualContext, int e) {
// building neu1 for current window
auto start = blockIdx.x * blockDim.x + threadIdx.x;
auto step = blockDim.x * gridDim.x;
for (int c = start; c < contextWidth; c += step) {
// getting next context word
auto cContext = bContext[c + (e * contextWidth)];
// skipping padded values
if (cContext < 0)
continue;
// if (cContext >= vocabSize)
// throw std::runtime_error("ContextID can't be >= vocab size");
T *syn0word = syn0 + (cContext * vectorLength);
for (int i = 0; i < vectorLength; i++)
neu1[i] += syn0word[i];
atomicAdd(actualContext, 1);
}
}
template <typename T>
__global__ void arrangeNeuKernel(int vectorLength, T* neu1, T* infVector, int* actualContext) {
auto start = blockIdx.x * blockDim.x + threadIdx.x;
auto step = blockDim.x * gridDim.x;
for (int i = start; i < vectorLength && *actualContext > 0; i += step)
neu1[i] /= (*actualContext + int(infVector != nullptr));
}
template <typename T>
__global__ void applyShiftKernel(int* bContext, int* bLocker, T* syn0, T* neu1e, int contextWidth, int vectorLength, int e, int starter) {
auto step = blockDim.x * gridDim.x;
auto start = blockDim.x * blockIdx.x + threadIdx.x;
for (int c = starter + start; c < contextWidth; c += step) {
// getting context
auto cContext = bContext[c + (e * contextWidth)];
auto cLock = bLocker[c + (e * contextWidth)];
// skipping padded values
if (cContext < 0 || cLock == 1)
continue;
// if (cContext >= vocabSize)
// throw std::runtime_error("ContextID can't be > vocab size");
// one word from context
T *syn0word = syn0 + (cContext * vectorLength);
for (int i = 0; i < vectorLength; i++)
syn0word[i] += neu1e[i];
}
}
template <typename T>
void cbowBatchExec_(LaunchContext* lc, NDArray &s0, NDArray &s1, NDArray &s1n, void *vexpTable, void *vnegTable, void *vinfVector, NDArray &context, NDArray &lockedWords, NDArray &targets, NDArray &negStarters, NDArray &indices, NDArray &codes, NDArray &lr, NDArray &nextRandom, NDArray &nLabels, const int nsRounds, const int vocabSize, const int vectorLength, const int expLength, const int negLength, const bool trainWords, const int numThreads) {
const auto syn0 = reinterpret_cast<T*>(s0.specialBuffer()); //bufferAsT<T>();
const auto syn1 = reinterpret_cast<T*>(s1.specialBuffer()); //bufferAsT<T>();
const auto syn1Neg = reinterpret_cast<T*>(s1n.specialBuffer()); //bufferAsT<T>();
const auto expTable = reinterpret_cast<T*>(vexpTable);
const auto negTable = reinterpret_cast<T*>(vnegTable);
const auto infVector = reinterpret_cast<T*>(vinfVector);
auto stream = lc->getCudaStream();
indices.syncToHost();
codes.syncToHost();
negStarters.syncToHost();
context.syncToHost();
//const auto numThreads = omp_get_max_threads();
const auto idxShift = indices.isEmpty() ? 0 : indices.sizeAt(1);
const auto hsRounds = codes.isEmpty() ? 0 : codes.sizeAt(1);
const auto numTargets = context.sizeAt(0);
const int contextWidth = context.sizeAt(1);
const auto bContext = reinterpret_cast<int*>(context.buffer()); //bufferAsT<int>();
const auto dContext = reinterpret_cast<int*>(context.specialBuffer()); //bufferAsT<int>();
const auto bLocker = reinterpret_cast<int*>(lockedWords.buffer()); //lockedWords.bufferAsT<int>();
const auto dLocker = reinterpret_cast<int*>(lockedWords.specialBuffer()); //lockedWords.bufferAsT<int>();
const auto bIndices = reinterpret_cast<int*>(indices.buffer());//AsT<int>();
const auto bCodes = reinterpret_cast<int8_t*>(codes.buffer()); //bufferAsT<int8_t>();
const auto bStarters = reinterpret_cast<int*>(negStarters.buffer()); //AsT<int>();
const auto numIndices = indices.isEmpty() ? 0 : indices.sizeAt(1);
lr.syncToHost();
nLabels.syncToHost();
//PRAGMA_OMP_PARALLEL_FOR_ARGS(num_threads(numThreads) private(sneu1, sneu1e))
//NDArray neuVector('c', {vectorLength}, DataTypeUtils::fromT<T>());
// auto neuEVector = neuVector; //NDArrayFactory::create<T>('c', {vectorLength});
T* neu1; // = reinterpret_cast<T*>(neuVector.specialBuffer());// = vectorLength <= 600 ? sneu1 : new T[vectorLength];
T* neu1e; // = reinterpret_cast<T*>(neuVector.specialBuffer()); // = vectorLength <= 600 ? sneu1e : new T[vectorLength];
auto cerr = cudaMalloc(&neu1, sizeof(T) * vectorLength);
if (cerr) {
throw cuda_exception::build("Cannot allocate temp vector buffer", cerr);
}
cerr = cudaMalloc(&neu1e, sizeof(T) * vectorLength);
if (cerr) {
throw cuda_exception::build("Cannot allocate temp vector buffer", cerr);
}
int* actualContext;
cerr = cudaMalloc(&actualContext, sizeof(int));
if (cerr) {
throw cuda_exception::build("Cannot allocate counter buffer", cerr);
}
for (int e = 0; e < numTargets; e++) {
// auto err = cudaMalloc(&neu1, sizeof(T)* vectorLength);
// q err = cudaMalloc(&neu1e, sizeof(T)*vectorLength);
//
// // optionally we nullify temp arrays after successful (and on first) cycle
// memset(neu1, 0, sizeof(T) * vectorLength);
// memset(neu1e, 0, sizeof(T) * vectorLength);
auto alpha = lr.e<double>(e);
auto numLabels = nLabels.isEmpty() ? 0 : nLabels.e<int>(e);
// auto err = cudaMemset(actualContext, 0, sizeof(int));
// if (err) {
// printf("Cuda error %d\n", err); break;
// }
buildCurrentWindowKernel<T><<<1,1,128, *stream>>>(vocabSize, contextWidth, vectorLength, dContext, syn0, neu1, actualContext, e);
arrangeNeuKernel<T><<<1,1,128, *stream>>>(vectorLength, neu1, infVector, actualContext);
// hierarchic softmax step
if (!indices.isEmpty()) {
for (int i = 0; i < numIndices; i++) {
const int cIndex = bIndices[(e * numIndices) + i];
const int cCode = bCodes[(e * numIndices) + i];
// we're skipping padded values
if (cIndex < 0)
continue;
if (cIndex >= vocabSize)
throw std::runtime_error("Index can't be > vocab size");
hSoftmax_<T>(neu1, syn1 + (cIndex * vectorLength), expTable, neu1e, alpha, vectorLength, cCode, expLength, false, stream);
}
}
// negative sampling step
if (!negStarters.isEmpty() && nsRounds > 0) {
int irow = bStarters[e];
const int nsStarter = irow;
unsigned long long randomValue = nextRandom.e<Nd4jLong>(e);
for (int r = 0; r < nsRounds + 1; r++) {
// we're skipping rng on 0 step
if (r != 0) {
randomValue = randomValue * (unsigned long long) 25214903917 + 11;
auto idx = nd4j::math::nd4j_abs<Nd4jLong>((randomValue >> 16) % negLength);
irow = idx >= negLength ? -1 : static_cast<int>(negTable[idx]);
if (irow < 0 || irow >= vocabSize) irow = randomValue % (vocabSize - 1) + 1;
if (irow == nsStarter)
continue;
nSampling_<T>(neu1, s1n.bufferWithOffset(irow * vectorLength), expTable, neu1e, alpha, vectorLength, r == 0 ? 1 : 0, expLength, infVector != nullptr, stream);
} else {
nSampling_<T>(neu1, s1n.bufferWithOffset(irow * vectorLength), expTable, neu1e, alpha, vectorLength, r == 0 ? 1 : 0, expLength, infVector != nullptr, stream);
}
//nd4j_printf("Thread <%i>: syn0: [%i]; s1n: [%i];\n", omp_get_thread_num(), 0, irow);
}
}
// if we're skipping labels
int starter = trainWords == 1 ? 0 : contextWidth - numLabels;
// applying previously averaged results
applyShiftKernel<T><<<1,1,128, *stream>>>(dContext, dLocker, syn0, neu1e, contextWidth, vectorLength, e, starter);
// optionally release temp arrays
// if (vectorLength > 600) {
// }
}
cerr = cudaFree(neu1);
if (cerr) {
throw cuda_exception::build("Cannot deallocate temp buffer1", cerr);
}
cerr = cudaFree(neu1e);
if (cerr) {
throw cuda_exception::build("Cannot deallocate temp buffer1 E", cerr);
}
cerr = cudaFree(actualContext);
if (cerr) {
throw cuda_exception::build("Cannot deallocate temp buffer1", cerr);
}
}
BUILD_SINGLE_TEMPLATE(template void cbowBatchExec_, (LaunchContext* lc, NDArray &s0, NDArray &s1, NDArray &s1n, void *vexpTable, void *vnegTable, void *vinfVector, NDArray &context, NDArray &lockedWords, NDArray &targets, NDArray &negStarters, NDArray &indices, NDArray &codes, NDArray &lr, NDArray &nextRandom, NDArray &nLabels, const int nsRounds, const int vocabSize, const int vectorLength, const int expLength, const int negLength, const bool trainWords, const int numThreads), FLOAT_TYPES);
void cbow(NDArray &syn0, NDArray &syn1, NDArray &syn1Neg, NDArray &expTable, NDArray &negTable, NDArray &target, NDArray &ngStarter, int nsRounds, NDArray &context, NDArray &lockedWords, NDArray &indices, NDArray &codes, NDArray &alpha, NDArray &randomValue, NDArray &numLabels, NDArray &inferenceVector, const bool trainWords, int numWorkers) {
auto xType = syn0.dataType();
auto lc = context.getContext();
indices.syncToHost();
NDArray::prepareSpecialUse({&syn0, &syn1, &syn1Neg, &expTable, &negTable, &target, &ngStarter}, {&context, &lockedWords, &indices, &codes, &alpha, &randomValue, &numLabels, &inferenceVector});
//auto stream = lc->getCudaStream();
if ((context.rankOf() == 0 || context.rankOf() == 1) && (indices.rankOf() == 1 || indices.rankOf() == 0)) {
// single round case
/*nd4j_printf("Row exec; ContextWidth: %i; LockedWords: %i; numLabels: %i; Train words: %i\n", (int) context.lengthOf(), (int) lockedWords.lengthOf(), numLabels.isEmpty() ? 0 : numLabels.e<int>(0), (int) trainWords);
if (context.lengthOf() == 2) {
context.printBuffer("context");
lockedWords.printBuffer("locked");
codes.printBuffer("codes");
indices.printBuffer("indices");
}*/
auto hsRounds = codes.lengthOf();
target.syncToHost();
numLabels.syncToHost();
target.syncToHost();
alpha.syncToHost();
numLabels.syncToHost();
codes.syncToHost();
negTable.syncToHost();
BUILD_SINGLE_SELECTOR(xType, cbow_, (lc, syn0.specialBuffer(), syn1.specialBuffer(), syn1Neg.specialBuffer(), expTable.specialBuffer(), negTable.buffer(), inferenceVector.specialBuffer(), target.isEmpty() ? -1 : target.e<int>(0), ngStarter.isEmpty() ? -1 : ngStarter.e<int>(0), reinterpret_cast<int *>(context.specialBuffer()), reinterpret_cast<int *>(lockedWords.specialBuffer()),reinterpret_cast<int *>(indices.buffer()), reinterpret_cast<int8_t *>(codes.buffer()), alpha.e<double>( 0), randomValue.e<Nd4jLong>(0), (int) context.lengthOf(), hsRounds, nsRounds, (int) syn0.sizeAt(0), (int) syn0.sizeAt(1), (int) expTable.lengthOf(), (int) negTable.lengthOf(), numLabels.isEmpty() ? 0 : numLabels.e<int>(0), trainWords), FLOAT_TYPES);
} else if (context.rankOf() == 2 && indices.rankOf() == 2) {
// batch mode
//nd4j_printf("Batch exec\n","");
BUILD_SINGLE_SELECTOR(xType, cbowBatchExec_, (lc, syn0, syn1, syn1Neg, expTable.specialBuffer(), negTable.specialBuffer(), nullptr, context, lockedWords, target, ngStarter, indices, codes, alpha, randomValue, numLabels, nsRounds, syn0.sizeAt(0), syn0.sizeAt(1), expTable.lengthOf(), negTable.isEmpty() ? 0 : negTable.lengthOf(), trainWords, numWorkers), FLOAT_TYPES);
} else
throw std::runtime_error("CBOW: context must have rank 0/1 or 2");
NDArray::registerSpecialUse({&syn0, &syn1, &syn1Neg, &expTable, &negTable, &target, &ngStarter}, {&context, &lockedWords, &indices, &codes, &alpha, &randomValue, &numLabels, &inferenceVector});
}
}
}
}