Merge master to upstream (#7945)

* Shugeo strided slice zeros (#14) * Modified strided_slice op to properly work with empty-like shapes. * Fixed test for reduce_mean with empty-like input. * [WIP] Last merge (#15) * correct logsoftmax looss (#2) * Small SameDiff listener fix (#4) * Various fixes (#6) * #7839 Fix for asXMatrix and tests * #7866 EmbeddingSequenceLayer dtype fix + test * #7856 SameDiff save/load stream methods * #7859 RegressionEvaluation rank 4 fix + tests + axis configuration * EvaluationBinary 3d/4d * More evaluation 3d/4d tests * #7847 Evaluation empty checks * Small test ifx * #7848 Fix median edge case * Improve DL4J samediff layer tests * [WIP] FastText wrapper implemented (#8) * FastText implemented * Some fixes * Fix shapes for wordsNearest * Validation of input vectors * Fixes * Fixed test * Thread tagged * Some tweaks * setContextClassLoader for DeallocatorServiceThread * Numpy format tests (#1) * Various fixes (#11) * #7852 SameDiff gather fix * #7892 SameDiff placeholder to constant conversion * #7890 validate input rank for MLN/CG init methods * Fix broken permute shape calculation * Permute and gather fixes * Tests * #7850 LogSumExp fix + test * Handful of test fixes * Empty arrays with non-scalar shapes (#10) * minor rearrangements for lambdas * empty tensors with non-scalar shapes * numpy empty tensors with non-scalar shapes * few more empty tweaks * Small fixes * conv3d signature update * micro fix in batchnorm mkldnn * Import fixes * Fix * MKL-DNN update * Small fill fix * fill with empty input + test * Fixes * Small error improvement * Fix * one special test * couple of fixes for lstm * Rewrite TFGraphMapper.getNDArrayFromTensor to be maintainable and less error prone * Fixes * FP16 * Unsigned * BFloat16 * Fill op - empty tweaks * - couple of fixes for empty arrays construction - stack updated * strided slice fix * one transform test * provide method for reducing shapeInfo in case of input array is empty * Fixed reduceAlongDimensions to use empty input properly. * couple of broadcast tests * couple of tests broadcast tests + tweak to make them pass * add check of non-empty to methods producing sub-arrays * Fixed reshapeC with zeros in shape. * complete empty check in reduce_... legacy ops * Concat and cumsum/prod * Tweak to empty shape inference on import * add empty check to the rest of reduce legacy ops * one more test * correct typo in evalReduceShapeInfoEmpty * Added tests for reduce_* ops to tests with zero shapes. * few more tests for empty reductions * Fixed strided_slice op with empty case and tests. * one more empty reduction test * Fixed strided_slice test. * add empty check to NDArray::reshapei * infOrMax * empty min/max with infinity tests * made unstack working correctly with empty arrays * few IndexReduce tests + tweaks for empty shapes * add test for empty concat * few tests fixed * Validation fix for reductions on empty shapes * Reverse fix * Reduction shape calc fixes * SameDiff.generateOutputVariable: don't use shape function to determine number of outputs * Range fix * - NDArray constructor updated for scalars/empty arrays - few tests fixed * More fixes * Empty creator fixes * concat fix * concat fix * TF import tests: allow 'both all NaN' and 'both all inf' to pass * Slice, zero fraction, and reshape fixes * transpose, gather * Zero fraction * scalar cast fix * Empty reduction axis support * few more tests fixed * Fixed input checks conforming with TF for concat op and tests. * few tests fixed * matmul scalar shape fix * Fixed checkout for data type and scalarity with concat to allow non-empty scalars with vector concats. * broadcast bool fix * few more tests * few more tests * correct evalReduceShapeInfoEmpty * argmax/argmin + tests * one more empty edge case + one more test * argmax/argmin/realdiv_bp tweaks * empty reshape test + fix * Helper fixes * Small fixes * Gather test fix * Gather test fix * Small fixes * reduce scalar zero values * scalar mean workaround * Remove debug code * along dim mean workaround * one more test * - equalsTo() tweak for empty arrays - one more test * broadcast tweaks * [WIP] Fixing outstanding issues for NLP (#9) * Avoid using not-inited objects * Test fixed. * Redundant method avoided for models like FastText * KMeans++ implementation * KMeans++ implementation * Disable parallel execution * KMeans++ * Tests * Dev branch merge (#16) * SameDiff: convertDataType and gradient check util improvements (#12) * GradCheck util improvements * StopGradient constructor + test * SameDiff: Add datatype conversion * Javadoc and add DataType.isNumerical() * Small fix * Fix SameDiff TF import test cases intermediate naming (workaround for bad default) * TFGraphTestAllHelper: check intermediates in execution order * Add missing debug listener * [WIP] lstmBlock fix + other changes (#13) - fixes lstmBlock issue - changes NDArray method reshape(), permute(), transpose() by making them return instance instead of pointer - CheckNumerics op - fixes for ReduceBool IsInfOrNan & IsFinite * Small test fix * CheckNumerics op wrapper * Fix some issues on master (#17) * Fix DataVec test issue * Fix issue with dl4j SameDiff output layer * Dtype fix for lambda layers * #7912 BertIterator dtype fix (use float32 not global default) * [WIP] Next set of CUDA stuff (#7) New CUDA implementations and improvements * bad file * Dev branch master merge (#23) * SameDiff: convertDataType and gradient check util improvements (#12) * GradCheck util improvements * StopGradient constructor + test * SameDiff: Add datatype conversion * Javadoc and add DataType.isNumerical() * Small fix * Fix SameDiff TF import test cases intermediate naming (workaround for bad default) * TFGraphTestAllHelper: check intermediates in execution order * Add missing debug listener * [WIP] lstmBlock fix + other changes (#13) - fixes lstmBlock issue - changes NDArray method reshape(), permute(), transpose() by making them return instance instead of pointer - CheckNumerics op - fixes for ReduceBool IsInfOrNan & IsFinite * Small test fix * CheckNumerics op wrapper * Compatibility of deserialization (#18) Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com> * SameDiff: add activation gradient checking support for debugging (#19) * SameDiff gradient checker: first pass on activation gradient checks * Fixes + tests for activation gradient checking * Javadoc * [WIP] Some nd4j data type corrections (#20) * Adjust data type * Set correct Data type. * Size of proper data type. * fix averaged cpu load (#22) * SameDiff ops, TF import and fixes (#24) * CheckNumerics tests + fixes + misc fixes Signed-off-by: AlexDBlack <blacka101@gmail.com> * Fake quant Signed-off-by: AlexDBlack <blacka101@gmail.com> * Fixes Signed-off-by: AlexDBlack <blacka101@gmail.com> * FakeQuantWithMinMaxArgs Signed-off-by: AlexDBlack <blacka101@gmail.com> * CheckNumerics fix Signed-off-by: AlexDBlack <blacka101@gmail.com> * Fix libnd4j ALL_INTS and ALL_FLOATS declaration (uint and bfloat types) Signed-off-by: AlexDBlack <blacka101@gmail.com> * Small fix Signed-off-by: AlexDBlack <blacka101@gmail.com> * Javadoc Signed-off-by: AlexDBlack <blacka101@gmail.com> * Exception tweak Signed-off-by: AlexDBlack <blacka101@gmail.com> * fix Signed-off-by: AlexDBlack <blacka101@gmail.com> * Fix for out of scope stack allocated var use Signed-off-by: AlexDBlack <blacka101@gmail.com> * Ignores Signed-off-by: AlexDBlack <blacka101@gmail.com> * Ignore for known failing test (already logged issue) Signed-off-by: AlexDBlack <blacka101@gmail.com> * Merge upstream to fork (#25) * Add thousand-separator commas to TotalParams (#7915) * Add thousand-separator commas to TotalParams The number of parameters can be quite large, and it would help the reading of the summary printout to have the TotalParams column & values at the bottom have thousand-separator-commas in them. * Add thousand-separator commas to MultiLayerNetwork Corresponding change to MultiLayerNetwork Signed-off-by: Jxtps Jxtps <jxtps435@gmail.com> * Update contributing and issue/PR templates (#7934) Signed-off-by: AlexDBlack <blacka101@gmail.com> * Fix link to AdaDelta paper (#7942) Fix link to AdaDelta paper hosted on matthewzeiler.com Signed-off-by: Jxtps * Fixes, and ignores for known/logged failing issues (#7943) Signed-off-by: AlexDBlack <blacka101@gmail.com> * SameDiff + DL4J/SameDiff: Multiple fixes (#28) * #7919 HDF5 attribute buffer length fix Signed-off-by: AlexDBlack <blacka101@gmail.com> * #7909 Arbiter constructor exception ux improvements Signed-off-by: AlexDBlack <blacka101@gmail.com> * #7925 RNN output layer length checks Signed-off-by: AlexDBlack <blacka101@gmail.com> * #7939 Add listener for validating inputs are not incorrectly modified Signed-off-by: AlexDBlack <blacka101@gmail.com> * #7939 Integrate NonInplaceValidationListener into tests * #7844 DL4J SameDiff fixes for variable minibatch size * DL4J SameDiff fixes - ensure gradient for input placeholder is available Signed-off-by: AlexDBlack <blacka101@gmail.com> * Tweaks to ExternalErrorsFunction - use placeholders, make more robust * Another fix * More fixes * More SameDiff/DL4J fixes * Scope out scalar array creation in BaseScalarOp * Remove debug code Signed-off-by: AlexDBlack <blacka101@gmail.com> * [WIP] Final dev branch merge (#29) * SameDiff: convertDataType and gradient check util improvements (#12) * GradCheck util improvements * StopGradient constructor + test * SameDiff: Add datatype conversion * Javadoc and add DataType.isNumerical() * Small fix * Fix SameDiff TF import test cases intermediate naming (workaround for bad default) * TFGraphTestAllHelper: check intermediates in execution order * Add missing debug listener * [WIP] lstmBlock fix + other changes (#13) - fixes lstmBlock issue - changes NDArray method reshape(), permute(), transpose() by making them return instance instead of pointer - CheckNumerics op - fixes for ReduceBool IsInfOrNan & IsFinite * Small test fix * CheckNumerics op wrapper * Compatibility of deserialization (#18) Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com> * SameDiff: add activation gradient checking support for debugging (#19) * SameDiff gradient checker: first pass on activation gradient checks * Fixes + tests for activation gradient checking * Javadoc * [WIP] Some nd4j data type corrections (#20) * Adjust data type * Set correct Data type. * Size of proper data type. * fix averaged cpu load (#22) * [WIP] Multiple dataset iterators (#27) * Splitting dataset into arbitrary number * Fixes * Multiple split of iterator * Test * Test * Some fixes * signature change * one more tweak Signed-off-by: raver119 <raver119@gmail.com> * one more test for sequential use of DataSetIteratorSplitter Signed-off-by: raver119 <raver119@gmail.com> * Fixes * Fixes * one more test for Alexander Signed-off-by: raver119 <raver119@gmail.com> * Some fixes * Some fixes * one more test for Alexander Signed-off-by: raver119 <raver119@gmail.com> * minor test fix Signed-off-by: raver119 <raver119@gmail.com> * Some fixes * Some fixes * couple of assertions tweaked Signed-off-by: raver119 <raver119@gmail.com> * MDS splitter test :/ Signed-off-by: raver119 <raver119@gmail.com> * Minor refactoring * Multi dataset * Some fixes * More tests * Small number of test fixes/improvements (failures on CI) (#31) Signed-off-by: AlexDBlack <blacka101@gmail.com> * [WIP] More CUDA stuff (#26) * initial commit Signed-off-by: raver119 <raver119@gmail.com> * LRN BP CUDA Signed-off-by: raver119 <raver119@gmail.com> * less memory Signed-off-by: raver119 <raver119@gmail.com> * Fixed bug with crop_and_resize op helper. * get rid of unnecessary index-calculation dunction Signed-off-by: Yurii <yurii@skymind.io> * Fixed sort with nth_element cuda-based helper. * Refactored nth_element. * Refactored nth_element op and tests. * Modified usage of dim array with sortTad routine. * Refactored main routine of helper for non_max_image_suppression op. * non_max_image_suppression op helper with cuda kernel implementation. Initial revision. * fix vol2col cuda kernel * meh Signed-off-by: raver119 <raver119@gmail.com> * topK concept Signed-off-by: raver119 <raver119@gmail.com> * unsorted topK with scanWitdh of 1 Signed-off-by: raver119 <raver119@gmail.com> * correct vol2col tests * sorted/unsorted topK Signed-off-by: raver119 <raver119@gmail.com> * implementation and fixing col2im/col2vol * Corrected usage flags with input/output with reverse op. * dup is const now Signed-off-by: raver119 <raver119@gmail.com> * percentile op Signed-off-by: raver119 <raver119@gmail.com> * group tests for mapool2d Signed-off-by: Yurii <yurii@skymind.io> * special test for george Signed-off-by: raver119 <raver119@gmail.com> * less threads for sortTad Signed-off-by: raver119 <raver119@gmail.com> * provide conv2d for cuda Signed-off-by: Yurii <yurii@skymind.io> * remove auther in sort tad kernel code Signed-off-by: Yurii <yurii@skymind.io> * provide depthwise_conv2d for cuda Signed-off-by: Yurii <yurii@skymind.io> * - max_pooling_with_argmax - null check for special use Signed-off-by: raver119 <raver119@gmail.com> * dts cuda Signed-off-by: raver119 <raver119@gmail.com> * provide sconv2d for cuda Signed-off-by: Yurii <yurii@skymind.io> * std cuda Signed-off-by: raver119 <raver119@gmail.com> * Refactored non_max_suppression op to conform TF implementation. * Improved suppression helper. * provide pooling3d for cuda Signed-off-by: Yurii <yurii@skymind.io> * minor lstm rearrangements Signed-off-by: raver119 <raver119@gmail.com> * more of minor lstm rearrangements Signed-off-by: raver119 <raver119@gmail.com> * (bi)dynamic_rnn Signed-off-by: raver119 <raver119@gmail.com> * templates init order Signed-off-by: raver119 <raver119@gmail.com> * Refactored non_max_suppression op. * Added cuda kernel for non_max_suppression. * CPU sort by key/value Signed-off-by: raver119 <raver119@gmail.com> * CPU sort TAD by key/value Signed-off-by: raver119 <raver119@gmail.com> * CPU sort TAD by key/value tests Signed-off-by: raver119 <raver119@gmail.com> * Eliminate compiler error with cuda implementation. * - repaired gradCheck in cuda - provide conv2d_bp for cuda Signed-off-by: Yurii <yurii@skymind.io> * missed signature Signed-off-by: raver119 <raver119@gmail.com> * provide depthwise_conv2d_bp for cuda Signed-off-by: Yurii <yurii@skymind.io> * Implementation of lup helper with cuda kernel. Initial commit. * further work on backprops for convolutions Signed-off-by: Yurii <yurii@skymind.io> * CUDA linear sort by key/val Signed-off-by: raver119 <raver119@gmail.com> * CUDA tad sort by key/val Signed-off-by: raver119 <raver119@gmail.com> * start providing of backprop for pooling2d/3d Signed-off-by: Yurii <yurii@skymind.io> * Added atomicAdd for bool datatype. * dynamic partition concept Signed-off-by: raver119 <raver119@gmail.com> * dynamic partition concept Signed-off-by: raver119 <raver119@gmail.com> * dynamic partition scalar CUDA Signed-off-by: raver119 <raver119@gmail.com> * important comment Signed-off-by: raver119 <raver119@gmail.com> * fix pooling2d/3d backprop helpers Signed-off-by: Yurii <yurii@skymind.io> * Added non-linear test with dynamic_partition. * Improved test for dynamic_partition. * dynamic_partition TAD concept Signed-off-by: raver119 <raver119@gmail.com> * - dynamic_partition TAD CUDA impl - dynamic_partition TAD CPU fix Signed-off-by: raver119 <raver119@gmail.com> * - rewrite cpu code for usampling2d/3d - write cuda code for usampling2d/3d Signed-off-by: Yurii <yurii@skymind.io> * dynamic_stitch CUDA vector case Signed-off-by: raver119 <raver119@gmail.com> * dynamic_stitch CUDA TAD case concept Signed-off-by: raver119 <raver119@gmail.com> * dynamic_stitch CUDA TAD case impl Signed-off-by: raver119 <raver119@gmail.com> * Added tests for dynamic_stitch 3D-4D cases. * minor tests tweaks Signed-off-by: raver119 <raver119@gmail.com> * Fixed type check for dynamic stitch. * min/max bp Signed-off-by: raver119 <raver119@gmail.com> * rewrite code for upsampling2d/3d cpu Signed-off-by: Yurii <yurii@skymind.io> * reduce min/max/norm_max bp Signed-off-by: raver119 <raver119@gmail.com> * lup implementation. Additional enhancements. * provide code for upsamling2d/3d backprop Signed-off-by: Yurii <yurii@skymind.io> * weightedCrossEntropyWithLogits Signed-off-by: raver119 <raver119@gmail.com> * Fixed template math atomicMul for 64bit ints. * Refactored dynamic_partition_bp op. * inverseBroadcast fix Signed-off-by: raver119 <raver119@gmail.com> * DynamicPartitionBP test datatype fixed. * - nd4j_atomicMul Windows fix - cpu/NDArrayLambda.hpp excluded from CUDA Signed-off-by: raver119 <raver119@gmail.com>
2019-06-28 01:37:04 +10:00 · 2019-06-28 01:37:04 +10:00 · 1170827c18
commit 1170827c18
parent cae4fc9760
331 changed files with 17959 additions and 7363 deletions
--- a/arbiter/arbiter-core/src/main/java/org/deeplearning4j/arbiter/optimize/api/TaskCreatorProvider.java
+++ b/arbiter/arbiter-core/src/main/java/org/deeplearning4j/arbiter/optimize/api/TaskCreatorProvider.java
@ -31,7 +31,7 @@ public class TaskCreatorProvider {
            }
            return c.newInstance();
        } catch (Exception e){
-            throw new RuntimeException("Could not create new instance of task creator class: " + c, e);
+            throw new RuntimeException("Could not create new instance of task creator class: " + c + " - missing no-arg constructor?", e);
        }
    }

--- a/arbiter/arbiter-core/src/main/java/org/deeplearning4j/arbiter/optimize/api/data/DataSetIteratorFactoryProvider.java
+++ b/arbiter/arbiter-core/src/main/java/org/deeplearning4j/arbiter/optimize/api/data/DataSetIteratorFactoryProvider.java
@ -83,7 +83,7 @@ public class DataSetIteratorFactoryProvider implements DataProvider {
                            (Class<? extends DataSetIteratorFactory>) Class.forName(value);
            return clazz.newInstance();
        } catch (Exception e) {
-            throw new RuntimeException(e);
+            throw new RuntimeException("Could not create DataSetIteratorFactory instance - missing no-arg constructor?", e);
        }
    }
 }
--- a/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/data/DataSetIteratorFactoryProvider.java
+++ b/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/data/DataSetIteratorFactoryProvider.java
@ -79,7 +79,7 @@ public class DataSetIteratorFactoryProvider implements DataProvider {
                            (Class<? extends DataSetIteratorFactory>) Class.forName(value);
            return clazz.newInstance();
        } catch (Exception e) {
-            throw new RuntimeException(e);
+            throw new RuntimeException("Could not create DataSetIteratorFactory instance - missing no-arg constructor?", e);
        }
    }
 }
--- a/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/scoring/impl/BaseNetScoreFunction.java
+++ b/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/scoring/impl/BaseNetScoreFunction.java
@ -54,7 +54,7 @@ public abstract class BaseNetScoreFunction implements ScoreFunction {
                ds.configure(dataSourceProperties);
            }
        } catch (Exception e){
-            throw new RuntimeException(e);
+            throw new RuntimeException("Error creating DataSource instance - missing no-arg constructor?", e);
        }
        return score(model, ds.testData());
    }
--- a/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/task/ComputationGraphTaskCreator.java
+++ b/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/task/ComputationGraphTaskCreator.java
@ -188,10 +188,15 @@ public class ComputationGraphTaskCreator implements TaskCreator {
            //For DataSetIterator: wraps in a MultiDataSetIterator, hence method can be used for both
            MultiDataSetIterator iterator;
            if(dataSource != null){
-                DataSource dsInstance = dataSource.newInstance();
-                if(dataSourceProperties != null)
-                    dsInstance.configure(dataSourceProperties);
-                iterator = ScoreUtil.getMultiIterator(dsInstance.trainData());
+                try {
+                    DataSource dsInstance = dataSource.newInstance();
+                    if (dataSourceProperties != null)
+                        dsInstance.configure(dataSourceProperties);
+                    iterator = ScoreUtil.getMultiIterator(dsInstance.trainData());
+                } catch (Exception e){
+                    throw new RuntimeException("Error instantiating instance of DataSource for class " + dataSource.getName() +
+                            " - no zero-arg constructor?",e);
+                }
            } else {
                iterator = ScoreUtil.getMultiIterator(dataProvider.trainData(candidate.getDataParameters()));
            }
--- a/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/task/MultiLayerNetworkTaskCreator.java
+++ b/arbiter/arbiter-deeplearning4j/src/main/java/org/deeplearning4j/arbiter/task/MultiLayerNetworkTaskCreator.java
@ -190,7 +190,8 @@ public class MultiLayerNetworkTaskCreator implements TaskCreator {
                try{
                    dsInstance = dataSource.newInstance();
                } catch (Exception e){
-                    throw new RuntimeException("Error instantiating instance of DataSource for class " + dataSource.getName());
+                    throw new RuntimeException("Error instantiating instance of DataSource for class " + dataSource.getName() +
+                            " - no zero-arg constructor?",e);
                }
                if(dataSourceProperties != null)
                    dsInstance.configure(dataSourceProperties);
--- a/datavec/datavec-api/src/test/java/org/datavec/api/transform/transform/ndarray/TestNDArrayWritableTransforms.java
+++ b/datavec/datavec-api/src/test/java/org/datavec/api/transform/transform/ndarray/TestNDArrayWritableTransforms.java
@ -26,6 +26,7 @@ import org.datavec.api.writable.NDArrayWritable;
 import org.datavec.api.writable.Text;
 import org.datavec.api.writable.Writable;
 import org.junit.Test;
+import org.nd4j.linalg.api.buffer.DataType;
 import org.nd4j.linalg.api.ndarray.INDArray;
 import org.nd4j.linalg.factory.Nd4j;
 import org.nd4j.linalg.ops.transforms.Transforms;
@ -78,14 +79,14 @@ public class TestNDArrayWritableTransforms {
        assertEquals(expColNames, tp.getFinalSchema().getColumnNames());


-        List<Writable> in = Arrays.<Writable>asList(new DoubleWritable(0), new NDArrayWritable(Nd4j.linspace(0, 9, 10)),
-                        new NDArrayWritable(Nd4j.valueArrayOf(1, 10, 2.0)));
+        List<Writable> in = Arrays.<Writable>asList(new DoubleWritable(0), new NDArrayWritable(Nd4j.linspace(DataType.DOUBLE,0, 10, 1).reshape(1,10)),
+                        new NDArrayWritable(Nd4j.valueArrayOf(1, 10, 2.0).castTo(DataType.DOUBLE)));
        List<Writable> out = tp.execute(in);

        List<Writable> exp =
-                        Arrays.<Writable>asList(new DoubleWritable(0), new NDArrayWritable(Nd4j.linspace(0, 9, 10)),
-                                        new NDArrayWritable(Nd4j.valueArrayOf(1, 10, 2.0)),
-                                        new NDArrayWritable(Nd4j.linspace(0, 9, 10).addi(2.0)));
+                        Arrays.<Writable>asList(new DoubleWritable(0), new NDArrayWritable(Nd4j.linspace(DataType.DOUBLE,0, 10, 1).reshape(1,10)),
+                                        new NDArrayWritable(Nd4j.valueArrayOf(1, 10, 2.0).castTo(DataType.DOUBLE)),
+                                        new NDArrayWritable(Nd4j.linspace(DataType.DOUBLE, 0, 10, 1).addi(2.0).reshape(1,10)));

        assertEquals(exp, out);
    }
--- a/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/datasets/iterator/DataSetSplitterTests.java
+++ b/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/datasets/iterator/DataSetSplitterTests.java
@ -20,9 +20,15 @@ import lombok.val;
 import org.deeplearning4j.BaseDL4JTest;
 import org.deeplearning4j.datasets.iterator.tools.DataSetGenerator;
 import org.junit.Test;
+import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
 import org.nd4j.linalg.exception.ND4JIllegalStateException;
+import org.nd4j.linalg.factory.Nd4j;

-import static org.junit.Assert.assertEquals;
+import java.util.Collections;
+import java.util.List;
+import java.util.Random;
+
+import static org.junit.Assert.*;

 public class DataSetSplitterTests extends BaseDL4JTest {
    @Test
@ -39,7 +45,7 @@ public class DataSetSplitterTests extends BaseDL4JTest {
        int gcntTest = 0;
        int global = 0;
        // emulating epochs here
-        for (int e = 0; e < numEpochs; e++){
+        for (int e = 0; e < numEpochs; e++) {
            int cnt = 0;
            while (train.hasNext()) {
                val data = train.next().getFeatures();
@ -79,7 +85,7 @@ public class DataSetSplitterTests extends BaseDL4JTest {
        int gcntTest = 0;
        int global = 0;
        // emulating epochs here
-        for (int e = 0; e < numEpochs; e++){
+        for (int e = 0; e < numEpochs; e++) {
            int cnt = 0;
            while (train.hasNext()) {
                val data = train.next().getFeatures();
@ -117,7 +123,7 @@ public class DataSetSplitterTests extends BaseDL4JTest {
        int gcntTest = 0;
        int global = 0;
        // emulating epochs here
-        for (int e = 0; e < numEpochs; e++){
+        for (int e = 0; e < numEpochs; e++) {
            int cnt = 0;
            while (train.hasNext()) {
                val data = train.next().getFeatures();
@ -144,4 +150,245 @@ public class DataSetSplitterTests extends BaseDL4JTest {

        assertEquals(1000 * numEpochs, global);
    }
+
+    @Test
+    public void testSplitter_4() {
+        val back = new DataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new DataSetIteratorSplitter(back, 1000, new double[]{0.5, 0.3, 0.2});
+        List<DataSetIterator> iteratorList = splitter.getIterators();
+        val numEpochs = 10;
+        int global = 0;
+        // emulating epochs here
+        for (int e = 0; e < numEpochs; e++) {
+            int iterNo = 0;
+            int perEpoch = 0;
+            for (val partIterator : iteratorList) {
+                int cnt = 0;
+                partIterator.reset();
+                while (partIterator.hasNext()) {
+                    val data = partIterator.next().getFeatures();
+                    assertEquals("Train failed on iteration " + cnt + "; epoch: " + e,
+                            (float) perEpoch, data.getFloat(0), 1e-5);
+                    //gcntTrain++;
+                    global++;
+                    cnt++;
+                    ++perEpoch;
+                }
+                ++iterNo;
+            }
+        }
+
+        assertEquals(1000* numEpochs, global);
+    }
+
+    @Test
+    public void testSplitter_5() {
+        val back = new DataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new DataSetIteratorSplitter(back, new int[]{900, 100});
+
+        List<DataSetIterator> iteratorList = splitter.getIterators();
+        val numEpochs = 10;
+
+        int global = 0;
+        // emulating epochs here
+        for (int e = 0; e < numEpochs; e++) {
+            int iterNo = 0;
+            int perEpoch = 0;
+            for (val partIterator : iteratorList) {
+                partIterator.reset();
+                while (partIterator.hasNext()) {
+                    int cnt = 0;
+                    val data = partIterator.next().getFeatures();
+
+                    assertEquals("Train failed on iteration " + cnt + "; epoch: " + e,
+                            (float) perEpoch, data.getFloat(0), 1e-5);
+                    //gcntTrain++;
+                    global++;
+                    cnt++;
+                    ++perEpoch;
+                }
+                ++iterNo;
+            }
+        }
+
+        assertEquals(1000 * numEpochs, global);
+    }
+
+    @Test
+    public void testSplitter_6() {
+        val back = new DataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        // we're going to mimic train+test+validation split
+        val splitter = new DataSetIteratorSplitter(back, new int[]{800, 100, 100});
+
+        assertEquals(3, splitter.getIterators().size());
+
+        val trainIter = splitter.getIterators().get(0);
+        val testIter = splitter.getIterators().get(1);
+        val validationIter = splitter.getIterators().get(2);
+
+        // we're going to have multiple epochs
+        int numEpochs = 10;
+        for (int e = 0; e < numEpochs; e++) {
+            int globalIter = 0;
+            trainIter.reset();
+            testIter.reset();
+            validationIter.reset();
+
+            boolean trained = false;
+            while (trainIter.hasNext()) {
+                trained = true;
+                val ds = trainIter.next();
+                assertNotNull(ds);
+
+                assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter, ds.getFeatures().getDouble(0), 1e-5f);
+                globalIter++;
+            }
+            assertTrue("Failed at epoch [" + e + "]", trained);
+            assertEquals(800, globalIter);
+
+
+            // test set is used every epoch
+            boolean tested = false;
+            //testIter.reset();
+            while (testIter.hasNext()) {
+                tested = true;
+                val ds = testIter.next();
+                assertNotNull(ds);
+
+                assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter, ds.getFeatures().getDouble(0), 1e-5f);
+                globalIter++;
+            }
+            assertTrue("Failed at epoch [" + e + "]", tested);
+            assertEquals(900, globalIter);
+
+            // validation set is used every 5 epochs
+            if (e % 5 == 0) {
+                boolean validated = false;
+                //validationIter.reset();
+                while (validationIter.hasNext()) {
+                    validated = true;
+                    val ds = validationIter.next();
+                    assertNotNull(ds);
+
+                    assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter, ds.getFeatures().getDouble(0), 1e-5f);
+                    globalIter++;
+                }
+                assertTrue("Failed at epoch [" + e + "]", validated);
+            }
+
+            // all 3 iterators have exactly 1000 elements combined
+            if (e % 5 == 0)
+                assertEquals(1000, globalIter);
+            else
+                assertEquals(900, globalIter);
+            trainIter.reset();
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_1() {
+        val back = new DataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new DataSetIteratorSplitter(back, new int[]{500, 500});
+
+        List<DataSetIterator> iteratorList = splitter.getIterators();
+        val numEpochs = 10;
+
+        int global = 0;
+        // emulating epochs here
+        for (int e = 0; e < numEpochs; e++) {
+
+            // Get data from second part, then rewind for the first one.
+            int cnt = 0;
+            int partNumber = 1;
+            while (iteratorList.get(partNumber).hasNext()) {
+                int farCnt = (1000 / 2) * (partNumber) + cnt;
+                val data = iteratorList.get(partNumber).next().getFeatures();
+
+                assertEquals("Train failed on iteration " + cnt + "; epoch: " + e, (float) farCnt, data.getFloat(0), 1e-5);
+                cnt++;
+                global++;
+            }
+            iteratorList.get(partNumber).reset();
+            partNumber = 0;
+            cnt = 0;
+            while (iteratorList.get(0).hasNext()) {
+                val data = iteratorList.get(0).next().getFeatures();
+
+                assertEquals("Train failed on iteration " + cnt + "; epoch: " + e, (float) cnt++, data.getFloat(0), 1e-5);
+                global++;
+            }
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_2() {
+        val back = new DataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new DataSetIteratorSplitter(back, new int[]{2});
+
+        List<DataSetIterator> iteratorList = splitter.getIterators();
+
+        for (int partNumber = 0 ; partNumber < iteratorList.size(); ++partNumber) {
+            int cnt = 0;
+            while (iteratorList.get(partNumber).hasNext()) {
+                val data = iteratorList.get(partNumber).next().getFeatures();
+
+                assertEquals("Train failed on iteration " + cnt, (float) (500*partNumber + cnt), data.getFloat(0), 1e-5);
+                cnt++;
+            }
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_3() {
+        val back = new DataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new DataSetIteratorSplitter(back, new int[]{10});
+
+        List<DataSetIterator> iteratorList = splitter.getIterators();
+        Random random = new Random();
+        int[] indexes = new int[iteratorList.size()];
+        for (int i = 0; i < indexes.length; ++i) {
+            indexes[i] = random.nextInt(iteratorList.size());
+        }
+
+        for (int partNumber : indexes) {
+            int cnt = 0;
+            while (iteratorList.get(partNumber).hasNext()) {
+                val data = iteratorList.get(partNumber).next().getFeatures();
+
+                assertEquals("Train failed on iteration " + cnt, (float) (500*partNumber + cnt), data.getFloat(0), 1e-5);
+                cnt++;
+            }
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_4() {
+        val back = new DataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        // we're going to mimic train+test+validation split
+        val splitter = new DataSetIteratorSplitter(back, new int[]{80, 10, 5});
+
+        assertEquals(3, splitter.getIterators().size());
+
+        val trainIter = splitter.getIterators().get(0);  // 0..79
+        val testIter = splitter.getIterators().get(1);   // 80 ..89
+        val validationIter = splitter.getIterators().get(2); // 90..94
+
+        // we're skipping train/test and go for validation first. we're that crazy, right.
+        int valCnt = 0;
+        while (validationIter.hasNext()) {
+            val ds = validationIter.next();
+            assertNotNull(ds);
+
+            assertEquals("Validation failed on iteration " + valCnt, (float) valCnt + 90, ds.getFeatures().getFloat(0), 1e-5);
+            valCnt++;
+        }
+        assertEquals(5, valCnt);
+    }
 }
--- a/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/datasets/iterator/MultiDataSetSplitterTests.java
+++ b/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/datasets/iterator/MultiDataSetSplitterTests.java
@ -18,11 +18,17 @@ package org.deeplearning4j.datasets.iterator;

 import lombok.val;
 import org.deeplearning4j.BaseDL4JTest;
+import org.deeplearning4j.datasets.iterator.tools.DataSetGenerator;
 import org.deeplearning4j.datasets.iterator.tools.MultiDataSetGenerator;
 import org.junit.Test;
+import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
+import org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator;
 import org.nd4j.linalg.exception.ND4JIllegalStateException;

-import static org.junit.Assert.assertEquals;
+import java.util.List;
+import java.util.Random;
+
+import static org.junit.Assert.*;

 /**
 *
@ -150,4 +156,309 @@ public class MultiDataSetSplitterTests extends BaseDL4JTest {

        assertEquals(1000 * numEpochs, global);
    }
+
+    @Test
+    public void testMultiSplitter_1() {
+        val back = new MultiDataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        // we're going to mimic train+test+validation split
+        val splitter = new MultiDataSetIteratorSplitter(back, new int[]{800, 100, 100});
+
+        assertEquals(3, splitter.getIterators().size());
+
+        val trainIter = splitter.getIterators().get(0);
+        val testIter = splitter.getIterators().get(1);
+        val validationIter = splitter.getIterators().get(2);
+
+        // we're going to have multiple epochs
+        int numEpochs = 10;
+        for (int e = 0; e < numEpochs; e++) {
+            int globalIter = 0;
+            trainIter.reset();
+            testIter.reset();
+            validationIter.reset();
+
+            boolean trained = false;
+            while (trainIter.hasNext()) {
+                trained = true;
+                val ds = trainIter.next();
+                assertNotNull(ds);
+
+                for (int i = 0; i < ds.getFeatures().length; ++i) {
+                    assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter, ds.getFeatures()[i].getDouble(0), 1e-5f);
+                }
+                globalIter++;
+            }
+            assertTrue("Failed at epoch [" + e + "]", trained);
+            assertEquals(800, globalIter);
+
+
+            // test set is used every epoch
+            boolean tested = false;
+            //testIter.reset();
+            while (testIter.hasNext()) {
+                tested = true;
+                val ds = testIter.next();
+                assertNotNull(ds);
+
+                for (int i = 0; i < ds.getFeatures().length; ++i) {
+                    assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter, ds.getFeatures()[i].getDouble(0), 1e-5f);
+                }
+                globalIter++;
+            }
+            assertTrue("Failed at epoch [" + e + "]", tested);
+            assertEquals(900, globalIter);
+
+            // validation set is used every 5 epochs
+            if (e % 5 == 0) {
+                boolean validated = false;
+                //validationIter.reset();
+                while (validationIter.hasNext()) {
+                    validated = true;
+                    val ds = validationIter.next();
+                    assertNotNull(ds);
+
+                    for (int i = 0; i < ds.getFeatures().length; ++i) {
+                        assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter, ds.getFeatures()[i].getDouble(0), 1e-5f);
+                    }
+                    globalIter++;
+                }
+                assertTrue("Failed at epoch [" + e + "]", validated);
+            }
+
+            // all 3 iterators have exactly 1000 elements combined
+            if (e % 5 == 0)
+                assertEquals(1000, globalIter);
+            else
+                assertEquals(900, globalIter);
+            trainIter.reset();
+        }
+    }
+
+    @Test
+    public void testSplitter_5() {
+        val back = new MultiDataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new MultiDataSetIteratorSplitter(back, new int[]{900, 100});
+
+        List<MultiDataSetIterator> iteratorList = splitter.getIterators();
+        val numEpochs = 10;
+
+        int global = 0;
+        // emulating epochs here
+        for (int e = 0; e < numEpochs; e++) {
+            int iterNo = 0;
+            int perEpoch = 0;
+            for (val partIterator : iteratorList) {
+                partIterator.reset();
+                while (partIterator.hasNext()) {
+                    int cnt = 0;
+                    val data = partIterator.next().getFeatures();
+
+                    for (int i = 0; i < data.length; ++i) {
+                        assertEquals("Train failed on iteration " + cnt + "; epoch: " + e,
+                                (float) perEpoch, data[i].getFloat(0), 1e-5);
+                    }
+                    //gcntTrain++;
+                    global++;
+                    cnt++;
+                    ++perEpoch;
+                }
+                ++iterNo;
+            }
+        }
+
+        assertEquals(1000 * numEpochs, global);
+    }
+
+    @Test
+    public void testSplitter_6() {
+        val back = new MultiDataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        // we're going to mimic train+test+validation split
+        val splitter = new MultiDataSetIteratorSplitter(back, new int[]{800, 100, 100});
+
+        assertEquals(3, splitter.getIterators().size());
+
+        val trainIter = splitter.getIterators().get(0);
+        val testIter = splitter.getIterators().get(1);
+        val validationIter = splitter.getIterators().get(2);
+
+        // we're going to have multiple epochs
+        int numEpochs = 10;
+        for (int e = 0; e < numEpochs; e++) {
+            int globalIter = 0;
+            trainIter.reset();
+            testIter.reset();
+            validationIter.reset();
+
+            boolean trained = false;
+            while (trainIter.hasNext()) {
+                trained = true;
+                val ds = trainIter.next();
+                assertNotNull(ds);
+
+                for (int i = 0; i < ds.getFeatures().length; ++i) {
+                    assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter,
+                            ds.getFeatures()[i].getDouble(0), 1e-5f);
+                }
+                globalIter++;
+            }
+            assertTrue("Failed at epoch [" + e + "]", trained);
+            assertEquals(800, globalIter);
+
+
+            // test set is used every epoch
+            boolean tested = false;
+            //testIter.reset();
+            while (testIter.hasNext()) {
+                tested = true;
+                val ds = testIter.next();
+                assertNotNull(ds);
+                for (int i = 0; i < ds.getFeatures().length; ++i) {
+                    assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter, ds.getFeatures()[i].getDouble(0), 1e-5f);
+                }
+                globalIter++;
+            }
+            assertTrue("Failed at epoch [" + e + "]", tested);
+            assertEquals(900, globalIter);
+
+            // validation set is used every 5 epochs
+            if (e % 5 == 0) {
+                boolean validated = false;
+                //validationIter.reset();
+                while (validationIter.hasNext()) {
+                    validated = true;
+                    val ds = validationIter.next();
+                    assertNotNull(ds);
+
+                    for (int i = 0; i < ds.getFeatures().length; ++i) {
+                        assertEquals("Failed at iteration [" + globalIter + "]", (double) globalIter,
+                                ds.getFeatures()[i].getDouble(0), 1e-5f);
+                    }
+                    globalIter++;
+                }
+                assertTrue("Failed at epoch [" + e + "]", validated);
+            }
+
+            // all 3 iterators have exactly 1000 elements combined
+            if (e % 5 == 0)
+                assertEquals(1000, globalIter);
+            else
+                assertEquals(900, globalIter);
+            trainIter.reset();
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_1() {
+        val back = new MultiDataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new MultiDataSetIteratorSplitter(back, new int[]{500, 500});
+
+        List<MultiDataSetIterator> iteratorList = splitter.getIterators();
+        val numEpochs = 10;
+
+        int global = 0;
+        // emulating epochs here
+        for (int e = 0; e < numEpochs; e++) {
+
+            // Get data from second part, then rewind for the first one.
+            int cnt = 0;
+            int partNumber = 1;
+            while (iteratorList.get(partNumber).hasNext()) {
+                int farCnt = (1000 / 2) * (partNumber) + cnt;
+                val data = iteratorList.get(partNumber).next().getFeatures();
+                for (int i = 0; i < data.length; ++i) {
+                    assertEquals("Train failed on iteration " + cnt + "; epoch: " + e, (float) farCnt, data[i].getFloat(0), 1e-5);
+                }
+                cnt++;
+                global++;
+            }
+            iteratorList.get(partNumber).reset();
+            partNumber = 0;
+            cnt = 0;
+            while (iteratorList.get(0).hasNext()) {
+                val data = iteratorList.get(0).next().getFeatures();
+                for (int i = 0; i < data.length; ++i) {
+                    assertEquals("Train failed on iteration " + cnt + "; epoch: " + e, (float) cnt++,
+                            data[i].getFloat(0), 1e-5);
+                }
+                global++;
+            }
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_2() {
+        val back = new MultiDataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new MultiDataSetIteratorSplitter(back, new int[]{2});
+
+        List<MultiDataSetIterator> iteratorList = splitter.getIterators();
+
+        for (int partNumber = 0 ; partNumber < iteratorList.size(); ++partNumber) {
+            int cnt = 0;
+            while (iteratorList.get(partNumber).hasNext()) {
+                val data = iteratorList.get(partNumber).next().getFeatures();
+                for (int i = 0; i < data.length; ++i) {
+                    assertEquals("Train failed on iteration " + cnt, (float) (500 * partNumber + cnt), data[i].getFloat(0), 1e-5);
+                }
+                cnt++;
+            }
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_3() {
+        val back = new MultiDataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        val splitter = new MultiDataSetIteratorSplitter(back, new int[]{10});
+
+        List<MultiDataSetIterator> iteratorList = splitter.getIterators();
+        Random random = new Random();
+        int[] indexes = new int[iteratorList.size()];
+        for (int i = 0; i < indexes.length; ++i) {
+            indexes[i] = random.nextInt(iteratorList.size());
+        }
+
+        for (int partNumber : indexes) {
+            int cnt = 0;
+            while (iteratorList.get(partNumber).hasNext()) {
+                val data = iteratorList.get(partNumber).next().getFeatures();
+                for (int i = 0; i < data.length; ++i) {
+                    assertEquals("Train failed on iteration " + cnt, (float) (500 * partNumber + cnt),
+                            data[i].getFloat(0), 1e-5);
+                }
+                cnt++;
+            }
+        }
+    }
+
+    @Test
+    public void testUnorderedSplitter_4() {
+        val back = new MultiDataSetGenerator(1000, new int[]{32, 100}, new int[]{32, 5});
+
+        // we're going to mimic train+test+validation split
+        val splitter = new MultiDataSetIteratorSplitter(back, new int[]{80, 10, 5});
+
+        assertEquals(3, splitter.getIterators().size());
+
+        val trainIter = splitter.getIterators().get(0);  // 0..79
+        val testIter = splitter.getIterators().get(1);   // 80 ..89
+        val validationIter = splitter.getIterators().get(2); // 90..94
+
+        // we're skipping train/test and go for validation first. we're that crazy, right.
+        int valCnt = 0;
+        while (validationIter.hasNext()) {
+            val ds = validationIter.next();
+            assertNotNull(ds);
+            for (int i = 0; i < ds.getFeatures().length; ++i) {
+                assertEquals("Validation failed on iteration " + valCnt, (float) valCnt + 90,
+                        ds.getFeatures()[i].getFloat(0), 1e-5);
+            }
+            valCnt++;
+        }
+        assertEquals(5, valCnt);
+    }
 }
--- a/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/nn/layers/recurrent/TestRnnLayers.java
+++ b/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/nn/layers/recurrent/TestRnnLayers.java
@ -24,6 +24,7 @@ import org.deeplearning4j.nn.conf.dropout.TestDropout;
 import org.deeplearning4j.nn.conf.layers.GravesLSTM;
 import org.deeplearning4j.nn.conf.layers.LSTM;
 import org.deeplearning4j.nn.conf.layers.Layer;
+import org.deeplearning4j.nn.conf.layers.RnnLossLayer;
 import org.deeplearning4j.nn.conf.layers.RnnOutputLayer;
 import org.deeplearning4j.nn.conf.layers.recurrent.SimpleRnn;
 import org.deeplearning4j.nn.multilayer.MultiLayerNetwork;
@ -196,4 +197,43 @@ public class TestRnnLayers extends BaseDL4JTest {
        }
    }

+    @Test
+    public void testMismatchedInputLabelLength(){
+
+        for( int i=0; i<2; i++ ){
+
+            NeuralNetConfiguration.ListBuilder lb = new NeuralNetConfiguration.Builder()
+
+                    .list()
+                    .layer(new SimpleRnn.Builder().nIn(5).nOut(5).build());
+
+            switch (i){
+                case 0:
+                    lb.layer(new RnnOutputLayer.Builder().activation(Activation.SOFTMAX).lossFunction(LossFunctions.LossFunction.MCXENT).nIn(5).nOut(5).build());
+                    break;
+                case 1:
+                    lb.layer(new RnnLossLayer.Builder().activation(Activation.SOFTMAX).lossFunction(LossFunctions.LossFunction.MCXENT).build());
+                    break;
+                default:
+                    throw new RuntimeException();
+            }
+
+            MultiLayerConfiguration conf = lb.build();
+            MultiLayerNetwork net = new MultiLayerNetwork(conf);
+            net.init();
+
+            INDArray in = Nd4j.rand(DataType.FLOAT, 3, 5, 5);
+            INDArray l = TestUtils.randomOneHotTimeSeries(3, 5, 10);
+
+            try{
+                net.fit(in,l);
+            } catch (Throwable t){
+                String msg = t.getMessage();
+                assertTrue(msg, msg.contains("sequence length") && msg.contains("input") && msg.contains("label"));
+            }
+
+        }
+
+
+    }
 }
--- a/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/plot/BarnesHutTsneTest.java
+++ b/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/plot/BarnesHutTsneTest.java
@ -249,7 +249,6 @@ public class BarnesHutTsneTest extends BaseDL4JTest {
    }

    @Test
-    @Ignore("AB 2019/05/31 - Failing on CI and locally - see issues 7820 and 7657")
    public void testCorrectness1() {
        DataTypeUtil.setDTypeForContext(DataType.DOUBLE);
        Nd4j.getRandom().setSeed(123);
@ -270,30 +269,18 @@ public class BarnesHutTsneTest extends BaseDL4JTest {
                .useAdaGrad(false).build();

        b.fit(data);
-        System.out.println(b.getData());

-        /*double[] expectedData = new double[]{15.5392794313924, 19.25226403656672, -5.194955746137196, -31.787679714614757, 48.8674725273665,
-                24.92775755686273, -22.621939920239065, -29.790772278125395, 19.027362415188914, -16.013800175884274,
-                -27.454680593309185, 1.2929960811295493, -40.45000061571038, 61.23261682914338, 5.62278768938746,
-                -28.16665244970911, -20.05502814088798, 12.803274346870865, -24.877262522905497, 45.115883138175874,
-                21.597495694710616, 18.63254779638783, -4.029728632528419, -0.4596087279592638, -42.35340705500429,
-                -69.24727547461491, 40.94332685199673, -24.60866142208024, 17.689874972878723, -3.6779759693605314,
-                -30.91803590368529, 10.645452930824145, 36.58583235020565, -64.74975614289316, -39.364099390585956,
-                72.54886481127016, -35.30663155696714, 19.37116912936714, -7.790876543092118, 19.6586396288508,
-                58.1332709511154, -18.49217368496203, -3.5050200971182424, 5.662891294031322, 39.69533295638775,
-                -15.114610550011662, -32.42366951357609, 17.039297537056537, 42.25610885633673, -2.7013781552769904,
-                -16.338582630617925, 41.734027526336874, 20.941332646863426, -3.2145240561108244, -45.36033539684912};*/
-        double[] expectedData = {40.93810899235225, 50.90183660191448, -14.298857560948981, -86.2012232604988, 129.51281793466023,
-                66.29136854264247, -61.650213611972326, -80.42836756633497, 50.28325210727952, -44.29008119040566,
-                -74.82748570869279, 2.0170536250746807, -109.21462846594635, 162.3973196127918, 14.000621153511705,
-                -76.30892822919527, -54.251704596942275, 33.99763310539589, -67.6307009607032, 119.50868525237786,
-                57.17786598853867, 49.1489174572297, -11.25663463504983, -2.38899196609398, -114.27194947404686,
-                -185.93832011474473, 108.9022579845252, -66.14099037301474, 47.13683038425694, -10.037893631405792,
-                -83.88458799629637, 26.985651418254996, 96.68139337135332, -174.2832443285551, -106.0999118697521,
-                193.02622700008175, -94.88003359113081, 51.39502524568139, -20.96021960048648, 52.32291574424741,
-                154.33973608321477, -50.90644802585217, -10.345744416395354, 13.721222143380892, 105.2111073677489,
-                -41.339268919407345, -87.73042354938127, 45.306865238870046, 112.53877133856602, -8.44454352074299,
-                -44.660828600669056, 110.72662022978719, 55.74660833987147, -9.613556053471232, -122.19953914048916};
+        double[] expectedData = new double[]{  63.8206,   80.4013,  -19.4424, -140.4326,  198.7239,
+                                              106.1148,  -96.6273, -124.3634,   78.4174,  -83.6621,
+                                             -121.8706,    3.0888, -172.8560,  255.1262,   20.7021,
+                                             -120.7942,  -78.1829,   56.6021, -112.3294,  185.4084,
+                                               88.5330,   78.0497,  -18.8673,  -11.0155, -175.1564,
+                                            -297.8463,  174.2511, -103.8793,   72.5455,  -15.8498,
+                                            -134.5235,   42.3300,  154.0391, -280.1010, -167.9765,
+                                               306.9938, -150.9666,   83.4419,  -36.0877,   83.9992,
+                                               245.1813,  -81.5018,  -14.8430,   16.1557,  166.8651,
+                                               -65.9247, -138.1783,   72.5444,  176.3088,  -25.6732,
+                                               -69.6843,  167.3360,   87.6238,  -18.5874, -187.3806};

        INDArray expectedArray = Nd4j.createFromArray(expectedData).reshape(11,5);
        for (int i = 0; i < expectedArray.rows(); ++i)
--- a/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/util/TimeSeriesUtilsTest.java
+++ b/deeplearning4j/deeplearning4j-core/src/test/java/org/deeplearning4j/util/TimeSeriesUtilsTest.java
@ -18,6 +18,7 @@ package org.deeplearning4j.util;

 import org.deeplearning4j.BaseDL4JTest;
 import org.junit.Test;
+import org.nd4j.linalg.api.buffer.DataType;
 import org.nd4j.linalg.api.ndarray.INDArray;
 import org.nd4j.linalg.factory.Nd4j;

@ -30,7 +31,7 @@ public class TimeSeriesUtilsTest extends BaseDL4JTest {

    @Test
    public void testMovingAverage() {
-        INDArray a = Nd4j.arange(0, 20);
+        INDArray a = Nd4j.arange(0, 20).castTo(DataType.DOUBLE);
        INDArray result = Nd4j.create(new double[] {1.5f, 2.5f, 3.5f, 4.5f, 5.5f, 6.5f, 7.5f, 8.5f, 9.5f, 10.5f, 11.5f,
                        12.5f, 13.5f, 14.5f, 15.5f, 16.5f, 17.5f});

--- a/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/DataSetIteratorSplitter.java
+++ b/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/DataSetIteratorSplitter.java
@ -24,6 +24,7 @@ import org.nd4j.linalg.dataset.api.DataSetPreProcessor;
 import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
 import org.nd4j.linalg.exception.ND4JIllegalStateException;

+import java.util.ArrayList;
 import java.util.List;
 import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.atomic.AtomicLong;
@ -42,14 +43,20 @@ public class DataSetIteratorSplitter {
    protected DataSetIterator backedIterator;
    protected final long totalExamples;
    protected final double ratio;
+    protected final double[] ratios;
    protected final long numTrain;
    protected final long numTest;
+    protected final long numArbitrarySets;
+    protected final int[] splits;
+

    protected AtomicLong counter = new AtomicLong(0);

    protected AtomicBoolean resetPending = new AtomicBoolean(false);
    protected DataSet firstTrain = null;

+    protected int partNumber = 0;
+
    /**
     * The only constructor
     *
@ -71,17 +78,94 @@ public class DataSetIteratorSplitter {
        this.backedIterator = baseIterator;
        this.totalExamples = totalBatches;
        this.ratio = ratio;
+        this.ratios = null;
        this.numTrain = (long) (totalExamples * ratio);
        this.numTest = totalExamples - numTrain;
+        this.numArbitrarySets = 2;
+        this.splits = null;

        log.warn("IteratorSplitter is used: please ensure you don't use randomization/shuffle in underlying iterator!");
    }

+    public DataSetIteratorSplitter(@NonNull DataSetIterator baseIterator, long totalBatches, double[] ratios) {
+        for (double ratio : ratios) {
+            if (!(ratio > 0.0 && ratio < 1.0))
+                throw new ND4JIllegalStateException("Ratio value should be in range of 0.0 > X < 1.0");
+        }
+
+        if (totalBatches < 0)
+            throw new ND4JIllegalStateException("totalExamples number should be positive value");
+
+        if (!baseIterator.resetSupported())
+            throw new ND4JIllegalStateException("Underlying iterator doesn't support reset, so it can't be used for runtime-split");
+
+
+        this.backedIterator = baseIterator;
+        this.totalExamples = totalBatches;
+        this.ratio = 0.0;
+        this.ratios = ratios;
+        this.numTrain = 0; //(long) (totalExamples * ratio);
+        this.numTest = 0; //totalExamples - numTrain;
+        this.numArbitrarySets = ratios.length;
+
+        this.splits = new int[this.ratios.length];
+        for (int i = 0; i < this.splits.length; ++i) {
+            this.splits[i] = (int)(totalExamples * ratios[i]);
+        }
+
+        log.warn("IteratorSplitter is used: please ensure you don't use randomization/shuffle in underlying iterator!");
+    }
+
+    public DataSetIteratorSplitter(@NonNull DataSetIterator baseIterator, int[] splits) {
+
+        /*if (!(simpleRatio > 0.0 && simpleRatio < 1.0))
+           throw new ND4JIllegalStateException("Ratio value should be in range of 0.0 > X < 1.0");*/
+
+        int totalBatches = 0;
+        for (val v:splits)
+            totalBatches += v;
+
+        if (totalBatches < 0)
+            throw new ND4JIllegalStateException("totalExamples number should be positive value");
+
+        if (!baseIterator.resetSupported())
+            throw new ND4JIllegalStateException("Underlying iterator doesn't support reset, so it can't be used for runtime-split");
+
+
+        this.backedIterator = baseIterator;
+        this.totalExamples = totalBatches;
+        this.ratio = 0.0;
+        this.ratios = null;
+
+        this.numTrain = 0; //(long) (totalExamples * ratio);
+        this.numTest = 0; //totalExamples - numTrain;
+        this.splits = splits;
+        this.numArbitrarySets = splits.length;
+
+        log.warn("IteratorSplitter is used: please ensure you don't use randomization/shuffle in underlying iterator!");
+    }
+
+    public List<DataSetIterator> getIterators() {
+        List<DataSetIterator> retVal = new ArrayList<>();
+        int partN = 0;
+        int bottom = 0;
+        for (final int split : splits) {
+                ScrollableDataSetIterator partIterator =
+                        new ScrollableDataSetIterator(partN++, backedIterator, counter, resetPending, firstTrain,
+                                new int[]{bottom,split});
+                bottom += split;
+                retVal.add(partIterator);
+        }
+        return retVal;
+    }
+
+
    /**
     * This method returns train iterator instance
     *
     * @return
     */
+    @Deprecated
    public DataSetIterator getTrainIterator() {
        return new DataSetIterator() {
            @Override
@ -184,6 +268,7 @@ public class DataSetIteratorSplitter {
     *
     * @return
     */
+    @Deprecated
    public DataSetIterator getTestIterator() {
        return new DataSetIterator() {
            @Override
--- a/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/MultiDataSetIteratorSplitter.java
+++ b/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/MultiDataSetIteratorSplitter.java
@ -21,9 +21,12 @@ import lombok.extern.slf4j.Slf4j;
 import lombok.val;
 import org.nd4j.linalg.dataset.api.MultiDataSet;
 import org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor;
+import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
 import org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator;
 import org.nd4j.linalg.exception.ND4JIllegalStateException;

+import java.util.ArrayList;
+import java.util.List;
 import java.util.concurrent.atomic.AtomicBoolean;
 import java.util.concurrent.atomic.AtomicLong;

@ -43,6 +46,9 @@ public class MultiDataSetIteratorSplitter {
    protected final double ratio;
    protected final long numTrain;
    protected final long numTest;
+    protected final double[] ratios;
+    protected final long numArbitrarySets;
+    protected final int[] splits;

    protected AtomicLong counter = new AtomicLong(0);

@ -71,15 +77,87 @@ public class MultiDataSetIteratorSplitter {
        this.ratio = ratio;
        this.numTrain = (long) (totalExamples * ratio);
        this.numTest = totalExamples - numTrain;
+        this.ratios = null;
+        this.numArbitrarySets = 0;
+        this.splits = null;

        log.warn("IteratorSplitter is used: please ensure you don't use randomization/shuffle in underlying iterator!");
    }

+    public MultiDataSetIteratorSplitter(@NonNull MultiDataSetIterator baseIterator, long totalBatches, double[] ratios) {
+        for (double ratio : ratios) {
+            if (!(ratio > 0.0 && ratio < 1.0))
+                throw new ND4JIllegalStateException("Ratio value should be in range of 0.0 > X < 1.0");
+        }
+
+        if (totalBatches < 0)
+            throw new ND4JIllegalStateException("totalExamples number should be positive value");
+
+        if (!baseIterator.resetSupported())
+            throw new ND4JIllegalStateException("Underlying iterator doesn't support reset, so it can't be used for runtime-split");
+
+
+        this.backedIterator = baseIterator;
+        this.totalExamples = totalBatches;
+        this.ratio = 0.0;
+        this.numTrain = (long) (totalExamples * ratio);
+        this.numTest = totalExamples - numTrain;
+        this.ratios = null;
+        this.numArbitrarySets = ratios.length;
+
+        this.splits = new int[this.ratios.length];
+        for (int i = 0; i < this.splits.length; ++i) {
+            this.splits[i] = (int)(totalExamples * ratios[i]);
+        }
+
+        log.warn("IteratorSplitter is used: please ensure you don't use randomization/shuffle in underlying iterator!");
+    }
+
+    public MultiDataSetIteratorSplitter(@NonNull MultiDataSetIterator baseIterator, int[] splits) {
+
+        int totalBatches = 0;
+        for (val v:splits)
+            totalBatches += v;
+
+        if (totalBatches < 0)
+            throw new ND4JIllegalStateException("totalExamples number should be positive value");
+
+        if (!baseIterator.resetSupported())
+            throw new ND4JIllegalStateException("Underlying iterator doesn't support reset, so it can't be used for runtime-split");
+
+
+        this.backedIterator = baseIterator;
+        this.totalExamples = totalBatches;
+        this.ratio = 0.0;
+        this.numTrain = (long) (totalExamples * ratio);
+        this.numTest = totalExamples - numTrain;
+        this.ratios = null;
+        this.numArbitrarySets = splits.length;
+        this.splits = splits;
+
+        log.warn("IteratorSplitter is used: please ensure you don't use randomization/shuffle in underlying iterator!");
+    }
+
+    public List<MultiDataSetIterator> getIterators() {
+        List<MultiDataSetIterator> retVal = new ArrayList<>();
+        int partN = 0;
+        int bottom = 0;
+        for (final int split : splits) {
+            ScrollableMultiDataSetIterator partIterator =
+                    new ScrollableMultiDataSetIterator(partN++, backedIterator, counter, firstTrain,
+                            new int[]{bottom,split});
+            bottom += split;
+            retVal.add(partIterator);
+        }
+        return retVal;
+    }
+
    /**
     * This method returns train iterator instance
     *
     * @return
     */
+    @Deprecated
    public MultiDataSetIterator getTrainIterator() {
        return new MultiDataSetIterator() {
            @Override
@ -162,6 +240,7 @@ public class MultiDataSetIteratorSplitter {
     *
     * @return
     */
+    @Deprecated
    public MultiDataSetIterator getTestIterator() {
        return new MultiDataSetIterator() {
            @Override
--- a/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/ScrollableDataSetIterator.java
+++ b/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/ScrollableDataSetIterator.java
@ -0,0 +1,158 @@
+package org.deeplearning4j.datasets.iterator;
+
+import lombok.val;
+import org.nd4j.linalg.dataset.DataSet;
+import org.nd4j.linalg.dataset.MultiDataSet;
+import org.nd4j.linalg.dataset.api.DataSetPreProcessor;
+import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
+import org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator;
+
+import java.util.List;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicLong;
+
+public class ScrollableDataSetIterator implements DataSetIterator {
+    private int thisPart = 0;
+    private int top = 0;
+    private int bottom = 0;
+    protected DataSetIterator backedIterator;
+    protected AtomicLong counter = new AtomicLong(0);
+
+    protected AtomicBoolean resetPending = new AtomicBoolean(false);
+    protected DataSet firstTrain = null;
+    protected MultiDataSet firstMultiTrain = null;
+    private double ratio;
+    private long totalExamples;
+    private long itemsPerPart;
+    private long current;
+
+
+    public ScrollableDataSetIterator(int num, DataSetIterator backedIterator, AtomicLong counter,
+                                     AtomicBoolean resetPending, DataSet firstTrain, double ratio,
+                                     int totalExamples) {
+        this.thisPart = num;
+        this.backedIterator = backedIterator;
+        this.counter = counter;
+        this.resetPending = resetPending;
+        this.firstTrain = firstTrain;
+        this.ratio = ratio;
+        this.totalExamples = totalExamples;
+        this.itemsPerPart = (long)(totalExamples * ratio);
+        this.current = 0;
+    }
+
+    public ScrollableDataSetIterator(int num, DataSetIterator backedIterator, AtomicLong counter,
+                                     AtomicBoolean resetPending, DataSet firstTrain,
+                                     int[] itemsPerPart) {
+        this.thisPart = num;
+        this.bottom = itemsPerPart[0];
+        this.top = bottom + itemsPerPart[1];
+        this.itemsPerPart = top;
+
+        this.backedIterator = backedIterator;
+        this.counter = counter;
+        //this.resetPending = resetPending;
+        this.firstTrain = firstTrain;
+        //this.totalExamples = totalExamples;
+        this.current = 0;
+    }
+
+    @Override
+    public DataSet next(int i) {
+        throw new UnsupportedOperationException();
+    }
+
+    @Override
+    public List<String> getLabels() {
+        return backedIterator.getLabels();
+    }
+
+    @Override
+    public int inputColumns() {
+        return backedIterator.inputColumns();
+    }
+
+    @Override
+    public void remove() {
+        throw new UnsupportedOperationException();
+    }
+
+    @Override
+    public int totalOutcomes() {
+        return backedIterator.totalOutcomes();
+    }
+
+    @Override
+    public boolean resetSupported() {
+        return backedIterator.resetSupported();
+    }
+
+    @Override
+    public boolean asyncSupported() {
+        return backedIterator.asyncSupported();
+    }
+
+    @Override
+    public void reset() {
+        resetPending.set(true);
+    }
+
+    @Override
+    public int batch() {
+        return backedIterator.batch();
+    }
+
+    @Override
+    public void setPreProcessor(DataSetPreProcessor dataSetPreProcessor) {
+        backedIterator.setPreProcessor(dataSetPreProcessor);
+    }
+
+    @Override
+    public DataSetPreProcessor getPreProcessor() {
+
+        return backedIterator.getPreProcessor();
+    }
+
+
+    @Override
+    public boolean hasNext() {
+        if (resetPending.get()) {
+            if (resetSupported()) {
+                backedIterator.reset();
+                counter.set(0);
+                current = 0;
+                resetPending.set(false);
+            } else
+                throw new UnsupportedOperationException("Reset isn't supported by underlying iterator");
+        }
+
+        boolean state = false;
+        if (current >= top)
+            return false;
+        state = backedIterator.hasNext();
+        if (!state)
+            return false;
+        if (state && counter.get() < itemsPerPart)
+            return true;
+        else
+            return false;
+
+    }
+
+    @Override
+    public DataSet next() {
+        counter.incrementAndGet();
+        if ((current == 0) && (bottom != 0)) {
+            backedIterator.reset();
+            long cnt = current;
+            for (; cnt < bottom; ++cnt) {
+                if (backedIterator.hasNext())
+                    backedIterator.next();
+            }
+            current = cnt+1;
+        }
+        else current++;
+        val p = backedIterator.next();
+        return p;
+    }
+}
--- a/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/ScrollableMultiDataSetIterator.java
+++ b/deeplearning4j/deeplearning4j-data/deeplearning4j-utility-iterators/src/main/java/org/deeplearning4j/datasets/iterator/ScrollableMultiDataSetIterator.java
@ -0,0 +1,121 @@
+package org.deeplearning4j.datasets.iterator;
+
+import lombok.val;
+import org.nd4j.linalg.dataset.DataSet;
+import org.nd4j.linalg.dataset.api.MultiDataSet;
+import org.nd4j.linalg.dataset.api.DataSetPreProcessor;
+import org.nd4j.linalg.dataset.api.MultiDataSetPreProcessor;
+import org.nd4j.linalg.dataset.api.iterator.DataSetIterator;
+import org.nd4j.linalg.dataset.api.iterator.MultiDataSetIterator;
+
+import javax.naming.OperationNotSupportedException;
+import java.util.List;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicLong;
+
+public class ScrollableMultiDataSetIterator implements MultiDataSetIterator {
+    private int thisPart = 0;
+    private int top = 0;
+    private int bottom = 0;
+    protected MultiDataSetIterator backedIterator;
+    protected AtomicLong counter = new AtomicLong(0);
+
+    protected AtomicBoolean resetPending = new AtomicBoolean(false);
+    protected DataSet firstTrain = null;
+    protected MultiDataSet firstMultiTrain = null;
+    private double ratio;
+    private long totalExamples;
+    private long itemsPerPart;
+    private long current;
+
+    public ScrollableMultiDataSetIterator(int num, MultiDataSetIterator backedIterator, AtomicLong counter,
+                                     MultiDataSet firstTrain,  int[] itemsPerPart) {
+        this.thisPart = num;
+        this.bottom = itemsPerPart[0];
+        this.top = bottom + itemsPerPart[1];
+        this.itemsPerPart = top;
+
+        this.counter = counter;
+        //this.resetPending = resetPending;
+        this.firstTrain = null;
+        this.firstMultiTrain = firstTrain;
+        //this.totalExamples = totalExamples;
+        this.current = 0;
+        this.backedIterator = backedIterator;
+        this.resetPending = resetPending;
+    }
+
+    @Override
+    public boolean resetSupported() {
+        return backedIterator.resetSupported();
+    }
+
+    @Override
+    public boolean asyncSupported() {
+        return backedIterator.asyncSupported();
+    }
+
+    @Override
+    public void reset() {
+        resetPending.set(true);
+    }
+
+    @Override
+    public void setPreProcessor(MultiDataSetPreProcessor dataSetPreProcessor) {
+        backedIterator.setPreProcessor(dataSetPreProcessor);
+    }
+
+    @Override
+    public MultiDataSetPreProcessor getPreProcessor() {
+
+        throw new UnsupportedOperationException();
+    }
+
+
+    @Override
+    public boolean hasNext() {
+        if (resetPending.get()) {
+            if (resetSupported()) {
+                backedIterator.reset();
+                counter.set(0);
+                current = 0;
+                resetPending.set(false);
+            } else
+                throw new UnsupportedOperationException("Reset isn't supported by underlying iterator");
+        }
+
+        boolean state = false;
+        if (current >= top)
+            return false;
+        state = backedIterator.hasNext();
+        if (!state)
+            return false;
+        if (state && counter.get() < itemsPerPart)
+            return true;
+        else
+            return false;
+
+    }
+
+    @Override
+    public MultiDataSet next() {
+        counter.incrementAndGet();
+        if ((current == 0) && (bottom != 0)) {
+            backedIterator.reset();
+            long cnt = current;
+            for (; cnt < bottom; ++cnt) {
+                if (backedIterator.hasNext())
+                    backedIterator.next();
+            }
+            current = cnt+1;
+        }
+        else current++;
+        val p = backedIterator.next();
+        return p;
+    }
+
+    @Override
+    public MultiDataSet next(int i) {
+        throw new UnsupportedOperationException();
+    }
+}
--- a/deeplearning4j/deeplearning4j-modelimport/src/main/java/org/deeplearning4j/nn/modelimport/keras/Hdf5Archive.java
+++ b/deeplearning4j/deeplearning4j-modelimport/src/main/java/org/deeplearning4j/nn/modelimport/keras/Hdf5Archive.java
@ -47,6 +47,8 @@ import static org.bytedeco.hdf5.global.hdf5.*;
@Slf4j
 public class Hdf5Archive implements Closeable {

+    public static final int MAX_BUFFER_SIZE_BYTES = (int)Math.pow(2, 28);       //256 MB
+
    /**
     * HDF5 library is not thread safe - possible to crash if multiple reads etc are performed concurrently
     * in multiple threads. This object is used for locking read etc activity using synchronized blocks
@ -338,7 +340,7 @@ public class Hdf5Archive implements Closeable {
    private String readAttributeAsJson(Attribute attribute) throws UnsupportedKerasConfigurationException {
        synchronized (Hdf5Archive.LOCK_OBJECT) {
            VarLenType vl = attribute.getVarLenType();
-            int bufferSizeMult = 1;
+            int currBufferLength = 2048;
            String s;
            /* TODO: find a less hacky way to do this.
             * Reading variable length strings (from attributes) is a giant
@ -349,8 +351,8 @@ public class Hdf5Archive implements Closeable {
             * buffer and repeat.
             */
            while (true) {
-                byte[] attrBuffer = new byte[bufferSizeMult * 2000];
-                BytePointer attrPointer = new BytePointer(attrBuffer);
+                byte[] attrBuffer = new byte[currBufferLength];
+                BytePointer attrPointer = new BytePointer(currBufferLength);
                attribute.read(vl, attrPointer);
                attrPointer.get(attrBuffer);
                s = new String(attrBuffer);
@ -362,9 +364,11 @@ public class Hdf5Archive implements Closeable {
                } catch (IOException e) {
                    //OK - we don't know how long the buffer needs to be, so we'll try again with larger buffer
                }
-                bufferSizeMult *= 2;
-                if (bufferSizeMult > 1024) {
-                    throw new UnsupportedKerasConfigurationException("Could not read abnormally long HDF5 attribute");
+
+                if(currBufferLength == MAX_BUFFER_SIZE_BYTES){
+                    throw new UnsupportedKerasConfigurationException("Could not read abnormally long HDF5 attribute: size exceeds " + currBufferLength + " bytes");
+                } else {
+                    currBufferLength = (int)Math.min(MAX_BUFFER_SIZE_BYTES, currBufferLength * 4L);
                }
            }
            vl.deallocate();
--- a/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/algorithm/BaseClusteringAlgorithm.java
+++ b/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/algorithm/BaseClusteringAlgorithm.java
@ -21,6 +21,7 @@ import lombok.NoArgsConstructor;
 import lombok.extern.slf4j.Slf4j;
 import lombok.val;
 import org.apache.commons.lang3.ArrayUtils;
+import org.apache.commons.math3.ml.clustering.KMeansPlusPlusClusterer;
 import org.deeplearning4j.clustering.cluster.Cluster;
 import org.deeplearning4j.clustering.cluster.ClusterSet;
 import org.deeplearning4j.clustering.cluster.ClusterUtils;
@ -62,12 +63,13 @@ public class BaseClusteringAlgorithm implements ClusteringAlgorithm, Serializabl
    private ClusterSet clusterSet;
    private List<Point> initialPoints;
    private transient ExecutorService exec;
+    private boolean useKmeansPlusPlus;


-
-    protected BaseClusteringAlgorithm(ClusteringStrategy clusteringStrategy) {
+    protected BaseClusteringAlgorithm(ClusteringStrategy clusteringStrategy, boolean useKmeansPlusPlus) {
        this.clusteringStrategy = clusteringStrategy;
        this.exec = MultiThreadUtils.newExecutorService();
+        this.useKmeansPlusPlus = useKmeansPlusPlus;
    }

    /**
@ -75,8 +77,8 @@ public class BaseClusteringAlgorithm implements ClusteringAlgorithm, Serializabl
     * @param clusteringStrategy
     * @return
     */
-    public static BaseClusteringAlgorithm setup(ClusteringStrategy clusteringStrategy) {
-        return new BaseClusteringAlgorithm(clusteringStrategy);
+    public static BaseClusteringAlgorithm setup(ClusteringStrategy clusteringStrategy, boolean useKmeansPlusPlus) {
+        return new BaseClusteringAlgorithm(clusteringStrategy, useKmeansPlusPlus);
    }

    /**
@ -86,7 +88,7 @@ public class BaseClusteringAlgorithm implements ClusteringAlgorithm, Serializabl
     */
    public ClusterSet applyTo(List<Point> points) {
        resetState(points);
-        initClusters();
+        initClusters(useKmeansPlusPlus);
        iterations();
        return clusterSet;
    }
@ -130,7 +132,7 @@ public class BaseClusteringAlgorithm implements ClusteringAlgorithm, Serializabl
     * Initialize the
     * cluster centers at random
     */
-    protected void initClusters() {
+    protected void initClusters(boolean kMeansPlusPlus) {
        log.info("Generating initial clusters");
        List<Point> points = new ArrayList<>(initialPoints);

@ -152,7 +154,10 @@ public class BaseClusteringAlgorithm implements ClusteringAlgorithm, Serializabl
        //Thus, we are more likely to select (as a new cluster center) a point that is far from an existing cluster
        while (clusterSet.getClusterCount() < initialClusterCount && !points.isEmpty()) {
            dxs = ClusterUtils.computeSquareDistancesFromNearestCluster(clusterSet, points, dxs, exec);
-            double r = random.nextFloat() * dxs.maxNumber().doubleValue();
+            double summed = Nd4j.sum(dxs).getDouble(0);
+            double r = kMeansPlusPlus ? random.nextDouble() * summed:
+                                        random.nextFloat() * dxs.maxNumber().doubleValue();
+
            for (int i = 0; i < dxs.length(); i++) {
                double distance = dxs.getDouble(i);
                Preconditions.checkState(distance >= 0, "Encountered negative distance: distance function is not valid? Distance " +
@ -170,6 +175,7 @@ public class BaseClusteringAlgorithm implements ClusteringAlgorithm, Serializabl
                        new IterationInfo(currentIteration, initialClusterSetInfo));
    }

+
    protected void applyClusteringStrategy() {
        if (!isStrategyApplicableNow())
            return;
--- a/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/cluster/ClusterUtils.java
+++ b/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/cluster/ClusterUtils.java
@ -79,8 +79,8 @@ public class ClusterUtils {
        int nClusters = clusterSet.getClusterCount();
        for (int i = 0; i < nClusters; i++) {
            final Cluster cluster = clusterSet.getClusters().get(i);
-            tasks.add(new Runnable() {
-                public void run() {
+            //tasks.add(new Runnable() {
+            //    public void run() {
                    try {
                        final ClusterInfo clusterInfo = clusterSetInfo.getClusterInfo(cluster.getId());
                        refreshClusterCenter(cluster, clusterInfo);
@ -88,10 +88,10 @@ public class ClusterUtils {
                    } catch (Throwable t) {
                        log.warn("Error refreshing cluster centers", t);
                    }
-                }
-            });
+            //    }
+            //});
        }
-        MultiThreadUtils.parallelTasks(tasks, executorService);
+        //MultiThreadUtils.parallelTasks(tasks, executorService);
    }

    public static void refreshClusterCenter(Cluster cluster, ClusterInfo clusterInfo) {
@ -146,28 +146,29 @@ public class ClusterUtils {
        List<Runnable> tasks = new ArrayList<>();
        for (int i = 0; i < pointsCount; i++) {
            final int i2 = i;
-            tasks.add(new Runnable() {
-                public void run() {
+            //tasks.add(new Runnable() {
+            //    public void run() {
                    try {
                        Point point = points.get(i2);
                        double dist = clusterSet.isInverse() ? newCluster.getDistanceToCenter(point)
                                : Math.pow(newCluster.getDistanceToCenter(point), 2);
-                        dxs.putScalar(i2, clusterSet.isInverse() ? dist : dist);
+                        dxs.putScalar(i2, /*clusterSet.isInverse() ? dist :*/ dist);
                    } catch (Throwable t) {
                        log.warn("Error computing squared distance from nearest cluster", t);
                    }
-                }
-            });
+            //    }
+            //});

        }

-        MultiThreadUtils.parallelTasks(tasks, executorService);
-
+        //MultiThreadUtils.parallelTasks(tasks, executorService);
        for (int i = 0; i < pointsCount; i++) {
            double previousMinDistance = previousDxs.getDouble(i);
            if (clusterSet.isInverse()) {
-                if (dxs.getDouble(i) < previousMinDistance)
+                if (dxs.getDouble(i) < previousMinDistance) {
+
                    dxs.putScalar(i, previousMinDistance);
+                }
            } else if (dxs.getDouble(i) > previousMinDistance)
                dxs.putScalar(i, previousMinDistance);
        }
@ -175,6 +176,23 @@ public class ClusterUtils {
        return dxs;
    }

+    public static INDArray computeWeightedProbaDistancesFromNearestCluster(final ClusterSet clusterSet,
+                                                                    final List<Point> points, INDArray previousDxs) {
+        final int pointsCount = points.size();
+        final INDArray dxs = Nd4j.create(pointsCount);
+        final Cluster newCluster = clusterSet.getClusters().get(clusterSet.getClusters().size() - 1);
+
+        Double sum = new Double(0);
+        for (int i = 0; i < pointsCount; i++) {
+
+                Point point = points.get(i);
+                double dist = Math.pow(newCluster.getDistanceToCenter(point), 2);
+                sum += dist;
+                dxs.putScalar(i, sum);
+        }
+
+        return dxs;
+    }
    /**
     *
     * @param clusterSet
@ -194,27 +212,27 @@ public class ClusterUtils {
        List<Runnable> tasks = new ArrayList<>();
        for (int i = 0; i < clusterCount; i++) {
            final Cluster cluster = clusterSet.getClusters().get(i);
-            tasks.add(new Runnable() {
-                public void run() {
+            //tasks.add(new Runnable() {
+            //    public void run() {
                    try {
                        info.getClustersInfos().put(cluster.getId(),
                                computeClusterInfos(cluster, clusterSet.getDistanceFunction()));
                    } catch (Throwable t) {
                        log.warn("Error computing cluster set info", t);
                    }
-                }
-            });
+                //}
+            //});
        }


-        MultiThreadUtils.parallelTasks(tasks, executorService);
+        //MultiThreadUtils.parallelTasks(tasks, executorService);

-        tasks = new ArrayList<>();
+        //tasks = new ArrayList<>();
        for (int i = 0; i < clusterCount; i++) {
            final int clusterIdx = i;
            final Cluster fromCluster = clusterSet.getClusters().get(i);
-            tasks.add(new Runnable() {
-                public void run() {
+            //tasks.add(new Runnable() {
+                //public void run() {
                    try {
                        for (int k = clusterIdx + 1, l = clusterSet.getClusterCount(); k < l; k++) {
                            Cluster toCluster = clusterSet.getClusters().get(k);
@ -230,12 +248,12 @@ public class ClusterUtils {
                    } catch (Throwable t) {
                        log.warn("Error computing distances", t);
                    }
-                }
-            });
+            //    }
+            //});

        }

-        MultiThreadUtils.parallelTasks(tasks, executorService);
+        //MultiThreadUtils.parallelTasks(tasks, executorService);

        return info;
    }
--- a/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/kmeans/KMeansClustering.java
+++ b/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/main/java/org/deeplearning4j/clustering/kmeans/KMeansClustering.java
@ -37,8 +37,8 @@ public class KMeansClustering extends BaseClusteringAlgorithm {
     *
     * @param clusteringStrategy
     */
-    protected KMeansClustering(ClusteringStrategy clusteringStrategy) {
-        super(clusteringStrategy);
+    protected KMeansClustering(ClusteringStrategy clusteringStrategy, boolean useKMeansPlusPlus) {
+        super(clusteringStrategy, useKMeansPlusPlus);
    }

    /**
@ -50,11 +50,11 @@ public class KMeansClustering extends BaseClusteringAlgorithm {
     * @return
     */
    public static KMeansClustering setup(int clusterCount, int maxIterationCount, Distance distanceFunction,
-                    boolean inverse) {
+                    boolean inverse, boolean useKMeansPlusPlus) {
        ClusteringStrategy clusteringStrategy =
                        FixedClusterCountStrategy.setup(clusterCount, distanceFunction, inverse);
        clusteringStrategy.endWhenIterationCountEquals(maxIterationCount);
-        return new KMeansClustering(clusteringStrategy);
+        return new KMeansClustering(clusteringStrategy, useKMeansPlusPlus);
    }

    /**
@ -66,10 +66,10 @@ public class KMeansClustering extends BaseClusteringAlgorithm {
     * @return
     */
    public static KMeansClustering setup(int clusterCount, double minDistributionVariationRate, Distance distanceFunction,
-                    boolean inverse, boolean allowEmptyClusters) {
+                    boolean inverse, boolean allowEmptyClusters, boolean useKMeansPlusPlus) {
        ClusteringStrategy clusteringStrategy = FixedClusterCountStrategy.setup(clusterCount, distanceFunction, inverse)
                        .endWhenDistributionVariationRateLessThan(minDistributionVariationRate);
-        return new KMeansClustering(clusteringStrategy);
+        return new KMeansClustering(clusteringStrategy, useKMeansPlusPlus);
    }


@ -81,8 +81,8 @@ public class KMeansClustering extends BaseClusteringAlgorithm {
     * @param distanceFunction the distance function to use for grouping
     * @return
     */
-    public static KMeansClustering setup(int clusterCount, int maxIterationCount, Distance distanceFunction) {
-        return setup(clusterCount, maxIterationCount, distanceFunction, false);
+    public static KMeansClustering setup(int clusterCount, int maxIterationCount, Distance distanceFunction, boolean useKMeansPlusPlus) {
+        return setup(clusterCount, maxIterationCount, distanceFunction, false, useKMeansPlusPlus);
    }

    /**
@ -94,17 +94,17 @@ public class KMeansClustering extends BaseClusteringAlgorithm {
     * @return
     */
    public static KMeansClustering setup(int clusterCount, double minDistributionVariationRate, Distance distanceFunction,
-                    boolean allowEmptyClusters) {
+                    boolean allowEmptyClusters, boolean useKMeansPlusPlus) {
        ClusteringStrategy clusteringStrategy = FixedClusterCountStrategy.setup(clusterCount, distanceFunction, false);
        clusteringStrategy.endWhenDistributionVariationRateLessThan(minDistributionVariationRate);
-        return new KMeansClustering(clusteringStrategy);
+        return new KMeansClustering(clusteringStrategy, useKMeansPlusPlus);
    }

    public static KMeansClustering setup(int clusterCount, Distance distanceFunction,
-                                         boolean allowEmptyClusters) {
+                                         boolean allowEmptyClusters, boolean useKMeansPlusPlus) {
        ClusteringStrategy clusteringStrategy = FixedClusterCountStrategy.setup(clusterCount, distanceFunction, false);
        clusteringStrategy.endWhenDistributionVariationRateLessThan(VARIATION_TOLERANCE);
-        return new KMeansClustering(clusteringStrategy);
+        return new KMeansClustering(clusteringStrategy, useKMeansPlusPlus);
    }

 }
--- a/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/test/java/org/deeplearning4j/clustering/kmeans/KMeansTest.java
+++ b/deeplearning4j/deeplearning4j-nearestneighbors-parent/nearestneighbor-core/src/test/java/org/deeplearning4j/clustering/kmeans/KMeansTest.java
@ -16,6 +16,7 @@

 package org.deeplearning4j.clustering.kmeans;

+import lombok.val;
 import org.apache.commons.lang3.time.StopWatch;
 import org.deeplearning4j.clustering.BaseDL4JTest;
 import org.deeplearning4j.clustering.algorithm.Distance;
@ -28,22 +29,25 @@ import org.nd4j.linalg.factory.Nd4j;

 import java.util.List;

-import static org.junit.Assert.assertEquals;
-import static org.junit.Assert.fail;
+import static org.junit.Assert.*;

 /**
 * Created by agibsonccc on 7/2/17.
 */
 public class KMeansTest extends BaseDL4JTest {

+    private boolean[] useKMeansPlusPlus = {true, false};
+
    @Test
    public void testKMeans() {
        Nd4j.getRandom().setSeed(7);
-        KMeansClustering kMeansClustering = KMeansClustering.setup(5, 5, Distance.EUCLIDEAN);
-        List<Point> points = Point.toPoints(Nd4j.randn(5, 5));
-        ClusterSet clusterSet = kMeansClustering.applyTo(points);
-        PointClassification pointClassification = clusterSet.classifyPoint(points.get(0));
-        System.out.println(pointClassification);
+        for (boolean mode : useKMeansPlusPlus) {
+            KMeansClustering kMeansClustering = KMeansClustering.setup(5, 5, Distance.EUCLIDEAN, mode);
+            List<Point> points = Point.toPoints(Nd4j.randn(5, 5));
+            ClusterSet clusterSet = kMeansClustering.applyTo(points);
+            PointClassification pointClassification = clusterSet.classifyPoint(points.get(0));
+            System.out.println(pointClassification);
+        }
    }

    @Test
@ -51,20 +55,22 @@ public class KMeansTest extends BaseDL4JTest {

        Nd4j.getRandom().setSeed(7);
        int numClusters = 5;
-        KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, 1000, Distance.COSINE_DISTANCE, true);
-        List<Point> points = Point.toPoints(Nd4j.rand(5, 300));
-        ClusterSet clusterSet = kMeansClustering.applyTo(points);
-        PointClassification pointClassification = clusterSet.classifyPoint(points.get(0));
+        for (boolean mode : useKMeansPlusPlus) {
+            KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, 1000, Distance.COSINE_DISTANCE, mode);
+            List<Point> points = Point.toPoints(Nd4j.rand(5, 300));
+            ClusterSet clusterSet = kMeansClustering.applyTo(points);
+            PointClassification pointClassification = clusterSet.classifyPoint(points.get(0));


-        KMeansClustering kMeansClusteringEuclidean = KMeansClustering.setup(numClusters, 1000, Distance.EUCLIDEAN);
-        ClusterSet clusterSetEuclidean = kMeansClusteringEuclidean.applyTo(points);
-        PointClassification pointClassificationEuclidean = clusterSetEuclidean.classifyPoint(points.get(0));
-        System.out.println("Cosine " + pointClassification);
-        System.out.println("Euclidean " + pointClassificationEuclidean);
+            KMeansClustering kMeansClusteringEuclidean = KMeansClustering.setup(numClusters, 1000, Distance.EUCLIDEAN, mode);
+            ClusterSet clusterSetEuclidean = kMeansClusteringEuclidean.applyTo(points);
+            PointClassification pointClassificationEuclidean = clusterSetEuclidean.classifyPoint(points.get(0));
+            System.out.println("Cosine " + pointClassification);
+            System.out.println("Euclidean " + pointClassificationEuclidean);

-        assertEquals(pointClassification.getCluster().getPoints().get(0),
-                        pointClassificationEuclidean.getCluster().getPoints().get(0));
+            assertEquals(pointClassification.getCluster().getPoints().get(0),
+                    pointClassificationEuclidean.getCluster().getPoints().get(0));
+        }
    }

    @Ignore
@ -73,22 +79,24 @@ public class KMeansTest extends BaseDL4JTest {
        Nd4j.setDefaultDataTypes(DataType.DOUBLE, DataType.DOUBLE);
        Nd4j.getRandom().setSeed(7);
        int numClusters = 20;
-        StopWatch watch = new StopWatch();
-        watch.start();
-        KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, 1000, Distance.COSINE_DISTANCE, true);
-        List<Point> points = Point.toPoints(Nd4j.linspace(0, 5000*300, 5000*300).reshape(5000,300 ));
+        for (boolean mode : useKMeansPlusPlus) {
+            StopWatch watch = new StopWatch();
+            watch.start();
+            KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, 1000, Distance.COSINE_DISTANCE, mode);
+            List<Point> points = Point.toPoints(Nd4j.linspace(0, 5000 * 300, 5000 * 300).reshape(5000, 300));

-        ClusterSet clusterSet = kMeansClustering.applyTo(points);
-        watch.stop();
-        System.out.println("Elapsed for clustering : " + watch);
+            ClusterSet clusterSet = kMeansClustering.applyTo(points);
+            watch.stop();
+            System.out.println("Elapsed for clustering : " + watch);

-        watch.reset();
-        watch.start();
-        for (Point p : points) {
-            PointClassification pointClassification = clusterSet.classifyPoint(p);
+            watch.reset();
+            watch.start();
+            for (Point p : points) {
+                PointClassification pointClassification = clusterSet.classifyPoint(p);
+            }
+            watch.stop();
+            System.out.println("Elapsed for search: " + watch);
        }
-        watch.stop();
-        System.out.println("Elapsed for search: " + watch);
    }

    @Test
@ -97,41 +105,43 @@ public class KMeansTest extends BaseDL4JTest {
        Nd4j.setDefaultDataTypes(DataType.DOUBLE, DataType.DOUBLE);
        Nd4j.getRandom().setSeed(7);
        int numClusters = 20;
-        StopWatch watch = new StopWatch();
-        watch.start();
-        KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, Distance.COSINE_DISTANCE, false);
+        for (boolean mode : useKMeansPlusPlus) {
+            StopWatch watch = new StopWatch();
+            watch.start();
+            KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, Distance.COSINE_DISTANCE, false, mode);

-        List<Point> points = Point.toPoints(Nd4j.linspace(0, 10000*300, 10000*300).reshape(10000,300 ));
+            List<Point> points = Point.toPoints(Nd4j.linspace(0, 10000 * 300, 10000 * 300).reshape(10000, 300));

-        ClusterSet clusterSet = kMeansClustering.applyTo(points);
-        watch.stop();
-        System.out.println("Elapsed for clustering : " + watch);
+            ClusterSet clusterSet = kMeansClustering.applyTo(points);
+            watch.stop();
+            System.out.println("Elapsed for clustering : " + watch);

-        watch.reset();
-        watch.start();
-        for (Point p : points) {
-            PointClassification pointClassification = clusterSet.classifyPoint(p);
+            watch.reset();
+            watch.start();
+            for (Point p : points) {
+                PointClassification pointClassification = clusterSet.classifyPoint(p);
+            }
+            watch.stop();
+            System.out.println("Elapsed for search: " + watch);
+
+            watch.reset();
+            watch.start();
+            kMeansClustering = KMeansClustering.setup(numClusters, 0.05, Distance.COSINE_DISTANCE, false, mode);
+
+            points = Point.toPoints(Nd4j.linspace(0, 10000 * 300, 10000 * 300).reshape(10000, 300));
+
+            clusterSet = kMeansClustering.applyTo(points);
+            watch.stop();
+            System.out.println("Elapsed for clustering : " + watch);
+
+            watch.reset();
+            watch.start();
+            for (Point p : points) {
+                PointClassification pointClassification = clusterSet.classifyPoint(p);
+            }
+            watch.stop();
+            System.out.println("Elapsed for search: " + watch);
        }
-        watch.stop();
-        System.out.println("Elapsed for search: " + watch);
-
-        watch.reset();
-        watch.start();
-        kMeansClustering = KMeansClustering.setup(numClusters, 0.05, Distance.COSINE_DISTANCE, false);
-
-        points = Point.toPoints(Nd4j.linspace(0, 10000*300, 10000*300).reshape(10000,300 ));
-
-        clusterSet = kMeansClustering.applyTo(points);
-        watch.stop();
-        System.out.println("Elapsed for clustering : " + watch);
-
-        watch.reset();
-        watch.start();
-        for (Point p : points) {
-            PointClassification pointClassification = clusterSet.classifyPoint(p);
-        }
-        watch.stop();
-        System.out.println("Elapsed for search: " + watch);
    }

    @Test
@ -141,45 +151,47 @@ public class KMeansTest extends BaseDL4JTest {
            Nd4j.setDefaultDataTypes(DataType.DOUBLE, DataType.DOUBLE);
            Nd4j.getRandom().setSeed(7);
            int numClusters = 3;
-            KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, 1000, Distance.EUCLIDEAN, true);
-            double[] data = new double[]{
-                    15, 16,
-                    16, 18.5,
-                    17, 20.2,
-                    16.4, 17.12,
-                    17.23, 18.12,
-                    43, 43,
-                    44.43, 45.212,
-                    45.8, 54.23,
-                    46.313, 43.123,
-                    50.21, 46.3,
-                    99, 99.22,
-                    100.32, 98.123,
-                    100.32, 97.423,
-                    102, 93.23,
-                    102.23, 94.23
-            };
-            List<Point> points = Point.toPoints(Nd4j.createFromArray(data).reshape(15, 2));
+            for (boolean mode : useKMeansPlusPlus) {
+                KMeansClustering kMeansClustering = KMeansClustering.setup(numClusters, 1000, Distance.EUCLIDEAN, mode);
+                double[] data = new double[]{
+                        15, 16,
+                        16, 18.5,
+                        17, 20.2,
+                        16.4, 17.12,
+                        17.23, 18.12,
+                        43, 43,
+                        44.43, 45.212,
+                        45.8, 54.23,
+                        46.313, 43.123,
+                        50.21, 46.3,
+                        99, 99.22,
+                        100.32, 98.123,
+                        100.32, 97.423,
+                        102, 93.23,
+                        102.23, 94.23
+                };
+                List<Point> points = Point.toPoints(Nd4j.createFromArray(data).reshape(15, 2));

-            ClusterSet clusterSet = kMeansClustering.applyTo(points);
+                ClusterSet clusterSet = kMeansClustering.applyTo(points);


-            INDArray row0 = Nd4j.createFromArray(new double[]{16.6575, 18.4850});
-            INDArray row1 = Nd4j.createFromArray(new double[]{32.6050, 31.1500});
-            INDArray row2 = Nd4j.createFromArray(new double[]{75.9348, 74.1990});
+                INDArray row0 = Nd4j.createFromArray(new double[]{16.6575, 18.4850});
+                INDArray row1 = Nd4j.createFromArray(new double[]{32.6050, 31.1500});
+                INDArray row2 = Nd4j.createFromArray(new double[]{75.9348, 74.1990});

            /*List<Cluster> clusters = clusterSet.getClusters();
            assertEquals(row0, clusters.get(0).getCenter().getArray());
            assertEquals(row1, clusters.get(1).getCenter().getArray());
            assertEquals(row2, clusters.get(2).getCenter().getArray());*/

-            PointClassification pointClassification = null;
-            for (Point p : points) {
-                pointClassification = clusterSet.classifyPoint(p);
-                System.out.println("Point: " + p.getArray() + " " + " assigned to cluster: " + pointClassification.getCluster().getCenter().getArray());
-                List<Cluster> clusters = clusterSet.getClusters();
-                for (int i = 0; i < clusters.size(); ++i)
-                    System.out.println("Choice: " + clusters.get(i).getCenter().getArray());
+                PointClassification pointClassification = null;
+                for (Point p : points) {
+                    pointClassification = clusterSet.classifyPoint(p);
+                    System.out.println("Point: " + p.getArray() + " " + " assigned to cluster: " + pointClassification.getCluster().getCenter().getArray());
+                    List<Cluster> clusters = clusterSet.getClusters();
+                    for (int i = 0; i < clusters.size(); ++i)
+                        System.out.println("Choice: " + clusters.get(i).getCenter().getArray());
+                }
            }
            /*assertEquals(Nd4j.createFromArray(new double[]{75.9348, 74.1990}),
                    pointClassification.getCluster().getCenter().getArray());*/
@ -233,4 +245,39 @@ public class KMeansTest extends BaseDL4JTest {
            System.out.println();
        }
    }
+
+    @Test
+    public void testInitClusters() {
+        Nd4j.setDefaultDataTypes(DataType.DOUBLE, DataType.DOUBLE);
+        Nd4j.getRandom().setSeed(7);
+        {
+            KMeansClustering kMeansClustering = KMeansClustering.setup(5, 1, Distance.EUCLIDEAN, true);
+
+            double[][] dataArray = {{1000000.0, 2.8E7, 5.5E7, 8.2E7}, {2.8E7, 5.5E7, 8.2E7, 1.09E8}, {5.5E7, 8.2E7, 1.09E8, 1.36E8},
+                    {8.2E7, 1.09E8, 1.36E8, 1.63E8}, {1.09E8, 1.36E8, 1.63E8, 1.9E8}, {1.36E8, 1.63E8, 1.9E8, 2.17E8},
+                    {1.63E8, 1.9E8, 2.17E8, 2.44E8}, {1.9E8, 2.17E8, 2.44E8, 2.71E8}, {2.17E8, 2.44E8, 2.71E8, 2.98E8},
+                    {2.44E8, 2.71E8, 2.98E8, 3.25E8}, {2.71E8, 2.98E8, 3.25E8, 3.52E8}, {2.98E8, 3.25E8, 3.52E8, 3.79E8},
+                    {3.25E8, 3.52E8, 3.79E8, 4.06E8}, {3.52E8, 3.79E8, 4.06E8, 4.33E8}, {3.79E8, 4.06E8, 4.33E8, 4.6E8},
+                    {4.06E8, 4.33E8, 4.6E8, 4.87E8}, {4.33E8, 4.6E8, 4.87E8, 5.14E8}, {4.6E8, 4.87E8, 5.14E8, 5.41E8},
+                    {4.87E8, 5.14E8, 5.41E8, 5.68E8}, {5.14E8, 5.41E8, 5.68E8, 5.95E8}, {5.41E8, 5.68E8, 5.95E8, 6.22E8},
+                    {5.68E8, 5.95E8, 6.22E8, 6.49E8}, {5.95E8, 6.22E8, 6.49E8, 6.76E8}, {6.22E8, 6.49E8, 6.76E8, 7.03E8},
+                    {6.49E8, 6.76E8, 7.03E8, 7.3E8}, {6.76E8, 7.03E8, 7.3E8, 7.57E8}, {7.03E8, 7.3E8, 7.57E8, 7.84E8}};
+            INDArray data = Nd4j.createFromArray(dataArray);
+            List<Point> points = Point.toPoints(data);
+
+            ClusterSet clusterSet = kMeansClustering.applyTo(points);
+
+            double[] centroid1 = {2.44e8,    2.71e8,    2.98e8,    3.25e8};
+            double[] centroid2 = {5.14e8,    5.41e8,    5.68e8,    5.95e8};
+            double[] centroid3 = {1.63e8,     1.9e8,    2.17e8,    2.44e8};
+            double[] centroid4 = {6.76e8,    7.03e8,     7.3e8,    7.57e8};
+            double[] centroid5 = {4.06e8,    4.33e8,     4.6e8,    4.87e8};
+
+            assertArrayEquals(centroid1, clusterSet.getClusters().get(0).getCenter().getArray().toDoubleVector(), 1e-4);
+            assertArrayEquals(centroid2, clusterSet.getClusters().get(1).getCenter().getArray().toDoubleVector(), 1e-4);
+            assertArrayEquals(centroid3, clusterSet.getClusters().get(2).getCenter().getArray().toDoubleVector(), 1e-4);
+            assertArrayEquals(centroid4, clusterSet.getClusters().get(3).getCenter().getArray().toDoubleVector(), 1e-4);
+            assertArrayEquals(centroid5, clusterSet.getClusters().get(4).getCenter().getArray().toDoubleVector(), 1e-4);
+        }
+    }
 }
--- a/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp-uima/src/test/java/org/deeplearning4j/models/WordVectorSerializerTest.java
+++ b/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp-uima/src/test/java/org/deeplearning4j/models/WordVectorSerializerTest.java
@ -23,6 +23,8 @@ import org.apache.commons.io.FileUtils;
 import org.apache.commons.lang.ArrayUtils;
 import org.apache.commons.lang3.RandomUtils;
 import org.deeplearning4j.BaseDL4JTest;
+import org.deeplearning4j.models.sequencevectors.SequenceVectors;
+import org.deeplearning4j.models.sequencevectors.serialization.VocabWordFactory;
 import org.junit.Rule;
 import org.junit.rules.TemporaryFolder;
 import org.nd4j.linalg.io.ClassPathResource;
@ -857,4 +859,34 @@ public class WordVectorSerializerTest extends BaseDL4JTest {
        }
    }

+    @Test
+    public void testBackwardsCompatibleWord2Vec() {
+        File model_v3 = Resources.asFile("deeplearning4j-nlp/model_beta3.zip");
+        File model_v4 = Resources.asFile("deeplearning4j-nlp/model_beta4.zip");
+        Word2Vec word2Vec1 = WordVectorSerializer.readWord2VecModel(model_v3, true);
+        Word2Vec word2Vec2 = WordVectorSerializer.readWord2VecModel(model_v4, true);
+        try {
+            assertEquals(word2Vec1.toJson(), word2Vec2.toJson());
+        } catch (Exception e) {
+            fail(e.getMessage());
+        }
+    }
+
+    @Test
+    public void testBackwardsCompatibleSequenceVectors() {
+        File model_v3 = Resources.asFile("deeplearning4j-nlp/seqv_beta3.csv");
+        File model_v4 = Resources.asFile("deeplearning4j-nlp/seqv_beta4.csv");
+        try {
+            SequenceVectors vectors1 = WordVectorSerializer.readSequenceVectors(new VocabWordFactory(), model_v3);
+            SequenceVectors vectors2 = WordVectorSerializer.readSequenceVectors(new VocabWordFactory(), model_v4);
+
+            assertEquals(vectors1.vocab().numWords(), vectors2.vocab().numWords());
+            for (int i = 0; i < vectors1.vocab().numWords(); ++i) {
+                assertEquals(vectors1.vocab().words().toArray()[i], vectors2.vocab().words().toArray()[i]);
+            }
+        } catch (Exception e) {
+            fail(e.getMessage());
+        }
+    }
+
 }
--- a/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/iterator/BertIterator.java
+++ b/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/iterator/BertIterator.java
@ -249,7 +249,7 @@ public class BertIterator implements MultiDataSetIterator {
            } else {
                throw new RuntimeException();
            }
-            l[0] = Nd4j.create(Nd4j.defaultFloatingPointType(), mbPadded, numClasses);
+            l[0] = Nd4j.create(DataType.FLOAT, mbPadded, numClasses);
            for( int i=0; i<mb; i++ ){
                l[0].putScalar(i, classLabels[i], 1.0);
            }
@ -277,9 +277,9 @@ public class BertIterator implements MultiDataSetIterator {
            if(unsupervisedLabelFormat == UnsupervisedLabelFormat.RANK2_IDX){
                labelArr = Nd4j.create(DataType.INT, mbPadded, outLength);
            } else if(unsupervisedLabelFormat == UnsupervisedLabelFormat.RANK3_NCL){
-                labelArr = Nd4j.create(Nd4j.defaultFloatingPointType(), mbPadded, vocabSize, outLength);
+                labelArr = Nd4j.create(DataType.FLOAT, mbPadded, vocabSize, outLength);
            } else if(unsupervisedLabelFormat == UnsupervisedLabelFormat.RANK3_LNC){
-                labelArr = Nd4j.create(Nd4j.defaultFloatingPointType(), outLength, mbPadded, vocabSize);
+                labelArr = Nd4j.create(DataType.FLOAT, outLength, mbPadded, vocabSize);
            } else {
                throw new IllegalStateException("Unknown unsupervised label format: " + unsupervisedLabelFormat);
            }
--- a/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/iterator/CnnSentenceDataSetIterator.java
+++ b/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/iterator/CnnSentenceDataSetIterator.java
@ -201,7 +201,7 @@ public class CnnSentenceDataSetIterator implements DataSetIterator {
        List<String> tokens = new ArrayList<>();
        while (t.hasMoreTokens()) {
            String token = t.nextToken();
-            if (!wordVectors.hasWord(token)) {
+            if (!wordVectors.outOfVocabularySupported() && !wordVectors.hasWord(token)) {
                switch (unknownWordHandling) {
                    case RemoveWord:
                        continue;
--- a/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/sequencevectors/SequenceVectors.java
+++ b/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/sequencevectors/SequenceVectors.java
@ -1312,10 +1312,12 @@ public class SequenceVectors<T extends SequenceElement> extends WordVectorsImpl<
                            int rest = batchSequences.size() % batchSize;
                            int chunks = ((batchSequences.size() >= batchSize) ? batchSequences.size() / batchSize : 0) + ((rest > 0)? 1 : 0);
                            for (int j = 0; j < chunks; ++j) {
-                                if (elementsLearningAlgorithm instanceof SkipGram)
-                                    ((SkipGram)elementsLearningAlgorithm).iterateSample(batchSequences.get(j));
-                                else if (elementsLearningAlgorithm instanceof CBOW)
-                                    ((CBOW)elementsLearningAlgorithm).iterateSample(batchSequences.get(j));
+                                if (trainElementsVectors) {
+                                    if (elementsLearningAlgorithm instanceof SkipGram)
+                                        ((SkipGram) elementsLearningAlgorithm).iterateSample(batchSequences.get(j));
+                                    else if (elementsLearningAlgorithm instanceof CBOW)
+                                        ((CBOW) elementsLearningAlgorithm).iterateSample(batchSequences.get(j));
+                                }

                                if (trainSequenceVectors) {
                                    if (sequenceLearningAlgorithm instanceof DBOW)
--- a/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/word2vec/VocabWord.java
+++ b/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/main/java/org/deeplearning4j/models/word2vec/VocabWord.java
@ -32,7 +32,7 @@ import java.io.Serializable;
 *
 * @author Adam Gibson
 */
-@JsonTypeInfo(use = JsonTypeInfo.Id.CLASS, include = JsonTypeInfo.As.PROPERTY, property = "@class")
+@JsonTypeInfo(use = JsonTypeInfo.Id.CLASS, include = JsonTypeInfo.As.PROPERTY, property = "@class", defaultImpl =  VocabWord.class)
@JsonAutoDetect(fieldVisibility = JsonAutoDetect.Visibility.ANY, getterVisibility = JsonAutoDetect.Visibility.NONE,
        setterVisibility = JsonAutoDetect.Visibility.NONE)
 public class VocabWord extends SequenceElement implements Serializable {
--- a/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/test/java/org/deeplearning4j/iterator/TestBertIterator.java
+++ b/deeplearning4j/deeplearning4j-nlp-parent/deeplearning4j-nlp/src/test/java/org/deeplearning4j/iterator/TestBertIterator.java
@ -224,6 +224,7 @@ public class TestBertIterator extends BaseDL4JTest {

    @Test(timeout = 20000L)
    public void testMinibatchPadding() throws Exception {
+        Nd4j.setDefaultDataTypes(DataType.FLOAT, DataType.FLOAT);
        String toTokenize1 = "I saw a girl with a telescope.";
        String toTokenize2 = "Donaudampfschifffahrts Kapitänsmützeninnenfuttersaum";
        BertWordPieceTokenizerFactory t = new BertWordPieceTokenizerFactory(pathToVocab, false, false, c);
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/api/TrainingConfig.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/api/TrainingConfig.java
@ -17,6 +17,7 @@
 package org.deeplearning4j.nn.api;

 import org.deeplearning4j.nn.conf.GradientNormalization;
+import org.nd4j.linalg.api.buffer.DataType;
 import org.nd4j.linalg.learning.config.IUpdater;
 import org.nd4j.linalg.learning.regularization.Regularization;

@ -73,4 +74,6 @@ public interface TrainingConfig {
     */
    double getGradientNormalizationThreshold();

+    void setDataType(DataType dataType);
+
 }
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/graph/GraphVertex.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/graph/GraphVertex.java
@ -93,4 +93,9 @@ public abstract class GraphVertex implements Cloneable, Serializable {
     */
    public abstract MemoryReport getMemoryReport(InputType... inputTypes);

+
+    public void setDataType(DataType dataType) {
+        //No-op for most layers
+    }
+
 }
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/graph/LayerVertex.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/graph/LayerVertex.java
@ -146,4 +146,9 @@ public class LayerVertex extends GraphVertex {
        //TODO preprocessor memory
        return layerConf.getLayer().getMemoryReport(it);
    }
+
+    @Override
+    public void setDataType(DataType dataType){
+        layerConf.getLayer().setDataType(dataType);
+    }
 }
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/Layer.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/Layer.java
@ -223,6 +223,11 @@ public abstract class Layer implements TrainingConfig, Serializable, Cloneable {
                        "Not supported: all layers with parameters should override this method");
    }

+    @Override
+    public void setDataType(DataType dataType) {
+        //No-op for most layers
+    }
+
    /**
     * This is a report of the estimated memory consumption for the given layer
     *
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/samediff/SameDiffLambdaVertex.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/samediff/SameDiffLambdaVertex.java
@ -96,7 +96,7 @@ public abstract class SameDiffLambdaVertex extends SameDiffVertex {

            if (!map.containsKey(inputNum)) {
                //Lazily define extra input variable as required
-                SDVariable var = sameDiff.var("var_" + inputNum, 1); //TODO is this shape safe?
+                SDVariable var = sameDiff.var("var_" + inputNum, dataType, -1); //TODO is this shape safe?
                map.put(inputNum, var);
            }

--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/samediff/SameDiffVertex.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/layers/samediff/SameDiffVertex.java
@ -62,6 +62,7 @@ public abstract class SameDiffVertex extends GraphVertex implements TrainingConf
    protected IUpdater biasUpdater;
    protected GradientNormalization gradientNormalization;
    protected double gradientNormalizationThreshold = Double.NaN;
+    protected DataType dataType;

    /**
     * Define the vertex
@ -234,4 +235,9 @@ public abstract class SameDiffVertex extends GraphVertex implements TrainingConf
    public double getGradientNormalizationThreshold() {
        return gradientNormalizationThreshold;
    }
+
+    @Override
+    public void setDataType(DataType dataType) {
+        this.dataType = dataType;
+    }
 }
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/misc/DummyConfig.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/conf/misc/DummyConfig.java
@ -19,6 +19,7 @@ package org.deeplearning4j.nn.conf.misc;
 import lombok.AllArgsConstructor;
 import org.deeplearning4j.nn.api.TrainingConfig;
 import org.deeplearning4j.nn.conf.GradientNormalization;
+import org.nd4j.linalg.api.buffer.DataType;
 import org.nd4j.linalg.learning.config.IUpdater;
 import org.nd4j.linalg.learning.config.NoOp;
 import org.nd4j.linalg.learning.regularization.Regularization;
@ -63,4 +64,9 @@ public class DummyConfig implements TrainingConfig {
    public double getGradientNormalizationThreshold() {
        return 1.0;
    }
+
+    @Override
+    public void setDataType(DataType dataType) {
+
+    }
 }
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/graph/ComputationGraph.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/graph/ComputationGraph.java
@ -512,6 +512,7 @@ public class ComputationGraph implements Serializable, Model, NeuralNetwork {
        for(; i<topologicalOrder.length; i++ ){
            String name = indices.getIdxToName().get(i);
            org.deeplearning4j.nn.conf.graph.GraphVertex n = configVertexMap.get(name);
+            n.setDataType(netDtype);
            numParamsForVertex[i] = n.numParams(true);
            numParams += numParamsForVertex[i];
        }
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/recurrent/RnnLossLayer.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/recurrent/RnnLossLayer.java
@ -26,6 +26,7 @@ import org.deeplearning4j.nn.gradient.DefaultGradient;
 import org.deeplearning4j.nn.gradient.Gradient;
 import org.deeplearning4j.nn.layers.BaseLayer;
 import org.deeplearning4j.util.TimeSeriesUtils;
+import org.nd4j.base.Preconditions;
 import org.nd4j.linalg.api.buffer.DataType;
 import org.nd4j.linalg.api.ndarray.INDArray;
 import org.nd4j.linalg.dataset.api.DataSet;
@ -35,6 +36,7 @@ import org.nd4j.linalg.primitives.Pair;
 import org.deeplearning4j.nn.workspace.ArrayType;
 import org.deeplearning4j.nn.workspace.LayerWorkspaceMgr;

+import java.util.Arrays;
 import java.util.List;

 /**
@ -60,10 +62,16 @@ public class RnnLossLayer extends BaseLayer<org.deeplearning4j.nn.conf.layers.Rn
        assertInputSet(true);
        if (input.rank() != 3)
            throw new UnsupportedOperationException(
-                            "Input is not rank 3. Got input with rank " + input.rank() + " " + layerId());
+                            "Input is not rank 3. Expected rank 3 input of shape [minibatch, size, sequenceLength]. Got input with rank " +
+                                    input.rank() + " with shape " + Arrays.toString(input.shape()) + " for layer " + layerId());
        if (labels == null)
            throw new IllegalStateException("Labels are not set (null)");

+        Preconditions.checkState(labels.rank() == 3, "Expected rank 3 labels array, got label array with shape %ndShape", labels);
+        Preconditions.checkState(input.size(2) == labels.size(2), "Sequence lengths do not match for RnnOutputLayer input and labels:" +
+                "Arrays should be rank 3 with shape [minibatch, size, sequenceLength] - mismatch on dimension 2 (sequence length) - input=%ndShape vs. label=%ndShape", input, labels);
+
+
        INDArray input2d = TimeSeriesUtils.reshape3dTo2d(input, workspaceMgr, ArrayType.BP_WORKING_MEM);
        INDArray labels2d = TimeSeriesUtils.reshape3dTo2d(labels, workspaceMgr, ArrayType.BP_WORKING_MEM);
        INDArray maskReshaped;
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/recurrent/RnnOutputLayer.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/recurrent/RnnOutputLayer.java
@ -23,6 +23,7 @@ import org.deeplearning4j.nn.gradient.Gradient;
 import org.deeplearning4j.nn.layers.BaseOutputLayer;
 import org.deeplearning4j.nn.params.DefaultParamInitializer;
 import org.deeplearning4j.util.TimeSeriesUtils;
+import org.nd4j.base.Preconditions;
 import org.nd4j.linalg.api.buffer.DataType;
 import org.nd4j.linalg.api.ndarray.INDArray;
 import org.nd4j.linalg.factory.Nd4j;
@ -57,8 +58,13 @@ public class RnnOutputLayer extends BaseOutputLayer<org.deeplearning4j.nn.conf.l
                    "Input is not rank 3. RnnOutputLayer expects rank 3 input with shape [minibatch, layerInSize, sequenceLength]." +
                            " Got input with rank " + input.rank() + " and shape " + Arrays.toString(input.shape()) + " - " + layerId());
        }
+        Preconditions.checkState(labels.rank() == 3, "Expected rank 3 labels array, got label array with shape %ndShape", labels);
+        Preconditions.checkState(input.size(2) == labels.size(2), "Sequence lengths do not match for RnnOutputLayer input and labels:" +
+                "Arrays should be rank 3 with shape [minibatch, size, sequenceLength] - mismatch on dimension 2 (sequence length) - input=%ndShape vs. label=%ndShape", input, labels);
+
        INDArray inputTemp = input;
        this.input = TimeSeriesUtils.reshape3dTo2d(input, workspaceMgr, ArrayType.BP_WORKING_MEM);
+
        Pair<Gradient, INDArray> gradAndEpsilonNext = super.backpropGradient(epsilon, workspaceMgr);    //Also applies dropout
        this.input = inputTemp;
        INDArray epsilon2d = gradAndEpsilonNext.getSecond();
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/samediff/SameDiffGraphVertex.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/samediff/SameDiffGraphVertex.java
@ -39,9 +39,7 @@ import org.nd4j.linalg.api.ops.impl.layers.ExternalErrorsFunction;
 import org.nd4j.linalg.factory.Nd4j;
 import org.nd4j.linalg.primitives.Pair;

-import java.util.Arrays;
-import java.util.LinkedHashMap;
-import java.util.Map;
+import java.util.*;

 /**
 * Implementation of a SameDiff graph vertex.
@ -96,12 +94,11 @@ public class SameDiffGraphVertex extends BaseGraphVertex {

    @Override
    public INDArray doForward(boolean training, LayerWorkspaceMgr workspaceMgr) {
-        if(sameDiff == null){
-            doInit();
-        }
-
        try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().scopeOutOfWorkspaces()) {
-//            sameDiff.clearExecutionCache();
+            if(sameDiff == null){
+                doInit();
+            }
+
            config.validateInput(inputs);
            for(int i=0; i<inputs.length; i++ ){
                String name = config.getVertexParams().getInputs().get(i);
@ -121,6 +118,10 @@ public class SameDiffGraphVertex extends BaseGraphVertex {
            }
            Map<String,INDArray> out = sameDiff.exec(null, outputKey);
            INDArray result = out.get(outputKey);
+
+            //Clear placeholders and op inputs to ensure no out-of-scope arrays are still referenced anywhere
+            sameDiff.clearPlaceholders(true);
+            sameDiff.clearOpInputs();
            return workspaceMgr.dup(ArrayType.ACTIVATIONS, result);
        }
    }
@ -131,27 +132,42 @@ public class SameDiffGraphVertex extends BaseGraphVertex {

        INDArray[] dLdIns;
        try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().scopeOutOfWorkspaces()){
-//            sameDiff.clearExecutionCache();
+            if(sameDiff == null){
+                doInit();
+            }
+
+            if(!sameDiff.hasGradientFunction()) {
+                //Create when scoped out, to ensure any arrays are not in WS
+                List<String> inputs = config.getVertexParams().getInputs();
+                String[] inArr = inputs.toArray(new String[inputs.size()]);
+                sameDiff.createGradFunction(inArr);
+            }
            config.validateInput(inputs);
-            //Set inputs
-            for(int i=0; i<inputs.length; i++ ){
-                String name = config.getVertexParams().getInputs().get(i);
-                final String maskName = name + "_mask";
-                sameDiff.associateArrayWithVariable(inputs[i].dup(), sameDiff.getVariable(name));
-                if(maskArrays != null && maskArrays[i] != null) {
-                    sameDiff.associateArrayWithVariable(maskArrays[i].dup(), maskName);
-                }else{
-                    sameDiff.associateArrayWithVariable(createMask(dataType, inputs[i].shape()), maskName);
+            Map<String,INDArray> phMap = new HashMap<>();
+            List<String> inputs = config.getVertexParams().getInputs();
+            int i=0;
+            for(String s : inputs){
+                phMap.put(s, this.inputs[i++]);
+            }
+            if(maskArrays != null){
+                for( int j=0; j<maskArrays.length; j++ ){
+                    String name = inputs.get(j);
+                    final String maskName = name + "_mask";
+                    if(maskArrays[j] != null) {
+                        sameDiff.associateArrayWithVariable(maskArrays[j].dup(), maskName);
+                    }
                }
            }
-            fn.updateVariable(outputVar.getVarName(), epsilon.dup());
+            String epsName = fn.getGradPlaceholderName();
+            phMap.put(epsName, epsilon);
+

            for(String s : paramTable.keySet() ){
                //TODO this should only be necessary, in theory, once!
                sameDiff.associateArrayWithVariable(paramTable.get(s), s);
            }

-            sameDiff.execBackwards(null);
+            sameDiff.execBackwards(phMap);
            for(String s : paramTable.keySet() ){
                INDArray sdGrad = sameDiff.grad(s).getArr();
                INDArray dl4jGrad = gradTable.get(s);
@ -159,10 +175,10 @@ public class SameDiffGraphVertex extends BaseGraphVertex {
                g.gradientForVariable().put(s, dl4jGrad);
            }

-            dLdIns = new INDArray[inputs.length];
-            for(int i=0; i<inputs.length; i++ ){
-                String name = config.getVertexParams().getInputs().get(i);
-                dLdIns[i] = sameDiff.grad(name).getArr();
+            dLdIns = new INDArray[inputs.size()];
+            for(int j=0; j<inputs.size(); j++ ){
+                String name = inputs.get(j);
+                dLdIns[j] = sameDiff.grad(name).getArr();
            }
        }

@ -171,6 +187,9 @@ public class SameDiffGraphVertex extends BaseGraphVertex {
            dLdIns[i] = workspaceMgr.dup(ArrayType.ACTIVATION_GRAD, dLdIns[i]);
        }

+        //Clear placeholders and op inputs to ensure no out-of-scope arrays are still referenced anywhere
+        sameDiff.clearPlaceholders(true);
+        sameDiff.clearOpInputs();
        return new Pair<>(g, dLdIns);
    }

--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/samediff/SameDiffLayer.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/samediff/SameDiffLayer.java
@ -35,6 +35,7 @@ import org.nd4j.linalg.factory.Nd4j;
 import org.nd4j.linalg.primitives.Pair;
 import org.deeplearning4j.nn.workspace.ArrayType;
 import org.deeplearning4j.nn.workspace.LayerWorkspaceMgr;
+import org.nd4j.linalg.util.ArrayUtil;

 import java.util.*;

@ -78,25 +79,32 @@ public class SameDiffLayer extends AbstractLayer<AbstractSameDiffLayer> {
    @Override
    public INDArray activate(boolean training, LayerWorkspaceMgr workspaceMgr) {
        assertInputSet(false);
-        if(sameDiff == null){
-            doInit();
-        }

        try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().scopeOutOfWorkspaces()) {
+            if(sameDiff == null){
+                doInit();
+            }
+
            org.deeplearning4j.nn.conf.layers.samediff.SameDiffLayer bl = (org.deeplearning4j.nn.conf.layers.samediff.SameDiffLayer) layerConf();
            bl.validateInput(input);
-            sameDiff.associateArrayWithVariable(input.dup(), sameDiff.getVariable(INPUT_KEY));
+
+            Map<String,INDArray> phMap = new HashMap<>();
+            phMap.put(INPUT_KEY, input);
            if(maskArray != null){
-                sameDiff.associateArrayWithVariable(maskArray, sameDiff.getVariable(MASK_KEY));
-            }else{
-                sameDiff.associateArrayWithVariable(SameDiffGraphVertex.createMask(dataType, input.shape()), sameDiff.getVariable(MASK_KEY));
+                phMap.put(MASK_KEY, maskArray);
            }
+
            for(String s : paramTable.keySet() ) {
                sameDiff.associateArrayWithVariable(paramTable.get(s), s);
            }

-            Map<String,INDArray> out = sameDiff.exec(null, outputKey);
+            Map<String,INDArray> out = sameDiff.exec(phMap, outputKey);
            INDArray result = out.get(outputKey);
+
+            //Clear placeholders and op inputs to ensure no out-of-scope arrays are still referenced anywhere
+            sameDiff.clearPlaceholders(true);
+            sameDiff.clearOpInputs();
+
            return workspaceMgr.dup(ArrayType.ACTIVATIONS, result);
        }
    }
@ -110,24 +118,36 @@ public class SameDiffLayer extends AbstractLayer<AbstractSameDiffLayer> {

        INDArray dLdIn;
        try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().scopeOutOfWorkspaces()){
-//            sameDiff.clearExecutionCache();
+            if(sameDiff == null){
+                doInit();
+            }
+            if(!sameDiff.hasGradientFunction()) {
+                //Create when scoped out, to ensure any arrays are not in WS
+                sameDiff.createGradFunction(INPUT_KEY);
+            }
+
            org.deeplearning4j.nn.conf.layers.samediff.SameDiffLayer bl = (org.deeplearning4j.nn.conf.layers.samediff.SameDiffLayer) layerConf();
            bl.validateInput(input);

-            sameDiff.associateArrayWithVariable(input.dup(), sameDiff.getVariable(INPUT_KEY));
-            if(maskArray != null){
-                sameDiff.associateArrayWithVariable(maskArray, sameDiff.getVariable(MASK_KEY));
-            }else{
-                sameDiff.associateArrayWithVariable(SameDiffGraphVertex.createMask(dataType, input.shape()), sameDiff.getVariable(MASK_KEY));
-            }
-            fn.updateVariable(outputVar.getVarName(), epsilon.dup());
-
            for(String s : paramTable.keySet() ){
                //TODO this should only be necessary, in theory, once!
                sameDiff.associateArrayWithVariable(paramTable.get(s), s);
            }

-            sameDiff.execBackwards(Collections.<String, INDArray>emptyMap());
+            Map<String,INDArray> phMap = new HashMap<>();
+            phMap.put(INPUT_KEY, input);
+            phMap.put(fn.getGradPlaceholderName(), epsilon);
+            if(maskArray != null){
+                phMap.put(MASK_KEY, maskArray);
+            }
+
+            List<String> requiredGrads = new ArrayList<>(paramTable.size() + 1);
+            requiredGrads.add(sameDiff.grad(INPUT_KEY).getVarName());
+            for(String s : paramTable.keySet()){
+                requiredGrads.add(sameDiff.grad(s).getVarName());
+            }
+
+            sameDiff.execBackwards(phMap, requiredGrads);
            for(String s : paramTable.keySet() ){
                INDArray sdGrad = sameDiff.grad(s).getArr();
                INDArray dl4jGrad = gradTable.get(s);
@ -138,6 +158,11 @@ public class SameDiffLayer extends AbstractLayer<AbstractSameDiffLayer> {
            dLdIn = sameDiff.grad(INPUT_KEY).getArr();
        }

+        //Clear placeholders and op inputs to ensure no out-of-scope arrays are still referenced anywhere
+        sameDiff.clearPlaceholders(true);
+        sameDiff.clearOpInputs();
+
+        System.out.println(dLdIn);
        return new Pair<>(g, workspaceMgr.dup(ArrayType.ACTIVATION_GRAD, dLdIn));   //TODO OPTIMIZE THIS
    }

@ -225,8 +250,9 @@ public class SameDiffLayer extends AbstractLayer<AbstractSameDiffLayer> {
            sameDiff = SameDiff.create();
            Map<String, INDArray> p = paramTable();

-            val inputShape = input.shape().clone();
-            SDVariable inputVar = sameDiff.var(INPUT_KEY, dataType, inputShape);
+            long[] inputShape = input.shape().clone();
+            inputShape[0] = -1;
+            SDVariable inputVar = sameDiff.placeHolder(INPUT_KEY, dataType, inputShape);
            Map<String, long[]> paramShapes = layerConf().getLayerParams().getParamShapes();
            Map<String, SDVariable> params = new LinkedHashMap<>();
            for (String s : paramShapes.keySet()) {
@ -235,7 +261,8 @@ public class SameDiffLayer extends AbstractLayer<AbstractSameDiffLayer> {
                params.put(s, v);
            }

-            SDVariable mask = sameDiff.constant(MASK_KEY, SameDiffGraphVertex.createMask(dataType, inputShape));
+            long[] maskShape = ArrayUtil.nTimes((long)inputShape.length, -1);
+            SDVariable mask = sameDiff.placeHolder(MASK_KEY, dataType, maskShape);

            SDVariable layerOutput = bl.defineLayer(sameDiff, inputVar, params, mask);
            Preconditions.checkNotNull(layerOutput, "Invalid output: layer output is null");
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/samediff/SameDiffOutputLayer.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/layers/samediff/SameDiffOutputLayer.java
@ -87,35 +87,43 @@ public class SameDiffOutputLayer extends AbstractLayer<org.deeplearning4j.nn.con
    private INDArray activateHelper(boolean activations, LayerWorkspaceMgr workspaceMgr){
        assertInputSet(false);

-        //Check where the output occors. If it's a simple loss layer (no params) this could
+        //Check where the output occurs. If it's a simple loss layer (no params) this could
        // just be the input!
        if(activations && INPUT_KEY.equals(layerConf().activationsVertexName())){
            return workspaceMgr.leverageTo(ArrayType.ACTIVATIONS, input);
        }

-        if(sameDiff == null){
-            doInit();
-        }
-
        //TODO optimize
        try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().scopeOutOfWorkspaces()) {
-            sameDiff.associateArrayWithVariable(input.dup(), sameDiff.getVariable(INPUT_KEY));
-            if(layerConf().labelsRequired() && labels != null) {
-                sameDiff.associateArrayWithVariable(labels.dup(), sameDiff.getVariable(LABELS_KEY));
+            if(sameDiff == null){
+                doInit();
            }
+
            for(String s : paramTable.keySet() ) {
                sameDiff.associateArrayWithVariable(paramTable.get(s), s);
            }

-            INDArray score = sameDiff.execAndEndResult();
+            Map<String,INDArray> phMap = new HashMap<>();
+            phMap.put(INPUT_KEY, input);
+            if(!activations && layerConf().labelsRequired() && labels != null) {
+                phMap.put(LABELS_KEY, labels);
+            }
+
+            String s = activations ? layerConf().activationsVertexName() : outputVar.getVarName();
+
+            INDArray out = sameDiff.execSingle(phMap, s);
+
+            //Clear placeholders and op inputs to ensure no out-of-scope arrays are still referenced anywhere
+            sameDiff.clearPlaceholders(true);
+            sameDiff.clearOpInputs();
+
            if(activations) {
-                INDArray result = sameDiff.getArrForVarName(layerConf().activationsVertexName());
-                Preconditions.checkNotNull(result, "Activations (result) array for variable \"%s\" was " +
+                Preconditions.checkNotNull(out, "Activations (result) array for variable \"%s\" was " +
                        "null - error during execution or this variable (as defined by method activationsVertexName()) " +
                        "does not exist", layerConf().activationsVertexName());
-                return workspaceMgr.dup(ArrayType.ACTIVATIONS, result);
+                return workspaceMgr.dup(ArrayType.ACTIVATIONS, out);
            } else {
-                return score;
+                return out;
            }
        }
    }
@ -127,23 +135,26 @@ public class SameDiffOutputLayer extends AbstractLayer<org.deeplearning4j.nn.con
        Preconditions.checkState(!layerConf().labelsRequired() || labels != null, "Cannot execute backprop: Labels are not set. " +
                "If labels are not required for this SameDiff output layer, override SameDiffOutputLayer.labelsRequired()" +
                " to return false instead");
-
-        if(sameDiff == null){
-            //Usually doInit will be called in forward pass; not necessarily the case in output layers
-            // (for efficiency, we skip output layer forward pass in MultiLayerNetwork/ComputationGraph)
-            doInit();
-        }
-
        Gradient g = new DefaultGradient();

        INDArray dLdIn;
        try(MemoryWorkspace ws = Nd4j.getWorkspaceManager().scopeOutOfWorkspaces()){
-            INDArray castInput = input.castTo(Nd4j.defaultFloatingPointType());
+            if(sameDiff == null){
+                //Usually doInit will be called in forward pass; not necessarily the case in output layers
+                // (for efficiency, we skip output layer forward pass in MultiLayerNetwork/ComputationGraph)
+                doInit();
+            }
+            if(!sameDiff.hasGradientFunction()) {
+                //Create when scoped out, to ensure any arrays are not in WS
+                sameDiff.createGradFunction(INPUT_KEY);
+            }
+
+            INDArray castInput = input.castTo(dataType);
            if(castInput.isAttached())
                castInput = castInput.dup();
            sameDiff.associateArrayWithVariable(castInput, sameDiff.getVariable(INPUT_KEY));
            if(layerConf().labelsRequired()) {
-                INDArray castLabels = labels.castTo(Nd4j.defaultFloatingPointType());
+                INDArray castLabels = labels.castTo(dataType);
                if(castLabels.isAttached())
                    castLabels = castLabels.dup();
                sameDiff.associateArrayWithVariable(castLabels, sameDiff.getVariable(LABELS_KEY));
@ -154,7 +165,17 @@ public class SameDiffOutputLayer extends AbstractLayer<org.deeplearning4j.nn.con
                sameDiff.associateArrayWithVariable(paramTable.get(s), s);
            }

-            sameDiff.execBackwards(Collections.<String, INDArray>emptyMap());
+            List<String> gradVarNames = new ArrayList<>();
+            for(String s : paramTable.keySet()){
+                gradVarNames.add(sameDiff.getVariable(s).getGradient().getVarName());
+            }
+            gradVarNames.add(sameDiff.grad(INPUT_KEY).getVarName());
+
+            Map<String,INDArray> phMap = new HashMap<>();
+            phMap.put(INPUT_KEY, input);
+            phMap.put(LABELS_KEY, labels);
+
+            sameDiff.execBackwards(phMap, gradVarNames);
            for(String s : paramTable.keySet() ){
                INDArray sdGrad = sameDiff.grad(s).getArr();
                INDArray dl4jGrad = gradTable.get(s);
@ -165,6 +186,10 @@ public class SameDiffOutputLayer extends AbstractLayer<org.deeplearning4j.nn.con
            dLdIn = sameDiff.grad(INPUT_KEY).getArr();
        }

+        //Clear placeholders and op inputs to ensure no out-of-scope arrays are still referenced anywhere
+        sameDiff.clearPlaceholders(true);
+        sameDiff.clearOpInputs();
+
        return new Pair<>(g, workspaceMgr.dup(ArrayType.ACTIVATION_GRAD, dLdIn));   //TODO OPTIMIZE THIS
    }

@ -252,18 +277,20 @@ public class SameDiffOutputLayer extends AbstractLayer<org.deeplearning4j.nn.con
            sameDiff = SameDiff.create();
            Map<String, INDArray> p = paramTable();

-            val inputShape = input.shape().clone();
-            SDVariable inputVar = sameDiff.var(INPUT_KEY, dataType, inputShape);
+            long[] inputShape = input.shape().clone();
+            inputShape[0] = -1;
+            SDVariable inputVar = sameDiff.placeHolder(INPUT_KEY, dataType, inputShape);
            SDVariable labelVar = null;
            if(layerConf().labelsRequired()){
-                long[] labelShape = labels == null ? new long[]{1} : labels.shape().clone();
-                labelVar = sameDiff.var(LABELS_KEY, dataType, labelShape);
+                long[] labelShape = labels == null ? new long[]{-1, -1} : labels.shape().clone();
+                labelShape[0] = -1;
+                labelVar = sameDiff.placeHolder(LABELS_KEY, dataType, labelShape);
            }
            Map<String, long[]> paramShapes = layerConf().getLayerParams().getParamShapes();
            Map<String, SDVariable> params = new LinkedHashMap<>();
            for (String s : paramShapes.keySet()) {
                val ps = paramShapes.get(s);
-                SDVariable v = sameDiff.var(s, ps);
+                SDVariable v = sameDiff.var(s, dataType, ps);
                params.put(s, v);
            }
            SDVariable layerOutput = bl.defineLayer(sameDiff, inputVar, labelVar, params);
--- a/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/multilayer/MultiLayerNetwork.java
+++ b/deeplearning4j/deeplearning4j-nn/src/main/java/org/deeplearning4j/nn/multilayer/MultiLayerNetwork.java
@ -660,6 +660,7 @@ public class MultiLayerNetwork implements Serializable, Classifier, Layer, Neura
            val nParamsPerLayer = new long[nLayers];
            for (int i = 0; i < nLayers; i++) {
                NeuralNetConfiguration conf = layerWiseConfigurations.getConf(i);
+                conf.getLayer().setDataType(netDtype);
                nParamsPerLayer[i] = conf.getLayer().initializer().numParams(conf);
                paramLength += nParamsPerLayer[i];
            }
--- a/deeplearning4j/dl4j-perf/src/main/java/org/deeplearning4j/perf/listener/HardwareMetric.java
+++ b/deeplearning4j/dl4j-perf/src/main/java/org/deeplearning4j/perf/listener/HardwareMetric.java
@ -152,7 +152,7 @@ public class HardwareMetric implements Serializable {
        return builder.logicalProcessorCount(processor.getLogicalProcessorCount())
                .physicalProcessorCount(processor.getPhysicalProcessorCount())
                .name(name)
-                .averagedCpuLoad((long) processor.getSystemCpuLoad() * 100)
+                .averagedCpuLoad((long)(processor.getSystemCpuLoad() * 100))
                .ioWaitTime(iowait).gpuMetrics(gpuMetric)
                .hostName(networkParams.getHostName()).diskInfo(diskInfoMap)
                .currentMemoryUse(globalMemory.getTotal() - globalMemory.getAvailable())
--- a/libnd4j/blas/CMakeLists.txt
+++ b/libnd4j/blas/CMakeLists.txt
@ -48,8 +48,6 @@ if(WIN32)
    SET(CMAKE_NINJA_FORCE_RESPONSE_FILE 1 CACHE INTERNAL "")
 endif()

-
-
 if ("${LIBND4J_ALL_OPS}")
    set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DLIBND4J_ALL_OPS=true")
 else()
@ -234,21 +232,21 @@ if(CUDA_BLAS)
            endif()
        endif()

-        if (NOT BUILD_TESTS)
-            file(GLOB_RECURSE EXCEPTIONS_SOURCES false ../include/exceptions/*.cpp ../include/exceptions/*.h)
-            file(GLOB_RECURSE EXEC_SOURCES false ../include/execution/*.cpp ../include/execution/*.h)
-            file(GLOB_RECURSE TYPES_SOURCES false ../include/types/*.cpp ../include/types/*.h)
-            file(GLOB_RECURSE ARRAY_SOURCES false ../include/array/impl/*.cpp ../include/array/cuda/*.cu ../include/array/*.h)
-            file(GLOB_RECURSE MEMORY_SOURCES false ../include/memory/*.cpp ../include/memory/*.h)
-            file(GLOB_RECURSE GRAPH_SOURCES false ../include/graph/*.cpp ../include/graph/*.h)
-            file(GLOB_RECURSE CUSTOMOPS_SOURCES false ../include/ops/declarable/generic/*.cpp)
-            file(GLOB_RECURSE CUSTOMOPS_HELPERS_SOURCES false ../include/ops/declarable/helpers/cpu/*.cpp)
-            file(GLOB_RECURSE OPS_SOURCES false ../include/ops/impl/*.cpp ../include/ops/declarable/impl/*.cpp  ../include/ops/*.h)
-            file(GLOB_RECURSE INDEXING_SOURCES false ../include/indexing/*.cpp ../include/indexing/*.h)
-            file(GLOB_RECURSE HELPERS_SOURCES false ../include/helpers/*.cpp ../include/helpers/cuda/*.cu ../include/helpers/*.h)
-            file(GLOB_RECURSE LOOPS_SOURCES false ../include/loops/*.cpp ../include/loops/*.h)
-            file(GLOB_RECURSE LOOPS_SOURCES_CUDA false ../include/loops/*.cu)
+        file(GLOB_RECURSE EXCEPTIONS_SOURCES false ../include/exceptions/*.cpp ../include/exceptions/*.h)
+        file(GLOB_RECURSE EXEC_SOURCES false ../include/execution/impl/*.cpp ../include/execution/*.cu ../include/execution/*.h)
+        file(GLOB_RECURSE TYPES_SOURCES false ../include/types/*.cpp ../include/types/*.h)
+        file(GLOB_RECURSE ARRAY_SOURCES false ../include/array/impl/*.cpp ../include/array/cuda/*.cu ../include/array/*.h)
+        file(GLOB_RECURSE MEMORY_SOURCES false ../include/memory/impl/*.cpp ../include/memory/cuda/*.cu ../include/memory/*.h)
+        file(GLOB_RECURSE GRAPH_SOURCES false ../include/graph/*.cpp ../include/graph/*.cu ../include/graph/*.h)
+        file(GLOB_RECURSE CUSTOMOPS_SOURCES false ../include/ops/declarable/generic/*.cpp)
+        file(GLOB_RECURSE CUSTOMOPS_HELPERS_SOURCES false ../include/ops/declarable/helpers/cuda/*.cu ../include/ops/declarable/helpers/impl/*.cpp)
+        file(GLOB_RECURSE OPS_SOURCES false ../include/ops/impl/*.cpp ../include/ops/declarable/impl/*.cpp  ../include/ops/*.h)
+        file(GLOB_RECURSE HELPERS_SOURCES false ../include/helpers/impl/*.cpp ../include/helpers/*.cu ../include/helpers/*.cupp ../include/helpers/*.h)
+        file(GLOB_RECURSE INDEXING_SOURCES false ../include/indexing/*.cpp ../include/indexing/*.h)
+        file(GLOB_RECURSE LOOPS_SOURCES false ../include/loops/*.cpp ../include/loops/*.h)
+        file(GLOB_RECURSE LOOPS_SOURCES_CUDA false ../include/loops/*.cu)

+        if (NOT BUILD_TESTS)
 			CUDA_ADD_LIBRARY(${LIBND4J_NAME} SHARED cuda/NativeOps.cu cuda/NativeOpExecutioner.cu ${LOOPS_SOURCES_CUDA}
                ${CUSTOMOPS_HELPERS_SOURCES} ${HELPERS_SOURCES} ${EXEC_SOURCES}
                ../include/cnpy/cnpy.cpp ../include/nd4jmemset.h ../include/nd4jmalloc.h
@ -258,26 +256,12 @@ if(CUDA_BLAS)
 		else()
            set (CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -DBUILD_TESTS=true")

-            file(GLOB_RECURSE EXCEPTIONS_SOURCES false ../include/exceptions/*.cpp ../include/exceptions/*.h)
-            file(GLOB_RECURSE EXEC_SOURCES false ../include/execution/impl/*.cpp ../include/execution/*.cu ../include/execution/*.h)
-            file(GLOB_RECURSE TYPES_SOURCES false ../include/types/*.cpp ../include/types/*.h)
-            file(GLOB_RECURSE ARRAY_SOURCES false ../include/array/impl/*.cpp ../include/array/cuda/*.cu ../include/array/*.h)
-            file(GLOB_RECURSE MEMORY_SOURCES false ../include/memory/impl/*.cpp ../include/memory/cuda/*.cu ../include/memory/*.h)
-            file(GLOB_RECURSE GRAPH_SOURCES false ../include/graph/*.cpp ../include/graph/*.cu ../include/graph/*.h)
-            file(GLOB_RECURSE CUSTOMOPS_SOURCES false ../include/ops/declarable/generic/*.cpp)
-            file(GLOB_RECURSE CUSTOMOPS_HELPERS_SOURCES false ../include/ops/declarable/helpers/cuda/*.cu)
-            file(GLOB_RECURSE OPS_SOURCES false ../include/ops/impl/*.cpp ../include/ops/declarable/impl/*.cpp  ../include/ops/*.h)
-            file(GLOB_RECURSE HELPERS_SOURCES false ../include/helpers/impl/*.cpp ../include/helpers/*.cu ../include/helpers/*.cupp ../include/helpers/*.h)
-            file(GLOB_RECURSE INDEXING_SOURCES false ../include/indexing/*.cpp ../include/indexing/*.h)
-            file(GLOB_RECURSE LOOPS_SOURCES false ../include/loops/*.cpp ../include/loops/*.h)
-            file(GLOB_RECURSE LOOPS_SOURCES_CUDA false ../include/loops/*.cu)
-
 			CUDA_ADD_LIBRARY(${LIBND4J_NAME} STATIC cuda/NativeOps.cu cuda/NativeOpExecutioner.cu ${LOOPS_SOURCES_CUDA}
                ${CUSTOMOPS_HELPERS_SOURCES} ${HELPERS_SOURCES} ${EXEC_SOURCES}
                ../include/cnpy/cnpy.cpp ../include/nd4jmemset.h ../include/nd4jmalloc.h
                cpu/GraphExecutioner.cpp cuda/NDArray.cu cpu/NDArrayFactory.cpp
                Environment.cpp Environment.h ${LOOPS_SOURCES} ${ARRAY_SOURCES} ${TYPES_SOURCES}
-                ${MEMORY_SOURCES} ${GRAPH_SOURCES} ${CUSTOMOPS_SOURCES} ${INDEXING_SOURCES} ${EXCEPTIONS_SOURCES}  ${OPS_SOURCES})
+                ${MEMORY_SOURCES} ${GRAPH_SOURCES} ${CUSTOMOPS_SOURCES} ${INDEXING_SOURCES} ${EXCEPTIONS_SOURCES} ${OPS_SOURCES})
 		endif()


@ -308,7 +292,7 @@ elseif(CPU_BLAS)
    file(GLOB_RECURSE MEMORY_SOURCES false ../include/memory/*.cpp ../include/memory/*.h)
    file(GLOB_RECURSE GRAPH_SOURCES false ../include/graph/*.cpp ../include/graph/*.h)
    file(GLOB_RECURSE CUSTOMOPS_SOURCES false ../include/ops/declarable/generic/*.cpp)
-    file(GLOB_RECURSE CUSTOMOPS_HELPERS_SOURCES false ../include/ops/declarable/helpers/cpu/*.cpp)
+    file(GLOB_RECURSE CUSTOMOPS_HELPERS_SOURCES false ../include/ops/declarable/helpers/cpu/*.cpp ../include/ops/declarable/helpers/impl/*.cpp)
    file(GLOB_RECURSE OPS_SOURCES false ../include/ops/impl/*.cpp ../include/ops/declarable/impl/*.cpp  ../include/ops/*.h)
    file(GLOB_RECURSE INDEXING_SOURCES false ../include/indexing/*.cpp ../include/indexing/*.h)
    file(GLOB_RECURSE HELPERS_SOURCES false ../include/helpers/*.cpp ../include/helpers/*.h)
--- a/libnd4j/blas/NDArray.h
+++ b/libnd4j/blas/NDArray.h
@ -372,8 +372,8 @@ namespace nd4j {
        /**
        *  if _bufferD==nullptr return _buffer, else return _bufferD
        */
-        FORCEINLINE void* specialBuffer();
-        FORCEINLINE void* getSpecialBuffer() const;
+        void* specialBuffer();
+        void* getSpecialBuffer() const;

        /**
        *   returns device buffer if compilation is for cuda case, otherwise returns host buffer
@ -429,16 +429,16 @@ namespace nd4j {
        /**
        *  permutes the dimensions in array according to "dimensions" array, new array points on _buffer of this array
        */
-		NDArray* permute(const std::initializer_list<int>& dimensions) const;
-        NDArray* permute(const std::vector<int>& dimensions) const;
-        NDArray* permute(const int* dimensions, const int rank) const;
+		NDArray permute(const std::initializer_list<int>& dimensions) const;
+        NDArray permute(const std::vector<int>& dimensions) const;
+        NDArray permute(const int* dimensions, const int rank) const;

        void permute(const int* dimensions, const int rank, NDArray& target) const;
        void permute(const std::vector<int>& dimensions, NDArray& target) const;

-        NDArray* permute(const std::initializer_list<Nd4jLong>& dimensions) const;
-        NDArray* permute(const std::vector<Nd4jLong>& dimensions) const;
-        NDArray* permute(const Nd4jLong* dimensions, const int rank) const;
+        NDArray permute(const std::initializer_list<Nd4jLong>& dimensions) const;
+        NDArray permute(const std::vector<Nd4jLong>& dimensions) const;
+        NDArray permute(const Nd4jLong* dimensions, const int rank) const;

        void permute(const Nd4jLong* dimensions, const int rank, NDArray& target) const;
        void permute(const std::vector<Nd4jLong>& dimensions, NDArray& target) const;
@ -508,7 +508,7 @@ namespace nd4j {
        /**
        *  returns new copy of this array, optionally in different order
        */
-        NDArray *dup(const char newOrder = 'a');
+        NDArray *dup(const char newOrder = 'a') const;

        /**
        *  returns sum of all elements of array
@ -687,7 +687,7 @@ namespace nd4j {
        void applyScalarArr(nd4j::scalar::BoolOps op, const NDArray* scalar, NDArray* target, ExtraArguments *extraParams = nullptr) const;


-#if defined(__CUDABLAS__) && defined(BUILD_TESTS)
+#if defined(__CUDABLAS__) //&& defined(BUILD_TESTS)
        template <typename Lambda>
        FORCEINLINE void applyLambda(Lambda func, NDArray* target = nullptr);

@ -790,8 +790,7 @@ namespace nd4j {
        /**
        *   apply transpose operation to the copy of this array, that is this array remains unaffected
        */
-        NDArray* transpose() const;
-        NDArray  transp() const;
+        NDArray transpose() const;

        /**
        *  perform transpose operation and store result in target, this array remains unaffected
@ -915,7 +914,7 @@ namespace nd4j {
        *
        * if permute have been applied before or there are weird strides, then new buffer is allocated for new array
        */
-		NDArray* reshape(const char order, const std::vector<Nd4jLong>& shape) const;
+		NDArray reshape(const char order, const std::vector<Nd4jLong>& shape) const;

        /**
        *  calculate strides and set given order
@ -2093,15 +2092,6 @@ Nd4jLong* NDArray::shapeInfo() {
    return _shapeInfo;
 }

-////////////////////////////////////////////////////////////////////////
-void* NDArray::specialBuffer() {
-
-    if (_buffer->special() == nullptr)
-        return getBuffer();
-    // FIXME: this should be fixed once CUDA backend added
-    return static_cast<int8_t*>(_buffer->special()) + (_offset * sizeOfT());
-}
-
 ////////////////////////////////////////////////////////////////////////
 Nd4jLong* NDArray::specialShapeInfo() {
    if (_shapeInfoD == nullptr)
@ -2110,14 +2100,6 @@ Nd4jLong* NDArray::specialShapeInfo() {
    return _shapeInfoD;
 }

-////////////////////////////////////////////////////////////////////////
-void* NDArray::getSpecialBuffer() const {
-      if (_buffer->special() == nullptr)
-        return getBuffer();
-    // FIXME: this should be fixed once CUDA backend added
-    return static_cast<int8_t*>(_buffer->special()) + (_offset * sizeOfT());
-}
-
 ////////////////////////////////////////////////////////////////////////
 Nd4jLong NDArray::getBufferOffset() const {
    return _offset;
@ -2137,7 +2119,7 @@ Nd4jLong* NDArray::getSpecialShapeInfo() const{
 }


-#if defined(__CUDACC__) && defined(BUILD_TESTS)
+#if defined(__CUDACC__) //&& defined(BUILD_TESTS)
 // for CUDA we need stil stuff inline
 #include "cuda/NDArrayLambda.hpp"
 #endif
--- a/libnd4j/blas/NDArray.hpp
+++ b/libnd4j/blas/NDArray.hpp
@ -39,9 +39,9 @@ NDArray* NDArray::asT() const{
    auto result = isScalar() ? new NDArray('c', {}, {0.}, DataTypeUtils::fromT<T>(), this->getContext()) : new NDArray(ordering(), getShapeAsVector(), DataTypeUtils::fromT<T>(), this->getContext());
    auto l = this->lengthOf();

-    prepareSpecialUse({result}, {this});
+    NDArray::prepareSpecialUse({result}, {this});
    NativeOpExecutioner::execTransformAny(getContext(), transform::AnyOps::Assign, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result->getBuffer(), result->getShapeInfo(), result->getSpecialBuffer(), result->getSpecialShapeInfo(), nullptr, nullptr, nullptr);
-    registerSpecialUse({result}, {this});
+    NDArray::registerSpecialUse({result}, {this});

    return result;
 }
@ -583,117 +583,130 @@ void NDArray::copyBuffersContinuouslyFrom(const NDArray& other, size_t sizeToCop
 void NDArray::assign(const double value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const float value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const float16 value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const bfloat16& value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const Nd4jLong value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const int value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), _shapeInfo, specialBuffer(), _shapeInfoD, buffer(), _shapeInfo, specialBuffer(), _shapeInfoD, temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp._shapeInfoD, nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const int16_t value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), _shapeInfo, specialBuffer(), _shapeInfoD, buffer(), _shapeInfo, specialBuffer(), _shapeInfoD, temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp._shapeInfoD, nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const uint8_t value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const uint16_t value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const uint32_t value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const uint64_t value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const int8_t value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
 void NDArray::assign(const bool value) {
    // just fire scalar
    auto temp = NDArrayFactory::create(this->dataType(), value, this->getContext());
-    prepareSpecialUse({this}, {&temp});
+
+    NDArray::prepareSpecialUse({this}, {&temp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::CopyPws, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), temp.buffer(), temp.shapeInfo(), temp.specialBuffer(), temp.getSpecialShapeInfo(), nullptr);
-    registerSpecialUse({this}, {&temp});
+    NDArray::registerSpecialUse({this}, {&temp});
 }

 //////////////////////////////////////////////////////////////////////////
@ -716,9 +729,9 @@ NDArray NDArray::varianceNumber(nd4j::variance::Ops op, bool biasCorrected) {

    NDArray res(DataTypeUtils::pickFloatingType(dataType()), getContext());

-    prepareSpecialUse({&res}, {this});
+    NDArray::prepareSpecialUse({&res}, {this});
    NativeOpExecutioner::execSummaryStatsScalar(getContext(), op, buffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), nullptr, res.buffer(), res.shapeInfo(), res.specialBuffer(), res.specialShapeInfo(), biasCorrected);
-    registerSpecialUse({&res}, {this});
+    NDArray::registerSpecialUse({&res}, {this});

    return res;
 }
@ -918,9 +931,9 @@ NDArray NDArray::reduceNumber(nd4j::reduce::FloatOps op, void *extraParams) cons
    auto shape = ConstantShapeHelper::getInstance()->scalarShapeInfo(DataTypeUtils::pickFloatingType(dataType()));
    NDArray result(shape, true, this->getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execReduceFloatScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo());
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -932,9 +945,9 @@ NDArray NDArray::reduceNumber(nd4j::reduce::SameOps op, void *extraParams) const

    NDArray result(dataType(), getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execReduceSameScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo());
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -947,9 +960,9 @@ NDArray NDArray::reduceNumber(nd4j::reduce::BoolOps op, void *extraParams) const
    auto shape = ConstantShapeHelper::getInstance()->scalarShapeInfo(DataType::BOOL);
    NDArray result(shape, true, this->getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execReduceBoolScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo());
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -962,9 +975,9 @@ NDArray NDArray::reduceNumber(nd4j::reduce::LongOps op, void *extraParams) const
    auto shape = ConstantShapeHelper::getInstance()->scalarShapeInfo(DataType::INT64);
    NDArray result(shape, true, this->getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execReduceLongScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo());
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -976,9 +989,9 @@ void NDArray::reduceNumber(nd4j::reduce::FloatOps op, NDArray& target, void *ext
    if(!target.isScalar() || target.dataType() != DataTypeUtils::pickFloatingType(dataType()))
        throw std::invalid_argument("NDArray::reduceNumber FloatOps: target array should be scalar and have corresponding float type!");

-    prepareSpecialUse({&target}, {this});
+    NDArray::prepareSpecialUse({&target}, {this});
    NativeOpExecutioner::execReduceFloatScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, target.buffer(), target.shapeInfo(), target.specialBuffer(), target.specialShapeInfo());
-    registerSpecialUse({&target}, {this});
+    NDArray::registerSpecialUse({&target}, {this});
 }

 //////////////////////////////////////////////////////////////////////////
@ -989,9 +1002,9 @@ void NDArray::reduceNumber(nd4j::reduce::SameOps op, NDArray& target, void *extr
    if(!target.isScalar() || target.dataType() != dataType())
        throw std::invalid_argument("NDArray::reduceNumber SameOps: target array should be scalar and have same type as this array!");

-    prepareSpecialUse({&target}, {this});
+    NDArray::prepareSpecialUse({&target}, {this});
    NativeOpExecutioner::execReduceSameScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, target.getBuffer(), target.getShapeInfo(), target.specialBuffer(), target.getSpecialShapeInfo());
-    registerSpecialUse({&target}, {this});
+    NDArray::registerSpecialUse({&target}, {this});
 }

 //////////////////////////////////////////////////////////////////////////
@ -1002,9 +1015,9 @@ void NDArray::reduceNumber(nd4j::reduce::BoolOps op, NDArray& target, void *extr
    if(!target.isScalar() || target.dataType() != DataType::BOOL)
        throw std::invalid_argument("NDArray::reduceNumber BoolOps: target array should be scalar and have bool type!");

-    prepareSpecialUse({&target}, {this});
+    NDArray::prepareSpecialUse({&target}, {this});
    NativeOpExecutioner::execReduceBoolScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, target.getBuffer(), target.getShapeInfo(), target.specialBuffer(), target.getSpecialShapeInfo());
-    registerSpecialUse({&target}, {this});
+    NDArray::registerSpecialUse({&target}, {this});
 }

 //////////////////////////////////////////////////////////////////////////
@ -1015,9 +1028,9 @@ void NDArray::reduceNumber(nd4j::reduce::LongOps op, NDArray& target, void *extr
    if(!target.isScalar() || target.dataType() != DataType::INT64)
        throw std::invalid_argument("NDArray::reduceNumber LongOps: target array should be scalar and have long type!");

-    prepareSpecialUse({&target}, {this});
+    NDArray::prepareSpecialUse({&target}, {this});
    NativeOpExecutioner::execReduceLongScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams, target.getBuffer(), target.getShapeInfo(), target.specialBuffer(), target.getSpecialShapeInfo());
-    registerSpecialUse({&target}, {this});
+    NDArray::registerSpecialUse({&target}, {this});
 }

 //////////////////////////////////////////////////////////////////////////
@ -1027,9 +1040,9 @@ NDArray NDArray::indexReduceNumber(nd4j::indexreduce::Ops op, ExtraArguments *ex

    auto res = NDArrayFactory::create<Nd4jLong>(0);

-    NDArray::prepareSpecialUse({&res}, {this});
+    NDArray::NDArray::prepareSpecialUse({&res}, {this});
    NativeOpExecutioner::execIndexReduceScalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), extraParams == nullptr ? nullptr : extraParams->argumentsAsT(this->dataType()), res.buffer(), res.shapeInfo(), res.specialBuffer(), res.specialShapeInfo());
-    NDArray::registerSpecialUse({&res}, {this});
+    NDArray::NDArray::registerSpecialUse({&res}, {this});

    return res;
 }
@ -1240,17 +1253,10 @@ BUILD_SINGLE_TEMPLATE(template void* NDArray::templatedPointerShift, (const Nd4j

 //////////////////////////////////////////////////////////////////////////
 // method makes copy of this array and applies to the copy transpose operation, this array remains unaffected
-NDArray* NDArray::transpose() const {
-    auto newArr = new NDArray(getBuffer(), getSpecialBuffer(), getShapeInfo(), getContext(), false, false);
-    newArr->transposei();
-
-    return newArr;
-}
-
-////////////////////////////////////////////////////////////////////////
-NDArray NDArray::transp() const {
-    NDArray newArr(getBuffer(), getShapeInfo(), getContext(), false);
+NDArray NDArray::transpose() const {
+    NDArray newArr(getDataBuffer(), ShapeDescriptor(getShapeInfo()), getContext(), getBufferOffset());
    newArr.transposei();
+
    return newArr;
 }

@ -1360,10 +1366,10 @@ Nd4jLong NDArray::argMax(std::initializer_list<int> dimensions) {

 //////////////////////////////////////////////////////////////////////////
 // create new array with corresponding order and shape, new array will point to the same _buffer as this array
-NDArray* NDArray::reshape(const char order, const std::vector<Nd4jLong>& shape) const {
+NDArray NDArray::reshape(const char order, const std::vector<Nd4jLong>& shape) const {

-    auto newArr = new NDArray(getDataBuffer(), ShapeDescriptor(getShapeInfo()), getContext());
-    newArr->reshapei(order, shape);
+    NDArray newArr(getDataBuffer(), ShapeDescriptor(getShapeInfo()), getContext(), getBufferOffset());
+    newArr.reshapei(order, shape);

    return newArr;
 }
@ -1420,43 +1426,43 @@ bool NDArray::permutei(const std::vector<Nd4jLong>& dimensions) {
 }

 //////////////////////////////////////////////////////////////////////////
-NDArray* NDArray::permute(const int* dimensions, const int rank) const {
+NDArray NDArray::permute(const int* dimensions, const int rank) const {

    // evaluate shapeInfo for output (permuted) array ret
    auto shapeInfoPermuted = ShapeUtils::evalPermShapeInfo(dimensions, rank, *this, getContext()->getWorkspace());
-    auto ret = new NDArray(_buffer, ShapeDescriptor(shapeInfoPermuted), getContext(), getBufferOffset());
-	ret->_isView = true;
+    NDArray ret(getDataBuffer(), ShapeDescriptor(shapeInfoPermuted), getContext(), getBufferOffset());
+	ret._isView = true;
    return ret;
 }

 /////////////////////////////////////////////////////////////////////////
-NDArray* NDArray::permute(const Nd4jLong* dimensions, const int rank) const {
+NDArray NDArray::permute(const Nd4jLong* dimensions, const int rank) const {
    int tempDims[MAX_RANK];
    shape::convertT<Nd4jLong, int>(const_cast<Nd4jLong *>(dimensions), tempDims, rank);
    return permute(tempDims, rank);
 }

 //////////////////////////////////////////////////////////////////////////
-NDArray* NDArray::permute(const std::vector<int>& dimensions) const {
+NDArray NDArray::permute(const std::vector<int>& dimensions) const {
    auto data = dimensions.data();
    auto size = dimensions.size();
    return permute(data, size);
 }

 //////////////////////////////////////////////////////////////////////////
-NDArray* NDArray::permute(const std::vector<Nd4jLong>& dimensions) const {
+NDArray NDArray::permute(const std::vector<Nd4jLong>& dimensions) const {
    return permute(dimensions.data(), dimensions.size());
 }


 //////////////////////////////////////////////////////////////////////////
-NDArray* NDArray::permute(const std::initializer_list<int>& dimensions) const {
+NDArray NDArray::permute(const std::initializer_list<int>& dimensions) const {
    std::vector<int> vec(dimensions);
    return permute(vec);
 }

 //////////////////////////////////////////////////////////////////////////
-NDArray* NDArray::permute(const std::initializer_list<Nd4jLong>& dimensions) const {
+NDArray NDArray::permute(const std::initializer_list<Nd4jLong>& dimensions) const {
    std::vector<Nd4jLong> vec(dimensions);
    return permute(vec);
 }
@ -1528,10 +1534,9 @@ bool NDArray::isUnitary() {
        throw std::runtime_error("isUnitary method: matrix must be square and have rank = 2 !");

    auto tr = this->transpose();
-    auto trMul = MmulHelper::mmul(this, tr, nullptr, 1.f, 0.f);
+    auto trMul = MmulHelper::mmul(this, &tr, nullptr, 1.f, 0.f);

    bool result = trMul->isIdentityMatrix();
-    delete tr;
    delete trMul;

    return result;
@ -1777,11 +1782,11 @@ NDArray NDArray::operator*(const T& scalar) const {

    auto tmp = NDArrayFactory::create(dataType(), scalar, getContext());
    NDArray result(_shapeInfo, DataTypeUtils::pickPairwiseResultType(dataType(), DataTypeUtils::fromT<T>()), false, getContext());
+
    NDArray::prepareSpecialUse({&result}, {this, &tmp});
-
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::Multiply, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result.buffer(), result.getShapeInfo(), result.specialBuffer(), result.getSpecialShapeInfo(), tmp.buffer(), tmp.shapeInfo(), tmp.specialBuffer(), tmp.specialShapeInfo(), nullptr);
-
    NDArray::registerSpecialUse({&result}, {this, &tmp});
+
    return result;
 }
 template NDArray NDArray::operator*(const double&   scalar) const;
@ -1811,6 +1816,7 @@ NDArray NDArray::operator/(const T& scalar) const {
    NDArray::prepareSpecialUse({&result}, {this, &tmp});
    NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::Divide, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result.buffer(), result.getShapeInfo(), result.specialBuffer(), result.getSpecialShapeInfo(), tmp.buffer(), tmp.shapeInfo(), tmp.specialBuffer(), tmp.specialShapeInfo(), nullptr);
    NDArray::registerSpecialUse({&result}, {this, &tmp});
+
    return result;
 }
 template NDArray NDArray::operator/(const double&   scalar) const;
@ -2050,14 +2056,14 @@ void NDArray::operator+=(const NDArray& other) {
        throw nd4j::datatype_exception::build("NDArray operator+=: Cannot add different types", this->dataType(), other.dataType());

    if (!this->isScalar() && other.isScalar()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::Add, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else if (other.lengthOf() == lengthOf() && this->rankOf() == other.rankOf()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), nd4j::pairwise::Add, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else{
        Nd4jLong *bShape = nullptr;
@ -2084,14 +2090,14 @@ void NDArray::operator-=(const NDArray& other) {
        throw nd4j::datatype_exception::build("NDArray operator-=: Cannot subtract different types", this->dataType(), other.dataType());

    if (!this->isScalar() && other.isScalar()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::Subtract, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else if (other.lengthOf() == lengthOf() && this->rankOf() == other.rankOf()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), nd4j::pairwise::Subtract, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else{
        Nd4jLong *bShape = nullptr;
@ -2117,14 +2123,14 @@ void NDArray::operator*=(const NDArray& other) {
        throw nd4j::datatype_exception::build("NDArray operator*=: Cannot multiply different types", this->dataType(), other.dataType());

    if (!this->isScalar() && other.isScalar()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::Multiply, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else if (other.lengthOf() == lengthOf() && this->rankOf() == other.rankOf()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), nd4j::pairwise::Multiply, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else{
        Nd4jLong *bShape = nullptr;
@ -2154,14 +2160,14 @@ void NDArray::operator/=(const NDArray& other) {
    }

    if (!this->isScalar() && other.isScalar()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execScalar(getContext(), nd4j::scalar::Divide, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else if (other.lengthOf() == lengthOf() && this->rankOf() == other.rankOf()) {
-        prepareSpecialUse({this}, {this, &other});
+        NDArray::prepareSpecialUse({this}, {this, &other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), nd4j::pairwise::Divide, buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), buffer(), getShapeInfo(), specialBuffer(), getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({this}, {this, &other});
+        NDArray::registerSpecialUse({this}, {this, &other});
    }
    else{
        Nd4jLong *bShape = nullptr;
@ -2264,9 +2270,9 @@ NDArray NDArray::operator-(const NDArray& other) const {
    if (other.lengthOf() == lengthOf() && this->rankOf() == other.rankOf()) {
        NDArray result(getShapeInfo(), DataTypeUtils::pickPairwiseResultType(getShapeInfo(), other.getShapeInfo()), false, getContext());

-        prepareSpecialUse({&result}, {this, &other});
+        NDArray::prepareSpecialUse({&result}, {this, &other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), nd4j::pairwise::Subtract, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), result.buffer(), result.getShapeInfo(), result.specialBuffer(), getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({&result}, {this, &other});
+        NDArray::registerSpecialUse({&result}, {this, &other});

        return result;
    }
@ -2285,9 +2291,9 @@ NDArray NDArray::operator*(const NDArray& other) const {
    if (other.lengthOf() == lengthOf() && this->rankOf() == other.rankOf()) {
        NDArray result(getShapeInfo(), DataTypeUtils::pickPairwiseResultType(getShapeInfo(), other.getShapeInfo()), false, this->getContext());

-        prepareSpecialUse({&result}, {this, &other});
+        NDArray::prepareSpecialUse({&result}, {this, &other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), nd4j::pairwise::Multiply, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), result.buffer(), result.getShapeInfo(), result.specialBuffer(), getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({&result}, {this, &other});
+        NDArray::registerSpecialUse({&result}, {this, &other});

        return result;
    }
@ -2308,9 +2314,9 @@ NDArray NDArray::operator/(const NDArray& other) const {
    if (other.lengthOf() == lengthOf() && this->rankOf() == other.rankOf()) {
        NDArray result(getShapeInfo(), DataTypeUtils::pickPairwiseResultType(getShapeInfo(), other.getShapeInfo()), false, getContext());

-        prepareSpecialUse({&result}, {this, &other});
+        NDArray::prepareSpecialUse({&result}, {this, &other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), nd4j::pairwise::Divide, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), result.buffer(), result.getShapeInfo(), result.specialBuffer(), getSpecialShapeInfo(), nullptr);
-        registerSpecialUse({&result}, {this, &other});
+        NDArray::registerSpecialUse({&result}, {this, &other});

        return result;
    }
@ -2326,9 +2332,9 @@ NDArray NDArray::operator-() const {

    NDArray result(getShapeInfo(), false, getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execTransformSame(getContext(), nd4j::transform::Neg, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result.buffer(), result.getShapeInfo(), result.specialBuffer(), result.getSpecialShapeInfo(), nullptr, nullptr, nullptr);
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -2631,7 +2637,7 @@ void NDArray::applyBroadcast(nd4j::broadcast::Ops op, const std::vector<int>& di
    if (other->lengthOf() == lengthOf() && this->rankOf() == other->rankOf()) {
        NDArray::prepareSpecialUse({result}, {this, other});
        NativeOpExecutioner::execPairwiseTransform(getContext(), fromBroadcastToPairwise(op), buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), other->getBuffer(), other->getShapeInfo(), other->getSpecialBuffer(), other->getSpecialShapeInfo(), result->buffer(), result->shapeInfo(), result->specialBuffer(), result->specialShapeInfo(), nullptr);
-        registerSpecialUse({result}, {this, other});
+        NDArray::registerSpecialUse({result}, {this, other});
        return;
    }

@ -2688,7 +2694,7 @@ void NDArray::applyBroadcast(nd4j::broadcast::BoolOps op, const std::vector<int>
    if (other->lengthOf() == lengthOf() && this->rankOf() == other->rankOf()) {
        NDArray::prepareSpecialUse({result}, {this, other});
        NativeOpExecutioner::execPairwiseBoolTransform(getContext(), fromBroadcastToPairwiseBool(op), buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), other->getBuffer(), other->getShapeInfo(), other->getSpecialBuffer(), other->getSpecialShapeInfo(), result->buffer(), result->shapeInfo(), result->specialBuffer(), result->specialShapeInfo(), nullptr);
-        registerSpecialUse({result}, {this, other});
+        NDArray::registerSpecialUse({result}, {this, other});
        return;
    }

@ -2896,7 +2902,7 @@ bool NDArray::reshapei(const char order, const std::vector<Nd4jLong>& cshape) {
    Nd4jLong *shapeInfoNew;
    ALLOCATE(shapeInfoNew, getContext()->getWorkspace(), shape::shapeInfoLength(rank), Nd4jLong);

-    bool canReshape = shape::reshapeC(this->rankOf(), this->_shapeInfo, shape.size(), shape.data(), shapeInfoNew);
+    bool canReshape = shape::reshapeC(rankOf(), shapeInfo(), shape.size(), shape.data(), shapeInfoNew);

    // we can do this only if there was no permute applied, or there are no weird strides
    if (canReshape) {
@ -2948,11 +2954,9 @@ void NDArray::applyPairwiseTransform(nd4j::pairwise::Ops op, const NDArray* othe
    if (target->dataType() != this->dataType() && target->dataType() != other->dataType())
        throw std::invalid_argument("NDArray::applyPairwiseTransform method - type of target array must be the same as type of this or other array !");

-    prepareSpecialUse({target}, {this, other});
-
+    NDArray::prepareSpecialUse({target}, {this, other});
    NativeOpExecutioner::execPairwiseTransform(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), other->getBuffer(), other->getShapeInfo(), other->getSpecialBuffer(), other->getSpecialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()) : nullptr);
-
-    registerSpecialUse({target}, {this, other});
+    NDArray::registerSpecialUse({target}, {this, other});

    if (extraParams != nullptr)
        synchronize("NDArray::applyPairwiseTransform");
@ -2969,9 +2973,9 @@ void NDArray::applyPairwiseTransform(nd4j::pairwise::BoolOps op, const NDArray *
    if (dataType() != other->dataType())
        throw std::invalid_argument("NDArray::applyPairwiseTransform BoolOps method - this and other arrays must have the same type !");

-    prepareSpecialUse({target}, {this, other});
+    NDArray::prepareSpecialUse({target}, {this, other});
    NativeOpExecutioner::execPairwiseBoolTransform(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getShapeInfo(), other->getBuffer(), other->getShapeInfo(), other->getSpecialBuffer(), other->getSpecialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()) : nullptr);
-    registerSpecialUse({target}, {this, other});
+    NDArray::registerSpecialUse({target}, {this, other});
 }

 //////////////////////////////////////////////////////////////////////////
@ -3070,22 +3074,23 @@ void NDArray::assign(const NDArray& other) {
    if (other.isScalar()) {

        if(this->isScalar()) {
-            preparePrimaryUse({this}, {&other});
+            NDArray::preparePrimaryUse({this}, {&other});
            BUILD_DOUBLE_SELECTOR(dataType(), other.dataType(), templatedDoubleAssign, (buffer(), 0, other.getBuffer(), 0), LIBND4J_TYPES, LIBND4J_TYPES);
-            registerPrimaryUse({this}, {&other});
+            NDArray::registerPrimaryUse({this}, {&other});
+            this->syncToDevice();
        }
        else {
            if (dataType() != other.dataType()) {
                auto tmp = other.cast(dataType());
-                prepareSpecialUse({this}, {tmp});
+                NDArray::prepareSpecialUse({this}, {tmp});
                NativeOpExecutioner::execScalar(getContext(), scalar::CopyPws, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), tmp->getBuffer(), tmp->getShapeInfo(), tmp->getSpecialBuffer(), tmp->getSpecialShapeInfo(), nullptr);
-                registerSpecialUse({this}, {});
+                NDArray::registerSpecialUse({this}, {});
                delete tmp;
            }
            else {
-                prepareSpecialUse({this}, {&other});
+                NDArray::prepareSpecialUse({this}, {&other});
                NativeOpExecutioner::execScalar(getContext(), scalar::CopyPws, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), nullptr);
-                registerSpecialUse({this}, {&other});
+                NDArray::registerSpecialUse({this}, {&other});
            }
        }
    }
@ -3101,16 +3106,16 @@ void NDArray::assign(const NDArray& other) {
        if (ordering() == other.ordering() && dataType() == other.dataType() && ews() == 1 && other.ews() == 1)
            copyBuffersContinuouslyFrom(other, other.lengthOf() * other.sizeOfT());
        else {
-            prepareSpecialUse({this}, {&other});
+            NDArray::prepareSpecialUse({this}, {&other});
            NativeOpExecutioner::execTransformAny(getContext(), transform::Assign, other.getBuffer(), other.getShapeInfo(), other.getSpecialBuffer(), other.getSpecialShapeInfo(), buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), nullptr, nullptr, nullptr);
-            registerSpecialUse({this}, {&other});
+            NDArray::registerSpecialUse({this}, {&other});
        }
    }
 }

 ////////////////////////////////////////////////////////////////////////
 // This method returns new copy of this NDArray, optionally in different order
-NDArray* NDArray::dup(const char newOrder) {
+NDArray* NDArray::dup(const char newOrder) const {

    if (isEmpty())
        return NDArrayFactory::empty_(dataType(), getContext());
@ -3170,7 +3175,7 @@ std::string NDArray::e(const Nd4jLong i) const {
    if (!isS())
        throw std::runtime_error("Can't get std::string out of non-string array");

-    preparePrimaryUse({}, {this});
+    NDArray::preparePrimaryUse({}, {this});

    // getting "virtual" offset. it's not real though,since it doesn't take lengths into account
    auto offset = getOffset(i);
@ -3208,8 +3213,8 @@ T NDArray::e(const Nd4jLong i) const {

    const auto rp = getOffset(i);

-    preparePrimaryUse({}, {this});
-    registerPrimaryUse({}, {this});
+    NDArray::preparePrimaryUse({}, {this});
+    NDArray::registerPrimaryUse({}, {this});
    BUILD_SINGLE_PARTIAL_SELECTOR(dataType(), return templatedGet<, T>(getBuffer(), rp), LIBND4J_TYPES);

 }
@ -3226,8 +3231,8 @@ T NDArray::e(const Nd4jLong i, const Nd4jLong j) const {
    const Nd4jLong coords[2] = {i, j};
    const auto xOffset = shape::getOffset(0, shapeOf(), stridesOf(), coords, rankOf());

-    preparePrimaryUse({}, {this});
-    registerPrimaryUse({}, {this});
+    NDArray::preparePrimaryUse({}, {this});
+    NDArray::registerPrimaryUse({}, {this});

    BUILD_SINGLE_PARTIAL_SELECTOR(dataType(), return templatedGet<, T>(getBuffer(), xOffset), LIBND4J_TYPES);

@ -3246,8 +3251,8 @@ T NDArray::e(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k) const {
    const Nd4jLong coords[3] = {i, j, k};
    const auto xOffset = shape::getOffset(0, shapeOf(), stridesOf(), coords, rankOf());

-    preparePrimaryUse({}, {this});
-    registerPrimaryUse({}, {this});
+    NDArray::preparePrimaryUse({}, {this});
+    NDArray::registerPrimaryUse({}, {this});

    BUILD_SINGLE_PARTIAL_SELECTOR(dataType(), return templatedGet<, T>(getBuffer(), xOffset), LIBND4J_TYPES);

@ -3266,8 +3271,8 @@ T NDArray::e(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k, const Nd4jLon
    const Nd4jLong coords[4] = {i, j, k, l};
    const auto xOffset = shape::getOffset(0, shapeOf(), stridesOf(), coords, rankOf());

-    preparePrimaryUse({}, {this});
-    registerPrimaryUse({}, {this});
+    NDArray::preparePrimaryUse({}, {this});
+    NDArray::registerPrimaryUse({}, {this});

    BUILD_SINGLE_PARTIAL_SELECTOR(dataType(), return templatedGet<, T>(getBuffer(), xOffset), LIBND4J_TYPES);

@ -3300,9 +3305,9 @@ void NDArray::applyTransform(nd4j::transform::FloatOps op, NDArray *target, Extr
    if (!target->isR())
        throw std::runtime_error("NDArray::applyTransform FloatOps: target array must have one of FLOAT types");

-    prepareSpecialUse({target}, {this});
+    NDArray::prepareSpecialUse({target}, {this});
    NativeOpExecutioner::execTransformFloat(getContext(), op, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()) : nullptr, nullptr, nullptr);
-    registerSpecialUse({target}, {this});
+    NDArray::registerSpecialUse({target}, {this});
 }

 ////////////////////////////////////////////////////////////////////////
@ -3314,9 +3319,9 @@ void NDArray::applyTransform(nd4j::transform::AnyOps op, NDArray *target, ExtraA
    if (target == nullptr)
        target = this;

-    prepareSpecialUse({target}, {this});
+    NDArray::prepareSpecialUse({target}, {this});
    NativeOpExecutioner::execTransformAny(getContext(), op, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()) : nullptr, nullptr, nullptr);
-    registerSpecialUse({target}, {this});
+    NDArray::registerSpecialUse({target}, {this});
 }

 ////////////////////////////////////////////////////////////////////////
@ -3331,9 +3336,9 @@ void NDArray::applyTransform(nd4j::transform::SameOps op, NDArray *target, Extra
    if (target->dataType() != dataType())
        throw std::runtime_error("NDArray::applyTransform SameOps: target array must have the same data type as original array");

-    prepareSpecialUse({target}, {this});
+    NDArray::prepareSpecialUse({target}, {this});
    NativeOpExecutioner::execTransformSame(getContext(), op, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()) : nullptr, nullptr, nullptr);
-    registerSpecialUse({target}, {this});
+    NDArray::registerSpecialUse({target}, {this});
 }

 ////////////////////////////////////////////////////////////////////////
@ -3347,9 +3352,9 @@ void NDArray::applyTransform(nd4j::transform::StrictOps op, NDArray *target, Ext
    if (!this->isR() || !target->isR() || (this->dataType() != target->dataType()))
        throw std::runtime_error("NDArray::applyTransform StrictOps: both Source and Target array must have same FLOAT type !");

-    registerSpecialUse({target}, {this});
+    NDArray::prepareSpecialUse({target}, {this});
    NativeOpExecutioner::execTransformStrict(getContext(), op, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()) : nullptr, nullptr, nullptr);
-    prepareSpecialUse({target}, {this});
+    NDArray::registerSpecialUse({target}, {this});
 }

 ////////////////////////////////////////////////////////////////////////
@ -3363,9 +3368,9 @@ void NDArray::applyTransform(nd4j::transform::BoolOps op, NDArray *target, Extra
    if (!target->isB())
        throw std::runtime_error("NDArray::applyTransform BoolOps: target array must have one of BOOL types");

-    prepareSpecialUse({target}, {this});
+    NDArray::prepareSpecialUse({target}, {this});
    NativeOpExecutioner::execTransformBool(getContext(), op, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()) : nullptr, nullptr, nullptr);
-    registerSpecialUse({target}, {this});
+    NDArray::registerSpecialUse({target}, {this});
 }

 ////////////////////////////////////////////////////////////////////////
@ -3375,9 +3380,9 @@ NDArray NDArray::transform(nd4j::transform::FloatOps op, void *extraParams) cons

    NDArray result(ordering(), getShapeAsVector(), DataTypeUtils::pickFloatingType(dataType()), getContext());

-    registerSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execTransformFloat(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo(), extraParams, nullptr, nullptr);
-    prepareSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -3389,9 +3394,9 @@ NDArray NDArray::transform(nd4j::transform::SameOps op, void *extraParams) const

    NDArray result(getShapeInfo(), false, getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execTransformSame(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo(), extraParams, nullptr, nullptr);
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -3403,9 +3408,9 @@ NDArray NDArray::transform(nd4j::transform::StrictOps op, void *extraParams) con

    NDArray result(getShapeInfo(), false, getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execTransformStrict(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo(), extraParams, nullptr, nullptr);
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -3417,9 +3422,9 @@ NDArray NDArray::transform(nd4j::transform::BoolOps op, void *extraParams) const

    NDArray result(ordering(), getShapeAsVector(), nd4j::DataType::BOOL, getContext());

-    prepareSpecialUse({&result}, {this});
+    NDArray::prepareSpecialUse({&result}, {this});
    NativeOpExecutioner::execTransformBool(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), result.buffer(), result.shapeInfo(), result.specialBuffer(), result.specialShapeInfo(), extraParams, nullptr, nullptr);
-    registerSpecialUse({&result}, {this});
+    NDArray::registerSpecialUse({&result}, {this});

    return result;
 }
@ -3435,9 +3440,9 @@ void NDArray::applyScalarArr(nd4j::scalar::Ops op, const NDArray* scalar, NDArra
    if(target->dataType() != DataTypeUtils::pickPairwiseResultType(shapeInfo(), scalar->getShapeInfo()) && !(target->dataType() == dataType() || target->dataType() == scalar->dataType()))
        throw std::invalid_argument("NDArray::applyScalarArr method: wrong type of target array!");

-    prepareSpecialUse({target}, {this, scalar});
+    NDArray::prepareSpecialUse({target}, {this, scalar});
    NativeOpExecutioner::execScalar(getContext(), op, buffer(), shapeInfo(), specialBuffer(), specialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), scalar->getBuffer(), scalar->getShapeInfo(), scalar->getSpecialBuffer(), scalar->getSpecialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()): nullptr);
-    registerSpecialUse({target}, {this, scalar});
+    NDArray::registerSpecialUse({target}, {this, scalar});
 }

 ////////////////////////////////////////////////////////////////////////
@ -3471,10 +3476,9 @@ void NDArray::applyScalarArr(nd4j::scalar::BoolOps op, const NDArray* scalar, ND
        throw std::invalid_argument("NDArray::applyScalarArr bool method: this and scalar arrays must have the same type!");
    }

-    prepareSpecialUse({target}, {this, scalar});
+    NDArray::prepareSpecialUse({target}, {this, scalar});
    NativeOpExecutioner::execScalarBool(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), target->buffer(), target->shapeInfo(), target->specialBuffer(), target->specialShapeInfo(), scalar->getBuffer(), scalar->getShapeInfo(), scalar->getSpecialBuffer(), scalar->getSpecialShapeInfo(), extraParams != nullptr ? extraParams->argumentsAsT(target->dataType()): nullptr);
-
-    registerSpecialUse({target}, {this, scalar});
+    NDArray::registerSpecialUse({target}, {this, scalar});
 }

 ////////////////////////////////////////////////////////////////////////
@ -3557,7 +3561,7 @@ NDArray* NDArray::applyReduce3(nd4j::reduce3::Ops op, const NDArray* other, cons

    NDArray::prepareSpecialUse({result}, {this, other});
    NativeOpExecutioner::execReduce3Scalar(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), params, other->getBuffer(), other->getShapeInfo(), other->getSpecialBuffer(), other->getSpecialShapeInfo(), result->buffer(), result->shapeInfo(), result->specialBuffer(), result->specialShapeInfo());
-    registerSpecialUse({result}, {this, other});
+    NDArray::registerSpecialUse({result}, {this, other});

    return result;
 }
@ -3635,9 +3639,9 @@ NDArray* NDArray::applyAllReduce3(nd4j::reduce3::Ops op, const NDArray *other, c

    auto pDims = nd4j::Environment::getInstance()->isCPU() ? copy.data() : nullptr;

-    prepareSpecialUse({result}, {this, other});
+    NDArray::prepareSpecialUse({result}, {this, other});
    NativeOpExecutioner::execReduce3All(getContext(), op, getBuffer(), getShapeInfo(), getSpecialBuffer(), getSpecialShapeInfo(), params, other->getBuffer(), other->getShapeInfo(), other->getSpecialBuffer(), other->getSpecialShapeInfo(), result->buffer(), result->shapeInfo(), result->specialBuffer(), result->specialShapeInfo(), pDims, copy.size(), packX.platformShapeInfo(), packX.platformOffsets(), packY.platformShapeInfo(), packY.platformOffsets());
-    registerSpecialUse({result}, {this, other});
+    NDArray::registerSpecialUse({result}, {this, other});

    return result;
 }
@ -3780,9 +3784,9 @@ void NDArray::p(const Nd4jLong i, const T value) {
    auto rp = getOffset(i);
    const void *pV = reinterpret_cast<const void*>(const_cast<T *>(&value));

-    preparePrimaryUse({this}, {}, true);
+    NDArray::preparePrimaryUse({this}, {}, true);
    BUILD_SINGLE_PARTIAL_SELECTOR(this->dataType(), templatedSet<, T>(this->getBuffer(), rp, pV), LIBND4J_TYPES);
-    registerPrimaryUse({this}, {});
+    NDArray::registerPrimaryUse({this}, {});
 }

 template void NDArray::p(const Nd4jLong i, const double value);
@ -3811,9 +3815,9 @@ void NDArray::p(const Nd4jLong i, const Nd4jLong j, const T value) {
    Nd4jLong coords[2] = {i, j};
    auto xOffset = shape::getOffset(0, shapeOf(), stridesOf(), coords, rankOf());

-    preparePrimaryUse({this}, {}, true);
+    NDArray::preparePrimaryUse({this}, {}, true);
    BUILD_SINGLE_PARTIAL_SELECTOR(dataType(), templatedSet<, T>(this->getBuffer(), xOffset, p), LIBND4J_TYPES);
-    registerPrimaryUse({this}, {});
+    NDArray::registerPrimaryUse({this}, {});
 }
 template void NDArray::p(const Nd4jLong i, const Nd4jLong j, const double value);
 template void NDArray::p(const Nd4jLong i, const Nd4jLong j, const float value);
@ -3837,13 +3841,13 @@ void NDArray::p(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k, const T va
    if (rankOf() != 3 || i >= shapeOf()[0] || j >= shapeOf()[1] || k >= shapeOf()[2])
        throw std::invalid_argument("NDArray:pe(i,j,k, value): one of input indexes is out of array length or rank!=3 !");

-    preparePrimaryUse({this}, {}, true);
+    NDArray::preparePrimaryUse({this}, {}, true);

    void *p = reinterpret_cast<void *>(const_cast<T *>(&value));
    Nd4jLong coords[3] = {i, j, k};
    auto xOffset = shape::getOffset(0, shapeOf(), stridesOf(), coords, rankOf());
    BUILD_SINGLE_PARTIAL_SELECTOR(dataType(), templatedSet<, T>(this->getBuffer(), xOffset, p), LIBND4J_TYPES);
-    registerPrimaryUse({this}, {});
+    NDArray::registerPrimaryUse({this}, {});
 }
 template void NDArray::p(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k, const double value);
 template void NDArray::p(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k, const float value);
@ -3870,9 +3874,9 @@ void NDArray::p(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k, const Nd4j
    Nd4jLong coords[4] = {i, j, k, l};
    auto xOffset = shape::getOffset(0, shapeOf(), stridesOf(), coords, rankOf());

-    preparePrimaryUse({this}, {}, true);
+    NDArray::preparePrimaryUse({this}, {}, true);
    BUILD_SINGLE_PARTIAL_SELECTOR(dataType(), templatedSet<, T>(this->getBuffer(), xOffset, p), LIBND4J_TYPES);
-    registerPrimaryUse({this}, {});
+    NDArray::registerPrimaryUse({this}, {});
 }
 template void NDArray::p(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k, const Nd4jLong l, const double value);
 template void NDArray::p(const Nd4jLong i, const Nd4jLong j, const Nd4jLong k, const Nd4jLong l, const float value);
@ -3896,10 +3900,10 @@ void NDArray::p(const Nd4jLong i, const NDArray& scalar) {
    if (i >= _length)
        throw std::invalid_argument("NDArray::p(i, NDArray_scalar): input index is out of array length !");

-    preparePrimaryUse({this}, {&scalar}, true);
+    NDArray::preparePrimaryUse({this}, {&scalar}, true);
    auto rp = getOffset(i);
    BUILD_SINGLE_SELECTOR(scalar.dataType(), templatedSet, (getBuffer(), rp, scalar.dataType(), scalar.getBuffer()), LIBND4J_TYPES);
-    registerPrimaryUse({this}, {&scalar});
+    NDArray::registerPrimaryUse({this}, {&scalar});
 }

 //////////////////////////////////////////////////////////////////////////
@ -4195,7 +4199,7 @@ ResultSet* NDArray::allTensorsAlongDimension(const std::vector<int> &dimensions)


    auto pack = ConstantTadHelper::getInstance()->tadForDimensions(_shapeInfo, const_cast<int*>(dimensions.data()), dimensions.size());
-    auto numTads = lengthOf() / shape::length(pack.primaryShapeInfo());
+    auto numTads = pack.numberOfTads();

    for (int idx = 0; idx < numTads; idx++ ) {
        auto array = new NDArray(_buffer, ShapeDescriptor(pack.primaryShapeInfo()), getContext(), pack.primaryOffsets()[idx] + getBufferOffset());
--- a/libnd4j/blas/NativeOps.h
+++ b/libnd4j/blas/NativeOps.h
@ -1578,6 +1578,20 @@ public:
            void *dx, Nd4jLong *dxShapeInfo,
            bool descending);

+    void sortByKey(Nd4jPointer *extraPointers,
+                   void *x, Nd4jLong *xShapeInfo,
+                   void *dx, Nd4jLong *dxShapeInfo,
+                   void *y, Nd4jLong *yShapeInfo,
+                   void *dy, Nd4jLong *dyShapeInfo,
+                   bool descending);
+
+    void sortByValue(Nd4jPointer *extraPointers,
+                     void *x, Nd4jLong *xShapeInfo,
+                     void *dx, Nd4jLong *dxShapeInfo,
+                     void *y, Nd4jLong *yShapeInfo,
+                     void *dy, Nd4jLong *dyShapeInfo,
+                     bool descending);
+
    void sortTad(Nd4jPointer *extraPointers,
            void *x, Nd4jLong *xShapeInfo,
            void *dx, Nd4jLong *dxShapeInfo,
@ -1587,6 +1601,24 @@ public:
            Nd4jLong *tadOffsets,
            bool descending);

+    void sortTadByKey(Nd4jPointer *extraPointers,
+                 void *x, Nd4jLong *xShapeInfo,
+                 void *dx, Nd4jLong *dxShapeInfo,
+                 void *y, Nd4jLong *yShapeInfo,
+                 void *dy, Nd4jLong *dyShapeInfo,
+                 int *dimension,
+                 int dimensionLength,
+                 bool descending);
+
+    void sortTadByValue(Nd4jPointer *extraPointers,
+                 void *x, Nd4jLong *xShapeInfo,
+                 void *dx, Nd4jLong *dxShapeInfo,
+                 void *y, Nd4jLong *yShapeInfo,
+                 void *dy, Nd4jLong *dyShapeInfo,
+                 int *dimension,
+                 int dimensionLength,
+                 bool descending);
+

    // special sort impl for sorting out COO indices and values
    void sortCooIndices(Nd4jPointer *extraPointers, Nd4jLong *indices, void *values, Nd4jLong length, int rank);
--- a/libnd4j/blas/cpu/NDArray.cpp
+++ b/libnd4j/blas/cpu/NDArray.cpp
@ -208,6 +208,23 @@ void* NDArray::specialBufferWithOffset(Nd4jLong offset) const {
    return nullptr;
 }

+////////////////////////////////////////////////////////////////////////
+void* NDArray::specialBuffer() {
+    if (_buffer->special() == nullptr)
+        return getBuffer();
+    // FIXME: this should be fixed once CUDA backend added
+    return static_cast<int8_t*>(_buffer->special()) + (_offset * sizeOfT());
+}
+
+////////////////////////////////////////////////////////////////////////
+void* NDArray::getSpecialBuffer() const {
+    if (_buffer->special() == nullptr)
+        return getBuffer();
+    // FIXME: this should be fixed once CUDA backend added
+    return static_cast<int8_t*>(_buffer->special()) + (_offset * sizeOfT());
+}
+
+
 //////////////////////////////////////////////////////////////////////////
 // change an array by repeating it the number of times given by reps.
 NDArray NDArray::tile(const std::vector<Nd4jLong>& reps) const {
--- a/libnd4j/blas/cpu/NDArrayFactory.cpp
+++ b/libnd4j/blas/cpu/NDArrayFactory.cpp
@ -27,6 +27,52 @@

 namespace nd4j {

+    ////////////////////////////////////////////////////////////////////////
+    template <>
+    NDArray NDArrayFactory::create<bool>(const char order, const std::vector<Nd4jLong> &shape, const std::vector<bool> &data, nd4j::LaunchContext * context) {
+
+        if ((int) shape.size() > MAX_RANK)
+            throw std::invalid_argument("NDArrayFactory::create: rank of NDArray can't exceed 32 !");
+
+        ShapeDescriptor descriptor(nd4j::DataType::BOOL, order, shape);
+
+        if (descriptor.arrLength() != data.size()) {
+            nd4j_printf("NDArrayFactory::create: data size [%i] doesn't match shape length [%lld]\n", data.size(), descriptor.arrLength());
+            throw std::runtime_error("NDArrayFactory::create: data size doesn't match shape");
+        }
+
+        bool* hostBuffer = nullptr;
+        ALLOCATE(hostBuffer, context->getWorkspace(), data.size(), bool);
+        std::copy(data.begin(), data.end(), hostBuffer);
+
+        std::shared_ptr<DataBuffer> buffer = std::make_shared<DataBuffer>(hostBuffer, data.size() * sizeof(bool), nd4j::DataType::BOOL, true, context->getWorkspace());
+
+        NDArray result(buffer, descriptor, context);
+
+        return result;
+    }
+
+    ////////////////////////////////////////////////////////////////////////
+    template <typename T>
+    NDArray NDArrayFactory::create(const char order, const std::vector<Nd4jLong> &shape, const std::vector<T> &data, nd4j::LaunchContext * context) {
+
+        if ((int) shape.size() > MAX_RANK)
+            throw std::invalid_argument("NDArrayFactory::create: rank of NDArray can't exceed 32 !");
+
+        ShapeDescriptor descriptor(DataTypeUtils::fromT<T>(), order, shape);
+
+        if (descriptor.arrLength() != data.size()) {
+            nd4j_printf("NDArrayFactory::create: data size [%i] doesn't match shape length [%lld]\n", data.size(), descriptor.arrLength());
+            throw std::runtime_error("NDArrayFactory::create: data size doesn't match shape");
+        }
+
+        std::shared_ptr<DataBuffer> buffer = std::make_shared<DataBuffer>(data.data(), DataTypeUtils::fromT<T>(), descriptor.arrLength() * sizeof(T), context->getWorkspace());
+
+        NDArray result(buffer, descriptor, context);
+
+        return result;
+
+    }

    NDArray NDArrayFactory::string(const char *str, nd4j::LaunchContext * context) {
        std::string s(str);
@ -227,10 +273,13 @@ template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<float16> &data, nd4j::LaunchContext * context);
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<bfloat16> &data, nd4j::LaunchContext * context);
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<int> &data, nd4j::LaunchContext * context);
+template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<unsigned int> &data, nd4j::LaunchContext * context);
+template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<unsigned long> &data, nd4j::LaunchContext * context);
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<Nd4jLong> &data, nd4j::LaunchContext * context);
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<int8_t> &data, nd4j::LaunchContext * context);
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<uint8_t> &data, nd4j::LaunchContext * context);
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<int16_t> &data, nd4j::LaunchContext * context);
+template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<uint16_t> &data, nd4j::LaunchContext * context);
 template NDArray* NDArrayFactory::create_(const char order, const std::vector<Nd4jLong> &shape, const std::vector<bool> &data, nd4j::LaunchContext * context);


@ -391,6 +440,7 @@ template NDArray NDArrayFactory::create(const std::vector<bfloat16> &values, nd4
 template NDArray NDArrayFactory::create(const std::vector<Nd4jLong> &values, nd4j::LaunchContext * context);
 template NDArray NDArrayFactory::create(const std::vector<int> &values, nd4j::LaunchContext * context);
 template NDArray NDArrayFactory::create(const std::vector<int16_t> &values, nd4j::LaunchContext * context);
+template NDArray NDArrayFactory::create(const std::vector<uint16_t> &values, nd4j::LaunchContext * context);
 template NDArray NDArrayFactory::create(const std::vector<int8_t> &values, nd4j::LaunchContext * context);
 template NDArray NDArrayFactory::create(const std::vector<uint8_t> &values, nd4j::LaunchContext * context);
 template NDArray NDArrayFactory::create(const std::vector<bool> &values, nd4j::LaunchContext * context);
@ -452,53 +502,6 @@ template NDArray NDArrayFactory::create(const std::vector<bool> &values, nd4j::L
        return new NDArray(order, shape, dataType, context);
    }

-////////////////////////////////////////////////////////////////////////
-    template <typename T>
-    NDArray NDArrayFactory::create(const char order, const std::vector<Nd4jLong> &shape, const std::vector<T> &data, nd4j::LaunchContext * context) {
-
-        if ((int) shape.size() > MAX_RANK)
-            throw std::invalid_argument("NDArrayFactory::create: rank of NDArray can't exceed 32 !");
-
-        ShapeDescriptor descriptor(DataTypeUtils::fromT<T>(), order, shape);
-
-        if (descriptor.arrLength() != data.size()) {
-            nd4j_printf("NDArrayFactory::create: data size [%i] doesn't match shape length [%lld]\n", data.size(), descriptor.arrLength());
-            throw std::runtime_error("NDArrayFactory::create: data size doesn't match shape");
-        }
-
-        std::shared_ptr<DataBuffer> buffer = std::make_shared<DataBuffer>(data.data(), DataTypeUtils::fromT<T>(), descriptor.arrLength() * sizeof(T), context->getWorkspace());
-
-        NDArray result(buffer, descriptor, context);
-
-        return result;
-
-    }
-    ////////////////////////////////////////////////////////////////////////
-    template <>
-    NDArray NDArrayFactory::create<bool>(const char order, const std::vector<Nd4jLong> &shape, const std::vector<bool> &data, nd4j::LaunchContext * context) {
-
-        if ((int) shape.size() > MAX_RANK)
-            throw std::invalid_argument("NDArrayFactory::create: rank of NDArray can't exceed 32 !");
-
-        ShapeDescriptor descriptor(nd4j::DataType::BOOL, order, shape);
-
-        if (descriptor.arrLength() != data.size()) {
-            nd4j_printf("NDArrayFactory::create: data size [%i] doesn't match shape length [%lld]\n", data.size(), descriptor.arrLength());
-            throw std::runtime_error("NDArrayFactory::create: data size doesn't match shape");
-        }
-
-        bool* hostBuffer = nullptr;
-        ALLOCATE(hostBuffer, context->getWorkspace(), data.size(), bool);
-        std::copy(data.begin(), data.end(), hostBuffer);
-
-        std::shared_ptr<DataBuffer> buffer = std::make_shared<DataBuffer>(hostBuffer, data.size() * sizeof(bool), nd4j::DataType::BOOL, true, context->getWorkspace());
-
-        NDArray result(buffer, descriptor, context);
-
-        return result;
-
-    }
-
 ////////////////////////////////////////////////////////////////////////
 template <typename T>
 NDArray NDArrayFactory::create(T* buffer, const char order, const std::initializer_list<Nd4jLong>& shape, nd4j::LaunchContext * context) {
--- a/libnd4j/blas/cpu/NativeOps.cpp
+++ b/libnd4j/blas/cpu/NativeOps.cpp
@ -2736,6 +2736,60 @@ Nd4jPointer NativeOps::shapeBufferForNumpy(Nd4jPointer npyArray) {
    return reinterpret_cast<Nd4jPointer>(shapeBuffer);
 }

+void NativeOps::sortByKey(Nd4jPointer *extraPointers,
+                          void *x, Nd4jLong *xShapeInfo,
+                          void *dx, Nd4jLong *dxShapeInfo,
+                          void *y, Nd4jLong *yShapeInfo,
+                          void *dy, Nd4jLong *dyShapeInfo,
+                          bool descending) {
+    auto xType = ArrayOptions::dataType(xShapeInfo);
+    auto yType = ArrayOptions::dataType(yShapeInfo);
+
+    BUILD_DOUBLE_SELECTOR(xType, yType, nd4j::DoubleMethods, ::sortByKey(x, xShapeInfo, y, yShapeInfo, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+}
+
+void NativeOps::sortByValue(Nd4jPointer *extraPointers,
+                            void *x, Nd4jLong *xShapeInfo,
+                            void *dx, Nd4jLong *dxShapeInfo,
+                            void *y, Nd4jLong *yShapeInfo,
+                            void *dy, Nd4jLong *dyShapeInfo,
+                            bool descending) {
+
+    auto xType = ArrayOptions::dataType(xShapeInfo);
+    auto yType = ArrayOptions::dataType(yShapeInfo);
+
+    BUILD_DOUBLE_SELECTOR(xType, yType, nd4j::DoubleMethods, ::sortByValue(x, xShapeInfo, y, yShapeInfo, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+}
+
+void NativeOps::sortTadByKey(Nd4jPointer *extraPointers,
+                  void *x, Nd4jLong *xShapeInfo,
+                  void *dx, Nd4jLong *dxShapeInfo,
+                  void *y, Nd4jLong *yShapeInfo,
+                  void *dy, Nd4jLong *dyShapeInfo,
+                  int *dimension,
+                  int dimensionLength,
+                  bool descending) {
+    auto xType = ArrayOptions::dataType(xShapeInfo);
+    auto yType = ArrayOptions::dataType(yShapeInfo);
+
+    BUILD_DOUBLE_SELECTOR(xType, yType, nd4j::DoubleMethods, ::sortTadByKey(x, xShapeInfo, y, yShapeInfo, dimension, dimensionLength, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+}
+
+void NativeOps::sortTadByValue(Nd4jPointer *extraPointers,
+                    void *x, Nd4jLong *xShapeInfo,
+                    void *dx, Nd4jLong *dxShapeInfo,
+                    void *y, Nd4jLong *yShapeInfo,
+                    void *dy, Nd4jLong *dyShapeInfo,
+                    int *dimension,
+                    int dimensionLength,
+                    bool descending) {
+    auto xType = ArrayOptions::dataType(xShapeInfo);
+    auto yType = ArrayOptions::dataType(yShapeInfo);
+
+    BUILD_DOUBLE_SELECTOR(xType, yType, nd4j::DoubleMethods, ::sortTadByValue(x, xShapeInfo, y, yShapeInfo, dimension, dimensionLength, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+}
+
+
 BUILD_SINGLE_TEMPLATE(template void flattenGeneric,(Nd4jPointer*, int, char, void*, Nd4jLong*, void*, Nd4jLong*), LIBND4J_TYPES);
 BUILD_SINGLE_TEMPLATE(template void pullRowsGeneric, (void *, Nd4jLong*, void*, Nd4jLong*, const int, Nd4jLong*, Nd4jLong*, Nd4jLong*, Nd4jLong*, Nd4jLong*), LIBND4J_TYPES);
 BUILD_SINGLE_TEMPLATE(template void tearGeneric, (void *, Nd4jLong*, Nd4jPointer*, Nd4jLong*, Nd4jLong*, Nd4jLong*), LIBND4J_TYPES);
--- a/libnd4j/blas/cuda/NDArray.cu
+++ b/libnd4j/blas/cuda/NDArray.cu
@ -192,8 +192,8 @@ void NDArray::setIdentity() {
    if (isS())
        throw std::runtime_error("NDArray::setIdentity: you can't use this method on String array!");

-    if (rankOf() != 2)
-        throw std::runtime_error("NDArray::setIdentity: method should work only for 2D tensors. But " + toStringValue(rankOf()) + " was given.");
+    // if (rankOf() != 2)
+    //     throw std::runtime_error("NDArray::setIdentity: method should work only for 2D tensors. But " + toStringValue(rankOf()) + " was given.");

    const int threadsPerBlock = MAX_NUM_THREADS / 4;
    const int blocksPerGrid = (lengthOf() + threadsPerBlock - 1) / threadsPerBlock;
@ -234,12 +234,15 @@ void NDArray::synchronize(const char* msg) const {
 void NDArray::prepareSpecialUse(const std::initializer_list<const NDArray*>& writeList, const std::initializer_list<const NDArray*>& readList, bool synchronizeWritables) {

    for (const auto& a : readList)
-        a->syncToDevice();
+        if(a != nullptr)
+            a->syncToDevice();

    for (const auto& a : writeList) {
-        a->getDataBuffer()->allocateSpecial();
-        if (synchronizeWritables)
-            a->syncToDevice();
+        if (a != nullptr) {
+            a->getDataBuffer()->allocateSpecial();
+            if (synchronizeWritables)
+                a->syncToDevice();
+        }
    }
 }

@ -247,22 +250,27 @@ void NDArray::prepareSpecialUse(const std::initializer_list<const NDArray*>& wri
 void NDArray::registerSpecialUse(const std::initializer_list<const NDArray*>& writeList, const std::initializer_list<const NDArray*>& readList) {

    for (const auto& p : readList)
-        p->tickReadDevice();
+        if(p != nullptr)
+            p->tickReadDevice();

    for (const auto& p : writeList)
-        p->tickWriteDevice();
+        if (p != nullptr)
+            p->tickWriteDevice();
 }

 ////////////////////////////////////////////////////////////////////////
 void NDArray::preparePrimaryUse(const std::initializer_list<const NDArray*>& writeList, const std::initializer_list<const NDArray*>& readList, bool synchronizeWritables) {

    for (const auto& a : readList)
+        if(a != nullptr)
            a->syncToHost();

    for (const auto& a : writeList) {
-        a->getDataBuffer()->allocatePrimary();
-        if (synchronizeWritables)
-            a->syncToHost();
+        if (a != nullptr) {
+            a->getDataBuffer()->allocatePrimary();
+            if (synchronizeWritables)
+                a->syncToHost();
+        }
    }
 }

@ -270,10 +278,12 @@ void NDArray::preparePrimaryUse(const std::initializer_list<const NDArray*>& wri
 void NDArray::registerPrimaryUse(const std::initializer_list<const NDArray*>& writeList, const std::initializer_list<const NDArray*>& readList) {

    for (const auto& p : readList)
-        p->tickReadHost();
+        if(p != nullptr)
+            p->tickReadHost();

    for (const auto& p : writeList)
-        p->tickWriteHost();
+        if (p != nullptr)
+            p->tickWriteHost();
 }

 //////////////////////////////////////////////////////////////////////////
@ -427,9 +437,26 @@ void NDArray::repeat(int dimension, NDArray& target) const {
    NDArray::registerSpecialUse({&target}, {this});
 }

+////////////////////////////////////////////////////////////////////////
+void* NDArray::specialBuffer() {
+
+    if (_buffer->special() == nullptr)
+        return getBuffer();
+    // FIXME: this should be fixed once CUDA backend added
+    return static_cast<int8_t*>(_buffer->special()) + (_offset * sizeOfT());
+}
+
+////////////////////////////////////////////////////////////////////////
+void* NDArray::getSpecialBuffer() const {
+    if (_buffer->special() == nullptr)
+        return getBuffer();
+    // FIXME: this should be fixed once CUDA backend added
+    return static_cast<int8_t*>(_buffer->special()) + (_offset * sizeOfT());
+}
+
 //////////////////////////////////////////////////////////////////////////
 template<typename T>
-void NDArray::printCurrentBuffer(const bool host, const char* msg, const int precision) const {\
+void NDArray::printCurrentBuffer(const bool host, const char* msg, const int precision) const {

    if(_length == 0)
            { printf("NDArray::printActualBuffer: array length is zero !\n"); return; }
@ -477,7 +504,7 @@ template void NDArray::printCurrentBuffer<double>(const bool host, const char* m

 #if defined(__CUDACC__) && !defined(BUILD_TESTS)

-#include <cpu/NDArrayLambda.hpp>
+//#include <cpu/NDArrayLambda.hpp>

 #endif

--- a/libnd4j/blas/cuda/NativeOps.cu
+++ b/libnd4j/blas/cuda/NativeOps.cu
@ -2321,6 +2321,163 @@ void NativeOps::sort(Nd4jPointer *extraPointers,
 }


+void NativeOps::sortByKey(Nd4jPointer *extraPointers,
+               void *x, Nd4jLong *xShapeInfo,
+               void *dX, Nd4jLong *dXShapeInfo,
+               void *y, Nd4jLong *yShapeInfo,
+               void *dy, Nd4jLong *dyShapeInfo,
+               bool descending) {
+
+    auto stream = reinterpret_cast<cudaStream_t *>(extraPointers[1]);
+
+    auto xLength = shape::length(xShapeInfo);
+    auto xEWS = shape::elementWiseStride(xShapeInfo);
+    auto xType = nd4j::ArrayOptions::dataType(xShapeInfo);
+    auto yType = nd4j::ArrayOptions::dataType(yShapeInfo);
+
+
+    // check if xLength is a power of 2, and use bitonic sort, if that's the case
+    if ((xLength != 0) && ((xLength & (xLength - 1)) == 0) && (xLength <= 1024 * 1024 * 10)) {
+        int numThreads = nd4j::math::nd4j_min<int>(512, xLength);
+        int numBlocks = xLength / numThreads;
+        if (xLength % numThreads > 0 || numBlocks == 0)
+            numBlocks++;
+
+        dim3 launchDims(numBlocks, numThreads, 32768);
+
+        for (int k = 2; k <= xLength; k = 2*k) {
+            for (int j = k >> 1; j > 0; j = j >> 1) {
+                BUILD_DOUBLE_SELECTOR(xType, yType, bitonicSortStepGenericKey, (launchDims, stream, dX, dXShapeInfo, dy, dyShapeInfo, j, k, xLength, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+            }
+        }
+    } else {
+        int numThreads = nd4j::math::nd4j_min<int>(512, xLength);
+        int numBlocks = xLength / numThreads;
+        if (xLength % numThreads > 0 || numBlocks == 0)
+            numBlocks++;
+
+        numBlocks = nd4j::math::nd4j_min<int>(512, numBlocks);
+        dim3 launchDims(numBlocks, numThreads, 32768);
+
+        int max = 2, dg = 0;
+        while (max < xLength) {
+            max <<= 1;
+            dg++;
+        }
+        max <<= 1;
+
+        for (int window = 2; window < max; window<<=1) {
+            int n = window;
+            int rev = 0;
+            do{
+                int half = n >> 1;
+                BUILD_DOUBLE_SELECTOR(xType, yType, bitonicArbitraryStepGenericKey, (launchDims, stream, dX, dXShapeInfo, dy, dyShapeInfo, n, xLength, rev, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+                n>>=1;
+                rev = 1;
+            } while(n > 1);
+        }
+    }
+}
+
+void NativeOps::sortByValue(Nd4jPointer *extraPointers,
+                 void *x, Nd4jLong *xShapeInfo,
+                 void *dX, Nd4jLong *dXShapeInfo,
+                 void *y, Nd4jLong *yShapeInfo,
+                 void *dy, Nd4jLong *dyShapeInfo,
+                 bool descending) {
+    auto stream = reinterpret_cast<cudaStream_t *>(extraPointers[1]);
+
+    auto xLength = shape::length(xShapeInfo);
+    auto xEWS = shape::elementWiseStride(xShapeInfo);
+    auto xType = nd4j::ArrayOptions::dataType(xShapeInfo);
+    auto yType = nd4j::ArrayOptions::dataType(yShapeInfo);
+
+
+    // check if xLength is a power of 2, and use bitonic sort, if that's the case
+    if ((xLength != 0) && ((xLength & (xLength - 1)) == 0) && (xLength <= 1024 * 1024 * 10)) {
+        int numThreads = nd4j::math::nd4j_min<int>(512, xLength);
+        int numBlocks = xLength / numThreads;
+        if (xLength % numThreads > 0 || numBlocks == 0)
+            numBlocks++;
+
+        dim3 launchDims(numBlocks, numThreads, 32768);
+
+        for (int k = 2; k <= xLength; k = 2*k) {
+            for (int j = k >> 1; j > 0; j = j >> 1) {
+                BUILD_DOUBLE_SELECTOR(xType, yType, bitonicSortStepGenericValue, (launchDims, stream, dX, dXShapeInfo, dy, dyShapeInfo, j, k, xLength, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+            }
+        }
+    } else {
+        int numThreads = nd4j::math::nd4j_min<int>(512, xLength);
+        int numBlocks = xLength / numThreads;
+        if (xLength % numThreads > 0 || numBlocks == 0)
+            numBlocks++;
+
+        numBlocks = nd4j::math::nd4j_min<int>(512, numBlocks);
+        dim3 launchDims(numBlocks, numThreads, 32768);
+
+        int max = 2, dg = 0;
+        while (max < xLength) {
+            max <<= 1;
+            dg++;
+        }
+        max <<= 1;
+
+        for (int window = 2; window < max; window<<=1) {
+            int n = window;
+            int rev = 0;
+            do{
+                int half = n >> 1;
+                BUILD_DOUBLE_SELECTOR(xType, yType, bitonicArbitraryStepGenericValue, (launchDims, stream, dX, dXShapeInfo, dy, dyShapeInfo, n, xLength, rev, descending), LIBND4J_TYPES, LIBND4J_TYPES);
+                n>>=1;
+                rev = 1;
+            } while(n > 1);
+        }
+    }
+}
+
+
+
+void NativeOps::sortTadByKey(Nd4jPointer *extraPointers,
+                             void *x, Nd4jLong *xShapeInfo,
+                             void *dX, Nd4jLong *dXShapeInfo,
+                             void *y, Nd4jLong *yShapeInfo,
+                             void *dy, Nd4jLong *dyShapeInfo,
+                             int *dimension,
+                             int dimensionLength,
+                             bool descending) {
+    auto stream = reinterpret_cast<cudaStream_t *>(extraPointers[1]);
+    auto context = extraPointers[0] == 0 ? LaunchContext::defaultContext(): reinterpret_cast<LaunchContext*>(extraPointers[0]);
+    auto tadPack = nd4j::ConstantTadHelper::getInstance()->tadForDimensions(xShapeInfo, dimension, dimensionLength);
+    dim3 launchDims((int) tadPack.numberOfTads(), 256, 2048);
+    auto xType = nd4j::ArrayOptions::dataType(xShapeInfo);
+    auto yType = nd4j::ArrayOptions::dataType(yShapeInfo);
+    BUILD_DOUBLE_SELECTOR(xType, yType, oesTadGenericKey, (launchDims, stream, dX, dXShapeInfo, dy, dyShapeInfo, nullptr, dimensionLength, tadPack.platformShapeInfo(), tadPack.platformOffsets(), descending), LIBND4J_TYPES, LIBND4J_TYPES);
+
+    nd4j::DebugHelper::checkErrorCode(stream, "sortTadKey(...) failed");
+}
+
+void NativeOps::sortTadByValue(Nd4jPointer *extraPointers,
+                               void *x, Nd4jLong *xShapeInfo,
+                               void *dX, Nd4jLong *dXShapeInfo,
+                               void *y, Nd4jLong *yShapeInfo,
+                               void *dy, Nd4jLong *dyShapeInfo,
+                               int *dimension,
+                               int dimensionLength,
+                               bool descending) {
+    auto stream = reinterpret_cast<cudaStream_t *>(extraPointers[1]);
+    auto context = extraPointers[0] == 0 ? LaunchContext::defaultContext(): reinterpret_cast<LaunchContext*>(extraPointers[0]);
+    auto tadPack = nd4j::ConstantTadHelper::getInstance()->tadForDimensions(xShapeInfo, dimension, dimensionLength);
+    dim3 launchDims((int) tadPack.numberOfTads(), 256, 2048);
+    auto xType = nd4j::ArrayOptions::dataType(yShapeInfo);
+    auto yType = nd4j::ArrayOptions::dataType(xShapeInfo);
+
+    BUILD_DOUBLE_SELECTOR(xType, yType, oesTadGenericKey, (launchDims, stream, dy, dyShapeInfo, dX, dXShapeInfo, nullptr, dimensionLength, tadPack.platformShapeInfo(), tadPack.platformOffsets(), descending), LIBND4J_TYPES, LIBND4J_TYPES);
+
+    nd4j::DebugHelper::checkErrorCode(stream, "sortTadValue(...) failed");
+}
+
+
 void NativeOps::sortTad(Nd4jPointer *extraPointers,
 						void *x, Nd4jLong *xShapeInfo,
 						void *dX, Nd4jLong *dXShapeInfo,
@ -2331,15 +2488,13 @@ void NativeOps::sortTad(Nd4jPointer *extraPointers,
 						bool descending) {
    // to be implemented
    auto stream = reinterpret_cast<cudaStream_t *>(extraPointers[1]);
-
+    auto context = extraPointers[0] == 0 ? LaunchContext::defaultContext(): reinterpret_cast<LaunchContext*>(extraPointers[0]);
    auto tadPack = nd4j::ConstantTadHelper::getInstance()->tadForDimensions(xShapeInfo, dimension, dimensionLength);
-
-    dim3 launchDims(tadPack.numberOfTads(), 1024, 33768);
-
+    dim3 launchDims((int) tadPack.numberOfTads(), 512, 33768);
 	auto xType = nd4j::ArrayOptions::dataType(xShapeInfo);
-    BUILD_SINGLE_SELECTOR(xType, oesTadGeneric, (launchDims, stream, dX, dXShapeInfo, dimension, dimensionLength, tadShapeInfo, tadOffsets, descending), LIBND4J_TYPES);
+    BUILD_SINGLE_SELECTOR(xType, oesTadGeneric, (launchDims, stream, dX, dXShapeInfo, nullptr, dimensionLength, tadShapeInfo, tadOffsets, descending), LIBND4J_TYPES);

-    nd4j::DebugHelper::checkErrorCode(stream, "sortTadFloat(...) failed");
+    nd4j::DebugHelper::checkErrorCode(stream, "sortTad(...) failed");
 }

 void NativeOps::sortCooIndices(Nd4jPointer *extraPointers, Nd4jLong *indices, void *values, Nd4jLong length, int rank) {
--- a/libnd4j/include/array/ConstantDataBuffer.h
+++ b/libnd4j/include/array/ConstantDataBuffer.h
@ -38,11 +38,11 @@ namespace nd4j {
        ConstantDataBuffer() = default;
        ~ConstantDataBuffer() = default;

-        Nd4jLong sizeOf();
-        Nd4jLong length();
+        Nd4jLong sizeOf() const;
+        Nd4jLong length() const;

-        Nd4jPointer primary();
-        Nd4jPointer special();
+        Nd4jPointer primary() const;
+        Nd4jPointer special() const;

        ConstantDataBuffer& operator=(const ConstantDataBuffer& other) = default;
        ConstantDataBuffer& operator=(ConstantDataBuffer&& other) noexcept = default;
--- a/libnd4j/include/array/DataBuffer.h
+++ b/libnd4j/include/array/DataBuffer.h
@ -261,6 +261,8 @@ DataBuffer& DataBuffer::operator=(const DataBuffer& other) {

    allocateBuffers();
    copyBufferFrom(other);
+
+    return *this;
 }

 ////////////////////////////////////////////////////////////////////////
@ -285,6 +287,8 @@ DataBuffer& DataBuffer::operator=(DataBuffer&& other) noexcept {
    other._primaryBuffer = other._specialBuffer = nullptr;
    other.setAllocFlags(false, false);
    other._lenInBytes = 0;
+
+    return *this;
 }

 ////////////////////////////////////////////////////////////////////////
--- a/libnd4j/include/array/DataTypeUtils.h
+++ b/libnd4j/include/array/DataTypeUtils.h
@ -335,6 +335,8 @@ FORCEINLINE std::string DataTypeUtils::asString(DataType dataType) {
            return std::string("INT8");
        case INT16:
            return std::string("INT16");
+        case UINT16:
+            return std::string("UINT16");
        case INT32:
            return std::string("INT32");
        case INT64:
@ -375,7 +377,7 @@ FORCEINLINE bool DataTypeUtils::castShapeInfo(const Nd4jLong *originalShapeInfo,
 ///////////////////////////////////////////////////////////////////
 // returns the difference between 1.0 and the next representable value of the given floating-point type
 template <typename T>
-FORCEINLINE T DataTypeUtils::eps() {
+FORCEINLINE _CUDA_HD T DataTypeUtils::eps() {
        if (std::is_same<T, double>::value)
            return std::numeric_limits<double>::epsilon();
        else if (std::is_same<T, float>::value)
--- a/libnd4j/include/array/ExtraArguments.h
+++ b/libnd4j/include/array/ExtraArguments.h
@ -26,6 +26,7 @@
 #include <vector>
 #include <array/DataType.h>
 #include <pointercast.h>
+#include <stdlib.h>

 namespace nd4j {
    class ND4J_EXPORT ExtraArguments {
--- a/libnd4j/include/array/TadPack.h
+++ b/libnd4j/include/array/TadPack.h
@ -35,21 +35,21 @@ namespace nd4j {
        TadPack() = default;
        ~TadPack() = default;

-        Nd4jLong* primaryShapeInfo();
-        Nd4jLong* primaryOffsets();
+        Nd4jLong* primaryShapeInfo() const;
+        Nd4jLong* primaryOffsets() const;

-        Nd4jLong* specialShapeInfo();
-        Nd4jLong* specialOffsets();
+        Nd4jLong* specialShapeInfo() const;
+        Nd4jLong* specialOffsets() const;

-        Nd4jLong numberOfTads();
-        int shapeInfoLength();
+        Nd4jLong numberOfTads() const;
+        int shapeInfoLength() const;

        /**
         * These methods return either primary or special pointers depending on platform binaries were compiled for
         * @return
         */
-        Nd4jLong *platformShapeInfo();
-        Nd4jLong *platformOffsets();
+        Nd4jLong *platformShapeInfo() const;
+        Nd4jLong *platformOffsets() const;
    };
 }

--- a/libnd4j/include/array/impl/ConstantDataBuffer.cpp
+++ b/libnd4j/include/array/impl/ConstantDataBuffer.cpp
@ -28,19 +28,19 @@ namespace nd4j {
        _sizeOf = sizeOf;
    }

-    Nd4jPointer ConstantDataBuffer::primary() {
+    Nd4jPointer ConstantDataBuffer::primary() const {
        return _primaryBuffer;
    }

-    Nd4jPointer ConstantDataBuffer::special() {
+    Nd4jPointer ConstantDataBuffer::special() const {
        return _specialBuffer;
    }

-    Nd4jLong ConstantDataBuffer::sizeOf() {
+    Nd4jLong ConstantDataBuffer::sizeOf() const {
        return _sizeOf;
    }

-    Nd4jLong ConstantDataBuffer::length() {
+    Nd4jLong ConstantDataBuffer::length() const {
        return _length;
    }

--- a/libnd4j/include/array/impl/NDArrayList.cpp
+++ b/libnd4j/include/array/impl/NDArrayList.cpp
@ -54,7 +54,7 @@ namespace nd4j {
    NDArray* NDArrayList::readRaw(int idx) {
        if (_chunks.count(idx) < 1) {
            nd4j_printf("Non-existent chunk requested: [%i]\n", idx);
-            throw std::runtime_error("Bad index");
+            throw std::invalid_argument("Bad index");
        }

        return _chunks[idx];
@ -120,7 +120,7 @@ namespace nd4j {
        // storing reference
        _chunks[idx] = array;

-        return ND4J_STATUS_OK;
+        return Status::OK();
    }

    std::vector<Nd4jLong>& NDArrayList::shape() {
@ -152,8 +152,10 @@ namespace nd4j {
        std::vector<bool> bargs;
        int numElements = _elements.load();

-        for (int e = 0; e < numElements; e++)
+        for (int e = 0; e < numElements; e++) {
+            _chunks[e]->syncToDevice();
            inputs.emplace_back(_chunks[e]);
+        }

        iargs.push_back(_axis);

--- a/libnd4j/include/array/impl/TadPack.cpp
+++ b/libnd4j/include/array/impl/TadPack.cpp
@ -29,34 +29,34 @@ namespace nd4j {
        _numTads = numTads;
    }

-    Nd4jLong* TadPack::primaryShapeInfo() {
+    Nd4jLong* TadPack::primaryShapeInfo() const {
        return reinterpret_cast<Nd4jLong *>(_tadShape.primary());
    }
-    Nd4jLong* TadPack::primaryOffsets() {
+    Nd4jLong* TadPack::primaryOffsets() const {
        return reinterpret_cast<Nd4jLong *>(_tadOffsets.primary());
    }

-    Nd4jLong* TadPack::specialShapeInfo() {
+    Nd4jLong* TadPack::specialShapeInfo() const {
        return reinterpret_cast<Nd4jLong *>(_tadShape.special());
    }

-    Nd4jLong* TadPack::specialOffsets() {
+    Nd4jLong* TadPack::specialOffsets() const {
        return reinterpret_cast<Nd4jLong *>(_tadOffsets.special());
    }

-    Nd4jLong TadPack::numberOfTads() {
+    Nd4jLong TadPack::numberOfTads() const {
        return _numTads;
    }

-    Nd4jLong* TadPack::platformShapeInfo() {
+    Nd4jLong* TadPack::platformShapeInfo() const {
        return nd4j::Environment::getInstance()->isCPU() ? primaryShapeInfo() : specialShapeInfo();
    }

-    Nd4jLong* TadPack::platformOffsets() {
+    Nd4jLong* TadPack::platformOffsets() const {
        return nd4j::Environment::getInstance()->isCPU() ? primaryOffsets() : specialOffsets();
    }

-    int TadPack::shapeInfoLength() {
+    int TadPack::shapeInfoLength() const {
        return (int) shape::shapeInfoLength(primaryShapeInfo());
    }
 }
--- a/libnd4j/include/helpers/AttentionHelper.h
+++ b/libnd4j/include/helpers/AttentionHelper.h
@ -27,7 +27,7 @@ namespace nd4j {
    class AttentionHelper {

    public:
-        static nd4j::NDArray* multiHeadProject(const nd4j::NDArray* input, const nd4j::NDArray* projectionMatrix, nd4j::LaunchContext * context = nd4j::LaunchContext ::defaultContext());
+        static nd4j::NDArray multiHeadProject(const nd4j::NDArray* input, const nd4j::NDArray* projectionMatrix, nd4j::LaunchContext * context = nd4j::LaunchContext ::defaultContext());
        static void multiHeadProjectBp(const nd4j::NDArray* input, const nd4j::NDArray* projectionMatrix, const nd4j::NDArray* eps, nd4j::NDArray* dLdInput, nd4j::NDArray* dLdProjectionMatrix, nd4j::LaunchContext * context = nd4j::LaunchContext ::defaultContext());
    };
 }
--- a/libnd4j/include/helpers/benchmark/MatrixBenchmark.h
+++ b/libnd4j/include/helpers/benchmark/MatrixBenchmark.h
@ -69,10 +69,10 @@ namespace nd4j {
        }

        void executeOnce() override {
-            auto xT = (_tA ? _x->transpose() : _x);
-            auto yT = (_tB ? _y->transpose() : _y);
+            auto xT = (_tA ? _x->transpose() : *_x);
+            auto yT = (_tB ? _y->transpose() : *_y);

-            MmulHelper::mmul(xT, yT, _z, _alpha, _beta);
+            MmulHelper::mmul(&xT, &yT, _z, _alpha, _beta);
        }

        std::string axis() override {
--- a/libnd4j/include/helpers/cpu/householder.cpp
+++ b/libnd4j/include/helpers/cpu/householder.cpp
@ -133,10 +133,10 @@ void Householder<T>::mulLeft(NDArray& matrix, const NDArray& tail, const T coeff
 	// if(matrix.rankOf() != 2)
 	// 	throw "ops::helpers::Householder::mulLeft method: input array must be 2D matrix !";

-	if(matrix.sizeAt(0) == 1)   
-    	matrix *= (T)1.f - coeff;
-  	
-  	else if(coeff != (T)0.f) {
+	if(matrix.sizeAt(0) == 1) {
+        matrix *= (T) 1.f - coeff;
+    }
+    else if(coeff != (T)0.f) {

  		auto bottomPart = new NDArray(matrix({1,matrix.sizeAt(0), 0,0}, true));
 		auto bottomPartCopy = *bottomPart;
@ -145,13 +145,11 @@ void Householder<T>::mulLeft(NDArray& matrix, const NDArray& tail, const T coeff

 			auto column = tail;
 			auto row = tail.transpose();
-    		auto resultingRow = mmul(*row, bottomPartCopy);
+    		auto resultingRow = mmul(row, bottomPartCopy);
    		auto fistRow = matrix({0,1, 0,0}, true);
    		resultingRow += fistRow;
    		fistRow -= resultingRow * coeff;
    		*bottomPart -= mmul(column, resultingRow) * coeff;
-
-			delete row;
 		}
 		else {

@ -161,9 +159,7 @@ void Householder<T>::mulLeft(NDArray& matrix, const NDArray& tail, const T coeff
    		auto fistRow = matrix({0,1, 0,0}, true);
    		resultingRow += fistRow;
    		fistRow -= resultingRow * coeff;
-    		*bottomPart -= mmul(*column, resultingRow) * coeff;    	
-
-			delete column;
+    		*bottomPart -= mmul(column, resultingRow) * coeff;
 		}
 		delete bottomPart;
 	}
@ -193,21 +189,16 @@ void Householder<T>::mulRight(NDArray& matrix, const NDArray& tail, const T coef
    		auto resultingCol = mmul(rightPartCopy, column);
    		resultingCol += *fistCol;
    		*fistCol -= resultingCol * coeff;
-    		*rightPart -= mmul(resultingCol, *row) * coeff;    		
-
-			delete row;			
+    		*rightPart -= mmul(resultingCol, row) * coeff;
 		}
 		else {

 			auto row = tail;
 			auto column = tail.transpose();
-    		auto resultingCol = mmul(rightPartCopy, *column);
+    		auto resultingCol = mmul(rightPartCopy, column);
    		resultingCol += *fistCol;
    		*fistCol -= resultingCol * coeff;
    		*rightPart -= mmul(resultingCol, row) * coeff;
-
-			delete column;
-			
 		}
  		delete rightPart;
  		delete fistCol;
--- a/libnd4j/include/helpers/cpu/jacobiSVD.cpp
+++ b/libnd4j/include/helpers/cpu/jacobiSVD.cpp
@ -157,8 +157,7 @@ bool JacobiSVD<T>::isBlock2x2NotDiag(NDArray& block, int p, int q, T& maxElem) {

        if(_calcU) {
            auto temp2 = rotation.transpose();
-            mulRotationOnRight(p, q, _u, *temp2);
-            delete temp2;
+            mulRotationOnRight(p, q, _u, temp2);
        }
    }

@ -251,9 +250,7 @@ void JacobiSVD<T>::svd2x2(const NDArray& block, int p, int q, NDArray& left, NDA
    m.p<T>(1, 1, _z);

    auto temp = right.transpose();
-    left.assign(mmul(rotation, *temp));
-    delete temp;
-
+    left.assign(mmul(rotation, temp));
 }


@ -289,7 +286,7 @@ void JacobiSVD<T>::evalData(const NDArray& matrix) {
    else if(_rows < _cols) {

        auto matrixT = matrix.transpose();
-        HHcolPivQR qr(*matrixT / scale);
+        HHcolPivQR qr(matrixT / scale);
        _m.assign(qr._qr({0,_rows, 0,_rows}));
        _m.fillAsTriangular<T>(0., 0, 0, 'l');
        _m.transposei();
@ -305,8 +302,6 @@ void JacobiSVD<T>::evalData(const NDArray& matrix) {

        if(_calcU)
            _u.assign(qr._permut);
-
-        delete matrixT;
    }
    else {

@ -352,8 +347,7 @@ void JacobiSVD<T>::evalData(const NDArray& matrix) {

                        if(_calcU) {
                            auto temp = rotLeft.transpose();
-                            mulRotationOnRight(p, q, _u, *temp);
-                            delete temp;
+                            mulRotationOnRight(p, q, _u, temp);
                        }

                        mulRotationOnRight(p, q, _m, rotRight);
--- a/libnd4j/include/helpers/cpu/svd.cpp
+++ b/libnd4j/include/helpers/cpu/svd.cpp
@ -920,7 +920,7 @@ void SVD<T>::evalData(const NDArray& matrix) {
    auto temp1 = biDiag._HHbidiag.transpose();
    auto temp2 = _m({0,_diagSize, 0,0}, true);
    temp2.assign(temp1);
-    delete temp1;   
+

    auto temp3 = _m({_m.sizeAt(0)-1,_m.sizeAt(0), 0,0}, true);
    temp3.assign(0.);
--- a/libnd4j/include/helpers/cuda_off/MmulHelper.cu
+++ b/libnd4j/include/helpers/cuda_off/MmulHelper.cu
@ -184,9 +184,9 @@ NDArray* MmulHelper::mmulMxM(const NDArray* A, const NDArray* B, NDArray* C, dou

    if(pC->ordering() != 'f') {
        auto temp = pA;
-        pA = pB  ->permute({1,0});
-        pB = temp->permute({1,0});
-        pC = pC  ->permute({1,0});
+        pA = new NDArray(pB  ->permute({1,0}));
+        pB = new NDArray(temp->permute({1,0}));
+        pC = new NDArray(pC  ->permute({1,0}));
        toDelete.push_back(pA);
        toDelete.push_back(pB);
        toDelete.push_back(pC);
@ -251,7 +251,8 @@ NDArray* MmulHelper::mmulMxM(const NDArray* A, const NDArray* B, NDArray* C, dou
            blocksPerGrid.y = math::nd4j_ceil<double, int>(static_cast<double>(M) / threadsPerBlock.y);    // rows
        }

-        BUILD_TRIPLE_SELECTOR(aType, bType, cType, usualGemm, (blocksPerGrid, threadsPerBlock, stream, transA, transB, M, N, K, alpha, pA->getSpecialBuffer(), lda, pB->getSpecialBuffer(), ldb, beta, pC->getSpecialBuffer(), ldc), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+        //BUILD_TRIPLE_SELECTOR(aType, bType, cType, usualGemm, (blocksPerGrid, threadsPerBlock, stream, transA, transB, M, N, K, alpha, pA->getSpecialBuffer(), lda, pB->getSpecialBuffer(), ldb, beta, pC->getSpecialBuffer(), ldc), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+        BUILD_SINGLE_SELECTOR_THRICE(aType, usualGemm, (blocksPerGrid, threadsPerBlock, stream, transA, transB, M, N, K, alpha, pA->getSpecialBuffer(), lda, pB->getSpecialBuffer(), ldb, beta, pC->getSpecialBuffer(), ldc), LIBND4J_TYPES)
    }

    if (status != CUBLAS_STATUS_SUCCESS) throw cuda_exception::build("MmulHelper::mmulMxM cuda failed !", status);
@ -339,7 +340,8 @@ NDArray* MmulHelper::mmulMxV(const NDArray* A, const NDArray* X, nd4j::NDArray*
            threadsPerBlock.x = 512;
            blocksPerGrid.x = math::nd4j_ceil<double, int>(static_cast<double>(M) / threadsPerBlock.x);    // rows
        }
-        BUILD_TRIPLE_SELECTOR(aType, xType, yType, usualGemv, (blocksPerGrid, threadsPerBlock, stream, transA, M, N, alpha, pA->getSpecialBuffer(), lda, X->getSpecialBuffer(), incx, beta, Y->getSpecialBuffer(), incy), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+        //BUILD_TRIPLE_SELECTOR(aType, xType, yType, usualGemv, (blocksPerGrid, threadsPerBlock, stream, transA, M, N, alpha, pA->getSpecialBuffer(), lda, X->getSpecialBuffer(), incx, beta, Y->getSpecialBuffer(), incy), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+        BUILD_SINGLE_SELECTOR_THRICE(xType, usualGemv, (blocksPerGrid, threadsPerBlock, stream, transA, M, N, alpha, pA->getSpecialBuffer(), lda, X->getSpecialBuffer(), incx, beta, Y->getSpecialBuffer(), incy), LIBND4J_TYPES)
    }

    if (status != CUBLAS_STATUS_SUCCESS) throw cuda_exception::build("MmulHelper::mmulMxV cuda failed !", status);
@ -396,7 +398,8 @@ NDArray* MmulHelper::dot(const NDArray* X, const NDArray* Y, nd4j::NDArray* Z, c

    NDArray::prepareSpecialUse({Z}, {X, Y});

-    BUILD_TRIPLE_SELECTOR(xType, yType, zType, usualDot, (blocksPerGrid, threadsPerBlock, stream, length, alpha, X->getSpecialBuffer(), incx, Y->getSpecialBuffer(), incy, beta, Z->getSpecialBuffer()), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+    //BUILD_TRIPLE_SELECTOR(xType, yType, zType, usualDot, (blocksPerGrid, threadsPerBlock, stream, length, alpha, X->getSpecialBuffer(), incx, Y->getSpecialBuffer(), incy, beta, Z->getSpecialBuffer()), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+    BUILD_SINGLE_SELECTOR_THRICE(xType, usualDot, (blocksPerGrid, threadsPerBlock, stream, length, alpha, X->getSpecialBuffer(), incx, Y->getSpecialBuffer(), incy, beta, Z->getSpecialBuffer()), LIBND4J_TYPES)

    auto cudaResult = cudaStreamSynchronize(*stream);
    if (cudaResult != 0) throw cuda_exception::build("MmulHelper::dot cuda failed !", cudaResult);
@ -406,8 +409,8 @@ NDArray* MmulHelper::dot(const NDArray* X, const NDArray* Y, nd4j::NDArray* Z, c
    return Z;
 }

-BUILD_TRIPLE_TEMPLATE(template void usualGemm, (const dim3 &blocksPerGrid, const dim3 &threadsPerBlock, cudaStream_t *stream, const bool transA, const bool transB, const int M, const int N, const int K, const double alpha, const void* vA, const int lda, const void* vB, const int ldb, const double beta, void* vC, const int ldc), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
-BUILD_TRIPLE_TEMPLATE(template void usualGemv, (const dim3 &blocksPerGrid, const dim3 &threadsPerBlock, cudaStream_t *stream, const bool transA, const int M, const int N, const double alpha, const void* vA, const int lda, const void* vB, const int incx, const double beta, void* vC, const int incy), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
-BUILD_TRIPLE_TEMPLATE(template void usualDot,  (const dim3 &blocksPerGrid, const dim3 &threadsPerBlock, cudaStream_t *stream, const Nd4jLong length, const double alpha, const void* vX, const Nd4jLong incx, const void* vY, const Nd4jLong incy, const double beta, void* vZ), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+//BUILD_TRIPLE_TEMPLATE(template void usualGemm, (const dim3 &blocksPerGrid, const dim3 &threadsPerBlock, cudaStream_t *stream, const bool transA, const bool transB, const int M, const int N, const int K, const double alpha, const void* vA, const int lda, const void* vB, const int ldb, const double beta, void* vC, const int ldc), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+//BUILD_TRIPLE_TEMPLATE(template void usualGemv, (const dim3 &blocksPerGrid, const dim3 &threadsPerBlock, cudaStream_t *stream, const bool transA, const int M, const int N, const double alpha, const void* vA, const int lda, const void* vB, const int incx, const double beta, void* vC, const int incy), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);
+//BUILD_TRIPLE_TEMPLATE(template void usualDot,  (const dim3 &blocksPerGrid, const dim3 &threadsPerBlock, cudaStream_t *stream, const Nd4jLong length, const double alpha, const void* vX, const Nd4jLong incx, const void* vY, const Nd4jLong incy, const double beta, void* vZ), LIBND4J_TYPES, FLOAT_TYPES, FLOAT_TYPES);

 }
--- a/libnd4j/include/helpers/impl/AttentionHelper.cpp
+++ b/libnd4j/include/helpers/impl/AttentionHelper.cpp
@ -28,33 +28,27 @@

 namespace nd4j {

-    nd4j::NDArray *
-    AttentionHelper::multiHeadProject(const nd4j::NDArray *input, const nd4j::NDArray *projectionMatrix, nd4j::LaunchContext * context) {
+    nd4j::NDArray AttentionHelper::multiHeadProject(const nd4j::NDArray *input, const nd4j::NDArray *projectionMatrix, nd4j::LaunchContext * context) {
        auto miniBatchSize = input->sizeAt(0);
        auto seqLength = input->sizeAt(2);
        auto numHeads = projectionMatrix->sizeAt(0);
        auto projectedSize = projectionMatrix->sizeAt(1);

        auto inputPerm = input->permute({1, 0, 2});
-        auto inputPrep = inputPerm->reshape('c', {input->sizeAt(1), (miniBatchSize * seqLength)});
+        auto inputPrep = inputPerm.reshape('c', {input->sizeAt(1), (miniBatchSize * seqLength)});
        auto projectionPrep = projectionMatrix->reshape('c', {numHeads * projectionMatrix->sizeAt(1), projectionMatrix->sizeAt(2)});

-        NDArray* projected = new NDArray('c', {numHeads * projectionMatrix->sizeAt(1), (miniBatchSize * seqLength)}, input->dataType(), context);
+        NDArray projected('c', {numHeads * projectionMatrix->sizeAt(1), (miniBatchSize * seqLength)}, input->dataType(), context);
        nd4j::ops::matmul mmul;
-        mmul.execute({projectionPrep, inputPrep}, {projected},  {}, {}, {});
+        mmul.execute({&projectionPrep, &inputPrep}, {&projected},  {}, {}, {});

-        projected->reshapei({numHeads, projectedSize, miniBatchSize, seqLength});
-        projected->permutei({2, 0, 1, 3});
-
-        delete inputPerm;
-        delete inputPrep;
-        delete projectionPrep;
+        projected.reshapei({numHeads, projectedSize, miniBatchSize, seqLength});
+        projected.permutei({2, 0, 1, 3});

        return projected;
    }

-    void
-    AttentionHelper::multiHeadProjectBp(const nd4j::NDArray *input, const nd4j::NDArray *projectionMatrix,
+    void AttentionHelper::multiHeadProjectBp(const nd4j::NDArray *input, const nd4j::NDArray *projectionMatrix,
                                        const nd4j::NDArray *eps, nd4j::NDArray *dLdInput,
                                        nd4j::NDArray *dLdProjectionMatrix, nd4j::LaunchContext * context) {
        auto miniBatchSize = input->sizeAt(0);
@ -63,16 +57,16 @@ namespace nd4j {
        auto projectedSize = projectionMatrix->sizeAt(1);

        auto epsPerm = eps->permute({1, 2, 0, 3});
-        auto epsReshaped = epsPerm->reshape('c', {numHeads * projectedSize, miniBatchSize * seqLength});
+        auto epsReshaped = epsPerm.reshape('c', {numHeads * projectedSize, miniBatchSize * seqLength});

        auto inputPerm = input->permute({1, 0, 2});
-        auto inputPrep = inputPerm->reshape('c', {input->sizeAt(1), (miniBatchSize * seqLength)});
+        auto inputPrep = inputPerm.reshape('c', {input->sizeAt(1), (miniBatchSize * seqLength)});
        auto projectionPrep = projectionMatrix->reshape('c', {numHeads * projectionMatrix->sizeAt(1), projectionMatrix->sizeAt(2)});

        nd4j::ops::matmul_bp mmulBp;
-        NDArray dLdProjectionPrep(projectionPrep->shapeInfo(), false, context);
-        NDArray dLdInputPrep(inputPrep->shapeInfo(), false, context);
-        mmulBp.execute({projectionPrep, inputPrep, epsReshaped}, {&dLdProjectionPrep, &dLdInputPrep}, {}, {}, {});
+        NDArray dLdProjectionPrep(projectionPrep.shapeInfo(), false, context);
+        NDArray dLdInputPrep(inputPrep.shapeInfo(), false, context);
+        mmulBp.execute({&projectionPrep, &inputPrep, &epsReshaped}, {&dLdProjectionPrep, &dLdInputPrep}, {}, {}, {});

        dLdProjectionPrep.reshapei({numHeads, projectionMatrix->sizeAt(1), projectionMatrix->sizeAt(2)});
        dLdProjectionMatrix->assign(dLdProjectionPrep);
@ -80,12 +74,6 @@ namespace nd4j {
        dLdInputPrep.reshapei({input->sizeAt(1), miniBatchSize, seqLength});
        dLdInputPrep.permutei({1, 0, 2});
        dLdInput->assign(dLdInputPrep);
-
-        delete inputPerm;
-        delete inputPrep;
-        delete epsPerm;
-        delete epsReshaped;
-        delete projectionPrep;
    }
 }

--- a/libnd4j/include/helpers/impl/GradCheck.cpp
+++ b/libnd4j/include/helpers/impl/GradCheck.cpp
@ -53,7 +53,7 @@ void GradCheck::fillGradArrays(const LossFunc loss, const std::vector<NDArray*>&
 bool GradCheck::checkGrad(ops::DeclarableOp& opFF, ops::DeclarableOp& opBP, const OpArgsHolder& argsHolderFF, const OpArgsHolder& argsHolderBP,
 	                      const std::vector<bool>& whatArrsToCheck, const std::vector<double>& idxRange, const LossFunc loss ) {

-	const int numInArrsFF     = argsHolderFF.getNumInArrs();						// also numInArrsFF = number of output arrays in opBP
+	const int numInArrsFF     = argsHolderFF.getNumInArrs();						// at the same time numInArrsFF = number of output arrays in opBP
 	const int numInGradArrsBP = argsHolderBP.getNumInArrs() - numInArrsFF;  		// because argsHolderBP.getNumInArrs() = numInArrsFF + numInGradArrsBP
 	const std::vector<NDArray*>& inArrsFF = argsHolderFF.getInArrs();
 	const std::vector<NDArray*>& inArrsBP = argsHolderBP.getInArrs();
@ -65,6 +65,7 @@ bool GradCheck::checkGrad(ops::DeclarableOp& opFF, ops::DeclarableOp& opBP, cons
 	ResultSet* outArrsBP = opBP.execute(argsHolderBP);		// number of output arrays in back prop = numInArrsFF;

 	NDArray tmpScalar(nd4j::DataType::DOUBLE, inArrsFF[0]->getContext()); // scalar = 0
+
 	for(int i = 0; i < numInArrsFF; ++i) {							// loop through input array

 		if(!whatArrsToCheck.empty() && static_cast<bool>(whatArrsToCheck[i]) == false)
@ -75,39 +76,39 @@ bool GradCheck::checkGrad(ops::DeclarableOp& opFF, ops::DeclarableOp& opBP, cons

 		for(Nd4jLong j = idxStart; j < idxEnd; ++j) {			// loop through all elements for current array

-			double& elem = inArrsFF[i]->t<double>(j);
-			const double orig = elem;
+			const double orig = inArrsFF[i]->e<double>(j);

 			// add epsilon, feed forward
-			elem = orig + EPSILON;
+			inArrsFF[i]->p<double>(j, orig + EPSILON);
 			ResultSet* outArrsFF = opFF.execute(argsHolderFF);
 			int numOutArrs = outArrsFF->size();
 			double scorePlus = 0.;
-			for(int k = 0; k < numOutArrs; ++k) {                // loop through output array
+
+			for(int k = 0; k < numOutArrs; ++k) {                // loop through output arrays
 				if(loss == SUM)
-					NativeOpExecutioner::execReduceSameScalar(LaunchContext::defaultContext(), reduce::Sum, outArrsFF->at(k)->getBuffer(), outArrsFF->at(k)->getShapeInfo(), outArrsFF->at(k)->getSpecialBuffer(), outArrsFF->at(k)->getSpecialShapeInfo(), nullptr, tmpScalar.buffer(), tmpScalar.shapeInfo(), tmpScalar.specialBuffer(), tmpScalar.specialShapeInfo());
+					outArrsFF->at(k)->reduceNumber(reduce::Sum, tmpScalar);
 				else
-					NativeOpExecutioner::execReduceFloatScalar(LaunchContext::defaultContext(), reduce::Mean, outArrsFF->at(k)->getBuffer(), outArrsFF->at(k)->getShapeInfo(), outArrsFF->at(k)->getSpecialBuffer(), outArrsFF->at(k)->getSpecialShapeInfo(), nullptr, tmpScalar.buffer(), tmpScalar.shapeInfo(), tmpScalar.specialBuffer(), tmpScalar.specialShapeInfo());
+					outArrsFF->at(k)->reduceNumber(reduce::Mean, tmpScalar);
 				scorePlus += tmpScalar.e<double>(0);
 			}
 			delete outArrsFF;

 			// subtract epsilon, feed forward
-			elem = orig - EPSILON;
+			inArrsFF[i]->p<double>(j, orig - EPSILON);
 			outArrsFF = opFF.execute(argsHolderFF);
 			double scoreMinus = 0.;

-			for(int k = 0; k < numOutArrs; ++k) {            // loop through output array
+			for(int k = 0; k < numOutArrs; ++k) {            // loop through output arrays
 				if(loss == SUM)
-					NativeOpExecutioner::execReduceSameScalar(LaunchContext::defaultContext(), reduce::Sum, outArrsFF->at(k)->getBuffer(), outArrsFF->at(k)->getShapeInfo(), outArrsFF->at(k)->getSpecialBuffer(), outArrsFF->at(k)->getSpecialShapeInfo(), nullptr, tmpScalar.buffer(), tmpScalar.shapeInfo(), tmpScalar.specialBuffer(), tmpScalar.specialShapeInfo());
+					outArrsFF->at(k)->reduceNumber(reduce::Sum, tmpScalar);
 				else
-					NativeOpExecutioner::execReduceFloatScalar(LaunchContext::defaultContext(), reduce::Mean, outArrsFF->at(k)->getBuffer(), outArrsFF->at(k)->getShapeInfo(), outArrsFF->at(k)->getSpecialBuffer(), outArrsFF->at(k)->getSpecialShapeInfo(), nullptr, tmpScalar.buffer(), tmpScalar.shapeInfo(), tmpScalar.specialBuffer(), tmpScalar.specialShapeInfo());
+					outArrsFF->at(k)->reduceNumber(reduce::Mean, tmpScalar);
 				scoreMinus += tmpScalar.e<double>(0);
 			}
 			delete outArrsFF;

 			// restore initial element value
-			elem = orig;
+			inArrsFF[i]->p<double>(j, orig);

 			// calculate numerical gradient
 			const double numericalGrad = (scorePlus - scoreMinus) / (2 * EPSILON);
--- a/libnd4j/include/helpers/impl/MmulHelper.cpp
+++ b/libnd4j/include/helpers/impl/MmulHelper.cpp
@ -43,22 +43,19 @@ nd4j::NDArray* nd4j::MmulHelper::tensorDot(const nd4j::NDArray* a, const nd4j::N

    auto outShape = ShapeUtils::evalShapeForTensorDot(a, b, axes_0, axes_1, permutAt, permutBt, shapeAt, shapeBt);

-    NDArray* aPR = a->permute(permutAt);
-    NDArray* bPR = b->permute(permutBt);
+    NDArray aPR = a->permute(permutAt);
+    NDArray bPR = b->permute(permutBt);

    // check whether reshape is necessary
-    if(!aPR->isSameShape(shapeAt))
-        aPR->reshapei( shapeAt);
-    if(!bPR->isSameShape(shapeBt))
-        bPR->reshapei( shapeBt);
+    if(!aPR.isSameShape(shapeAt))
+        aPR.reshapei( shapeAt);
+    if(!bPR.isSameShape(shapeBt))
+        bPR.reshapei( shapeBt);

-    NDArray* c = mmul(aPR, bPR, nullptr, 1.0, 0.0);
+    NDArray* c = mmul(&aPR, &bPR, nullptr, 1.0, 0.0);

    c->reshapei(outShape);

-    delete aPR;
-    delete bPR;
-
    return c;
 }

@ -74,21 +71,21 @@ void nd4j::MmulHelper::tensorDot(const nd4j::NDArray* a, const nd4j::NDArray* b,

    // check whether permutation is required
    if(!permutForC.empty())
-        cP = c->permute(permutForC);
+        cP = new NDArray(c->permute(permutForC));

    auto aPR = a->permute(permutAt);
    auto bPR = b->permute(permutBt);

    // check whether reshape is necessary
-    if(!aPR->isSameShape(shapeAt))
-            aPR->reshapei(shapeAt);
-    if(!bPR->isSameShape(shapeBt))
-            bPR->reshapei(shapeBt);
+    if(!aPR.isSameShape(shapeAt))
+            aPR.reshapei(shapeAt);
+    if(!bPR.isSameShape(shapeBt))
+            bPR.reshapei(shapeBt);

-    if(!cP->isSameShape({aPR->sizeAt(0), bPR->sizeAt(1)}))
-        cPR = cP->reshape(cP->ordering(), {aPR->sizeAt(0), bPR->sizeAt(1)});
+    if(!cP->isSameShape({aPR.sizeAt(0), bPR.sizeAt(1)}))
+        cPR = new NDArray(cP->reshape(cP->ordering(), {aPR.sizeAt(0), bPR.sizeAt(1)}));

-    mmul(aPR, bPR, cPR, 1.0, 0.0);
+    mmul(&aPR, &bPR, cPR, 1.0, 0.0);

    if(cPR->getBuffer() != cP->getBuffer() || cPR->getSpecialBuffer() != cP->getSpecialBuffer() )   // this means both permute and reshape have been performed on c, cP always points on c->getBuffer()
        cP->assign(cPR);
@ -97,40 +94,42 @@ void nd4j::MmulHelper::tensorDot(const nd4j::NDArray* a, const nd4j::NDArray* b,
        delete cPR;
    if(cP != c)
        delete cP;
-    delete aPR;
-    delete bPR;
 }


 #ifndef __JAVACPP_HACK__
 //////////////////////////////////////////////////////////////////////////
 void nd4j::MmulHelper::tensorDot(const NDArray* a, const NDArray* b, NDArray* c, const std::vector<std::vector<Nd4jLong>>& modifA, const std::vector<std::vector<Nd4jLong>>& modifB, const std::vector<std::vector<Nd4jLong>>& modifC) {
+
    NDArray *aPR(const_cast<NDArray*>(a)), *bPR(const_cast<NDArray*>(b));
    std::string whatToDoWithA, whatToDoWithB, whatToDoWithC;         // "" - nothing; "p" - permutation; "r" - reshaping; "pr" - permutation+reshaping; "rp" - reshaping/permutation, and so on; if another string is produced - throw exception
+
    for(const auto& arr : modifA)
        whatToDoWithA = (std::find(arr.begin(), arr.end(), 0) != arr.end()) ? whatToDoWithA + "p" : whatToDoWithA + "r";        // when 0 is present in arr then it is permutation array, otherwise - it is reshaping array
    for(const auto& arr : modifB)
        whatToDoWithB = (std::find(arr.begin(), arr.end(), 0) != arr.end()) ? whatToDoWithB + "p" : whatToDoWithB + "r";
    for(const auto& arr : modifC)
        whatToDoWithC = (std::find(arr.begin(), arr.end(), 0) != arr.end()) ? whatToDoWithC + "p" : whatToDoWithC + "r";
+
    // first step for a array
    if(!whatToDoWithA.empty())
-        aPR = (whatToDoWithA[0] == 'p') ? a->permute(modifA[0]) : a->reshape(a->ordering(), modifA[0]);
+        aPR = (whatToDoWithA[0] == 'p') ? new NDArray(a->permute(modifA[0])) : new NDArray(a->reshape(a->ordering(), modifA[0]));
    // first step for b array
    if(!whatToDoWithB.empty())
-        bPR = (whatToDoWithB[0] == 'p') ? b->permute(modifB[0]) : b->reshape(b->ordering(), modifB[0]);
+        bPR = (whatToDoWithB[0] == 'p') ? new NDArray(b->permute(modifB[0])) : new NDArray(b->reshape(b->ordering(), modifB[0]));
    // rest steps for a array
    for(int i = 1; i < whatToDoWithA.size(); ++i)
        if(whatToDoWithA[i] == 'p') aPR->permutei(modifA[i]); else aPR->reshapei(modifA[i]);
    // rest steps for b array
    for(int i = 1; i < whatToDoWithB.size(); ++i)
        if(whatToDoWithB[i] == 'p') bPR->permutei(modifB[i]); else bPR->reshapei(modifB[i]);
+
    // now work with c array
    std::vector<NDArray*> cArrs = {c};
    if(!whatToDoWithC.empty()) {
        cArrs = std::vector<NDArray*>(whatToDoWithC.size()+1, c);
        for(int i = 0; i < cArrs.size()-1; ++i)
-            cArrs[i+1] = (whatToDoWithC[i] == 'p') ? cArrs[i]->permute(modifC[i]) : cArrs[i]->reshape(c->ordering(), modifC[i]);  // since we ignore first element in cArrs (that is cArrs[0]) then it is always equal to c
+            cArrs[i+1] = (whatToDoWithC[i] == 'p') ? new NDArray(cArrs[i]->permute(modifC[i])) : new NDArray(cArrs[i]->reshape(c->ordering(), modifC[i]));  // since we ignore first element in cArrs (that is cArrs[0]) then it is always equal to c
    }

    mmul(aPR, bPR, cArrs[cArrs.size()-1], 1.0, 0.0);
@ -152,18 +151,21 @@ void nd4j::MmulHelper::tensorDot(const NDArray* a, const NDArray* b, NDArray* c,

 //////////////////////////////////////////////////////////////////////////
 NDArray* nd4j::MmulHelper::tensorDot(const nd4j::NDArray* a, const nd4j::NDArray* b, const std::vector<std::vector<Nd4jLong>>& modifA, const std::vector<std::vector<Nd4jLong>>& modifB) {
+
    NDArray *aPR(const_cast<NDArray*>(a)), *bPR(const_cast<NDArray*>(b));
    std::string whatToDoWithA, whatToDoWithB;         // "" - nothing; "p" - permutation only; "r" - reshaping only; "pr" - permutation+reshaping; "rp" - reshaping/permutation; another string - throw exception
+
    for(const auto& arr : modifA)
        whatToDoWithA = (std::find(arr.begin(), arr.end(), 0) != arr.end()) ? whatToDoWithA + "p" : whatToDoWithA + "r";        // when 0 is present in arr then it is permutation array, otherwise - it is reshaping array
    for(const auto& arr : modifB)
        whatToDoWithB = (std::find(arr.begin(), arr.end(), 0) != arr.end()) ? whatToDoWithB + "p" : whatToDoWithB + "r";
+
    // first step for a array
    if(!whatToDoWithA.empty())
-        aPR = (whatToDoWithA[0] == 'p') ? a->permute(modifA[0]) : a->reshape(a->ordering(), modifA[0]);
+        aPR = (whatToDoWithA[0] == 'p') ? new NDArray(a->permute(modifA[0])) : new NDArray(a->reshape(a->ordering(), modifA[0]));
    // first step for b array
    if(!whatToDoWithB.empty())
-        bPR = (whatToDoWithB[0] == 'p') ? b->permute(modifB[0]) : b->reshape(b->ordering(), modifB[0]);
+        bPR = (whatToDoWithB[0] == 'p') ? new NDArray(b->permute(modifB[0])) : new NDArray(b->reshape(b->ordering(), modifB[0]));
    // rest steps for a array
    for(int i = 1; i < whatToDoWithA.size(); ++i)
        if(whatToDoWithA[i] == 'p') aPR->permutei(modifA[i]); else aPR->reshapei(modifA[i]);
@ -293,17 +295,17 @@ nd4j::NDArray* MmulHelper::mmul(const nd4j::NDArray* A, const nd4j::NDArray* B,
            permut[rank-1] = rank - 2;

            if(transX)
-                xT = x->permute(permut);
+                xT = new NDArray(x->permute(permut));

            if(transY)
-                yT = y->permute(permut);
+                yT = new NDArray(y->permute(permut));
        }

        if(xRank <= 2 && yRank <= 2) {  // dot (1Dx1D), vector-matrix (1Dx2D), matrix-vector (2Dx1D), matrix-matrix (2Dx2D) product cases

            if(xRank == 1 && yRank == 2) {   // reduce vector-matrix to matrix-matrix case
-                xT = x->reshape(x->ordering(), {1, x->lengthOf()}); // please note x is not transposed in this case (since xRank=1)
-                zT = z->reshape(z->ordering(), {1, z->lengthOf()});
+                xT = new NDArray(x->reshape(x->ordering(), {1, x->lengthOf()})); // please note x is not transposed in this case (since xRank=1)
+                zT = new NDArray(z->reshape(z->ordering(), {1, z->lengthOf()}));
            }

            mmul(xT, yT, zT, 1., 0.);
--- a/libnd4j/include/helpers/impl/ShapeUtils.cpp
+++ b/libnd4j/include/helpers/impl/ShapeUtils.cpp
@ -473,19 +473,9 @@ bool ShapeUtils::evalBroadcastShapeInfo(Nd4jLong *max, Nd4jLong *min, const bool
    // FIXME: get rid of memcpy here
    memcpy(tmpShapeInfo, maxShapeInfo, shape::shapeInfoByteLength(maxRank));
    for (int i = 0; i < minRank; ++i)
-        if(maxShapeInfo[maxRank-i] < minShapeInfo[minRank-i])
+        if((maxShapeInfo[maxRank-i] != 0 && maxShapeInfo[maxRank-i] < minShapeInfo[minRank-i]) || minShapeInfo[minRank-i] == 0)
            tmpShapeInfo[maxRank - i] = minShapeInfo[minRank-i];

-    // nullify zero axis
-    for (int e = 0; e < maxRank; e++)
-        if (maxShapeInfo[e+1] == 0)
-            tmpShapeInfo[e+1] = 0;
-
-    int delta = maxRank - minRank;
-    for (int e = minRank - 1; e >= 0; e--)
-        if (minShapeInfo[e + 1] == 0)
-            tmpShapeInfo[e + 1 + delta] = 0;
-
    ShapeUtils::updateStridesAndType(tmpShapeInfo, DataTypeUtils::pickPairwiseResultType(maxShapeInfo, minShapeInfo), shape::order(maxShapeInfo));

    if (shape::isEmpty(max) || shape::isEmpty(min)) {
--- a/libnd4j/include/helpers/impl/logger.cpp
+++ b/libnd4j/include/helpers/impl/logger.cpp
@ -40,7 +40,7 @@ namespace nd4j {
 #ifdef __CUDACC__
    __host__
 #endif
-     void Logger::printv(const char *format, std::vector<int>& vec) {
+     void Logger::printv(const char *format, const std::vector<int>& vec) {
        printf("%s: {", format);
        for(int e = 0; e < vec.size(); e++) {
            auto v = vec[e];
@ -55,7 +55,7 @@ namespace nd4j {
    #ifdef __CUDACC__
    __host__
 #endif
-     void Logger::printv(const char *format, std::vector<Nd4jLong>& vec) {
+     void Logger::printv(const char *format, const std::vector<Nd4jLong>& vec) {
        printf("%s: {", format);
        for(int e = 0; e < vec.size(); e++) {
            auto v = vec[e];
--- a/libnd4j/include/helpers/logger.h
+++ b/libnd4j/include/helpers/logger.h
@ -55,8 +55,8 @@ namespace nd4j {

        static void _CUDA_H info(const char *format, ...);

-        static void _CUDA_H printv(const char *format, std::vector<int>& vec);
-        static void _CUDA_H printv(const char *format, std::vector<Nd4jLong>& vec);
+        static void _CUDA_H printv(const char *format, const std::vector<int>& vec);
+        static void _CUDA_H printv(const char *format, const std::vector<Nd4jLong>& vec);
    };

 }
--- a/libnd4j/include/helpers/shape.h
+++ b/libnd4j/include/helpers/shape.h
@ -1023,23 +1023,6 @@ namespace shape {
    */
    ND4J_EXPORT _CUDA_HD void calcSubArrShapeAndOffsets(const Nd4jLong* wholeShapeInfo, const Nd4jLong numOfSubArrs, const int dimsSize, const int* dimsToExclude, Nd4jLong* subArrShapeInfo, Nd4jLong* subArrOffsets, bool keepUnitiesInShape = false);

-    /**
-    * insert dimension at shape[axis] position
-    * 1) for example: for given rank = 3, shape = {2,4,5}, axis = 1, dimension = 10 result is -> shape = {2,10,4,5}
-    * 2) for example: for given rank = 3, shape = {2,4,5}, axis = 3, dimension = 10 result is -> shape = {2,4,5,10}
-    * so be careful and provide shape buffer with enough (at least rank+1) length
-    * axis should be within [0, rank] range
-    */
-    ND4J_EXPORT _CUDA_HD void insertDimension(const int rank, Nd4jLong *shape, const Nd4jLong axis, const Nd4jLong dimension);
-
-    /**
-    * erase dimension at shape[axis] position
-    * 1) for example: for given rank = 3, shape = {2,4,5}, axis = 1, result is -> shape = {2,5}
-    * 2) for example: for given rank = 3, shape = {2,4,5}, axis = 2, result is -> shape = {2,4}
-    * axis should be within [0, rank-1] range
-    */
-    ND4J_EXPORT _CUDA_HD void eraseDimension(const int rank, Nd4jLong *shape, const Nd4jLong axis);
-



@ -4932,21 +4915,6 @@ INLINEDEF _CUDA_HD void calcOffsets(const Nd4jLong *xShapeInfo, Nd4jLong*& xOffs
    }
 }

-//////////////////////////////////////////////////////////////////////
-INLINEDEF _CUDA_HD void insertDimension(const int rank, Nd4jLong *shape, const Nd4jLong axis, const Nd4jLong dimension) {
-
-    for (int i = rank; i > axis; --i)
-        shape[i] = shape[i - 1];
-
-    shape[axis] = dimension;
-}
-
-//////////////////////////////////////////////////////////////////////
-INLINEDEF _CUDA_HD void eraseDimension(const int rank, Nd4jLong *shape, const Nd4jLong axis) {
-
-    for (int i = axis; i < rank - 1; ++i)
-        shape[i] = shape[i + 1];
-}


 }
--- a/libnd4j/include/loops/cpu/reduce/reduce_bool.cpp
+++ b/libnd4j/include/loops/cpu/reduce/reduce_bool.cpp
@ -244,8 +244,9 @@ namespace functions {
                        auto xi = x + threadOffset;
                        auto ulen = static_cast<unsigned int>(info.getItersPerThread(threadNum));

-                        for (Nd4jLong i = 0; i < ulen; i++)
+                        for (Nd4jLong i = 0; i < ulen; i++) {
                            local = OpType::update(local, OpType::op(xi[i], extraParams), extraParams);
+                        }

                        PRAGMA_OMP_CRITICAL
                        startingVal = OpType::update(startingVal, local, extraParams);
--- a/libnd4j/include/loops/cuda/broadcasting.cu
+++ b/libnd4j/include/loops/cuda/broadcasting.cu
@ -122,7 +122,7 @@ namespace functions {

                tadLength = shape::length(tadOnlyShapeInfo);
                tadEWS = shape::elementWiseStride(tadOnlyShapeInfo);
-                numTads = shape::length(xShapeInfo) / tadLength;
+                numTads = shape::length(yShapeInfo) / tadLength;
                xEWS = shape::elementWiseStride(xShapeInfo);
                zEWS = shape::elementWiseStride(tadOnlyShapeInfoZ);
            }
--- a/libnd4j/include/loops/cuda/specials/bitonicArbitraryStep.cu
+++ b/libnd4j/include/loops/cuda/specials/bitonicArbitraryStep.cu
@ -21,12 +21,165 @@

 #include <ops/specials_cuda.h>

+//////////////////////////////////////////////////////////////////////////
+template <typename X, typename Y>
+__global__ void bitonicArbitraryStepKernelValue(void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int window, int length,  int reverse, bool descending) {
+    auto x = static_cast<X*>(vx);
+    auto y = static_cast<Y*>(vy);
+
+    int tid = threadIdx.x + blockDim.x * blockIdx.x;
+    int half = window>>1;
+
+    __shared__ Nd4jLong xLength;
+    if (threadIdx.x == 0) {
+        xLength = shape::length(xShapeInfo);
+    }
+    __syncthreads();
+
+    //for (int i = 0; i < length; i+= window)
+    /*
+        if window == 4;
+        iterations will be: 0; 4; 8; 12; 16; 20
+        if gridDim = 3;
+        on first iteration we'll have: 0; 4; 8;
+        on second iteration we'll have: 0 + (3 * 4) = 12;  4 + (3 * 4) = 16; 8 + (3 * 4) = 20
+    */
+    int firstPosition;
+    int firstStep;
+    int secondPosition;
+    int secondStep;
+
+    int WARP_SIZE = 32;
+    int numWarps = (gridDim.x * blockDim.x) / 32;
+    int warpId = tid / WARP_SIZE;
+    int warpIdx = tid % WARP_SIZE;
+
+    if (half >= 128) {
+        firstPosition = blockIdx.x * window;
+        firstStep = gridDim.x * window;
+
+        secondPosition = threadIdx.x;
+        secondStep = blockDim.x;
+    } else if (half >= 32) {
+        firstPosition = warpId * window;
+        firstStep = numWarps * window;
+
+        secondPosition = warpIdx;
+        secondStep = WARP_SIZE;
+    } else {
+        firstPosition = tid * window;
+        firstStep = blockDim.x * gridDim.x * window;
+
+        secondPosition = 0;
+        secondStep = 1;
+    }
+
+
+    for (int i = firstPosition; i < length; i += firstStep) {
+        for (int j = secondPosition; j < half; j += secondStep) {
+            int it = (reverse) ? i + j + half : i + window - j - 1;
+            int ij = i+j;
+            if (it < length && ij < length ) {
+                int posIT = shape::getIndexOffset(it, yShapeInfo, xLength);
+                int posIJ = shape::getIndexOffset(ij, yShapeInfo, xLength);
+
+                Y v0 = y[posIJ];
+                Y v1 = y[posIT];
+
+                if(!descending == (v0 > v1)) {
+                    y[posIJ] = v1;
+                    y[posIT] = v0;
+
+                    X xtemp = x[posIJ];
+                    x[posIJ] = x[posIT];
+                    x[posIT] = xtemp;
+                }
+            }
+        }
+    }
+}
+
+//////////////////////////////////////////////////////////////////////////
+template <typename X, typename Y>
+__global__ void bitonicArbitraryStepKernelKey(void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int window, int length,  int reverse, bool descending) {
+    auto x = static_cast<X*>(vx);
+    auto y = static_cast<Y*>(vy);
+
+    int tid = threadIdx.x + blockDim.x * blockIdx.x;
+    int half = window>>1;
+
+    __shared__ Nd4jLong xLength;
+    if (threadIdx.x == 0) {
+        xLength = shape::length(xShapeInfo);
+    }
+    __syncthreads();
+
+    //for (int i = 0; i < length; i+= window)
+    /*
+        if window == 4;
+        iterations will be: 0; 4; 8; 12; 16; 20
+        if gridDim = 3;
+        on first iteration we'll have: 0; 4; 8;
+        on second iteration we'll have: 0 + (3 * 4) = 12;  4 + (3 * 4) = 16; 8 + (3 * 4) = 20
+    */
+    int firstPosition;
+    int firstStep;
+    int secondPosition;
+    int secondStep;
+
+    int WARP_SIZE = 32;
+    int numWarps = (gridDim.x * blockDim.x) / 32;
+    int warpId = tid / WARP_SIZE;
+    int warpIdx = tid % WARP_SIZE;
+
+    if (half >= 128) {
+        firstPosition = blockIdx.x * window;
+        firstStep = gridDim.x * window;
+
+        secondPosition = threadIdx.x;
+        secondStep = blockDim.x;
+    } else if (half >= 32) {
+        firstPosition = warpId * window;
+        firstStep = numWarps * window;
+
+        secondPosition = warpIdx;
+        secondStep = WARP_SIZE;
+    } else {
+        firstPosition = tid * window;
+        firstStep = blockDim.x * gridDim.x * window;
+
+        secondPosition = 0;
+        secondStep = 1;
+    }
+
+
+    for (int i = firstPosition; i < length; i += firstStep) {
+        for (int j = secondPosition; j < half; j += secondStep) {
+            int it = (reverse) ? i + j + half : i + window - j - 1;
+            int ij = i+j;
+            if (it < length && ij < length ) {
+                int posIT = shape::getIndexOffset(it, xShapeInfo, xLength);
+                int posIJ = shape::getIndexOffset(ij, xShapeInfo, xLength);
+
+                X v0 = x[posIJ];
+                X v1 = x[posIT];
+
+                if(!descending == (v0 > v1)) {
+                    x[posIJ] = v1;
+                    x[posIT] = v0;
+
+                    Y ytemp = y[posIJ];
+                    y[posIJ] = y[posIT];
+                    y[posIT] = ytemp;
+                }
+            }
+        }
+    }
+}

 //////////////////////////////////////////////////////////////////////////
 template<typename T>
-__device__
-void bitonicArbitraryStepKernel(void *vx, Nd4jLong *xShapeInfo, int window, int length,  int reverse, bool descending) {
-
+__global__ void execBitonicArbitraryStepKernel(void *vx, Nd4jLong *xShapeInfo, int window, int length,  int reverse, bool descending) {
    auto x = static_cast<T*>(vx);

    int tid = threadIdx.x + blockDim.x * blockIdx.x;
@ -85,8 +238,8 @@ void bitonicArbitraryStepKernel(void *vx, Nd4jLong *xShapeInfo, int window, int
            int it = (reverse) ? i + j + half : i + window - j - 1;
            int ij = i+j;
            if (it < length && ij < length ) {
-                int posIT = getDevicePosition(xShapeInfo,it, xLength);
-                int posIJ = getDevicePosition(xShapeInfo, ij, xLength);
+                int posIT = shape::getIndexOffset(it, xShapeInfo, xLength);
+                int posIJ = shape::getIndexOffset(ij, xShapeInfo, xLength);

                shmem[threadIdx.x] = x[posIJ];
                shmem[threadIdx.x + blockDim.x] = x[posIT];
@ -100,18 +253,22 @@ void bitonicArbitraryStepKernel(void *vx, Nd4jLong *xShapeInfo, int window, int
    }
 }

-//////////////////////////////////////////////////////////////////////////
-template<typename T>
-__global__ void execBitonicArbitraryStepKernel(void *vx, Nd4jLong *xShapeInfo, int window, int length,  int reverse, bool descending) {
-
-    bitonicArbitraryStepKernel<T>(vx, xShapeInfo, window, length, reverse, descending);
-}
-
 //////////////////////////////////////////////////////////////////////////
 template<typename T>
 __host__ void bitonicArbitraryStepGeneric(dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, int window, int length,  int reverse, bool descending) {
-
    execBitonicArbitraryStepKernel<T><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, window, length, reverse, descending);
-    nd4j::DebugHelper::checkErrorCode(stream, "bitonicArbitrary(...) failed");
 }
+
+template <typename X, typename Y>
+__host__ void bitonicArbitraryStepGenericKey(dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int window, int length,  int reverse, bool descending) {
+    bitonicArbitraryStepKernelKey<X,Y><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, vy, yShapeInfo, window, length, reverse, descending);
+}
+
+template <typename X, typename Y>
+__host__ void bitonicArbitraryStepGenericValue(dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int window, int length,  int reverse, bool descending) {
+    bitonicArbitraryStepKernelValue<X,Y><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, vy, yShapeInfo, window, length, reverse, descending);
+}
+
 BUILD_SINGLE_TEMPLATE(template void ND4J_EXPORT bitonicArbitraryStepGeneric, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, int window, int length,  int reverse, bool descending), LIBND4J_TYPES);
+BUILD_DOUBLE_TEMPLATE(template void ND4J_EXPORT bitonicArbitraryStepGenericKey, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int window, int length,  int reverse, bool descending), LIBND4J_TYPES, LIBND4J_TYPES);
+BUILD_DOUBLE_TEMPLATE(template void ND4J_EXPORT bitonicArbitraryStepGenericValue, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int window, int length,  int reverse, bool descending), LIBND4J_TYPES, LIBND4J_TYPES);
--- a/libnd4j/include/loops/cuda/specials/bitonicSortStep.cu
+++ b/libnd4j/include/loops/cuda/specials/bitonicSortStep.cu
@ -21,9 +21,119 @@

 #include <ops/specials_cuda.h>

+//////////////////////////////////////////////////////////////////////////
+template <typename X, typename Y>
+__global__ void bitonicSortStepKernelValue(void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int j, int k, int length, bool descending) {
+
+    auto x = static_cast<X*>(vx);
+    auto y = static_cast<Y*>(vy);
+
+    unsigned int i, ixj; /* Sorting partners: i and ixj */
+    i = threadIdx.x + blockDim.x * blockIdx.x;
+
+    __shared__ Nd4jLong xLength;
+    if (threadIdx.x == 0)
+        xLength = shape::length(xShapeInfo);
+
+    __syncthreads();
+
+
+    if (i >= length)
+        return;
+
+    ixj = i^j;
+
+    /* The threads with the lowest ids sort the array. */
+    if ((ixj)>i) {
+        int posI = shape::getIndexOffset(i, yShapeInfo, xLength);
+        int posIXJ = shape::getIndexOffset(ixj, yShapeInfo, xLength);
+
+        if ((i&k)==0) {
+            /* Sort ascending */
+            if (!descending == (y[posI]>y[posIXJ])) {
+                /* exchange(i,ixj); */
+                X temp = x[posI];
+                x[posI] = x[posIXJ];
+                x[posIXJ] = temp;
+
+                Y ytemp = y[posI];
+                y[posI] = y[posIXJ];
+                y[posIXJ] = ytemp;
+            }
+        } else if ((i&k)!=0) {
+            /* Sort descending */
+            if (!descending == (y[posI]<y[posIXJ])) {
+                /* exchange(i,ixj); */
+                X temp = x[posI];
+                x[posI] = x[posIXJ];
+                x[posIXJ] = temp;
+
+                Y ytemp = y[posI];
+                y[posI] = y[posIXJ];
+                y[posIXJ] = ytemp;
+            }
+        }
+    }
+}
+
+//////////////////////////////////////////////////////////////////////////
+template <typename X, typename Y>
+__global__ void bitonicSortStepKernelKey(void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int j, int k, int length, bool descending) {
+
+    auto x = static_cast<X*>(vx);
+    auto y = static_cast<Y*>(vy);
+
+    unsigned int i, ixj; /* Sorting partners: i and ixj */
+    i = threadIdx.x + blockDim.x * blockIdx.x;
+
+    __shared__ Nd4jLong xLength;
+    if (threadIdx.x == 0)
+        xLength = shape::length(xShapeInfo);
+
+    __syncthreads();
+
+
+    if (i >= length)
+        return;
+
+    ixj = i^j;
+
+    /* The threads with the lowest ids sort the array. */
+    if ((ixj)>i) {
+        int posI = shape::getIndexOffset(i, xShapeInfo, xLength);
+        int posIXJ = shape::getIndexOffset(ixj, xShapeInfo, xLength);
+
+        if ((i&k)==0) {
+            /* Sort ascending */
+            if (!descending == (x[posI]>x[posIXJ])) {
+                /* exchange(i,ixj); */
+                X temp = x[posI];
+                x[posI] = x[posIXJ];
+                x[posIXJ] = temp;
+
+                Y ytemp = y[posI];
+                y[posI] = y[posIXJ];
+                y[posIXJ] = ytemp;
+            }
+        } else if ((i&k)!=0) {
+            /* Sort descending */
+            if (!descending == (x[posI]<x[posIXJ])) {
+                /* exchange(i,ixj); */
+                X temp = x[posI];
+                x[posI] = x[posIXJ];
+                x[posIXJ] = temp;
+
+                Y ytemp = y[posI];
+                y[posI] = y[posIXJ];
+                y[posIXJ] = ytemp;
+            }
+        }
+    }
+}
+
 //////////////////////////////////////////////////////////////////////////
 template<typename T>
-__device__ void bitonicSortStepKernel(void *vx, Nd4jLong *xShapeInfo, int j, int k, int length, bool descending) {
+__global__ void bitonicSortStepKernel(void *vx, Nd4jLong *xShapeInfo, int j, int k, int length, bool descending) {

    auto x = static_cast<T*>(vx);

@ -44,8 +154,8 @@ __device__ void bitonicSortStepKernel(void *vx, Nd4jLong *xShapeInfo, int j, int

    /* The threads with the lowest ids sort the array. */
    if ((ixj)>i) {
-        int posI = getDevicePosition(xShapeInfo, i, xLength);
-        int posIXJ = getDevicePosition(xShapeInfo, ixj, xLength);
+        int posI = shape::getIndexOffset(i, xShapeInfo, xLength);
+        int posIXJ = shape::getIndexOffset(ixj, xShapeInfo, xLength);

        if ((i&k)==0) {
            /* Sort ascending */
@ -69,16 +179,23 @@ __device__ void bitonicSortStepKernel(void *vx, Nd4jLong *xShapeInfo, int j, int

 //////////////////////////////////////////////////////////////////////////
 template<typename T>
-__global__ void execBitonicSortStepKernel(void *vx, Nd4jLong *xShapeInfo, int j, int k, int length, bool descending) {
-
-    bitonicSortStepKernel<T>(vx, xShapeInfo, j, k, length, descending);
+__host__ void bitonicSortStepGeneric(dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, int j, int k, int length, bool descending) {
+    bitonicSortStepKernel<T><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, j, k, length, descending);
 }

 //////////////////////////////////////////////////////////////////////////
-template<typename T>
-__host__ void bitonicSortStepGeneric(dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, int j, int k, int length, bool descending) {
-
-    execBitonicSortStepKernel<T><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, j, k, length, descending);
-    nd4j::DebugHelper::checkErrorCode(stream, "bitonicSortStep(...) failed");
+template <typename X, typename Y>
+__host__ void bitonicSortStepGenericKey(dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int j, int k, int length, bool descending) {
+    bitonicSortStepKernelKey<X,Y><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, vy, yShapeInfo, j, k, length, descending);
 }
+
+//////////////////////////////////////////////////////////////////////////
+template <typename X, typename Y>
+__host__ void bitonicSortStepGenericValue(dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int j, int k, int length, bool descending) {
+    bitonicSortStepKernelValue<X,Y><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, vy, yShapeInfo, j, k, length, descending);
+}
+
+
 BUILD_SINGLE_TEMPLATE(template void ND4J_EXPORT bitonicSortStepGeneric, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, int j, int k, int length, bool descending), LIBND4J_TYPES);
+BUILD_DOUBLE_TEMPLATE(template void ND4J_EXPORT bitonicSortStepGenericKey, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int j, int k, int length, bool descending), LIBND4J_TYPES, LIBND4J_TYPES);
+BUILD_DOUBLE_TEMPLATE(template void ND4J_EXPORT bitonicSortStepGenericValue, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int j, int k, int length, bool descending), LIBND4J_TYPES, LIBND4J_TYPES);
--- a/libnd4j/include/loops/cuda/specials/oesTad.cu
+++ b/libnd4j/include/loops/cuda/specials/oesTad.cu
@ -16,18 +16,89 @@

 //
 // @author raver119@gmail.com
-// @author Yurii Shyrma, created on 28.11.2018
 //

 #include <ops/specials_cuda.h>

+//////////////////////////////////////////////////////////////////////////
+template <typename X, typename Y>
+__global__ void execOesTadKernelKey(void *vx, Nd4jLong *xShapeInfo,
+                                    void *vy, Nd4jLong *yShapeInfo,
+                                 int *dimension, int dimensionLength,
+                                 Nd4jLong *tadShapeInfo, Nd4jLong *tadOffsets,
+                                 bool descending) {
+
+    auto x = static_cast<X*>(vx);
+    auto y = static_cast<Y*>(vy);
+
+    __shared__ int xLength;
+    __shared__ int xTadLength;
+    __shared__ int numTads;
+    if (threadIdx.x == 0) {
+        xLength = shape::length(xShapeInfo);
+        xTadLength = shape::length(tadShapeInfo);
+        numTads = xLength / xTadLength;
+    }
+    __syncthreads();
+
+    for (int r = blockIdx.x; r < numTads; r += gridDim.x) {
+        auto dx = x + tadOffsets[r];
+        auto dy = y + tadOffsets[r];
+
+        // this is general loop, we go uncached
+        int iterations = xTadLength;
+
+        for (int i = 0; i < iterations; i++) {
+
+            if (i % 2 == 0) {
+                for (int tid = threadIdx.x; tid < xTadLength; tid += blockDim.x) {
+                    auto top = 2 * tid + 1;
+                    if (top < xTadLength) {
+                        auto t0 = shape::getIndexOffset(top - 1, tadShapeInfo, xTadLength);
+                        auto t1 = shape::getIndexOffset(top, tadShapeInfo, xTadLength);
+
+                        if (!descending == (dx[t0] > dx[t1])) {
+                            X dt0 = dx[t0];
+                            dx[t0] = dx[t1];
+                            dx[t1] = dt0;
+
+                            Y dy0 = dy[t0];
+                            dy[t0] = dy[t1];
+                            dy[t1] = dy0;
+                        }
+                    }
+                }
+            } else {
+                for (int tid = threadIdx.x; tid < xTadLength; tid += blockDim.x) {
+                    auto top = 2 * tid + 2;
+                    if (top < xTadLength) {
+                        auto t0 = shape::getIndexOffset(top - 1, tadShapeInfo, xTadLength);
+                        auto t1 = shape::getIndexOffset(top, tadShapeInfo, xTadLength);
+
+                        if (!descending == (dx[t0] > dx[t1])) {
+                            X dt0 = dx[t0];
+                            dx[t0] = dx[t1];
+                            dx[t1] = dt0;
+
+                            Y dy0 = dy[t0];
+                            dy[t0] = dy[t1];
+                            dy[t1] = dy0;
+                        }
+                    }
+                }
+            }
+            __syncthreads();
+        }
+    }
+}
+
+
 //////////////////////////////////////////////////////////////////////////
 template<typename T>
-__device__
-void oesTadKernel(void *vx, Nd4jLong *xShapeInfo, 
-                int *dimension, int dimensionLength, 
-                Nd4jLong *tadShapeInfo, Nd4jLong *tadOffsets, 
-                bool descending) {
+__global__ void execOesTadKernel(void *vx, Nd4jLong *xShapeInfo,
+                                int *dimension, int dimensionLength,
+                                Nd4jLong *tadShapeInfo, Nd4jLong *tadOffsets,
+                                bool descending) {

    auto x = static_cast<T*>(vx);
    const int sharedSize = 32768;
@ -56,7 +127,7 @@ void oesTadKernel(void *vx, Nd4jLong *xShapeInfo,
        int iterations = xTadLength;
        if (cached) {
            for (int tid = threadIdx.x; tid < xTadLength; tid += blockDim.x) {
-                auto t0 = getDevicePosition(tadShapeInfo, tid, xTadLength);
+                auto t0 = shape::getIndexOffset(tid, tadShapeInfo, xTadLength);
                shmem[tid] = dx[t0];
            }

@ -70,8 +141,8 @@ void oesTadKernel(void *vx, Nd4jLong *xShapeInfo,
                for (int tid = threadIdx.x; tid < xTadLength; tid += blockDim.x) {
                    auto top = 2 * tid + 1;
                    if (top < xTadLength) {
-                        auto t0 = cached ? top - 1 : getDevicePosition(tadShapeInfo, top - 1, xTadLength);
-                        auto t1 = cached ? top : getDevicePosition(tadShapeInfo, top, xTadLength);
+                        auto t0 = cached ? top - 1 : shape::getIndexOffset(top - 1, tadShapeInfo, xTadLength);
+                        auto t1 = cached ? top : shape::getIndexOffset(top, tadShapeInfo, xTadLength);

                        if (!descending == (dx[t0] > dx[t1])) {
                            T dt0 = dx[t0];
@ -84,8 +155,8 @@ void oesTadKernel(void *vx, Nd4jLong *xShapeInfo,
                for (int tid = threadIdx.x; tid < xTadLength; tid += blockDim.x) {
                    auto top = 2 * tid + 2;
                    if (top < xTadLength) {
-                        auto t0 = cached ? top - 1 : getDevicePosition(tadShapeInfo, top - 1, xTadLength);
-                        auto t1 = cached ? top : getDevicePosition(tadShapeInfo, top, xTadLength);
+                        auto t0 = cached ? top - 1 : shape::getIndexOffset(top - 1, tadShapeInfo, xTadLength);
+                        auto t1 = cached ? top : shape::getIndexOffset(top, tadShapeInfo, xTadLength);

                        if (!descending == (dx[t0] > dx[t1])) {
                            T dt0 = dx[t0];
@ -102,23 +173,13 @@ void oesTadKernel(void *vx, Nd4jLong *xShapeInfo,
        if (cached) {
            dx = x + tadOffsets[r];
            for (int tid = threadIdx.x; tid < xTadLength; tid += blockDim.x) {
-                auto t0 = getDevicePosition(tadShapeInfo, tid, xTadLength);
+                auto t0 = shape::getIndexOffset(tid, tadShapeInfo, xTadLength);
                dx[t0] = shmem[tid];
            }
        }
    }
 }

-//////////////////////////////////////////////////////////////////////////
-template<typename T>
-__global__ void execOesTadKernel(void *vx, Nd4jLong *xShapeInfo, 
-                                int *dimension, int dimensionLength, 
-                                Nd4jLong *tadShapeInfo, Nd4jLong *tadOffsets, 
-                                bool descending) {
-
-    oesTadKernel<T>(vx, xShapeInfo, dimension, dimensionLength, tadShapeInfo, tadOffsets, descending);
-}
-
 //////////////////////////////////////////////////////////////////////////
 template<typename T>
 __host__ void oesTadGeneric(dim3 &launchDims, cudaStream_t *stream,
@ -128,6 +189,18 @@ __host__ void oesTadGeneric(dim3 &launchDims, cudaStream_t *stream,
                                bool descending) {

    execOesTadKernel<T><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, dimension, dimensionLength, tadShapeInfo, tadOffsets, descending);
-    nd4j::DebugHelper::checkErrorCode(stream, "oesTad(...) failed");
 }
+
+template <typename X, typename Y>
+__host__ void oesTadGenericKey(dim3 &launchDims, cudaStream_t *stream,
+                            void *vx, Nd4jLong *xShapeInfo,
+                            void *vy, Nd4jLong *yShapeInfo,
+                            int *dimension, int dimensionLength,
+                            Nd4jLong *tadShapeInfo, Nd4jLong *tadOffsets,
+                            bool descending) {
+
+    execOesTadKernelKey<X,Y><<<launchDims.x, launchDims.y, launchDims.z, *stream>>>(vx, xShapeInfo, vy, yShapeInfo, dimension, dimensionLength, tadShapeInfo, tadOffsets, descending);
+}
+
 BUILD_SINGLE_TEMPLATE(template void ND4J_EXPORT oesTadGeneric, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, int *dimension, int dimensionLength, Nd4jLong *tadShapeInfo, Nd4jLong *tadOffsets, bool descending), LIBND4J_TYPES);
+BUILD_DOUBLE_TEMPLATE(template void ND4J_EXPORT oesTadGenericKey, (dim3 &launchDims, cudaStream_t *stream, void *vx, Nd4jLong *xShapeInfo, void *vy, Nd4jLong *yShapeInfo, int *dimension, int dimensionLength, Nd4jLong *tadShapeInfo, Nd4jLong *tadOffsets, bool descending), LIBND4J_TYPES, LIBND4J_TYPES);
--- a/libnd4j/include/ops/declarable/generic/activations/prelu.cpp
+++ b/libnd4j/include/ops/declarable/generic/activations/prelu.cpp
@ -65,13 +65,7 @@ CONFIGURABLE_OP_IMPL(prelu, 2, 1, true, 0, 0) {
    REQUIRE_TRUE(product == alphaLen, 0, "PRELU OP: wrong shape of alpha array, expected is %s, but got %s instead !", ShapeUtils::shapeAsString(expectedAlphaShape).c_str(), ShapeUtils::shapeAsString(alphaShape).c_str());
    // ***** end of validation ***** //

-    if(alphaShape != expectedAlphaShape)
-        alpha = alpha->reshape(alpha->ordering(), expectedAlphaShape);
-
-    helpers::prelu(block.launchContext(), *input, *alpha, *output);
-
-    if(alphaShape != expectedAlphaShape)
-        delete alpha;
+    helpers::prelu(block.launchContext(), *input,  alphaShape != expectedAlphaShape ? alpha->reshape(alpha->ordering(), expectedAlphaShape) : *alpha, *output);

    return Status::OK();
 }
@ -128,9 +122,10 @@ CONFIGURABLE_OP_IMPL(prelu_bp, 3, 2, true, 0, 0) {
    REQUIRE_TRUE(product == alphaLen, 0, "PRELU_BP OP: wrong shape of alpha array, expected is %s, but got %s instead !", ShapeUtils::shapeAsString(expectedAlphaShape).c_str(), ShapeUtils::shapeAsString(alphaShape).c_str());
    // ***** end of validation ***** //

+
    if(alphaShape != expectedAlphaShape) {
-        alpha = alpha->reshape(alpha->ordering(), expectedAlphaShape);
-        dLdA  = dLdA->reshape(dLdA->ordering(), expectedAlphaShape);
+        alpha = new NDArray(alpha->reshape(alpha->ordering(), expectedAlphaShape));
+        dLdA  = new NDArray(dLdA->reshape(dLdA->ordering(), expectedAlphaShape));
    }

    helpers::preluBP(block.launchContext(), *input, *alpha, *dLdO, *dLdI, *dLdA);
--- a/libnd4j/include/ops/declarable/generic/boolean/lt_scalar.cpp
+++ b/libnd4j/include/ops/declarable/generic/boolean/lt_scalar.cpp
@ -29,7 +29,6 @@ namespace nd4j {
            auto x = INPUT_VARIABLE(0);
            auto y = INPUT_VARIABLE(1);

-            nd4j_printf("Comparing [%f] to [%f]\n", x->e<float>(0), y->e<float>(0));
            if (x->e<float>(0) < y->e<float>(0))
                return ND4J_STATUS_TRUE;
            else
--- a/libnd4j/include/ops/declarable/generic/boolean/where.cpp
+++ b/libnd4j/include/ops/declarable/generic/boolean/where.cpp
@ -31,7 +31,7 @@ namespace nd4j {
            auto condition = INPUT_VARIABLE(0);
            auto z = OUTPUT_VARIABLE(0);
            if (z->isEmpty())
-                return ND4J_STATUS_OK;
+                return Status::OK();

            if (block.width() == 3) {
                auto x = INPUT_VARIABLE(1);
@ -44,12 +44,10 @@ namespace nd4j {
                    // FIXME: for perf it might be better to issue memcpy here, and fill only mismatched values from either X or Y
                    for (int e = 0; e < condition->lengthOf(); e++) {
                        if (y->isR()) {
-                            auto r = !condition->e<bool>(e) ? y->e<double>(e)
-                                                                           : x->e<double>(e);
+                            auto r = !condition->e<bool>(e) ? y->e<double>(e) : x->e<double>(e);
                            z->p(e, r);
                        } else {
-                            auto r = !condition->e<bool>(e) ? y->e<Nd4jLong>(e)
-                                                                           : x->e<Nd4jLong>(e);
+                            auto r = !condition->e<bool>(e) ? y->e<Nd4jLong>(e) : x->e<Nd4jLong>(e);
                            z->p(e, r);
                        }
                    }
@ -86,7 +84,7 @@ namespace nd4j {

                helpers::_where(block.launchContext(), *condition, *output, block.workspace());
            }
-            return ND4J_STATUS_OK;
+            return Status::OK();
        }

        DECLARE_SHAPE_FN(Where) {
--- a/libnd4j/include/ops/declarable/generic/boolean/where_np.cpp
+++ b/libnd4j/include/ops/declarable/generic/boolean/where_np.cpp
@ -120,7 +120,7 @@ namespace nd4j {
                }
            }

-            return ND4J_STATUS_OK;
+            return Status::OK();
        }

        DECLARE_SHAPE_FN(where_np) {
--- a/libnd4j/include/ops/declarable/generic/convo/conv1d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/conv1d.cpp
@ -81,11 +81,7 @@ CUSTOM_OP_IMPL(conv1d, 2, 1, false, 0, 4) {
    auto outputReshaped  = output ->reshape(output->ordering(),  reshapeForOutput);
    auto weightsReshaped = weights->reshape(weights->ordering(), {1, weights->sizeAt(0), weights->sizeAt(1), weights->sizeAt(2)});   // [kW, iC, oC] -> [1, kW, iC, oC]

-    ConvolutionUtils::conv2d(block, inputReshaped, weightsReshaped, bias, outputReshaped, 1,kW,  1,sW,  0,pW,  1,1,  isSameMode,  isNCW);
-
-    delete inputReshaped;
-    delete outputReshaped;
-    delete weightsReshaped;
+    ConvolutionUtils::conv2d(block, &inputReshaped, &weightsReshaped, bias, &outputReshaped, 1,kW,  1,sW,  0,pW,  1,1,  isSameMode,  isNCW);

    return Status::OK();
 }
@ -217,13 +213,7 @@ CUSTOM_OP_IMPL(conv1d_bp, 3, 2, false, 0, 4) {
    auto weightsReshaped = weights->reshape(weights->ordering(),{1, weights->sizeAt(0), weights->sizeAt(1), weights->sizeAt(2)});    // [kW, iC, oC] -> [1, kW, iC, oC]
    auto gradWReshaped   = gradW  ->reshape(gradW->ordering(),  {1, weights->sizeAt(0), weights->sizeAt(1), weights->sizeAt(2)});    // [kW, iC, oC] -> [1, kW, iC, oC]

-    ConvolutionUtils::conv2dBP(block, inputReshaped, weightsReshaped, bias, gradOReshaped, gradIReshaped, gradWReshaped, gradB, 1,kW,  1,sW,  0,pW,  1,1,  isSameMode,  isNCW);
-
-    delete inputReshaped;
-    delete gradIReshaped;
-    delete gradOReshaped;
-    delete weightsReshaped;
-    delete gradWReshaped;
+    ConvolutionUtils::conv2dBP(block, &inputReshaped, &weightsReshaped, bias, &gradOReshaped, &gradIReshaped, &gradWReshaped, gradB, 1,kW,  1,sW,  0,pW,  1,1,  isSameMode,  isNCW);

    return Status::OK();
 }
--- a/libnd4j/include/ops/declarable/generic/convo/conv3d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/conv3d.cpp
@ -151,10 +151,10 @@ CUSTOM_OP_IMPL(conv3dnew, 2, 1, false, 0, 13) {

    std::vector<int> permutForOutput;

-    if(!isNCDHW)
-        input = input->permute({0,4,1,2,3});                                    // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
-    else
+    if (isNCDHW)
        permutForOutput    = {0,2,3,4,1};                                        // [bS, oC, oD, oH, oW] -> [bS, oD, oH, oW, oC]
+    else
+        input = new NDArray(input->permute({0,4,1,2,3}));

    NDArray columns(input->ordering(), {bS, iC, kD, kH, kW, oD, oH, oW}, input->dataType(), block.launchContext());
    ConvolutionUtils::vol2col(block, *input, columns, sD, sH, sW, pD, pH, pW, dD, dH, dW);                 // [bS, iC, iD, iH, iW] is convoluted to [bS, iC, kD, kH, kW, oD, oH, oW]
@ -164,7 +164,7 @@ CUSTOM_OP_IMPL(conv3dnew, 2, 1, false, 0, 13) {
    if(bias)
        output->applyBroadcast(broadcast::Add, {indIOioC}, bias);

-    if(!isNCDHW)
+     if(!isNCDHW)
        delete input;

    return Status::OK();
@ -447,21 +447,23 @@ CUSTOM_OP_IMPL(conv3dnew_bp, 3, 2, false, 0, 13) {
    std::vector<int> gradOaxesForDot;

    if(!isNDHWC) {
-        input = input->permute({0,4,1,2,3});                                    // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
-        gradI = gradI->permute({0,4,1,2,3});                                    // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
        gradOaxesForDot  = {0,1,2,3};                                           // bS, oD, oH, oW
+        input = new NDArray(input->permute({0,4,1,2,3}));                       // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
+        gradI = new NDArray(gradI->permute({0,4,1,2,3}));                       // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
    }
-    else
+    else {
        gradOaxesForDot  = {0,2,3,4};                                           // bS, oD, oH, oW
+    }

    // ----- calculation of gradW and gradB ----- //
    NDArray columns(input->ordering(), {bS, iC, kD, kH, kW, oD, oH, oW}, input->dataType(), block.launchContext());
    ConvolutionUtils::vol2col(block, *input, columns, sD, sH, sW, pD, pH, pW, dD, dH, dW);                   // [bS, iC, iD, iH, iW] is convoluted to [bS, iC, kD, kH, kW, oD, oH, oW]
    MmulHelper::tensorDot(&columns, gradO, gradW, {0,5,6,7}, gradOaxesForDot, {3,0,1,2,4});     // [bS, iC, kD, kH, kW, oD, oH, oW] x [bS, oD, oH, oW, oC]/[bS, oC, oD, oH, oW] = [iC, kD, kH, kW, oC]

+    //----- calculation of gradO -----//
    if(gradB) {
        if(gradB->rankOf() == 2)
-            gradB = gradB->reshape(gradB->ordering(), {(int)gradB->lengthOf()});
+            gradB = new NDArray(gradB->reshape(gradB->ordering(), {(int)gradB->lengthOf()}));
        gradO->reduceAlongDimension(reduce::Sum, gradB, gradOaxesForDot);                          // sum over bS oD oH oW
        if(gradB != OUTPUT_VARIABLE(2))
            delete gradB;
--- a/libnd4j/include/ops/declarable/generic/convo/deconv2d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/deconv2d.cpp
@ -64,7 +64,7 @@ CUSTOM_OP_IMPL(deconv2d, 2, 1, false, 0, 9) {
        REQUIRE_TRUE(bias->rankOf() <= 2 && oC == bias->lengthOf(), 0, "CUSTOM DECONV2D OP: wrong shape of array with biases, expected rank, length: <=2, %i, but got %i, %i instead !", oC, bias->rankOf(), bias->lengthOf());

    if(!isNCHW)
-        output  = output->permute({0, 3, 1, 2});                                // [bS, oH, oW, oC] -> [bS, oC, oH, oW]
+        output = new NDArray(output->permute({0, 3, 1, 2}));       // [bS, oH, oW, oC] -> [bS, oC, oH, oW]

    if(isSameMode)                       // SAME
        ConvolutionUtils::calcPadding2D(pH, pW, oH, oW, iH, iW, kH, kW, sH, sW, dH, dW);
@ -82,7 +82,7 @@ CUSTOM_OP_IMPL(deconv2d, 2, 1, false, 0, 9) {
    if(bias)
        output->applyBroadcast(broadcast::Add, {1}, bias);

-    if(!isNCHW)
+     if(!isNCHW)
        delete output;

    return Status::OK();
@ -211,8 +211,9 @@ CUSTOM_OP_IMPL(deconv2d_bp, 3, 2, false, 0, 9) {

    // -----prepare permutation arrays and axes for dot product ----- //
    std::vector<int> inputAxesForDot;
+
    if(!isNCHW) {
-        gradO = gradO->permute({0, 3, 1, 2});                                   // [bS, oH, oW, oC] -> [bS, oC, oH, oW]
+        gradO = new NDArray(gradO->permute({0, 3, 1, 2}));                      // [bS, oH, oW, oC] -> [bS, oC, oH, oW]
        inputAxesForDot = {0, 1, 2};                                            // bS, iH, iW
    }
    else
@ -228,7 +229,7 @@ CUSTOM_OP_IMPL(deconv2d_bp, 3, 2, false, 0, 9) {
    // ----- calculation of gradB ----- //
    if(gradB) {
        if(gradB->rankOf() == 2)
-            gradB = gradB->reshape(gradB->ordering(), {(int)gradB->lengthOf()});
+            gradB = new NDArray(gradB->reshape(gradB->ordering(), {(int)gradB->lengthOf()}));
        gradO->reduceAlongDimension(reduce::Sum, gradB, {0, 2, 3});                                // sum over bS, oH, oW
        if(gradB != OUTPUT_VARIABLE(2))
            delete gradB;
@ -237,7 +238,7 @@ CUSTOM_OP_IMPL(deconv2d_bp, 3, 2, false, 0, 9) {
    if(!isNCHW)
        delete gradO;

-    return ND4J_STATUS_OK;
+    return Status::OK();
 }

 DECLARE_SHAPE_FN(deconv2d_bp) {
--- a/libnd4j/include/ops/declarable/generic/convo/deconv3d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/deconv3d.cpp
@ -39,20 +39,20 @@ CUSTOM_OP_IMPL(deconv3d, 2, 1, false, 0, 13) {
    REQUIRE_TRUE(input->rankOf()   == 5, 0, "CUSTOM DECONV3D OP: rank of input array must be equal to 5, but got %i instead !", input->rankOf());
    REQUIRE_TRUE(weights->rankOf() == 5, 0, "CUSTOM DECONV3D OP: rank of weights array must be equal to 5, but got %i instead !", weights->rankOf());

-    int kD = INT_ARG(0) > 0 ? INT_ARG(0) : static_cast<int>(weights->sizeAt(0));// filter(kernel) depth
-    int kH = INT_ARG(1) > 0 ? INT_ARG(1) : static_cast<int>(weights->sizeAt(1));// filter(kernel) height
-    int kW = INT_ARG(2) > 0 ? INT_ARG(2) : static_cast<int>(weights->sizeAt(2));// filter(kernel) width
-    int sD = INT_ARG(3);                                                        // strides depth
-    int sH = INT_ARG(4);                                                        // strides height
-    int sW = INT_ARG(5);                                                        // strides width
-    int pD = INT_ARG(6);                                                        // paddings depth
-    int pH = INT_ARG(7);                                                        // paddings height
-    int pW = INT_ARG(8);                                                        // paddings width
-    int dD = INT_ARG(9);                                                        // dilations depth
-    int dH = INT_ARG(10);                                                       // dilations height
-    int dW = INT_ARG(11);                                                       // dilations width
-    int isSameMode = INT_ARG(12);                                               // 0-SAME,  1-VALID
-    int isNCDHW  = block.getIArguments()->size() > 13 ? !INT_ARG(13) : 1;       // INT_ARG(13): 1-NDHWC, 0-NCDHW
+    int kD = INT_ARG(0) > 0 ? INT_ARG(0) : static_cast<int>(weights->sizeAt(0));    // filter(kernel) depth
+    int kH = INT_ARG(1) > 0 ? INT_ARG(1) : static_cast<int>(weights->sizeAt(1));    // filter(kernel) height
+    int kW = INT_ARG(2) > 0 ? INT_ARG(2) : static_cast<int>(weights->sizeAt(2));    // filter(kernel) width
+    int sD = INT_ARG(3);                                                            // strides depth
+    int sH = INT_ARG(4);                                                            // strides height
+    int sW = INT_ARG(5);                                                            // strides width
+    int pD = INT_ARG(6);                                                            // paddings depth
+    int pH = INT_ARG(7);                                                            // paddings height
+    int pW = INT_ARG(8);                                                            // paddings width
+    int dD = INT_ARG(9);                                                            // dilations depth
+    int dH = INT_ARG(10);                                                           // dilations height
+    int dW = INT_ARG(11);                                                           // dilations width
+    int isSameMode = INT_ARG(12);                                                   // 0-SAME,  1-VALID
+    int isNCDHW  = block.getIArguments()->size() > 13 ? !INT_ARG(13) : 1;           // INT_ARG(13): 1-NDHWC, 0-NCDHW

    int bS, iC, iD, iH, iW, oC, oD, oH, oW;                     // batch size, input channels, input depth/height/width, output channels, output depth/height/width;
    int indIOioC, indIOioD, indWoC, indWiC, indWkD;             // corresponding indexes
@ -64,7 +64,7 @@ CUSTOM_OP_IMPL(deconv3d, 2, 1, false, 0, 13) {
        REQUIRE_TRUE(bias->rankOf() <= 2 && oC == bias->lengthOf(), 0, "CUSTOM DECONV3D OP: wrong shape of array with biases, expected rank, length: <=2, %i, but got %i, %i instead !", oC, bias->rankOf(), bias->lengthOf());

    if(!isNCDHW)
-        output  = output->permute({0, 4, 1, 2, 3});                             // [bS, oD, oH, oW, oC] -> [bS, oC, oD, oH, oW] 
+        output = new NDArray(output->permute({0, 4, 1, 2, 3}));                 // [bS, oD, oH, oW, oC] -> [bS, oC, oD, oH, oW]

    if(isSameMode)                       // SAME
        ConvolutionUtils::calcPadding3D(pD, pH, pW, oD, oH, oW, iD, iH, iW, kD, kH, kW, sD, sH, sW, dD, dH, dW);
@ -225,8 +225,9 @@ CUSTOM_OP_IMPL(deconv3d_bp, 3, 2, false, 0, 13) {

    // -----prepare permutation arrays and axes for dot product ----- //
    std::vector<int> inputAxesForDot;
+
    if(!isNCDHW) {
-        gradO = gradO->permute({0, 4, 1, 2, 3});                                // [bS, oD, oH, oW, oC] -> [bS, oC, oD, oH, oW]
+        gradO = new NDArray(gradO->permute({0, 4, 1, 2, 3}));                   // [bS, oD, oH, oW, oC] -> [bS, oC, oD, oH, oW]
        inputAxesForDot = {0, 1, 2, 3};                                         // bS, iD, iH, iW
    }
    else
@ -240,7 +241,7 @@ CUSTOM_OP_IMPL(deconv3d_bp, 3, 2, false, 0, 13) {
    // ----- calculation of gradB ----- //
    if(gradB) {
        if(gradB->rankOf() == 2)
-            gradB = gradB->reshape(gradB->ordering(), {(int)gradB->lengthOf()});
+            gradB = new NDArray(gradB->reshape(gradB->ordering(), {(int)gradB->lengthOf()}));
        gradO->reduceAlongDimension(reduce::Sum, gradB, {0, 2, 3, 4});                                // sum over bS, oD, oH, oW
        if(gradB != OUTPUT_VARIABLE(2))
            delete gradB;
--- a/libnd4j/include/ops/declarable/generic/convo/dilation2d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/dilation2d.cpp
@ -71,7 +71,7 @@ namespace ops {
        int pad_top = 0, pad_left = 0;
        int out_rows = 0, out_cols = 0;

-        helpers::_dilation_hw(block.launchContext(), input->shapeInfo(), weights->shapeInfo(), strides, rates, isSameShape, &stride_rows, &stride_cols, &rate_rows, &rate_cols, &pad_top, &pad_left, &out_rows, &out_cols);
+        helpers::dilation_hw(block.launchContext(), input->shapeInfo(), weights->shapeInfo(), strides, rates, isSameShape, &stride_rows, &stride_cols, &rate_rows, &rate_cols, &pad_top, &pad_left, &out_rows, &out_cols);


        REQUIRE_TRUE(out_rows > 0 && out_cols > 0, 0, "Dilation2D: outY and outX should have positive values, but got [%i, %i] instead", out_rows, out_cols);
@ -126,7 +126,7 @@ namespace ops {
        int pad_top = 0, pad_left = 0;
        int out_rows = 0, out_cols = 0;

-        helpers::_dilation_hw(block.launchContext(), input, weights, strides, rates, isSameShape, &stride_rows, &stride_cols, &rate_rows, &rate_cols, &pad_top, &pad_left, &out_rows, &out_cols);
+        helpers::dilation_hw(block.launchContext(), input, weights, strides, rates, isSameShape, &stride_rows, &stride_cols, &rate_rows, &rate_cols, &pad_top, &pad_left, &out_rows, &out_cols);

        std::array<Nd4jLong, 4> shape = {{batch_size, out_rows, out_cols, depth}};
        newShape = ConstantShapeHelper::getInstance()->createShapeInfo(ArrayOptions::dataType(weights), 'c', 4, shape.data());
--- a/libnd4j/include/ops/declarable/generic/convo/pooling/avgpool2d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/pooling/avgpool2d.cpp
@ -59,9 +59,9 @@ CUSTOM_OP_IMPL(avgpool2d, 1, 1, false, 0, 10) {
    const int iH = static_cast<int>(isNCHW ? input->sizeAt(2) : input->sizeAt(1));
    const int iW = static_cast<int>(isNCHW ? input->sizeAt(3) : input->sizeAt(2));

-    if (!isNCHW) {
-        input  = input->permute({0, 3, 1, 2});                // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
-        output = output->permute({0, 3, 1, 2});               // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
+    if(!isNCHW) {
+        input  = new NDArray(input->permute({0, 3, 1, 2}));                // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        output = new NDArray(output->permute({0, 3, 1, 2}));               // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
    }

    ConvolutionUtils::calcOutSizePool2D(oH, oW, kH, kW, sH, sW, pH, pW, dH, dW, iH, iW, isSameMode);
@ -71,9 +71,8 @@ CUSTOM_OP_IMPL(avgpool2d, 1, 1, false, 0, 10) {

    // 0,1 - kernel Height/Width; 2,3 - stride Height/Width; 4,5 - pad Height/Width; 6,7 - dilation Height/Width; 8 - poolingMode; 9 - divisor;
    ConvolutionUtils::pooling2d(block, *input, *output, kH, kW, sH, sW, pH, pW, dH, dW, PoolingType::AVG_POOL, extraParam0);
-    //output->printBuffer("output op");

-    if (!isNCHW) {
+    if(!isNCHW) {
        delete input;
        delete output;
    }
@ -177,10 +176,11 @@ CUSTOM_OP_IMPL(avgpool2d_bp, 2, 1, false, 0, 10) {
    REQUIRE_TRUE(expectedGradOShape == ShapeUtils::shapeAsString(gradO), 0, "AVGPOOL2D_BP op: wrong shape of output's gradients array (next epsilon), expected is %s, but got %s instead !", expectedGradOShape.c_str(), ShapeUtils::shapeAsString(gradO).c_str());
    REQUIRE_TRUE(expectedGradIShape == ShapeUtils::shapeAsString(gradI), 0, "AVGPOOL2D_BP op: wrong shape of input's gradients array (epsilon), expected is %s, but got %s instead !", expectedGradIShape.c_str(), ShapeUtils::shapeAsString(gradI).c_str());

+
    if(!isNCHW) {
-        input = input->permute({0, 3, 1, 2});                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]                        
-        gradI = gradI->permute({0, 3, 1, 2});                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]                        
-        gradO = gradO->permute({0, 3, 1, 2});                                   // [bS, oH, oW, iC] -> [bS, iC, oH, oW]                        
+        input = new NDArray(input->permute({0, 3, 1, 2}));                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        gradI = new NDArray(gradI->permute({0, 3, 1, 2}));                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        gradO = new NDArray(gradO->permute({0, 3, 1, 2}));                                   // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
    }

    if(isSameMode)                       // SAME
@ -205,9 +205,6 @@ CUSTOM_OP_IMPL(avgpool2d_bp, 2, 1, false, 0, 10) {
        delete gradI;
        delete gradO;
    }
-    // delete columns;
-    // delete columns2d;
-    // delete gradOVector;

    return Status::OK();

--- a/libnd4j/include/ops/declarable/generic/convo/pooling/avgpool3d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/pooling/avgpool3d.cpp
@ -61,8 +61,8 @@ CUSTOM_OP_IMPL(avgpool3dnew, 1, 1, false, 0, 14) {
    REQUIRE_TRUE(expectedOutputShape == ShapeUtils::shapeAsString(output), 0, "AVGPOOL3D op: wrong shape of output array, expected is %s, but got %s instead !", expectedOutputShape.c_str(), ShapeUtils::shapeAsString(output).c_str());

    if(!isNCDHW) {
-        input  = input->permute({0, 4, 1, 2, 3});                                                       // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
-        output = output->permute({0, 4, 1, 2, 3});                                                      // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]
+        input  = new NDArray(input->permute({0, 4, 1, 2, 3}));              // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
+        output = new NDArray(output->permute({0, 4, 1, 2, 3}));             // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]
    }

    if(isSameMode)                       // SAME
@ -180,9 +180,9 @@ CUSTOM_OP_IMPL(avgpool3dnew_bp, 2, 1, false, 0, 14) {
    REQUIRE_TRUE(expectedGradIShape == ShapeUtils::shapeAsString(gradI), 0, "AVGPOOL3D_BP op: wrong shape of input's gradients array (epsilon), expected is %s, but got %s instead !", expectedGradIShape.c_str(), ShapeUtils::shapeAsString(gradI).c_str());

    if(!isNCDHW) {
-        input = input->permute({0, 4, 1, 2, 3});                                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]                        
-        gradI = gradI->permute({0, 4, 1, 2, 3});                                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]                        
-        gradO = gradO->permute({0, 4, 1, 2, 3});                                   // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]                        
+        input = new NDArray(input->permute({0, 4, 1, 2, 3}));                                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
+        gradI = new NDArray(gradI->permute({0, 4, 1, 2, 3}));                                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
+        gradO = new NDArray(gradO->permute({0, 4, 1, 2, 3}));                                   // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]
    }

    if(isSameMode)                       // SAME
--- a/libnd4j/include/ops/declarable/generic/convo/pooling/maxpool2d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/pooling/maxpool2d.cpp
@ -59,9 +59,9 @@ CUSTOM_OP_IMPL(maxpool2d, 1, 1, false, 0, 9) {
    const int iH = isNCHW ? input->sizeAt(2) : input->sizeAt(1);
    const int iW = isNCHW ? input->sizeAt(3) : input->sizeAt(2);

-    if (!isNCHW) {
-        input  = input->permute({0, 3, 1, 2});                // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
-        output = output->permute({0, 3, 1, 2});               // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
+    if(!isNCHW) {
+        input  = new NDArray(input->permute({0, 3, 1, 2}));                // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        output = new NDArray(output->permute({0, 3, 1, 2}));               // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
    }

    ConvolutionUtils::calcOutSizePool2D(oH, oW, kH, kW, sH, sW, pH, pW, dH, dW, iH, iW, isSameMode);
@ -72,7 +72,7 @@ CUSTOM_OP_IMPL(maxpool2d, 1, 1, false, 0, 9) {
    // 0,1 - kernel Height/Width; 2,3 - stride Height/Width; 4,5 - pad Height/Width; 6,7 - dilation Height/Width; poolingMode; 9 - divisor;
    ConvolutionUtils::pooling2d(block, *input, *output, kH, kW, sH, sW, pH, pW, dH, dW, PoolingType::MAX_POOL, 1);

-    if (!isNCHW) {
+    if(!isNCHW) {
        delete input;
        delete output;
    }
@ -175,9 +175,9 @@ CUSTOM_OP_IMPL(maxpool2d_bp, 2, 1, false, 0, 10) {
    REQUIRE_TRUE(expectedGradIShape == ShapeUtils::shapeAsString(gradI), 0, "MAXPOOL2D_BP op: wrong shape of input's gradients array (epsilon), expected is %s, but got %s instead !", expectedGradIShape.c_str(), ShapeUtils::shapeAsString(gradI).c_str());

    if(!isNCHW) {
-        input = input->permute({0, 3, 1, 2});                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]                        
-        gradI = gradI->permute({0, 3, 1, 2});                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]                        
-        gradO = gradO->permute({0, 3, 1, 2});                                   // [bS, oH, oW, iC] -> [bS, iC, oH, oW]                        
+        input = new NDArray(input->permute({0, 3, 1, 2}));                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        gradI = new NDArray(gradI->permute({0, 3, 1, 2}));                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        gradO = new NDArray(gradO->permute({0, 3, 1, 2}));                                   // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
    }

    if(isSameMode)                       // SAME
@ -203,9 +203,6 @@ CUSTOM_OP_IMPL(maxpool2d_bp, 2, 1, false, 0, 10) {
        delete gradI;
        delete gradO;
    }
-    // delete columns;
-    // delete columns2d;
-    // delete gradOVector;

    return Status::OK();
 }
--- a/libnd4j/include/ops/declarable/generic/convo/pooling/maxpool3d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/pooling/maxpool3d.cpp
@ -63,8 +63,8 @@ CUSTOM_OP_IMPL(maxpool3dnew, 1, 1, false, 0, 14) {
    // REQUIRE_TRUE(kD/2 >= pD && kH/2 >= pH && kW/2 >= pW, 0, "MAXPOOL3D OP: pad depth/height/width must not be greater than half of kernel depth/height/width, but got [%i, %i, %i] and [%i, %i, %i] correspondingly !", pD,pH,pW, kD,kH,kW);

    if(!isNCDHW) {
-        input  = input->permute({0, 4, 1, 2, 3});                                                       // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
-        output = output->permute({0, 4, 1, 2, 3});                                                      // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]
+        input  = new NDArray(input->permute({0, 4, 1, 2, 3}));          // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
+        output = new NDArray(output->permute({0, 4, 1, 2, 3}));         // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]
    }

    if(isSameMode)                       // SAME
@ -182,9 +182,9 @@ CUSTOM_OP_IMPL(maxpool3dnew_bp, 2, 1, false, 0, 14) {
    REQUIRE_TRUE(expectedGradIShape == ShapeUtils::shapeAsString(gradI), 0, "MAXPOOL3D_BP op: wrong shape of input's gradients array (epsilon), expected is %s, but got %s instead !", expectedGradIShape.c_str(), ShapeUtils::shapeAsString(gradI).c_str());

    if(!isNCDHW) {
-        input = input->permute({0, 4, 1, 2, 3});                                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]                        
-        gradI = gradI->permute({0, 4, 1, 2, 3});                                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]                        
-        gradO = gradO->permute({0, 4, 1, 2, 3});                                   // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]                        
+        input = new NDArray(input->permute({0, 4, 1, 2, 3}));                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
+        gradI = new NDArray(gradI->permute({0, 4, 1, 2, 3}));                   // [bS, iD, iH, iW, iC] -> [bS, iC, iD, iH, iW]
+        gradO = new NDArray(gradO->permute({0, 4, 1, 2, 3}));                   // [bS, oD, oH, oW, iC] -> [bS, iC, oD, oH, oW]
    }

    if(isSameMode)                       // SAME
@ -211,9 +211,6 @@ CUSTOM_OP_IMPL(maxpool3dnew_bp, 2, 1, false, 0, 14) {
        delete gradI;
        delete gradO;
    }
-    // delete columns;
-    // delete columns2d;
-    // delete gradOVector;

    return Status::OK();
 }
--- a/libnd4j/include/ops/declarable/generic/convo/pooling/pnormpool2d.cpp
+++ b/libnd4j/include/ops/declarable/generic/convo/pooling/pnormpool2d.cpp
@ -54,9 +54,9 @@ namespace nd4j {

            int isNCHW  = block.getIArguments()->size() > 10 ? !INT_ARG(10) : 1;       // 1-NHWC, 0-NCHW

-            if (!isNCHW) {
-                input  = input->permute({0, 3, 1, 2});                  // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
-                output = output->permute({0, 3, 1, 2});                 // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
+            if(!isNCHW) {
+                input  = new NDArray(input->permute({0, 3, 1, 2}));                  // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+                output = new NDArray(output->permute({0, 3, 1, 2}));                 // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
            }

            const auto inY = static_cast<int>(input->sizeAt(2));
@ -70,7 +70,7 @@ namespace nd4j {
            // 0,1 - kernel Height/Width; 2,3 - stride Height/Width; 4,5 - pad Height/Width; 6,7 - dilation Height/Width; 8 - poolingMode; 9 - divisor;
            ConvolutionUtils::pooling2d(block, *input, *output, kY, kX, sY, sX, pY, pX, dY, dX, PoolingType::PNORM_POOL, extraParam0);

-            if (!isNCHW) {
+            if(!isNCHW) {
                delete input;
                delete output;
            }
@ -175,9 +175,9 @@ CUSTOM_OP_IMPL(pnormpool2d_bp, 2, 1, false, 1, 10) {
    REQUIRE_TRUE(expectedGradIShape == ShapeUtils::shapeAsString(gradI), 0, "PNORMPOOL2D_BP op: wrong shape of input's gradients array (epsilon), expected is %s, but got %s instead !", expectedGradIShape.c_str(), ShapeUtils::shapeAsString(gradI).c_str());

    if(!isNCHW) {
-        input = input->permute({0, 3, 1, 2});                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]                        
-        gradI = gradI->permute({0, 3, 1, 2});                                   // [bS, iH, iW, iC] -> [bS, iC, iH, iW]                        
-        gradO = gradO->permute({0, 3, 1, 2});                                   // [bS, oH, oW, iC] -> [bS, iC, oH, oW]                        
+        input = new NDArray(input->permute({0, 3, 1, 2}));          // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        gradI = new NDArray(gradI->permute({0, 3, 1, 2}));          // [bS, iH, iW, iC] -> [bS, iC, iH, iW]
+        gradO = new NDArray(gradO->permute({0, 3, 1, 2}));          // [bS, oH, oW, iC] -> [bS, iC, oH, oW]
    }

    // if(isSameMode)                       // SAME
@ -216,10 +216,6 @@ CUSTOM_OP_IMPL(pnormpool2d_bp, 2, 1, false, 1, 10) {
        delete gradI;
        delete gradO;
    }
-    // delete columns;
-    // delete columns2d;
-    // delete gradOVector;
-    // delete denomVec;

    return Status::OK();
 }
--- a/libnd4j/include/ops/declarable/generic/loss/softmaxCrossEntropy.cpp
+++ b/libnd4j/include/ops/declarable/generic/loss/softmaxCrossEntropy.cpp
@ -78,7 +78,7 @@ CUSTOM_OP_IMPL(softmax_cross_entropy_loss, 3, 1, false, 1, 1) {
 	auto weightsBroad = weights;
 	if(!weights->isScalar() && !weights->isSameShape(&E)) {
 		if(E.rankOf() == 1 && weights->isVector() && weights->rankOf() > 1)
-    		weightsBroad = weights->reshape(weights->ordering(), {weights->lengthOf()});
+    		weightsBroad = new NDArray(weights->reshape(weights->ordering(), {weights->lengthOf()}));
    	else
 			weightsBroad = new NDArray(weights->tileToShape(E.getShapeInfo()));
 	}
--- a/libnd4j/include/ops/declarable/generic/nn/dot_product_attention.cpp
+++ b/libnd4j/include/ops/declarable/generic/nn/dot_product_attention.cpp
@ -74,7 +74,7 @@ namespace ops  {
        }

        if(mask != nullptr){
-            NDArray* reshapedMask;
+            NDArray reshapedMask;
            if(weights->rankOf() == 4){
                reshapedMask = mask->reshape(mask->ordering(), {mask->sizeAt(0), 1, mask->sizeAt(1), 1});
            }else{
@ -87,8 +87,7 @@ namespace ops  {
            // before going through the softmax, we effectively push all masked positions to zero after softmax.
            //
            // we are using 1e9 to mean effectively infinity
-            *weights += (*reshapedMask - 1) * 1e9;
-            delete reshapedMask;
+            *weights += (reshapedMask - 1) * 1e9;
        }

        nd4j::ops::softmax softmax;
@ -175,14 +174,13 @@ namespace ops  {
            preSoftmax /= factor;

        if(mask != nullptr){
-            NDArray* reshapedMask;
+            NDArray reshapedMask;
            if(preSoftmax.rankOf() == 4){
                reshapedMask = mask->reshape(mask->ordering(), {mask->sizeAt(0), 1, mask->sizeAt(1), 1});
            }else{
                reshapedMask = mask->reshape(mask->ordering(), {mask->sizeAt(0), mask->sizeAt(1), 1});
            }
-            preSoftmax += (*reshapedMask - 1) * 1e9;
-            delete reshapedMask;
+            preSoftmax += (reshapedMask - 1) * 1e9;
        }

        NDArray weights('c', weightShape, values->dataType(), block.launchContext());
--- a/libnd4j/include/ops/declarable/generic/nn/lrn.cpp
+++ b/libnd4j/include/ops/declarable/generic/nn/lrn.cpp
@ -70,7 +70,7 @@ namespace nd4j {
            float beta  = T_ARG(2);            
            int depth   = INT_ARG(0);

-            helpers::lrnBP(*input, *gradO, *gradI, depth, bias, alpha, beta);
+            helpers::lrnBP(block, *input, *gradO, *gradI, depth, bias, alpha, beta);

            return Status::OK();
        }
--- a/libnd4j/include/ops/declarable/generic/nn/multi_head_dot_product_attention.cpp
+++ b/libnd4j/include/ops/declarable/generic/nn/multi_head_dot_product_attention.cpp
@ -98,9 +98,9 @@ namespace ops  {
        auto projectedValues = AttentionHelper::multiHeadProject(values, Wv, block.launchContext());

        // Apply Attention
-        NDArray attnResults('c', {projectedQueries->sizeAt(0), projectedValues->sizeAt(1), projectedValues->sizeAt(2), projectedQueries->sizeAt(3)}, projectedValues->dataType(), block.launchContext());
+        NDArray attnResults('c', {projectedQueries.sizeAt(0), projectedValues.sizeAt(1), projectedValues.sizeAt(2), projectedQueries.sizeAt(3)}, projectedValues.dataType(), block.launchContext());
        nd4j::ops::dot_product_attention attention;
-        attention.execute({projectedQueries, projectedKeys, projectedValues, mask}, {&attnResults, weights ? OUTPUT_VARIABLE(1) : nullptr}, {}, {normalization, weights}, {});
+        attention.execute({&projectedQueries, &projectedKeys, &projectedValues, mask}, {&attnResults, weights ? OUTPUT_VARIABLE(1) : nullptr}, {}, {normalization, weights}, {});

        // Project attention results
        attnResults.permutei({0, 3, 1, 2});
@ -111,11 +111,9 @@ namespace ops  {
        mmul.execute({&attnResults, Wo},{&projRes}, {}, {}, {});
        projRes.reshapei(projRes.ordering(), {miniBatchSize, queryCount, outSize});
        projRes.permutei({0, 2, 1});
-        output->assign(projRes);

-        delete projectedQueries;
-        delete projectedKeys;
-        delete projectedValues;
+        // FIXME: bad for performance
+        output->assign(projRes);

        return Status::OK();
    }
@ -227,9 +225,9 @@ namespace ops  {
        auto projectedValues = AttentionHelper::multiHeadProject(values, Wv, block.launchContext());

        // Apply Attention
-        NDArray attnResults('c', {projectedQueries->sizeAt(0), projectedValues->sizeAt(1), projectedValues->sizeAt(2), projectedQueries->sizeAt(3)}, projectedValues->dataType(), block.launchContext());
+        NDArray attnResults('c', {projectedQueries.sizeAt(0), projectedValues.sizeAt(1), projectedValues.sizeAt(2), projectedQueries.sizeAt(3)}, projectedValues.dataType(), block.launchContext());
        nd4j::ops::dot_product_attention attention;
-        attention.execute({projectedQueries, projectedKeys, projectedValues, mask}, {&attnResults}, {}, {normalization, 0}, {});
+        attention.execute({&projectedQueries, &projectedKeys, &projectedValues, mask}, {&attnResults}, {}, {normalization, 0}, {});

        // Project attention results
        attnResults.permutei({0, 3, 1, 2});
@ -237,31 +235,25 @@ namespace ops  {

        // dLdWo
        auto epsPerm = eps->permute({0, 2, 1});
-        auto epsPostReshape = epsPerm->reshape(eps->ordering(), {miniBatchSize * queryCount, outSize});
+        auto epsPostReshape = epsPerm.reshape(eps->ordering(), {miniBatchSize * queryCount, outSize});
        nd4j::ops::matmul_bp matmulBp;
        NDArray dLdPreWo(attnResults.shapeInfo(), false, block.launchContext());
-        matmulBp.execute({&attnResults, Wo, epsPostReshape}, {&dLdPreWo, dLdWo}, {}, {}, {});
+        matmulBp.execute({&attnResults, Wo, &epsPostReshape}, {&dLdPreWo, dLdWo}, {}, {}, {});

        // dLdAttn
-        dLdPreWo.reshapei({miniBatchSize, queryCount, numHeads, projectedValues->sizeAt(2)});
+        dLdPreWo.reshapei({miniBatchSize, queryCount, numHeads, projectedValues.sizeAt(2)});
        dLdPreWo.permutei({0, 2, 3, 1});

        nd4j::ops::dot_product_attention_bp attentionBp;
-        NDArray dLdProjectedQueries(projectedQueries->shapeInfo(), false, block.launchContext());
-        NDArray dLdProjectedKeys(projectedKeys->shapeInfo(), false, block.launchContext());
-        NDArray dLdProjectedValues(projectedValues->shapeInfo(), false, block.launchContext());
-        attentionBp.execute({projectedQueries, projectedKeys, projectedValues, &dLdPreWo, mask},{&dLdProjectedQueries, &dLdProjectedKeys, &dLdProjectedValues}, {}, {normalization}, {});
+        NDArray dLdProjectedQueries(projectedQueries.shapeInfo(), false, block.launchContext());
+        NDArray dLdProjectedKeys(projectedKeys.shapeInfo(), false, block.launchContext());
+        NDArray dLdProjectedValues(projectedValues.shapeInfo(), false, block.launchContext());
+        attentionBp.execute({&projectedQueries, &projectedKeys, &projectedValues, &dLdPreWo, mask},{&dLdProjectedQueries, &dLdProjectedKeys, &dLdProjectedValues}, {}, {normalization}, {});

        AttentionHelper::multiHeadProjectBp(queries, Wq, &dLdProjectedQueries, dLdq, dLdWq, block.launchContext());
        AttentionHelper::multiHeadProjectBp(keys, Wk, &dLdProjectedKeys, dLdk, dLdWk, block.launchContext());
        AttentionHelper::multiHeadProjectBp(values, Wv, &dLdProjectedValues, dLdv, dLdWv, block.launchContext());

-        delete projectedQueries;
-        delete projectedKeys;
-        delete projectedValues;
-        delete epsPerm;
-        delete epsPostReshape;
-
        return Status::OK();
    }

--- a/libnd4j/include/ops/declarable/generic/parity_ops/betaInc.cpp
+++ b/libnd4j/include/ops/declarable/generic/parity_ops/betaInc.cpp
@ -51,7 +51,7 @@ CONFIGURABLE_OP_IMPL(betainc, 3, 1, false, 0, 0) {
        REQUIRE_TRUE(0.f <= x->e<float>(i) && x->e<float>(i) <= 1.f, 0, "BETAINC op: all elements of x array must be within [0, 1] range!");
    }

-    *output = helpers::betaInc(block.launchContext(), *a, *b, *x);
+    helpers::betaInc(block.launchContext(), *a, *b, *x, *output);

    return Status::OK();
 }
--- a/libnd4j/include/ops/declarable/generic/parity_ops/bias_add.cpp
+++ b/libnd4j/include/ops/declarable/generic/parity_ops/bias_add.cpp
@ -48,10 +48,7 @@ namespace nd4j {
                //nd4j_debug("Reshaping to: [%i, %i]\n", -1, (int) bias->lengthOf());
                auto tArr = input->reshape(input->ordering(), shape);
                auto zArr = z->reshape(z->ordering(), shape);
-                tArr->addRowVector(bias, zArr);
-
-                delete tArr;
-                delete zArr;
+                tArr.addRowVector(bias, &zArr);
            }

            STORE_RESULT(*z);
@ -87,13 +84,12 @@ namespace nd4j {
            // cnn case
            if (input->rankOf() == 4) {
                auto epsilonNext2d = epsilonNext->permute({1, 0, 2, 3});
-                epsilonNext2d->reshapei('c', {(int) bias->lengthOf(), -1});
+                epsilonNext2d.reshapei('c', {(int) bias->lengthOf(), -1});

-                auto sum = epsilonNext2d->reduceAlongDimension(reduce::Sum, {1});
+                auto sum = epsilonNext2d.reduceAlongDimension(reduce::Sum, {1});
                gradB->assign(sum);

                delete sum;
-                delete epsilonNext2d;
            } else if (input->rankOf() == 2) {
                // regular fully-connected case
                auto sum = epsilonNext->reduceAlongDimension(reduce::Sum, {0});
--- a/libnd4j/include/ops/declarable/generic/parity_ops/check_numerics.cpp
+++ b/libnd4j/include/ops/declarable/generic/parity_ops/check_numerics.cpp
@ -0,0 +1,56 @@
+/*******************************************************************************
+ * Copyright (c) 2015-2018 Skymind, Inc.
+ *
+ * This program and the accompanying materials are made available under the
+ * terms of the Apache License, Version 2.0 which is available at
+ * https://www.apache.org/licenses/LICENSE-2.0.
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+ * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
+ * License for the specific language governing permissions and limitations
+ * under the License.
+ *
+ * SPDX-License-Identifier: Apache-2.0
+ ******************************************************************************/
+
+//
+//  @author raver119@gmail.com
+//
+
+#include <op_boilerplate.h>
+#if NOT_EXCLUDED(OP_check_numerics)
+
+#include <ops/declarable/CustomOperations.h>
+
+namespace nd4j {
+    namespace ops {
+
+        CUSTOM_OP_IMPL(check_numerics, 2, 1, true, 0, 0) {
+            auto input = INPUT_VARIABLE(0);
+            auto message = INPUT_VARIABLE(1);
+            auto output = OUTPUT_VARIABLE(0);
+
+            auto allFinite = input->reduceNumber(reduce::BoolOps::IsFinite);
+            REQUIRE_TRUE(allFinite.e<bool>(0), 0, "CheckNumerics: %s", message->e<std::string>(0).c_str());
+
+            if (!block.isInplace())
+                output->assign(input);
+
+            return Status::OK();
+        }
+
+        DECLARE_SHAPE_FN(check_numerics) {
+            return SHAPELIST(ConstantShapeHelper::getInstance()->createShapeInfo(ShapeDescriptor(inputShape->at(0))));
+        }
+
+        DECLARE_TYPES(check_numerics) {
+            getOpDescriptor()
+                    ->setAllowedInputTypes(0, {ALL_FLOATS})
+                    ->setAllowedInputTypes(1, nd4j::DataType::UTF8)
+                    ->setAllowedOutputTypes({ALL_FLOATS});
+        }
+    }
+}
+
+#endif
--- a/libnd4j/include/ops/declarable/generic/parity_ops/crop_and_resize.cpp
+++ b/libnd4j/include/ops/declarable/generic/parity_ops/crop_and_resize.cpp
@ -56,7 +56,7 @@ namespace nd4j {
        }

        DECLARE_SHAPE_FN(crop_and_resize) {
-            auto in = inputShape->at(0);
+            auto in = inputShape->at(1);

            Nd4jLong outputShape[4];

@ -77,8 +77,13 @@ namespace nd4j {
        }
        DECLARE_TYPES(crop_and_resize) {
            getOpDescriptor()
-                    ->setAllowedInputTypes(nd4j::DataType::ANY)
-                    ->setAllowedOutputTypes({ALL_FLOATS});
+                    ->setAllowedInputTypes(0, {ALL_INTS, ALL_FLOATS})
+//                    ->setAllowedInputTypes(1, {ALL_FLOATS})
+                    ->setAllowedInputTypes(1, {FLOAT32}) // as TF
+                    ->setAllowedInputTypes(2, {ALL_INTS})
+                    ->setAllowedInputTypes(3, {ALL_INTS})
+                    ->setAllowedOutputTypes({FLOAT32}); // as TF
+//                    ->setAllowedOutputTypes({ALL_FLOATS});
        }
    }
 }
--- a/libnd4j/include/ops/declarable/generic/parity_ops/cross.cpp
+++ b/libnd4j/include/ops/declarable/generic/parity_ops/cross.cpp
@ -47,9 +47,9 @@ namespace ops {
        auto o = OUTPUT_VARIABLE(0);

        if (a->lengthOf() == 3) {
-            helpers::_cross(block.launchContext(), a, b, o);
+            helpers::cross(block.launchContext(), a, b, o);
        } else {
-            helpers::_crossBatched(block.launchContext(), a, b, o);
+            helpers::crossBatched(block.launchContext(), a, b, o);
        }

        return Status::OK();
--- a/Show More
+++ b/Show More