43 lines
1.6 KiB
Markdown
43 lines
1.6 KiB
Markdown
|
---
|
||
|
title: DataVec Executors
|
||
|
short_title: Executors
|
||
|
description: Execute ETL and vectorization in a local instance.
|
||
|
category: DataVec
|
||
|
weight: 3
|
||
|
---
|
||
|
|
||
|
## Local or remote execution?
|
||
|
|
||
|
Because datasets are commonly large by nature, you can decide on an execution mechanism that best suits your needs. For example, if you are vectorizing a large training dataset, you can process it in a distributed Spark cluster. However, if you need to do real-time inference, DataVec also provides a local executor that doesn't require any additional setup.
|
||
|
|
||
|
## Executing a transform process
|
||
|
|
||
|
Once you've created your `TransformProcess` using your `Schema`, and you've either loaded your dataset into a Apache Spark `JavaRDD` or have a `RecordReader` that load your dataset, you can execute a transform.
|
||
|
|
||
|
Locally this looks like:
|
||
|
|
||
|
```java
|
||
|
import org.datavec.local.transforms.LocalTransformExecutor;
|
||
|
|
||
|
List<List<Writable>> transformed = LocalTransformExecutor.execute(recordReader, transformProcess)
|
||
|
|
||
|
List<List<List<Writable>>> transformedSeq = LocalTransformExecutor.executeToSequence(sequenceReader, transformProcess)
|
||
|
|
||
|
List<List<Writable>> joined = LocalTransformExecutor.executeJoin(join, leftReader, rightReader)
|
||
|
```
|
||
|
|
||
|
When using Spark this looks like:
|
||
|
|
||
|
```java
|
||
|
import org.datavec.spark.transforms.SparkTransformExecutor;
|
||
|
|
||
|
JavaRDD<List<Writable>> transformed = SparkTransformExecutor.execute(inputRdd, transformProcess)
|
||
|
|
||
|
JavaRDD<List<List<Writable>>> transformedSeq = SparkTransformExecutor.executeToSequence(inputSequenceRdd, transformProcess)
|
||
|
|
||
|
JavaRDD<List<Writable>> joined = SparkTransformExecutor.executeJoin(join, leftRdd, rightRdd)
|
||
|
```
|
||
|
|
||
|
## Available executors
|
||
|
|
||
|
{{autogenerated}}
|