64 lines
2.8 KiB
Markdown
64 lines
2.8 KiB
Markdown
|
---
|
||
|
title: DataVec Transforms
|
||
|
short_title: Transforms
|
||
|
description: Data wrangling and mapping from one schema to another.
|
||
|
category: DataVec
|
||
|
weight: 1
|
||
|
---
|
||
|
|
||
|
## Data wrangling
|
||
|
|
||
|
One of the key tools in DataVec is transformations. DataVec helps the user map a dataset from one schema to another, and provides a list of operations to convert types, format data, and convert a 2D dataset to sequence data.
|
||
|
|
||
|
## Building a transform process
|
||
|
|
||
|
A transform process requires a `Schema` to successfully transform data. Both schema and transform process classes come with a helper `Builder` class which are useful for organizing code and avoiding complex constructors.
|
||
|
|
||
|
When both are combined together they look like the sample code below. Note how `inputDataSchema` is passed into the `Builder` constructor. Your transform process will fail to compile without it.
|
||
|
|
||
|
```java
|
||
|
import org.datavec.api.transform.TransformProcess;
|
||
|
|
||
|
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
|
||
|
.removeColumns("CustomerID","MerchantID")
|
||
|
.filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
|
||
|
.conditionalReplaceValueTransform(
|
||
|
"TransactionAmountUSD", //Column to operate on
|
||
|
new DoubleWritable(0.0), //New value to use, when the condition is satisfied
|
||
|
new DoubleColumnCondition("TransactionAmountUSD",ConditionOp.LessThan, 0.0)) //Condition: amount < 0.0
|
||
|
.stringToTimeTransform("DateTimeString","YYYY-MM-DD HH:mm:ss.SSS", DateTimeZone.UTC)
|
||
|
.renameColumn("DateTimeString", "DateTime")
|
||
|
.transform(new DeriveColumnsFromTimeTransform.Builder("DateTime").addIntegerDerivedColumn("HourOfDay", DateTimeFieldType.hourOfDay()).build())
|
||
|
.removeColumns("DateTime")
|
||
|
.build();
|
||
|
```
|
||
|
|
||
|
## Executing a transformation
|
||
|
|
||
|
Different "backends" for executors are available. Using the `tp` transform process above, here's how you can execute it locally using plain DataVec.
|
||
|
|
||
|
```java
|
||
|
import org.datavec.local.transforms.LocalTransformExecutor;
|
||
|
|
||
|
List<List<Writable>> processedData = LocalTransformExecutor.execute(originalData, tp);
|
||
|
```
|
||
|
|
||
|
## Debugging
|
||
|
|
||
|
Each operation in a transform process represents a "step" in schema changes. Sometimes, the resulting transformation is not the intended result. You can debug this by printing each step in the transform `tp` with the following:
|
||
|
|
||
|
```java
|
||
|
//Now, print the schema after each time step:
|
||
|
int numActions = tp.getActionList().size();
|
||
|
|
||
|
for(int i=0; i<numActions; i++ ){
|
||
|
System.out.println("\n\n==================================================");
|
||
|
System.out.println("-- Schema after step " + i + " (" + tp.getActionList().get(i) + ") --");
|
||
|
|
||
|
System.out.println(tp.getSchemaAfterStep(i));
|
||
|
}
|
||
|
```
|
||
|
|
||
|
## Available transformations and conversions
|
||
|
|
||
|
{{autogenerated}}
|