2.8 KiB
title | short_title | description | category | weight |
---|---|---|---|---|
DataVec Transforms | Transforms | Data wrangling and mapping from one schema to another. | DataVec | 1 |
Data wrangling
One of the key tools in DataVec is transformations. DataVec helps the user map a dataset from one schema to another, and provides a list of operations to convert types, format data, and convert a 2D dataset to sequence data.
Building a transform process
A transform process requires a Schema
to successfully transform data. Both schema and transform process classes come with a helper Builder
class which are useful for organizing code and avoiding complex constructors.
When both are combined together they look like the sample code below. Note how inputDataSchema
is passed into the Builder
constructor. Your transform process will fail to compile without it.
import org.datavec.api.transform.TransformProcess;
TransformProcess tp = new TransformProcess.Builder(inputDataSchema)
.removeColumns("CustomerID","MerchantID")
.filter(new ConditionFilter(new CategoricalColumnCondition("MerchantCountryCode", ConditionOp.NotInSet, new HashSet<>(Arrays.asList("USA","CAN")))))
.conditionalReplaceValueTransform(
"TransactionAmountUSD", //Column to operate on
new DoubleWritable(0.0), //New value to use, when the condition is satisfied
new DoubleColumnCondition("TransactionAmountUSD",ConditionOp.LessThan, 0.0)) //Condition: amount < 0.0
.stringToTimeTransform("DateTimeString","YYYY-MM-DD HH:mm:ss.SSS", DateTimeZone.UTC)
.renameColumn("DateTimeString", "DateTime")
.transform(new DeriveColumnsFromTimeTransform.Builder("DateTime").addIntegerDerivedColumn("HourOfDay", DateTimeFieldType.hourOfDay()).build())
.removeColumns("DateTime")
.build();
Executing a transformation
Different "backends" for executors are available. Using the tp
transform process above, here's how you can execute it locally using plain DataVec.
import org.datavec.local.transforms.LocalTransformExecutor;
List<List<Writable>> processedData = LocalTransformExecutor.execute(originalData, tp);
Debugging
Each operation in a transform process represents a "step" in schema changes. Sometimes, the resulting transformation is not the intended result. You can debug this by printing each step in the transform tp
with the following:
//Now, print the schema after each time step:
int numActions = tp.getActionList().size();
for(int i=0; i<numActions; i++ ){
System.out.println("\n\n==================================================");
System.out.println("-- Schema after step " + i + " (" + tp.getActionList().get(i) + ") --");
System.out.println(tp.getSchemaAfterStep(i));
}
Available transformations and conversions
{{autogenerated}}