2.2 KiB
title | short_title | description | category | weight |
---|---|---|---|---|
DataVec Schema | Schema | Schemas for datasets and transformation. | DataVec | 1 |
Why use schemas?
The unfortunate reality is that data is dirty. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.
Using schemas
Schemas are primarily used for programming transformations. Before you can properly execute a TransformProcess
you will need to pass the schema of the data being transformed.
An example of a schema for merchant records may look like:
Schema inputDataSchema = new Schema.Builder()
.addColumnsString("DateTimeString", "CustomerID", "MerchantID")
.addColumnInteger("NumItemsInTransaction")
.addColumnCategorical("MerchantCountryCode", Arrays.asList("USA","CAN","FR","MX"))
.addColumnDouble("TransactionAmountUSD",0.0,null,false,false) //$0.0 or more, no maximum limit, no NaN and no Infinite values
.addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit"))
.build();
Joining schemas
If you have two different datasets that you want to merge together, DataVec provides a Join
class with different join strategies such as Inner
or RightOuter
.
Schema customerInfoSchema = new Schema.Builder()
.addColumnLong("customerID")
.addColumnString("customerName")
.addColumnCategorical("customerCountry", Arrays.asList("USA","France","Japan","UK"))
.build();
Schema customerPurchasesSchema = new Schema.Builder()
.addColumnLong("customerID")
.addColumnTime("purchaseTimestamp", DateTimeZone.UTC)
.addColumnLong("productID")
.addColumnInteger("purchaseQty")
.addColumnDouble("unitPriceUSD")
.build();
Join join = new Join.Builder(Join.JoinType.Inner)
.setJoinColumns("customerID")
.setSchemas(customerInfoSchema, customerPurchasesSchema)
.build();
Once you've defined your join and you've loaded the data into DataVec, you must use an Executor
to complete the join.
Classes and utilities
DataVec comes with a few Schema
classes and helper utilities for 2D and sequence types of data.
{{autogenerated}}