60 lines
2.2 KiB
Markdown
60 lines
2.2 KiB
Markdown
|
---
|
||
|
title: DataVec Schema
|
||
|
short_title: Schema
|
||
|
description: Schemas for datasets and transformation.
|
||
|
category: DataVec
|
||
|
weight: 1
|
||
|
---
|
||
|
|
||
|
## Why use schemas?
|
||
|
|
||
|
The unfortunate reality is that data is *dirty*. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.
|
||
|
|
||
|
## Using schemas
|
||
|
|
||
|
Schemas are primarily used for programming transformations. Before you can properly execute a `TransformProcess` you will need to pass the schema of the data being transformed.
|
||
|
|
||
|
An example of a schema for merchant records may look like:
|
||
|
|
||
|
```java
|
||
|
Schema inputDataSchema = new Schema.Builder()
|
||
|
.addColumnsString("DateTimeString", "CustomerID", "MerchantID")
|
||
|
.addColumnInteger("NumItemsInTransaction")
|
||
|
.addColumnCategorical("MerchantCountryCode", Arrays.asList("USA","CAN","FR","MX"))
|
||
|
.addColumnDouble("TransactionAmountUSD",0.0,null,false,false) //$0.0 or more, no maximum limit, no NaN and no Infinite values
|
||
|
.addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit"))
|
||
|
.build();
|
||
|
```
|
||
|
|
||
|
## Joining schemas
|
||
|
|
||
|
If you have two different datasets that you want to merge together, DataVec provides a `Join` class with different join strategies such as `Inner` or `RightOuter`.
|
||
|
|
||
|
```java
|
||
|
Schema customerInfoSchema = new Schema.Builder()
|
||
|
.addColumnLong("customerID")
|
||
|
.addColumnString("customerName")
|
||
|
.addColumnCategorical("customerCountry", Arrays.asList("USA","France","Japan","UK"))
|
||
|
.build();
|
||
|
|
||
|
Schema customerPurchasesSchema = new Schema.Builder()
|
||
|
.addColumnLong("customerID")
|
||
|
.addColumnTime("purchaseTimestamp", DateTimeZone.UTC)
|
||
|
.addColumnLong("productID")
|
||
|
.addColumnInteger("purchaseQty")
|
||
|
.addColumnDouble("unitPriceUSD")
|
||
|
.build();
|
||
|
|
||
|
Join join = new Join.Builder(Join.JoinType.Inner)
|
||
|
.setJoinColumns("customerID")
|
||
|
.setSchemas(customerInfoSchema, customerPurchasesSchema)
|
||
|
.build();
|
||
|
```
|
||
|
|
||
|
Once you've defined your join and you've loaded the data into DataVec, you must use an `Executor` to complete the join.
|
||
|
|
||
|
## Classes and utilities
|
||
|
|
||
|
DataVec comes with a few `Schema` classes and helper utilities for 2D and sequence types of data.
|
||
|
|
||
|
{{autogenerated}}
|