cavis/docs/datavec/templates/schema.md

60 lines
2.2 KiB
Markdown
Raw Normal View History

2019-06-06 14:21:15 +02:00
---
title: DataVec Schema
short_title: Schema
description: Schemas for datasets and transformation.
category: DataVec
weight: 1
---
## Why use schemas?
The unfortunate reality is that data is *dirty*. When trying to vecotrize a dataset for deep learning, it is quite rare to find files that have zero errors. Schema is important for maintaining the meaning of the data before using it for something like training a neural network.
## Using schemas
Schemas are primarily used for programming transformations. Before you can properly execute a `TransformProcess` you will need to pass the schema of the data being transformed.
An example of a schema for merchant records may look like:
```java
Schema inputDataSchema = new Schema.Builder()
.addColumnsString("DateTimeString", "CustomerID", "MerchantID")
.addColumnInteger("NumItemsInTransaction")
.addColumnCategorical("MerchantCountryCode", Arrays.asList("USA","CAN","FR","MX"))
.addColumnDouble("TransactionAmountUSD",0.0,null,false,false) //$0.0 or more, no maximum limit, no NaN and no Infinite values
.addColumnCategorical("FraudLabel", Arrays.asList("Fraud","Legit"))
.build();
```
## Joining schemas
If you have two different datasets that you want to merge together, DataVec provides a `Join` class with different join strategies such as `Inner` or `RightOuter`.
```java
Schema customerInfoSchema = new Schema.Builder()
.addColumnLong("customerID")
.addColumnString("customerName")
.addColumnCategorical("customerCountry", Arrays.asList("USA","France","Japan","UK"))
.build();
Schema customerPurchasesSchema = new Schema.Builder()
.addColumnLong("customerID")
.addColumnTime("purchaseTimestamp", DateTimeZone.UTC)
.addColumnLong("productID")
.addColumnInteger("purchaseQty")
.addColumnDouble("unitPriceUSD")
.build();
Join join = new Join.Builder(Join.JoinType.Inner)
.setJoinColumns("customerID")
.setSchemas(customerInfoSchema, customerPurchasesSchema)
.build();
```
Once you've defined your join and you've loaded the data into DataVec, you must use an `Executor` to complete the join.
## Classes and utilities
DataVec comes with a few `Schema` classes and helper utilities for 2D and sequence types of data.
{{autogenerated}}