cavis/docs/datavec/templates/analysis.md

58 lines
2.3 KiB
Markdown
Raw Normal View History

2019-06-06 14:21:15 +02:00
---
title: DataVec Analysis
short_title: Analysis
description: Gather statistics on datasets.
category: DataVec
weight: 2
---
## Analysis of data
Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.
## Using Spark for analysis
If you have loaded your data into Apache Spark, DataVec has a special `AnalyzeSpark` class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the `JavaRDD` and `Schema` to the class.
If you are using DataVec in Scala and your data was loaded into a regular `RDD` class, you can convert it by calling `.toJavaRDD()` which returns a `JavaRDD`. If you need to convert it back, call `rdd()`.
The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD `javaRdd` and the schema `mySchema`:
```java
import org.datavec.spark.transform.AnalyzeSpark;
import org.datavec.api.writable.Writable;
import org.datavec.api.transform.analysis.*;
int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeSpark.analyze(mySchema, javaRdd, maxHistogramBuckets)
DataQualityAnalysis analysis = AnalyzeSpark.analyzeQuality(mySchema, javaRdd)
Writable max = AnalyzeSpark.max(javaRdd, "myColumn", mySchema)
int numSamples = 5
List<Writable> sample = AnalyzeSpark.sampleFromColumn(numSamples, "myColumn", mySchema, javaRdd)
```
Note that if you have sequence data, there are special methods for that as well:
```java
SequenceDataAnalysis seqAnalysis = AnalyzeSpark.analyzeSequence(mySchema, sequenceRdd)
List<Writable> uniqueSequence = AnalyzeSpark.getUniqueSequence("myColumn", seqSchema, sequenceRdd)
```
## Analyzing locally
The `AnalyzeLocal` class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a `RecordReader` which allows it to iterate over the dataset.
```java
import org.datavec.local.transforms.AnalyzeLocal;
int maxHistogramBuckets = 10
DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader, maxHistogramBuckets)
```
## Utilities
{{autogenerated}}