58 lines
2.3 KiB
Markdown
58 lines
2.3 KiB
Markdown
---
|
|
title: DataVec Analysis
|
|
short_title: Analysis
|
|
description: Gather statistics on datasets.
|
|
category: DataVec
|
|
weight: 2
|
|
---
|
|
|
|
## Analysis of data
|
|
|
|
Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics.
|
|
|
|
## Using Spark for analysis
|
|
|
|
If you have loaded your data into Apache Spark, DataVec has a special `AnalyzeSpark` class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the `JavaRDD` and `Schema` to the class.
|
|
|
|
If you are using DataVec in Scala and your data was loaded into a regular `RDD` class, you can convert it by calling `.toJavaRDD()` which returns a `JavaRDD`. If you need to convert it back, call `rdd()`.
|
|
|
|
The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD `javaRdd` and the schema `mySchema`:
|
|
|
|
```java
|
|
import org.datavec.spark.transform.AnalyzeSpark;
|
|
import org.datavec.api.writable.Writable;
|
|
import org.datavec.api.transform.analysis.*;
|
|
|
|
int maxHistogramBuckets = 10
|
|
DataAnalysis analysis = AnalyzeSpark.analyze(mySchema, javaRdd, maxHistogramBuckets)
|
|
|
|
DataQualityAnalysis analysis = AnalyzeSpark.analyzeQuality(mySchema, javaRdd)
|
|
|
|
Writable max = AnalyzeSpark.max(javaRdd, "myColumn", mySchema)
|
|
|
|
int numSamples = 5
|
|
List<Writable> sample = AnalyzeSpark.sampleFromColumn(numSamples, "myColumn", mySchema, javaRdd)
|
|
```
|
|
|
|
Note that if you have sequence data, there are special methods for that as well:
|
|
|
|
```java
|
|
SequenceDataAnalysis seqAnalysis = AnalyzeSpark.analyzeSequence(mySchema, sequenceRdd)
|
|
|
|
List<Writable> uniqueSequence = AnalyzeSpark.getUniqueSequence("myColumn", seqSchema, sequenceRdd)
|
|
```
|
|
|
|
## Analyzing locally
|
|
|
|
The `AnalyzeLocal` class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a `RecordReader` which allows it to iterate over the dataset.
|
|
|
|
```java
|
|
import org.datavec.local.transforms.AnalyzeLocal;
|
|
|
|
int maxHistogramBuckets = 10
|
|
DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader, maxHistogramBuckets)
|
|
```
|
|
|
|
## Utilities
|
|
|
|
{{autogenerated}} |