--- title: DataVec Analysis short_title: Analysis description: Gather statistics on datasets. category: DataVec weight: 2 --- ## Analysis of data Sometimes datasets are too large or too abstract in their format to manually analyze and estimate statistics on certain columns or patterns. DataVec comes with some helper utilities for performing a data analysis, and maximums, means, minimums, and other useful metrics. ## Using Spark for analysis If you have loaded your data into Apache Spark, DataVec has a special `AnalyzeSpark` class which can generate histograms, collect statistics, and return information about the quality of the data. Assuming you have already loaded your data into a Spark RDD, pass the `JavaRDD` and `Schema` to the class. If you are using DataVec in Scala and your data was loaded into a regular `RDD` class, you can convert it by calling `.toJavaRDD()` which returns a `JavaRDD`. If you need to convert it back, call `rdd()`. The code below demonstrates some of many analyses for a 2D dataset in Spark analysis using the RDD `javaRdd` and the schema `mySchema`: ```java import org.datavec.spark.transform.AnalyzeSpark; import org.datavec.api.writable.Writable; import org.datavec.api.transform.analysis.*; int maxHistogramBuckets = 10 DataAnalysis analysis = AnalyzeSpark.analyze(mySchema, javaRdd, maxHistogramBuckets) DataQualityAnalysis analysis = AnalyzeSpark.analyzeQuality(mySchema, javaRdd) Writable max = AnalyzeSpark.max(javaRdd, "myColumn", mySchema) int numSamples = 5 List sample = AnalyzeSpark.sampleFromColumn(numSamples, "myColumn", mySchema, javaRdd) ``` Note that if you have sequence data, there are special methods for that as well: ```java SequenceDataAnalysis seqAnalysis = AnalyzeSpark.analyzeSequence(mySchema, sequenceRdd) List uniqueSequence = AnalyzeSpark.getUniqueSequence("myColumn", seqSchema, sequenceRdd) ``` ## Analyzing locally The `AnalyzeLocal` class works very similarly to its Spark counterpart and has a similar API. Instead of passing an RDD, it accepts a `RecordReader` which allows it to iterate over the dataset. ```java import org.datavec.local.transforms.AnalyzeLocal; int maxHistogramBuckets = 10 DataAnalysis analysis = AnalyzeLocal.analyze(mySchema, csvRecordReader, maxHistogramBuckets) ``` ## Utilities {{autogenerated}}