History

* DataVec fixes for Jackson version upgrade

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J jackson updates + databind version 2.9.9.3

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Shade snakeyaml along with jackson

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Version fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Switch DataVec legacy JSON format handling to mixins

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Next set of fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Cleanup for legacy JSON mapping

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Upgrade commons compress to 1.18; small test fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* New Jackson backward compatibility for DL4J - Round 1

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* New Jackson backward compatibility for DL4J - Round 2

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* More fixes, all but legacy custom passing

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Provide an upgrade path for custom layers for models in pre-1.0.0-beta JSON format

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Legacy deserialization cleanup

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Small amount of polish - legacy JSON

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Upgrade guava version

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* IEvaluation legacy format deserialization fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Upgrade play version to 2.7.3

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Update nd4j-parameter-server-status to new Play API

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Update DL4J UI for new play version

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* More play framework updates

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Small fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Remove Spark 1/2 adapter code from DataVec

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* datavec-spark dependency cleanup

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 1

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 2

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 3

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 4

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Test fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Another fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Breeze upgrade, dependency cleanup

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Add Scala 2.12 version to pom.xml

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* change-scala-versions.sh - add scala 2.12, remove 2.10

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Move Spark version properties to parent pom (now that only one spark version is supported)

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DataVec Play fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* datavec play dependency fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Clean up old spark/jackson stuff

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Cleanup jackson unused dependencies

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Dropping redundant dependency

Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com>

* Removed scalaxy dependency

Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com>

* DataVec fixes for Jackson version upgrade

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J jackson updates + databind version 2.9.9.3

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Shade snakeyaml along with jackson

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Version fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Switch DataVec legacy JSON format handling to mixins

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Next set of fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Cleanup for legacy JSON mapping

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Upgrade commons compress to 1.18; small test fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* New Jackson backward compatibility for DL4J - Round 1

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* New Jackson backward compatibility for DL4J - Round 2

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* More fixes, all but legacy custom passing

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Provide an upgrade path for custom layers for models in pre-1.0.0-beta JSON format

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Legacy deserialization cleanup

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Small amount of polish - legacy JSON

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Upgrade guava version

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* IEvaluation legacy format deserialization fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Upgrade play version to 2.7.3

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Update nd4j-parameter-server-status to new Play API

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Update DL4J UI for new play version

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* More play framework updates

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Small fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Remove Spark 1/2 adapter code from DataVec

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* datavec-spark dependency cleanup

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 1

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 2

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 3

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DL4J spark updates, pt 4

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Test fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Another fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Breeze upgrade, dependency cleanup

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Add Scala 2.12 version to pom.xml

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* change-scala-versions.sh - add scala 2.12, remove 2.10

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Move Spark version properties to parent pom (now that only one spark version is supported)

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DataVec Play fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* datavec play dependency fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Clean up old spark/jackson stuff

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Cleanup jackson unused dependencies

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Add shaded guava

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Dropping redundant dependency

Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com>

* Removed scalaxy dependency

Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com>

* Ensure not possible to import pre-shaded classes, and remove direct guava dependencies in favor of shaded

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* ND4J Shaded guava import fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* DataVec and DL4J guava shading

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Arbiter, RL4J fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Build fixed

Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com>

* Fix dependency

Signed-off-by: Alexander Stoyakin <alexander.stoyakin@gmail.com>

* Fix bad merge

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Jackson shading fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Set play secret, datavec-spark-inference-server

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Fix for datavec-spark-inference-server

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Arbiter fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Arbiter fixes

Signed-off-by: AlexDBlack <blacka101@gmail.com>

* Small test fix

Signed-off-by: AlexDBlack <blacka101@gmail.com>

2019-08-30 14:35:27 +10:00

.github

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

contrib

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

datavec-api

Version upgrades (#199 )

2019-08-30 14:35:27 +10:00

datavec-arrow

Version upgrades (#199 )

2019-08-30 14:35:27 +10:00

datavec-camel

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

datavec-data

Version upgrades (#199 )

2019-08-30 14:35:27 +10:00

datavec-excel

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

datavec-geo

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

datavec-hadoop

Version upgrades (#199 )

2019-08-30 14:35:27 +10:00

datavec-jdbc

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

datavec-local

Version upgrades (#199 )

2019-08-30 14:35:27 +10:00

datavec-perf

Fix test profiles (#197 )

2019-08-30 11:33:31 +10:00

datavec-python

Fix test profiles (#197 )

2019-08-30 11:33:31 +10:00

datavec-spark

Version upgrades (#199 )

2019-08-30 14:35:27 +10:00

datavec-spark-inference-parent

Version upgrades (#199 )

2019-08-30 14:35:27 +10:00

.travis.yml

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

buildmultiplescalaversions.sh

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

LICENSE

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

pom.xml

Fix (#198 )

2019-08-30 11:40:39 +10:00

README.md

Eclipse Migration Initial Commit

2019-06-06 15:21:15 +03:00

README.md

DataVec

DataVec is an Apache 2.0-licensed library for machine-learning ETL (Extract, Transform, Load) operations. DataVec's purpose is to transform raw data into usable vector formats that can be fed to machine learning algorithms. By contributing code to this repository, you agree to make your contribution available under an Apache 2.0 license.

Why Would I Use DataVec?

Data handling is sometimes messy, and we believe it should be distinct from high-performance algebra libraries (such as nd4j or Deeplearning4j).

DataVec allows a practitioner to take raw data and produce open standard compliant vectorized data (svmLight, etc) quickly. Current input data types supported out of the box:

CSV Data
Raw Text Data (Tweets, Text Documents, etc)
Image Data
LibSVM
SVMLight
MatLab (MAT) format
JSON, XML, YAML, XML

Datavec draws inspiration from a lot of the Hadoop ecosystem tools, and in particular accesses data on disk through the Hadoop API (like Spark does), which means it's compatible with many records.

DataVec also includes sophisticated functionality for feature engineering, data cleaning and data normalization both for static data and for sequences (time series). Such operations can be executed on Apache Spark using DataVec-Spark.

Datavec's architecture : API, transforms and filters, and schema management

Apart from obviously providing readers for classic data formats, DataVec also provides an interface. So if you wanted to ingest specific custom data, you wouldn't have to build the whole pipeline. You would just have to write the very first step. For example, if you describe through the API how your data fits into a common format that complies with the interface, DataVec would return a list of Writables for each record. You'll find more detail on the API in the corresponding module.

Another thing you can do with DataVec is data cleaning. Instead of having clean, ready-to-go data, let's say you start with data in different forms or from different sources. You might need to do sampling, filtering, or several incredibly messy ETL tasks needed to prepare data in the real world. DataVec offers filters and transformations that help with curating, preparing and massaging your data. It leverages Apache Spark to do this at scale.

Finally, DataVec tracks a schema for your columnar data, across all transformations. This schema is actively checked through probing, and DataVec will raise exceptions if your data does not match the schema. You can specify filters as well: you can attach a regular expression to an input column of type String, for example, and DataVec will only keep data that matches this filter.

On Distribution

Distributed treatment through Apache Spark is optional, including running Spark in local-mode (where your cluster is emulated with multi-threading) when necessary. Datavec aims to abstract away from the actual execution, and create at compile time, a logical set of operations to execute. While we have some code that uses Spark, we do not want to be locked into a single tool, and using Apache Flink or Beam are possibilities - projects on which we would welcome collaboration.

Examples

Examples for using DataVec are available here: https://github.com/deeplearning4j/dl4j-examples

Contribute

Where to contribute?

We have a lot in the pipeline, and we'd love to receive contributions. We want to support representing data as more than a collection of simple types ("writables"), and rather as binary data — that will help with GC pressure across our pipelines and fit better with media-based use cases, where columnar data is not essential. We also expect it will streamline a lot of the specialized operations we now do on primitive types.

That being said, an area that could use a first contribution is the implementations of the RecordReader interface, since this is relatively self-contained. Of note, to support most of the distributed file formats of the Hadoop ecosystem, we use Apache Camel. Camel supports a pluggable DataFormat to allow messages to be marshalled to and from binary or text formats to support a kind of Message Translator.

Another area that is relatively self-contained is transformations, where you might find a filter or data munging operation that has not been implemented yet, and provide it in a self-contained way.

Which maintainers to contact?

It's useful to know which maintainers to contact to get information on a particular part of the code, including reviewing your pull requests, or asking questions on our gitter channel. For this you can use the following, indicative mapping:

RecordReader implementations: @saudet and @agibsonccc
Transformations and their API: @agibsonccc and @AlexDBlack
Spark and distributed processing: @AlexDBlack, @agibsonccc and @huitseeker
Native formats, geodata: @saudet

How to contribute

Check for open issues, or open a new issue to start a discussion around a feature idea or a bug.
If you feel uncomfortable or uncertain about an issue or your changes, feel free to contact us on Gitter using the link above.
Fork the repository on GitHub to start making your changes.
Write a test, which shows that the bug was fixed or that the feature works as expected.
Note the repository follows the Google Java style with two modifications: 120-char column wrap and 4-spaces indentation. You can format your code to this format by typing mvn formatter:format in the subproject you work on, by using the contrib/formatter.xml at the root of the repository to configure the Eclipse formatter, or by using the INtellij plugin.
Send a pull request, and bug us on Gitter until it gets merged and published.

Eclipse Setup

Downloading the latest JAR from https://projectlombok.org/download
Double-click the JAR file to install the plugin for Eclipse
Clone Datavec to your system
Import the project as a Maven project
You will also need clone and build ND4J and libnd4j