cavis/docs/deeplearning4j-nlp/templates/tokenization.md

---
title: Tokenization
short_title: Tokenization
description: Breaking text into individual words for language processing in DL4J.
category: Language Processing
weight: 10
---

## What is Tokenization?

Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. [Word2Vec](./word2vec.html) can output text windows that comprise training examples for input into neural nets, as seen here.

## Example

Here's an example of tokenization done with DL4J tools:
                 
         //tokenization with lemmatization,part of speech taggin,sentence segmentation
         TokenizerFactory tokenizerFactory = new UimaTokenizerFactory();
         Tokenizer tokenizer = tokenizerFactory.tokenize("mystring");

          //iterate over the tokens
          while(tokenizer.hasMoreTokens()) {
          	   String token = tokenizer.nextToken();
          }
          
          //get the whole list of tokens
          List<String> tokens = tokenizer.getTokens();

The above snippet creates a tokenizer capable of stemming.

In Word2Vec, that's the recommended a way of creating a vocabulary, because it averts various vocabulary quirks, such as the singular and plural of the same noun being counted as two different words.
Eclipse Migration Initial Commit 2019-06-06 14:21:15 +02:00			`---`
			`title: Tokenization`
			`short_title: Tokenization`
			`description: Breaking text into individual words for language processing in DL4J.`
			`category: Language Processing`
			`weight: 10`
			`---`

			`## What is Tokenization?`

			`Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. [Word2Vec](./word2vec.html) can output text windows that comprise training examples for input into neural nets, as seen here.`

			`## Example`

			`Here's an example of tokenization done with DL4J tools:`

			`//tokenization with lemmatization,part of speech taggin,sentence segmentation`
			`TokenizerFactory tokenizerFactory = new UimaTokenizerFactory();`
			`Tokenizer tokenizer = tokenizerFactory.tokenize("mystring");`

			`//iterate over the tokens`
			`while(tokenizer.hasMoreTokens()) {`
			`String token = tokenizer.nextToken();`
			`}`

			`//get the whole list of tokens`
			`List<String> tokens = tokenizer.getTokens();`

			`The above snippet creates a tokenizer capable of stemming.`

			`In Word2Vec, that's the recommended a way of creating a vocabulary, because it averts various vocabulary quirks, such as the singular and plural of the same noun being counted as two different words.`