cavis

1.2 KiB

Raw Blame History

title	short_title	description	category	weight
Tokenization	Tokenization	Breaking text into individual words for language processing in DL4J.	Language Processing	10

What is Tokenization?

Tokenization is the process of breaking text down into individual words. Word windows are also composed of tokens. Word2Vec can output text windows that comprise training examples for input into neural nets, as seen here.

Example

Here's an example of tokenization done with DL4J tools:

     //tokenization with lemmatization,part of speech taggin,sentence segmentation
     TokenizerFactory tokenizerFactory = new UimaTokenizerFactory();
     Tokenizer tokenizer = tokenizerFactory.tokenize("mystring");

      //iterate over the tokens
      while(tokenizer.hasMoreTokens()) {
      	   String token = tokenizer.nextToken();
      }
      
      //get the whole list of tokens
      List<String> tokens = tokenizer.getTokens();

The above snippet creates a tokenizer capable of stemming.

In Word2Vec, that's the recommended a way of creating a vocabulary, because it averts various vocabulary quirks, such as the singular and plural of the same noun being counted as two different words.

1.2 KiB Raw Blame History

What is Tokenization?

Example

1.2 KiB

Raw Blame History