TextContentTokenizer

class TextContentTokenizer(language: Language?, overrideContentLanguage: Boolean = false, contextSnippetLength: Int = 50, textTokenizerFactory: (Language?) -> TextTokenizer) : ContentTokenizer

A ContentTokenizer using a TextTokenizer to split the text of the Content.Element into smaller portions.

Parameters

contextSnippetLength

Length of before and after snippets in the produced Locators.

overrideContentLanguage

If true, let language override language information that could be available in content. If false, language will be used only as a default when there is no data-specific information.

Constructors

Link copied to clipboard
constructor(language: Language?, overrideContentLanguage: Boolean = false, contextSnippetLength: Int = 50, textTokenizerFactory: (Language?) -> TextTokenizer)
constructor(language: Language?, unit: TextUnit, overrideContentLanguage: Boolean = false)

A ContentTokenizer using the default TextTokenizer to split the text of the Content.Element.

Functions

Link copied to clipboard
open override fun tokenize(data: Content.Element): List<Content.Element>