Package de.danielnaber.jwordsplitter
Class AbstractWordSplitter
java.lang.Object
de.danielnaber.jwordsplitter.AbstractWordSplitter
- Direct Known Subclasses:
GermanWordSplitter
This class can split compound words into their smallest parts (atoms). For example "Erhebungsfehler"
will be split into "erhebung" and "fehler", if "erhebung" and "fehler" are in the dictionary
and "erhebungsfehler" is not. Thus how words are split only depends on the contents of
the dictionary. A dictionary for German is included.
This is especially useful for German words but it will work with all languages. The order of the words in the collection will be identical to their appearance in the connected word. It's good to provide a large dictionary.
Please note: We don't expect to have any special chars here (!":;,.-_, etc.). Only a set of characters and only one word.
-
Field Summary
FieldsModifier and TypeFieldDescriptionprivate ExceptionSplits
private final boolean
private int
private int
private boolean
-
Constructor Summary
ConstructorsConstructorDescriptionAbstractWordSplitter
(boolean hideInterfixCharacters) Create a word splitter that uses the embedded dictionary.AbstractWordSplitter
(boolean hideInterfixCharacters, File plainTextDict) AbstractWordSplitter
(boolean hideInterfixCharacters, InputStream plainTextDict) AbstractWordSplitter
(boolean hideInterfixCharacters, Set<String> words) -
Method Summary
Modifier and TypeMethodDescriptionvoid
addException
(String completeWord, List<String> wordParts) private void
cleanLeadingAndTrailingHyphens
(List<String> disambiguatedParts) private boolean
endsWithInterfix
(String word) private String
findInterfixOrNull
(String word) getAllSplits
(String word) Experimental: Split a word with unknown parts, typically because one part has a typo.getAllSplits
(String word, boolean fromLeft) protected abstract int
protected abstract GermanInterfixDisambiguator
getExceptionSplitOrNull
(String rightPart, String leftPart) protected abstract Collection
<String> Interfix elements in lowercase, e.g.getSubWords
(String word) getWordList
(File file) getWordList
(InputStream stream) private boolean
private boolean
isSimpleWord
(String part) private String
removeInterfix
(String word, String interfixOrNull) void
setExceptionFile
(String filename) void
setMaximumWordLength
(int len) Words longer than this will throw anIllegalArgumentException
to avoid extremely long processing times.void
setMinimumWordLength
(int len) void
setStrictMode
(boolean strictMode) When set to true, words will only be split if all parts are words.splitFromRight
(String word, boolean collectSubwords)
-
Field Details
-
words
-
hideInterfixCharacters
private final boolean hideInterfixCharacters -
exceptionSplits
-
strictMode
private boolean strictMode -
minimumWordLength
private int minimumWordLength -
maximumWordLength
private int maximumWordLength
-
-
Constructor Details
-
AbstractWordSplitter
Create a word splitter that uses the embedded dictionary.- Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)- Throws:
IOException
-
AbstractWordSplitter
public AbstractWordSplitter(boolean hideInterfixCharacters, InputStream plainTextDict) throws IOException - Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)plainTextDict
- a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format- Throws:
IOException
-
AbstractWordSplitter
- Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)plainTextDict
- a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format- Throws:
IOException
-
AbstractWordSplitter
- Parameters:
hideInterfixCharacters
- whether the word parts returned bysplitWord(String)
still contain the connecting character (a.k.a. interfix)words
- the compound part words- Throws:
IOException
- Since:
- 4.1
-
-
Method Details
-
getWordList
- Throws:
IOException
-
getWordList
- Throws:
IOException
-
getDisambiguator
-
getDefaultMinimumWordLength
protected abstract int getDefaultMinimumWordLength() -
getInterfixCharacters
Interfix elements in lowercase, e.g. at least "s" for German. -
getWordList
- Throws:
IOException
-
setMinimumWordLength
public void setMinimumWordLength(int len) -
setMaximumWordLength
public void setMaximumWordLength(int len) Words longer than this will throw anIllegalArgumentException
to avoid extremely long processing times. The default is 70.- Since:
- 4.2
-
setExceptionFile
- Parameters:
filename
- UTF-8 encoded file with exceptions in the classpath, one exception per line, using pipe as delimiter. Example: Pilot|sendung- Throws:
IOException
-
addException
- Parameters:
completeWord
- the word for which an exception is to be defined (will be considered case-insensitive)wordParts
- the parts in which the word is to be split (use a list with a single element if the word should not be split)
-
setStrictMode
public void setStrictMode(boolean strictMode) When set to true, words will only be split if all parts are words. Otherwise the splitting result might contain parts that are not words. -
getAllSplits
Experimental: Split a word with unknown parts, typically because one part has a typo. This could be used to split three-part compounds where one part has a typo (the caller is then responsible for making useful corrections out of these parts). Results are returned in no specific order.- Since:
- 4.0
-
getAllSplits
- Throws:
InterruptedException
-
isLoopEnd
-
getSubWords
- Since:
- 4.2
-
splitWord
-
splitWord
- Returns:
- a list of compound parts, with one element (the input word itself) if the input
could not be split; returns an empty list if the input is
null
- Since:
- 4.2
-
cleanLeadingAndTrailingHyphens
-
split
-
splitFromRight
-
getExceptionSplitOrNull
-
findInterfixOrNull
-
endsWithInterfix
-
removeInterfix
-
isSimpleWord
-