Class AbstractWordSplitter

java.lang.Object
de.danielnaber.jwordsplitter.AbstractWordSplitter
Direct Known Subclasses:
GermanWordSplitter

public abstract class AbstractWordSplitter extends Object
This class can split compound words into their smallest parts (atoms). For example "Erhebungsfehler" will be split into "erhebung" and "fehler", if "erhebung" and "fehler" are in the dictionary and "erhebungsfehler" is not. Thus how words are split only depends on the contents of the dictionary. A dictionary for German is included.

This is especially useful for German words but it will work with all languages. The order of the words in the collection will be identical to their appearance in the connected word. It's good to provide a large dictionary.

Please note: We don't expect to have any special chars here (!":;,.-_, etc.). Only a set of characters and only one word.

  • Field Details

    • words

      protected Set<String> words
    • hideInterfixCharacters

      private final boolean hideInterfixCharacters
    • exceptionSplits

      private ExceptionSplits exceptionSplits
    • strictMode

      private boolean strictMode
    • minimumWordLength

      private int minimumWordLength
    • maximumWordLength

      private int maximumWordLength
  • Constructor Details

    • AbstractWordSplitter

      public AbstractWordSplitter(boolean hideInterfixCharacters) throws IOException
      Create a word splitter that uses the embedded dictionary.
      Parameters:
      hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
      Throws:
      IOException
    • AbstractWordSplitter

      public AbstractWordSplitter(boolean hideInterfixCharacters, InputStream plainTextDict) throws IOException
      Parameters:
      hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
      plainTextDict - a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format
      Throws:
      IOException
    • AbstractWordSplitter

      public AbstractWordSplitter(boolean hideInterfixCharacters, File plainTextDict) throws IOException
      Parameters:
      hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
      plainTextDict - a stream of a text file with one word per line, to be used instead of the embedded dictionary, must be in UTF-8 format
      Throws:
      IOException
    • AbstractWordSplitter

      public AbstractWordSplitter(boolean hideInterfixCharacters, Set<String> words) throws IOException
      Parameters:
      hideInterfixCharacters - whether the word parts returned by splitWord(String) still contain the connecting character (a.k.a. interfix)
      words - the compound part words
      Throws:
      IOException
      Since:
      4.1
  • Method Details

    • getWordList

      protected abstract Set<String> getWordList(InputStream stream) throws IOException
      Throws:
      IOException
    • getWordList

      protected abstract Set<String> getWordList() throws IOException
      Throws:
      IOException
    • getDisambiguator

      protected abstract GermanInterfixDisambiguator getDisambiguator()
    • getDefaultMinimumWordLength

      protected abstract int getDefaultMinimumWordLength()
    • getInterfixCharacters

      protected abstract Collection<String> getInterfixCharacters()
      Interfix elements in lowercase, e.g. at least "s" for German.
    • getWordList

      private Set<String> getWordList(File file) throws IOException
      Throws:
      IOException
    • setMinimumWordLength

      public void setMinimumWordLength(int len)
    • setMaximumWordLength

      public void setMaximumWordLength(int len)
      Words longer than this will throw an IllegalArgumentException to avoid extremely long processing times. The default is 70.
      Since:
      4.2
    • setExceptionFile

      public void setExceptionFile(String filename) throws IOException
      Parameters:
      filename - UTF-8 encoded file with exceptions in the classpath, one exception per line, using pipe as delimiter. Example: Pilot|sendung
      Throws:
      IOException
    • addException

      public void addException(String completeWord, List<String> wordParts)
      Parameters:
      completeWord - the word for which an exception is to be defined (will be considered case-insensitive)
      wordParts - the parts in which the word is to be split (use a list with a single element if the word should not be split)
    • setStrictMode

      public void setStrictMode(boolean strictMode)
      When set to true, words will only be split if all parts are words. Otherwise the splitting result might contain parts that are not words.
    • getAllSplits

      public List<List<String>> getAllSplits(String word)
      Experimental: Split a word with unknown parts, typically because one part has a typo. This could be used to split three-part compounds where one part has a typo (the caller is then responsible for making useful corrections out of these parts). Results are returned in no specific order.
      Since:
      4.0
    • getAllSplits

      List<List<String>> getAllSplits(String word, boolean fromLeft) throws InterruptedException
      Throws:
      InterruptedException
    • isLoopEnd

      private boolean isLoopEnd(boolean fromLeft, int i, String word)
    • getSubWords

      public List<String> getSubWords(String word)
      Since:
      4.2
    • splitWord

      public List<String> splitWord(String word)
    • splitWord

      public List<String> splitWord(String word, boolean collectSubwords)
      Returns:
      a list of compound parts, with one element (the input word itself) if the input could not be split; returns an empty list if the input is null
      Since:
      4.2
    • cleanLeadingAndTrailingHyphens

      private void cleanLeadingAndTrailingHyphens(List<String> disambiguatedParts)
    • split

      private List<String> split(String word, boolean allowInterfixRemoval, boolean collectSubwords)
    • splitFromRight

      private List<String> splitFromRight(String word, boolean collectSubwords)
    • getExceptionSplitOrNull

      private List<String> getExceptionSplitOrNull(String rightPart, String leftPart)
    • findInterfixOrNull

      private String findInterfixOrNull(String word)
    • endsWithInterfix

      private boolean endsWithInterfix(String word)
    • removeInterfix

      private String removeInterfix(String word, String interfixOrNull)
    • isSimpleWord

      private boolean isSimpleWord(String part)