it.unimi.dsi.mg4j.util
Class MinimalPerfectHash

java.lang.Object
  extended byit.unimi.dsi.mg4j.index.AbstractTermMap
      extended byit.unimi.dsi.mg4j.util.MinimalPerfectHash
All Implemented Interfaces:
Serializable, TermMap
Direct Known Subclasses:
SignedMinimalPerfectHash

public class MinimalPerfectHash
extends AbstractTermMap
implements Serializable

Order-preserving minimal perfect hash tables.

Given a collection of terms without duplicates (it is not forced to be a Set so that one can also use arrays), the constructor of this class finds an order-preserving minimal perfect hash function for the collection. Subsequent calls to the hash(CharSequence) method will return the ordinal position of the provided string (if it was in the collection; otherwise, the result is a random position). The class can then be saved by serialisation and reused later.

If you also need to establish whether a given term was or not in the collection, you should consider using a SignedMinimalPerfectHash implementing class instead.

This class is very scalable, and if you have enough memory it will handle efficiently hundreds of millions of terms: in particular, the offline constructor can build a map without loading the terms in memory.

To do its job, the class must store three vectors of weights that are used to compute certain hash functions. By default, the vectors are long as the longest term, but if your collection is sorted you can ask (passing WEIGHT_UNKNOWN_SORTED_TERMS to a constructor) for the shortest possible vector length. This optimisation does not change the memory footprint, but it can improve performance.

As a commodity, this class provides a main method that reads from standard input a sequence of newline-separated terms, and writes a serialised minimal perfect hash table for the given list. You can specify on the command line the kind of table (e.g., CRC32SignedMinimalPerfectHash) and have it fetched by reflection.

Implementation Details

An object of this class uses about 1.25n integers between 0 and n-1 inclusive to hash n. This is asymptotically optimal, but for small collections using an integer wastes a large number of bits in each entry.

The technique used here is suggested (but not fully described) in the second edition of Managing Gigabytes. It is due to Havas, Majewski, Wormald and Czech, and it is fully described in the monograph by Czech, Havas and Majewski on perfect hashing (Perfect Hashing, Theoretical Computer Science 182(1997), pages 1-143).

First, a triple of intermediate hash functions (generated via universal hashing) define for each term a 3-hyperedge in a hypergraph with 1.25n vertices. Each intermediate hash function uses a vector of random weights; the length of the vector is by default the length of the longest term, but if the collection is sorted it is possible to compute the minimum length of a prefix that will distinguish any pair of term.

Then, by successively stripping hyperedges with one vertex of degree one, we create an ordering on the hyperedges. Finally, we assign (in reverse order) to each vertex of a hyperedge a number in the range 0 and n-1 in such a way that the sum of the three numbers modulo n is exactly the original index of the term corresponding to the hyperedge. This is possible with high probability, so we try until we succeed.

Note that the mathematical results guaranteeing that it is possible to find a function in expected constant time are asymptotic. Thus, when the number of terms is less than TERM_THRESHOLD, an instance of this class just stores the terms in a vector and scans them linearly. This behaviour, however, is completely transparent to the user.

Rounds and Logging

Building a minimal perfect hash map may take hours. As it happens with all probabilistic algorithms, one can just give estimates of the expected time.

There are two probabilistic sources of problems: degenerate hyperedges and non-acyclic hypergraphs. The ratio between the number of vertices and the number of terms guarantee that acyclicity is true with high probability. However, when generating the hypergraph we use three hash functions, and it must never happen that the value of two of those functions coincide on a given term. Because of the birthday paradox, the probability of getting a nondegerate edge is just

(m-1)(m-2)/m2,
where m=1.25n. Since this must happen for n times, we must raise this probability to n, and as n grows the value quickly (and monotonically) reaches e-12/5. We conclude that the expected number of trials to generate a random 3-hypergraph is bounded by e12/5, which is about 11. Note that this bound does not depend on n.

Once the hypergraph has been generated, the stripping procedure may fail. However, the expected number of trials tends to 1 as n approaches infinity (Czech, Havas and Majewski, for instance, report that on a set of 50,000 terms they obtained consistently one trial for more than 5000 experiments). The number of expected trials, of course, must be multiplied by 11, as every failure of the stripping procedure requires generating a new hypergraph.

To help diagnosing problem with the generation process class, a property named it.unimi.dsi.mg4j.util.minimalperfecthash.verbose is checked for: if set to true, then the class will log what's happening to standard error. In particular, it will signal whenever a degenerate hyperedge is generated, and the various construction phases.

Note that if during the generation process the log warns more than once about duplicate hyperedges, you should suspect that there are duplicates in the term list, as duplicate hyperedges are extremely unlikely.

Since:
0.1
Author:
Sebastiano Vigna, Paolo Boldi
See Also:
Serialized Form

Field Summary
protected  int[] g
          The final magick—the function turning the intermediate hash values into the final bucket.
protected  int m
          The number of vertices of the intermediate hypergraph.
protected  int n
          The number of buckets.
protected  long n4
          Four times the number of buckets.
protected  int rightShift
          The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to m.
static long serialVersionUID
           
protected  CharSequence[] t
          If n is smaller than TERM_THRESHOLD, a vector containing the terms.
static int TERM_THRESHOLD
          The minimum number of terms that will trigger the construction of a minimal perfect hash; overwise, terms are simply stored in a vector.
protected  Collection terms
          The term collection (it could be an anonymous class wrapping a file).
protected static boolean VERBOSE
          Whether we should log what's happening (it is set at class load time using the system property it.unimi.dsi.mg4j.util.minimalperfecthash.verbose).
static int WEIGHT_UNKNOWN
          A special value denoting that the weight length is unknown, and should be computed using the maximum length of a term.
static int WEIGHT_UNKNOWN_SORTED_TERMS
          A special value denoting that the weight length is unknown, and should be computed assuming that the terms appear in lexicographic order.
protected  int[] weight0
          Vector of weights to compute the first intermediate hash function.
protected  int[] weight1
          Vector of weights to compute the second intermediate hash function.
protected  int[] weight2
          Vector of weights to compute the third intermediate hash function.
protected  int weightLength
          The length of the components of the weight vectors (it's faster than asking the length of the vectors).
 
Constructor Summary
MinimalPerfectHash(Collection terms)
          Creates a new order-preserving minimal perfect hash table for the given set of terms, using as many weights as the longest term in the collection.
MinimalPerfectHash(Collection terms, int weightLength)
          Creates a new order-preserving minimal perfect hash table for the given set of terms using the given number of weights.
MinimalPerfectHash(String termFile)
          Creates a new order-preserving minimal perfect hash table for the given file of terms, using as many weights as the longest term in the file.
MinimalPerfectHash(String termFile, String encoding, int weightLength)
          Creates a new order-preserving minimal perfect hash table for the given file of terms using the given number of weights.
 
Method Summary
 int get(CharSequence term)
          Hashes a given term.
 int get(MutableString term)
          Hashes a given term.
protected  int getFromT(CharSequence term)
          Gets a term out of the stored array t.
 int getWeightLength()
          Deprecated. As of mg4j 0.8, replaced by weightLength().
 int hash(CharSequence term)
          Deprecated. As of mg4j 1.0, replaced by get(CharSequence).
protected  void hash(CharSequence term, int[] h)
          Hashes a given term using the intermediate hash functions.
 int hash(MutableString term)
          Deprecated. As of mg4j 1.0, replaced by get(MutableString).
static void main(String[] arg)
           
 int size()
          Returns the number of terms hashed.
 int weightLength()
          Returns the length of the weight vectors.
 
Methods inherited from class it.unimi.dsi.mg4j.index.AbstractTermMap
get, get
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

TERM_THRESHOLD

public static final int TERM_THRESHOLD
The minimum number of terms that will trigger the construction of a minimal perfect hash; overwise, terms are simply stored in a vector.

See Also:
Constant Field Values

WEIGHT_UNKNOWN

public static final int WEIGHT_UNKNOWN
A special value denoting that the weight length is unknown, and should be computed using the maximum length of a term.

See Also:
Constant Field Values

WEIGHT_UNKNOWN_SORTED_TERMS

public static final int WEIGHT_UNKNOWN_SORTED_TERMS
A special value denoting that the weight length is unknown, and should be computed assuming that the terms appear in lexicographic order.

See Also:
Constant Field Values

n

protected final int n
The number of buckets.


m

protected final int m
The number of vertices of the intermediate hypergraph.


rightShift

protected final int rightShift
The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to m.


weight0

protected final int[] weight0
Vector of weights to compute the first intermediate hash function.


weight1

protected final int[] weight1
Vector of weights to compute the second intermediate hash function.


weight2

protected final int[] weight2
Vector of weights to compute the third intermediate hash function.


weightLength

protected final int weightLength
The length of the components of the weight vectors (it's faster than asking the length of the vectors).


g

protected final int[] g
The final magick—the function turning the intermediate hash values into the final bucket.


VERBOSE

protected static final boolean VERBOSE
Whether we should log what's happening (it is set at class load time using the system property it.unimi.dsi.mg4j.util.minimalperfecthash.verbose).


terms

protected final transient Collection terms
The term collection (it could be an anonymous class wrapping a file).


n4

protected transient long n4
Four times the number of buckets.


t

protected transient CharSequence[] t
If n is smaller than TERM_THRESHOLD, a vector containing the terms.


serialVersionUID

public static final long serialVersionUID
See Also:
Constant Field Values
Constructor Detail

MinimalPerfectHash

public MinimalPerfectHash(Collection terms)
Creates a new order-preserving minimal perfect hash table for the given set of terms, using as many weights as the longest term in the collection.

Caution: if the collection contains twice the same term, this constructor will never return.

Parameters:
terms - some terms to hash; it is assumed that this Collection does not contain duplicates or the empty term.

MinimalPerfectHash

public MinimalPerfectHash(Collection terms,
                          int weightLength)
Creates a new order-preserving minimal perfect hash table for the given set of terms using the given number of weights.

This constructor can be used to force the use of a reduced number of weights if you are sure that the first weightLength characters of each term in terms are distinct.

If you do not know your weight length in advance, you can pass two special values as weightLength: WEIGHT_UNKNOWN will force the computation of weightLength as the maximum length of a term, whereas WEIGHT_UNKNOWN_SORTED_TERMS forces the assumption that terms are sorted: in this case, we search for the minimum prefix that will disambiguate all terms in the collection.

Caution: if two terms in the collection have a common prefix of length weightLength this constructor will never return.

Parameters:
terms - some terms to hash; if weightLength is specified explicitly, it is assumed that this Collection does not contain terms with a common prefix of weightLength characters or the empty term.
weightLength - the number of weights used generating the intermediate hash functions, WEIGHT_UNKNOWN or WEIGHT_UNKNOWN_SORTED_TERMS.
See Also:
MinimalPerfectHash(Collection)

MinimalPerfectHash

public MinimalPerfectHash(String termFile,
                          String encoding,
                          int weightLength)
Creates a new order-preserving minimal perfect hash table for the given file of terms using the given number of weights.

Parameters:
termFile - an file containing one term on each line; it is assumed that it does not contain terms with a common prefix of weightLength characters or the empty term.
encoding - the encoding of termFile; if null, it is assumed to be the platform default encoding.
weightLength - the number of weights used generating the intermediate hash functions, WEIGHT_UNKNOWN or WEIGHT_UNKNOWN_SORTED_TERMS.
See Also:
MinimalPerfectHash(Collection, int)

MinimalPerfectHash

public MinimalPerfectHash(String termFile)
Creates a new order-preserving minimal perfect hash table for the given file of terms, using as many weights as the longest term in the file.

Parameters:
termFile - a file in the platform default encoding containing one term on each line; it is assumed that the file does not contain duplicates or the empty term.
See Also:
MinimalPerfectHash(String, String, int)
Method Detail

hash

protected void hash(CharSequence term,
                    int[] h)
Hashes a given term using the intermediate hash functions.

Parameters:
term - a term to hash.
h - a three-element vector that will be filled with the three intermediate hash values.

getFromT

protected int getFromT(CharSequence term)
Gets a term out of the stored array t.

Note: This method does not check for t being non-null.

Parameters:
term - a term.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is 0.

get

public int get(CharSequence term)
Hashes a given term.

Specified by:
get in interface TermMap
Parameters:
term - a term to be hashed.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

get

public int get(MutableString term)
Hashes a given term.

Parameters:
term - a term to be hashed.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

hash

public int hash(CharSequence term)
Deprecated. As of mg4j 1.0, replaced by get(CharSequence).

Hashes a given term.

Parameters:
term - a term to be hashed.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

hash

public int hash(MutableString term)
Deprecated. As of mg4j 1.0, replaced by get(MutableString).

Hashes a given term.

Parameters:
term - a term to be hashed.
Returns:
the position of the given term in the generating collection, starting from 0. If the term was not in the original collection, the result is a random position.

getWeightLength

public int getWeightLength()
Deprecated. As of mg4j 0.8, replaced by weightLength().

Gets the length of the weight vectors.

Returns:
the length of weight vectors used in this table.

weightLength

public int weightLength()
Returns the length of the weight vectors.

Returns:
the length of weight vectors used in this table.

size

public int size()
Returns the number of terms hashed.

Returns:
the number of terms hashed.

main

public static void main(String[] arg)
                 throws IllegalArgumentException,
                        SecurityException,
                        InstantiationException,
                        IllegalAccessException,
                        InvocationTargetException,
                        NoSuchMethodException,
                        ClassNotFoundException,
                        FileNotFoundException,
                        IOException
Throws:
IllegalArgumentException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
ClassNotFoundException
FileNotFoundException
IOException