|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.index.AbstractTermMap
it.unimi.dsi.mg4j.util.MinimalPerfectHash
Order-preserving minimal perfect hash tables.
Given a collection of terms without duplicates (it is not forced to be a Set
so that one can also
use arrays), the constructor of this class finds an order-preserving minimal perfect hash function for
the collection. Subsequent calls to the hash(CharSequence)
method will return the ordinal position of
the provided string (if it was in the collection; otherwise, the result is a random position). The class
can then be saved by serialisation and reused later.
If you also need to establish whether a given term was or not in the
collection, you should consider using a SignedMinimalPerfectHash
implementing class instead.
This class is very scalable, and if you have enough memory it will handle efficiently hundreds of millions of terms: in particular, the offline constructor can build a map without loading the terms in memory.
To do its job, the class must store three vectors of weights that are used to compute
certain hash functions. By default, the vectors are long as the longest term, but if
your collection is sorted you can ask (passing WEIGHT_UNKNOWN_SORTED_TERMS
to a constructor)
for the shortest possible vector length. This optimisation does not change the
memory footprint, but it can improve performance.
As a commodity, this class provides a main method that reads from
standard input a sequence of newline-separated terms, and
writes a serialised minimal perfect hash table for the given list. You can
specify on the command line the kind of table (e.g., CRC32SignedMinimalPerfectHash
) and have it fetched by reflection.
An object of this class uses about 1.25n integers between 0 and n-1 inclusive to hash n. This is asymptotically optimal, but for small collections using an integer wastes a large number of bits in each entry.
The technique used here is suggested (but not fully described) in the second edition of Managing Gigabytes. It is due to Havas, Majewski, Wormald and Czech, and it is fully described in the monograph by Czech, Havas and Majewski on perfect hashing (Perfect Hashing, Theoretical Computer Science 182(1997), pages 1-143).
First, a triple of intermediate hash functions (generated via universal hashing) define for each term a 3-hyperedge in a hypergraph with 1.25n vertices. Each intermediate hash function uses a vector of random weights; the length of the vector is by default the length of the longest term, but if the collection is sorted it is possible to compute the minimum length of a prefix that will distinguish any pair of term.
Then, by successively stripping hyperedges with one vertex of degree one, we create an ordering on the hyperedges. Finally, we assign (in reverse order) to each vertex of a hyperedge a number in the range 0 and n-1 in such a way that the sum of the three numbers modulo n is exactly the original index of the term corresponding to the hyperedge. This is possible with high probability, so we try until we succeed.
Note that the mathematical results guaranteeing that it is possible to find a
function in expected constant time are asymptotic.
Thus, when the number of terms is less than TERM_THRESHOLD
, an instance
of this class just stores the terms in a vector and scans them linearly. This behaviour,
however, is completely transparent to the user.
Building a minimal perfect hash map may take hours. As it happens with all probabilistic algorithms, one can just give estimates of the expected time.
There are two probabilistic sources of problems: degenerate hyperedges and non-acyclic hypergraphs. The ratio between the number of vertices and the number of terms guarantee that acyclicity is true with high probability. However, when generating the hypergraph we use three hash functions, and it must never happen that the value of two of those functions coincide on a given term. Because of the birthday paradox, the probability of getting a nondegerate edge is just
(m-1)(m-2)/m2,where m=1.25n. Since this must happen for n times, we must raise this probability to n, and as n grows the value quickly (and monotonically) reaches e-12/5. We conclude that the expected number of trials to generate a random 3-hypergraph is bounded by e12/5, which is about 11. Note that this bound does not depend on n.
Once the hypergraph has been generated, the stripping procedure may fail. However, the expected number of trials tends to 1 as n approaches infinity (Czech, Havas and Majewski, for instance, report that on a set of 50,000 terms they obtained consistently one trial for more than 5000 experiments). The number of expected trials, of course, must be multiplied by 11, as every failure of the stripping procedure requires generating a new hypergraph.
To help diagnosing problem with the generation process class, a property named it.unimi.dsi.mg4j.util.minimalperfecthash.verbose is checked for: if set to true, then the class will log what's happening to standard error. In particular, it will signal whenever a degenerate hyperedge is generated, and the various construction phases.
Note that if during the generation process the log warns more than once about duplicate hyperedges, you should suspect that there are duplicates in the term list, as duplicate hyperedges are extremely unlikely.
Field Summary | |
protected int[] |
g
The final magick—the function turning the intermediate hash values into the final bucket. |
protected int |
m
The number of vertices of the intermediate hypergraph. |
protected int |
n
The number of buckets. |
protected long |
n4
Four times the number of buckets. |
protected int |
rightShift
The maximum amount of right shift to perform on a 32-bit number so to obtain something greater than or equal to m . |
static long |
serialVersionUID
|
protected CharSequence[] |
t
If n is smaller than TERM_THRESHOLD , a vector containing the terms. |
static int |
TERM_THRESHOLD
The minimum number of terms that will trigger the construction of a minimal perfect hash; overwise, terms are simply stored in a vector. |
protected Collection |
terms
The term collection (it could be an anonymous class wrapping a file). |
protected static boolean |
VERBOSE
Whether we should log what's happening (it is set at class load time using the system property it.unimi.dsi.mg4j.util.minimalperfecthash.verbose). |
static int |
WEIGHT_UNKNOWN
A special value denoting that the weight length is unknown, and should be computed using the maximum length of a term. |
static int |
WEIGHT_UNKNOWN_SORTED_TERMS
A special value denoting that the weight length is unknown, and should be computed assuming that the terms appear in lexicographic order. |
protected int[] |
weight0
Vector of weights to compute the first intermediate hash function. |
protected int[] |
weight1
Vector of weights to compute the second intermediate hash function. |
protected int[] |
weight2
Vector of weights to compute the third intermediate hash function. |
protected int |
weightLength
The length of the components of the weight vectors (it's faster than asking the length of the vectors). |
Constructor Summary | |
MinimalPerfectHash(Collection terms)
Creates a new order-preserving minimal perfect hash table for the given set of terms, using as many weights as the longest term in the collection. |
|
MinimalPerfectHash(Collection terms,
int weightLength)
Creates a new order-preserving minimal perfect hash table for the given set of terms using the given number of weights. |
|
MinimalPerfectHash(String termFile)
Creates a new order-preserving minimal perfect hash table for the given file of terms, using as many weights as the longest term in the file. |
|
MinimalPerfectHash(String termFile,
String encoding,
int weightLength)
Creates a new order-preserving minimal perfect hash table for the given file of terms using the given number of weights. |
Method Summary | |
int |
get(CharSequence term)
Hashes a given term. |
int |
get(MutableString term)
Hashes a given term. |
protected int |
getFromT(CharSequence term)
Gets a term out of the stored array t . |
int |
getWeightLength()
Deprecated. As of mg4j 0.8, replaced by weightLength() . |
int |
hash(CharSequence term)
Deprecated. As of mg4j 1.0, replaced by get(CharSequence) . |
protected void |
hash(CharSequence term,
int[] h)
Hashes a given term using the intermediate hash functions. |
int |
hash(MutableString term)
Deprecated. As of mg4j 1.0, replaced by get(MutableString) . |
static void |
main(String[] arg)
|
int |
size()
Returns the number of terms hashed. |
int |
weightLength()
Returns the length of the weight vectors. |
Methods inherited from class it.unimi.dsi.mg4j.index.AbstractTermMap |
get, get |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
public static final int TERM_THRESHOLD
public static final int WEIGHT_UNKNOWN
public static final int WEIGHT_UNKNOWN_SORTED_TERMS
protected final int n
protected final int m
protected final int rightShift
m
.
protected final int[] weight0
protected final int[] weight1
protected final int[] weight2
protected final int weightLength
protected final int[] g
protected static final boolean VERBOSE
protected final transient Collection terms
protected transient long n4
protected transient CharSequence[] t
n
is smaller than TERM_THRESHOLD
, a vector containing the terms.
public static final long serialVersionUID
Constructor Detail |
public MinimalPerfectHash(Collection terms)
Caution: if the collection contains twice the same term, this constructor will never return.
terms
- some terms to hash; it is assumed that this Collection
does not contain duplicates or the empty term.public MinimalPerfectHash(Collection terms, int weightLength)
This constructor can be used to force the use of a reduced number of weights if you are sure that
the first weightLength
characters of each term in terms
are
distinct.
If you do not know your weight length in advance, you can pass two
special values as weightLength
: WEIGHT_UNKNOWN
will force the computation of weightLength
as the maximum
length of a term, whereas WEIGHT_UNKNOWN_SORTED_TERMS
forces
the assumption that terms are sorted: in this case, we search for
the minimum prefix that will disambiguate all terms in the collection.
Caution: if two terms in the collection have a
common prefix of length weightLength
this constructor will
never return.
terms
- some terms to hash; if weightLength
is specified explicitly, it is assumed that this Collection
does not contain terms with a common prefix of
weightLength
characters or the empty term.weightLength
- the number of weights used generating the
intermediate hash functions, WEIGHT_UNKNOWN
or WEIGHT_UNKNOWN_SORTED_TERMS
.MinimalPerfectHash(Collection)
public MinimalPerfectHash(String termFile, String encoding, int weightLength)
termFile
- an file containing one term on each line; it is assumed that
it does not contain terms with a common prefix of
weightLength
characters or the empty term.encoding
- the encoding of termFile
; if null
, it
is assumed to be the platform default encoding.weightLength
- the number of weights used generating the
intermediate hash functions, WEIGHT_UNKNOWN
or WEIGHT_UNKNOWN_SORTED_TERMS
.MinimalPerfectHash(Collection, int)
public MinimalPerfectHash(String termFile)
termFile
- a file in the platform default encoding
containing one term on each line; it is assumed that the file does
not contain duplicates or the empty term.MinimalPerfectHash(String, String, int)
Method Detail |
protected void hash(CharSequence term, int[] h)
term
- a term to hash.h
- a three-element vector that will be filled with the three intermediate hash values.protected int getFromT(CharSequence term)
t
.
Note: This method does not check for t
being non-null
.
term
- a term.
public int get(CharSequence term)
get
in interface TermMap
term
- a term to be hashed.
public int get(MutableString term)
term
- a term to be hashed.
public int hash(CharSequence term)
mg4j
1.0, replaced by get(CharSequence)
.
term
- a term to be hashed.
public int hash(MutableString term)
mg4j
1.0, replaced by get(MutableString)
.
term
- a term to be hashed.
public int getWeightLength()
mg4j
0.8, replaced by weightLength()
.
public int weightLength()
public int size()
public static void main(String[] arg) throws IllegalArgumentException, SecurityException, InstantiationException, IllegalAccessException, InvocationTargetException, NoSuchMethodException, ClassNotFoundException, FileNotFoundException, IOException
IllegalArgumentException
SecurityException
InstantiationException
IllegalAccessException
InvocationTargetException
NoSuchMethodException
ClassNotFoundException
FileNotFoundException
IOException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |