|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.tool.FirstPass
Builds batched occurrence files from a list of documents read from standard input.
This class reads from standard input a sequence of documents and produces
corresponding compressed occurrence files, containing a settable number of
occurrences, which must be merged using SecondPass
. Documents are
separated by a special Unicode character (default is 10, i.e., newline)
that can be set with an option. Every document is considered as a sequence
of words: a word is a maximal sequence of letters and digits (in the Java sense).
The only mandatory argument is a basename, which will be used to stem the names of all files generated.
Since documents are read sequentially, every document has a natural
index starting from 0. If no permutation is specified, the document
index of each document corresponds to its natural index. If, however,
a permutation is specified, the document index of a document is the image
through the permutation of its natural index. More precisely, a permutation
for N documents is a list of N distinct integers
between 0 and N-1 inclusive, and a document with natural index
i has document index given by the i-th element of the
list. This is useful when indexing ranked documents (e.g., if you are
indexing a part of the web and would like the index to return documents
with higher rank first). If the permutation file is provided, it must be
a sequence of integers, written using the DataOutputStream.writeInt(int)
method;
if N is the number of documents, the file is to contain exactly
N distinct integers between 0 and N-1.
Also every term has an index starting from 0. Increasing indices,
starting from 0, are assigned as new terms are met. If you prefer having
lexicographically ordered terms, you should call MiddlePass
after
FirstPass
.
These are the files currently generated:
SecondPass
to produce
the inverted index.
property file
containing information
about the index. Currently, the following keys are generated:
Method Summary | |
static void |
main(String[] arg)
|
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Method Detail |
public static void main(String[] arg) throws FileNotFoundException, IOException, ClassNotFoundException
FileNotFoundException
IOException
ClassNotFoundException
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |