it.unimi.dsi.mg4j.tool
Class FirstPass

java.lang.Object
  extended byit.unimi.dsi.mg4j.tool.FirstPass

public final class FirstPass
extends Object

Builds batched occurrence files from a list of documents read from standard input.

This class reads from standard input a sequence of documents and produces corresponding compressed occurrence files, containing a settable number of occurrences, which must be merged using SecondPass. Documents are separated by a special Unicode character (default is 10, i.e., newline) that can be set with an option. Every document is considered as a sequence of words: a word is a maximal sequence of letters and digits (in the Java sense).

The only mandatory argument is a basename, which will be used to stem the names of all files generated.

Since documents are read sequentially, every document has a natural index starting from 0. If no permutation is specified, the document index of each document corresponds to its natural index. If, however, a permutation is specified, the document index of a document is the image through the permutation of its natural index. More precisely, a permutation for N documents is a list of N distinct integers between 0 and N-1 inclusive, and a document with natural index i has document index given by the i-th element of the list. This is useful when indexing ranked documents (e.g., if you are indexing a part of the web and would like the index to return documents with higher rank first). If the permutation file is provided, it must be a sequence of integers, written using the DataOutputStream.writeInt(int) method; if N is the number of documents, the file is to contain exactly N distinct integers between 0 and N-1.

Also every term has an index starting from 0. Increasing indices, starting from 0, are assigned as new terms are met. If you prefer having lexicographically ordered terms, you should call MiddlePass after FirstPass.

These are the files currently generated:

basename.terms
For each indexed term, the corresponding literal string in UTF-8 encoding. More precisely, the i-th line of the file (starting from 0) contains the literal string corresponding to term index i.
basename.frequencies
This file contains, for each term, the number of documents in which the term appears in γ coding. More precisely, i-th integer of the file (starting from 0) is the number of documents in which the term of index i appears.
basename.sizes
This file contains, for each indexed document, the corresponding size (=number of words) in γ coding. More precisely, i-th integer of the file (starting from 0) is the size in words of the document of index i.
basename.batchi
The i-th batch of occurrences. These files are used in the SecondPass to produce the inverted index.
basename.properties
A Java property file containing information about the index. Currently, the following keys are generated:
documents
number documents in the collection;
terms
number of indexed terms;
occurrences
number of words throughout the whole collection;
batches
number of generated batch files;
basename
the basename (see above);
maxdocsize
maximum size of a document in words;
iscasesensitive
a boolean denoting whether the index is case sensitive;
occsperbatch
maximum number of occurrences in each batch;
permutation
name of the permutation file used.

Since:
0.6
Author:
Sebastiano Vigna

Method Summary
static void main(String[] arg)
           
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

main

public static void main(String[] arg)
                 throws FileNotFoundException,
                        IOException,
                        ClassNotFoundException
Throws:
FileNotFoundException
IOException
ClassNotFoundException