it.unimi.dsi.mg4j.index
Class IndexReader

java.lang.Object
  extended byit.unimi.dsi.mg4j.index.IndexReader
All Implemented Interfaces:
CompressionFlags
Direct Known Subclasses:
SkipIndexReader

public class IndexReader
extends Object
implements CompressionFlags

Provides facilities to read directly an inverted index.

An inverted index is made by a sequence of inverted lists (one inverted list for each term). Inverted lists are made by document records. For more information on the structure of an inverted index, see the documentation of it.unimi.dsi.mg4j.index.

Note that Index provides a much easier interface to index access. Instead of creating an IndexReader explicitly, you can use Index.getReader().

Class usage

The simplest way of getting data from an IndexReader is to ask for the documents matching a term.

Otherwise, you can access the index directly. The method readFrequency() should be called only when the underlying stream is positioned on the start of an inverted list; in such case, it returns the frequency of the corresponding term. After that, you can read the document records sequentially: you have to read exactly as many document records as specified in the frequency, at the beginning of the list.

For each document record, you call (possibly: many times) readDocumentPointer() to read the document pointer, and then readDocumentPositions(int[]) (just once) to read the positions. After reading the last record, you will be positioned on the beginning of the next inverted list, if any.

The skipTo(int) method can be called either after or before reading a document pointer, and allows one to skip ahead in the inverted list. More precisely, assume that you are currently on a record with pointer p, and you call skipTo(n); if there is a document record with pointer ≥ n in the current inverted list, you will be positioned on the beginning of the first such record following the current one (in particular, if pn, the reader will not move); if there is no document record satisfying the condition, you will be positioned on the beginning of the next inverted list (or at the end of the index, if this is the last inverted list).

Using offsets to position the reader

Hence, as explained above, the index is read sequentially, one inverted list after another, one document record after another. The skipTo(int) method can be used to jump ahead (but not backwards!) to another document record inside a single inverted list.

As explained in the IndexWriter documentation, an inverted index may be created with an associated offset bit stream: the offset bit stream stores the positions where each inverted list starts inside the index. You can read an offset bit stream using the Index.readOffsets(InputBitStream, int, ProgressMeter) static method: this method reads the offset bit stream and returns its content as a list.

Such a list may be passed to the IndexReader constructor: in this case, you can use the position(t) method to position the index reader on the beginning of the inverted list relative to a the t-th term. In order to do so, two conditions must be satisfied.

The position(int) method may be called at any time.

Implementation details

After calling readFrequency(), the reader is at the beginning of the first document record. We assume that this is an invariant: the reader is always at the beginning of a document record; the variable numberOfDocumentRecord keeps track of the number of the document record: this ranges from 0 to the frequency (when equal to the frequency, we are not really at the beginning of a document record but rather at the beginning of the next inverted list, or at the end of the file).

Since:
0.6
Author:
Paolo Boldi, Sebastiano Vigna

Field Summary
protected  int b
          The parameter b for Golomb coding of pointers.
protected static int BEFORE_COUNT
          This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.
protected static int BEFORE_FREQUENCY
          This value of state means that we are positioned at the start of an inverted list (just before the frequency).
protected static int BEFORE_POINTER
          This value of state means that we are at the start of a new document record, unless we already read all documents (i.e., numberOfDocumentRecord == frequency), in which case we are at the end of the inverted list, and endOfList() is true.
protected static int BEFORE_POSITIONS
          This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.
protected  int count
          The current count (if this index contains counts).
protected  int currentDocument
          The last document pointer in the current list, or -1 if we just read the frequency.
protected static int FIRST_UNUSED_STATE
          This is the first unused state.
protected  int frequency
          The current frequency.
protected  InputBitStream ibs
          The underlying input bit stream.
protected  Index index
          The reference index.
protected  int log2b
          The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.
protected  int numberOfDocumentRecord
          The number of the document record we are going to read inside the current inverted list.
protected  double relativeFrequency
          The number of document records that the current inverted list will contain, divided by the number of documents.
protected  int state
          This variable tracks the current state of the reader.
protected  int term
          The current term.
 
Fields inherited from interface it.unimi.dsi.mg4j.index.CompressionFlags
ARITH, CODING_NAME, COUNTS_DEFAULT, COUNTS_DELTA, COUNTS_GAMMA, COUNTS_SHIFT, DELTA, FREQUENCIES_DEFAULT, FREQUENCIES_DELTA, FREQUENCIES_GAMMA, FREQUENCIES_SHIFT, GAMMA, GOLOMB, INTERP, NIBBLE, NO_COUNTS, NO_POSITIONS, NONE, POINTERS_DEFAULT, POINTERS_DELTA, POINTERS_GAMMA, POINTERS_GOLOMB, POINTERS_SHIFT, POSITIONS_ARITH, POSITIONS_DEFAULT, POSITIONS_DELTA, POSITIONS_GAMMA, POSITIONS_GOLOMB, POSITIONS_INTERP, POSITIONS_SHIFT, POSITIONS_SKEWED_GOLOMB, SKEWED_GOLOMB, UNARY, ZETA
 
Constructor Summary
IndexReader(Index index)
          Creates a new index reader, with the specified underlying Index and default buffer size.
IndexReader(Index index, InputBitStream ibs)
          Creates a new index reader, with the specified underlying Index and input bit stream.
IndexReader(Index index, int bufferSize)
          Creates a new index reader, with the specified underlying Index.
IndexReader(InputBitStream ibs, int N, long flags)
          Deprecated. Please use Index.getReader() or IndexReader(Index).
IndexReader(InputBitStream ibs, LongList offsets, IntList sizes, int N, long flags)
          Deprecated. Please use Index.getReader() or IndexReader(Index).
IndexReader(InputBitStream ibs, LongList offsets, int N, long flags)
          Deprecated. Please use Index.getReader() or IndexReader(Index).
 
Method Summary
 void close()
          Closes the underlying bit stream.
 int currentDocumentPointer()
          Returns the current document pointer.
 DocumentIterator documents(CharSequence term)
          Returns a document iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.
 DocumentIterator documents(int term)
          Returns a document iterator over the documents containing a term.
 boolean endOfList()
          Returns whether there are no more document records in the current inverted list.
protected  boolean justAfterPointer()
          Returns true if the index is positioned just after a pointer.
 void position(int term)
          Positions the index on the inverted list of a given term.
 long readBits()
          Returns the number of bits read from the last call to position(int).
 int readDocumentPointer()
          Reads a document pointer from a given InputBitStream.
 int readDocumentPositions(int[] occ)
          Reads the positions of the occurrences of the current term in the current document.
 int readFrequency()
          Reads and returns the frequency (number of documents where the term appears).
 int readPositionCount()
          Reads the position count for the current document.
 boolean skipTo(int p)
          Skips to the first document record whose document pointer is greater than or equal to the given value.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

index

protected final Index index
The reference index.


ibs

protected InputBitStream ibs
The underlying input bit stream.


b

protected int b
The parameter b for Golomb coding of pointers.


log2b

protected int log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b.


term

protected int term
The current term.


frequency

protected int frequency
The current frequency.


count

protected int count
The current count (if this index contains counts).


relativeFrequency

protected double relativeFrequency
The number of document records that the current inverted list will contain, divided by the number of documents.


currentDocument

protected int currentDocument
The last document pointer in the current list, or -1 if we just read the frequency.


numberOfDocumentRecord

protected int numberOfDocumentRecord
The number of the document record we are going to read inside the current inverted list.


state

protected int state
This variable tracks the current state of the reader.


BEFORE_FREQUENCY

protected static final int BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list (just before the frequency).

See Also:
Constant Field Values

BEFORE_POINTER

protected static final int BEFORE_POINTER
This value of state means that we are at the start of a new document record, unless we already read all documents (i.e., numberOfDocumentRecord == frequency), in which case we are at the end of the inverted list, and endOfList() is true.

See Also:
Constant Field Values

BEFORE_COUNT

protected static final int BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it means that we are positioned just before the count for the current document record.

See Also:
Constant Field Values

BEFORE_POSITIONS

protected static final int BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions; it means that we are positioned just before the position list of the current document record.

See Also:
Constant Field Values

FIRST_UNUSED_STATE

protected static final int FIRST_UNUSED_STATE
This is the first unused state. Subclasses may start from this value to define new states.

See Also:
Constant Field Values
Constructor Detail

IndexReader

public IndexReader(Index index)
            throws IOException
Creates a new index reader, with the specified underlying Index and default buffer size.

Parameters:
index - the index.
See Also:
IndexReader(Index, int)

IndexReader

public IndexReader(Index index,
                   int bufferSize)
            throws IOException
Creates a new index reader, with the specified underlying Index.

Parameters:
index - the index.
bufferSize - the size of the buffer of the underlying input bit stream.

IndexReader

public IndexReader(Index index,
                   InputBitStream ibs)
            throws IOException
Creates a new index reader, with the specified underlying Index and input bit stream.

Parameters:
index - the index.
ibs - the underlying bit stream.

IndexReader

public IndexReader(InputBitStream ibs,
                   LongList offsets,
                   IntList sizes,
                   int N,
                   long flags)
            throws IOException
Deprecated. Please use Index.getReader() or IndexReader(Index).

Creates a new index reader, with the specified underlying InputBitStream.

Parameters:
ibs - the underlying input bit stream.
offsets - the offset list; may be null if you do not plan using position(int).
sizes - the size list; may be null if your code does not require it.
N - the number of documents in the collection.
flags - a bit mask setting the coding techniques to be used (see the IndexReader introduction).

IndexReader

public IndexReader(InputBitStream ibs,
                   LongList offsets,
                   int N,
                   long flags)
            throws IOException
Deprecated. Please use Index.getReader() or IndexReader(Index).

Creates a new index reader, with the specified underlying InputBitStream.

Parameters:
ibs - the underlying input bit stream.
N - the number of documents in the collection.
flags - a bit mask setting the coding techniques to be used (see the IndexReader introduction).

IndexReader

public IndexReader(InputBitStream ibs,
                   int N,
                   long flags)
            throws IOException
Deprecated. Please use Index.getReader() or IndexReader(Index).

Creates a new index reader, with the specified underlying InputBitStream.

Parameters:
ibs - the underlying input bit stream.
N - the number of documents in the collection.
flags - a bit mask setting the coding techniques to be used (see the IndexReader introduction).
Method Detail

position

public void position(int term)
              throws IOException
Positions the index on the inverted list of a given term. This method can be called at any time.

Parameters:
term - a term.
Throws:
IOException

endOfList

public boolean endOfList()
Returns whether there are no more document records in the current inverted list.

Returns:
whether there are no more document records in the current inverted list.

readFrequency

public int readFrequency()
                  throws IOException,
                         IllegalStateException
Reads and returns the frequency (number of documents where the term appears). The underlying bit stream must be positioned on the beginning of an inverted list, or at the real end of an inverted list (not only endOfList() must return true, but also all data contained in the last document record must have been read).

After that, you are ready to read the document records.

Returns:
the frequency.
Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException

readDocumentPointer

public int readDocumentPointer()
                        throws IOException
Reads a document pointer from a given InputBitStream. This method returns the document pointer; after this, you can use the stream to read the positions.

Throws:
IOException - if an exception is thrown by the underlying stream

currentDocumentPointer

public int currentDocumentPointer()
Returns the current document pointer.

Returns:
the current document pointer, or -1 if no document pointer has been read so far.

justAfterPointer

protected boolean justAfterPointer()
Returns true if the index is positioned just after a pointer.

Returns:
true if the index is positioned just after a pointer.

readPositionCount

public int readPositionCount()
                      throws IOException,
                             NoSuchElementException
Reads the position count for the current document.

Returns:
the count, that is, the number of positions of the current term in the current document.
Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException - if you try to read a count but you have not just read a pointer.
NoSuchElementException - if you try to read beyond the end of an inverted list.

readDocumentPositions

public int readDocumentPositions(int[] occ)
                          throws IOException,
                                 NoSuchElementException
Reads the positions of the occurrences of the current term in the current document.

First the count (number of positions) is read using the specified coding (unless it has already been read using readPositionCount()). Then, the vector of positions is read using the specified coding.

This method can also skip the list of document positions by passing a null vector.

Parameters:
occ - the vector of positions; this vector must be at least as large as the number of occurrences that shall be read; if it is null, positions are discarded.
Returns:
the number of positions (the number of valid entries in occ); if the returned value is negative, occ has not enough space to store positions (the absolute returned value is the number of positions).
Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException - if you try to read positions without known the frequency.
NoSuchElementException - if you try to read beyond the end of an inverted list.

skipTo

public boolean skipTo(int p)
               throws IOException
Skips to the first document record whose document pointer is greater than or equal to the given value.

If this method returns true, then there is one record satisfying the condition, and the index is positioned over it. If this method returns false, then the condition is not satisfied and the index is positioned on the last document record.

In any case, this method leaves the index in the same state as after a call to readDocumentPointer(), unless it is called when endOfList() is true, in which case it does nothing and returns false.

Parameters:
p - a document pointer.
Returns:
true if we found a record whose document pointer is greater than or equal to p.
Throws:
IOException

readBits

public long readBits()
Returns the number of bits read from the last call to position(int).

Returns:
the number of bits read so far.

close

public void close()
           throws IOException,
                  IllegalStateException
Closes the underlying bit stream.

Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException - if too few records were written for the previous inverted list.

documents

public DocumentIterator documents(int term)
                           throws IOException
Returns a document iterator over the documents containing a term.

Parameters:
term - a term.
Throws:
IOException

documents

public DocumentIterator documents(CharSequence term)
                           throws IOException
Returns a document iterator over the documents containing a term; the term is given explicitly, and the index term map is used, if present.

Parameters:
term - a term (the term will be downcased if the index is case insensitive).
Throws:
IOException - if an exception occurred while accessing the index.
UnsupportedOperationException - if the term map is not available for the underlying index.