it.unimi.dsi.mg4j.index
Class SkipIndexWriter

java.lang.Object
  extended byit.unimi.dsi.mg4j.index.IndexWriter
      extended byit.unimi.dsi.mg4j.index.SkipIndexWriter
All Implemented Interfaces:
CompressionFlags

public class SkipIndexWriter
extends IndexWriter

Provides facilities to write skip inverted indices, that is, inverted indices with an additional skip structure. A skip inverted index allows one to skip ahead when reading inverted lists. More specifically, when reading the inverted list relative to a certain term, one may want to decide to skip all document records that concern documents with pointer less than a given integer. In a normal inverted index this is impossible: one would have to read all document records sequentially (by the way, this is what IndexReader.skipTo(int) does).

Skip structure

Before reading this section, please consult it.unimi.dsi.mg4j.index to understand what is the usual structure of an inverted index.

A skip inverted index is just like a usual inverted index, with the only difference that some of the document records contain additional information that allow to skip document records without actually reading them.

The classical approach to skipping is to embed pointers that allow to skip a fixed amount of document records. MG4J, instead, embeds a perfect, static skip list in every inverted list using a new method to represent and compress skip towers. As a result, MG4J can embed tall towers with small quanta (e.g., quantum 64 and height 8) with a very small growth of the index size. The details of the method will be presented in a forthcoming paper.

The document records that contain a skip tower are called skip document records; skip towers are stored within a document record just after the document pointer and just before the document data.

The construction of a skip inverted index depends on two positive integers, q and h, called the quantum and the height. For sake of simplicity, let H=2h and w=Hq.

Consider the records in a certain inverted list; these records are (logically) grouped from the left into blocks of width w (the last block, of course, may contain less than w records; see below). Hence, the records within a certain block are indexed from 0 to w-1. The skip records are only those whose index (within the block) is of the form kq, where k ranges from 0 to H-1; the number k is called the skip index of the skip record (within the block that contains it).

The skip tower of the skip record with skip index k contains LSB(k)+1 entries, where LSB(k) is the index of the least significant bit of k, a.k.a. the number of trailing zeroes in the h-long binary expansion of k. Hence, the 0-th skip tower has h+1 entries, the H/2-th skip tower has h entries, and so on. Skip entries within each skip tower have an index, called the skip tower index: if a skip tower contains s+1 entries, their skip tower index will range from 0 to s (inclusive).

Each entry within a skip tower is really made of two numbers: the skip pointer and the skip amount. Consider the i-th entry of the k-th skip tower (the one contained in the record kq within the block); let P be the skip pointer and A be the skip amount.

The entry we are considering points to another skip record: the one whose skip index is k+2i. For example, the last entry (the entry with skip tower index h) in the skip tower of skip index 0 points to the H-th skip record within the block (that is: the first record in the following block, if any). The next-to-last entry (skip tower index h-1) in the same skip tower points to the H/2-th skip record within the block (that is: the record in the middle of the block). The entry number 0 in the same skip tower points to the skip record with skip index 1.

Now, P is the document pointer of the pointed skip document record, and A is the number of bits to skip from the end of the skip tower to the start of the pointed (the k+2i-th) tower.

Defective blocks

The last block within an inverted list may contain less than w records, and in this case we say that it is defective. If a defective block contains only C < w records, the number of entries in the k-th skip tower is not LSB(k)+1, but rather 1+min(LSB(k),MSB(⌊C/q⌋ - k)), where MSB(k) is the most significant bit of k, and moreover MSB(0) = -1 by definition. In particular, defective quanta have an empty tower.

Class usage and statistics

This class is used in the same way as IndexWriter. It further collects more statistics concerning the compression of skips.

Since:
0.6
Author:
Paolo Boldi, Sebastiano Vigna

Nested Class Summary
static class SkipIndexWriter.TowerData
          A structure maintaining statistical data about tower construction.
 
Field Summary
 long bitsForEntryBitLengths
          The number of bits written for entry lenghts.
 long bitsForQuantumBitLengths
          The number of bits written for quantum lengths.
 long numberOfBlocks
          The number of written blocks.
 int prevEntryBitLength
          An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.
 int prevQuantumBitLength
          An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.
 SkipIndexWriter.TowerData towerData
          The sum of all tower data computed so far.
 
Fields inherited from class it.unimi.dsi.mg4j.index.IndexWriter
b, BEFORE_COUNT, BEFORE_DOCUMENT_RECORD, BEFORE_FREQUENCY, BEFORE_INVERTED_LIST, BEFORE_POINTER, BEFORE_POSITIONS, bitsForAlignment, bitsForCounts, bitsForFrequencies, bitsForPointers, bitsForPositions, currentDocument, currentTerm, FIRST_UNUSED_STATE, flags, frequency, hasCounts, hasPositions, lastDocument, log2b, maxDocPos, numberOfDocuments, obs, pointerCoding, relativeFrequency, state, writtenDocuments
 
Fields inherited from interface it.unimi.dsi.mg4j.index.CompressionFlags
ARITH, CODING_NAME, COUNTS_DEFAULT, COUNTS_DELTA, COUNTS_GAMMA, COUNTS_SHIFT, DELTA, FREQUENCIES_DEFAULT, FREQUENCIES_DELTA, FREQUENCIES_GAMMA, FREQUENCIES_SHIFT, GAMMA, GOLOMB, INTERP, NIBBLE, NO_COUNTS, NO_POSITIONS, NONE, POINTERS_DEFAULT, POINTERS_DELTA, POINTERS_GAMMA, POINTERS_GOLOMB, POINTERS_SHIFT, POSITIONS_ARITH, POSITIONS_DEFAULT, POSITIONS_DELTA, POSITIONS_GAMMA, POSITIONS_GOLOMB, POSITIONS_INTERP, POSITIONS_SHIFT, POSITIONS_SKEWED_GOLOMB, SKEWED_GOLOMB, UNARY, ZETA
 
Constructor Summary
SkipIndexWriter(OutputBitStream obs, int N, int flags, int q, int h)
          Creates a new skip index writer, with the specified underlying OutputBitStream, without an associated offset bit stream.
SkipIndexWriter(OutputBitStream obs, OutputBitStream offset, int N, int flags, int q, int h)
          Creates a new skip index writer, with the specified underlying OutputBitStream.
 
Method Summary
 void close()
          Closes the underlying bit stream.
 OutputBitStream newDocumentRecord()
          Starts a new document record.
 OutputBitStream newDocumentRecord(int pointer)
          Starts a new document record.
 long newInvertedList()
          Starts a new inverted list.
 long newInvertedList(int frequency)
          Deprecated. Use newInvertedList() followed by writeFrequency(int).
 Properties properties()
          This method should only be called after close().
 int writeDocumentPointer(OutputBitStream out, int pointer)
          Writes a document pointer.
 int writeFrequency(int frequency)
          Writes the frequency.
 long writtenBits()
          This method returns the overall number of bits written onto the underlying stream.
 
Methods inherited from class it.unimi.dsi.mg4j.index.IndexWriter
flags2String, writeDocumentPositions, writePositionCount
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

towerData

public SkipIndexWriter.TowerData towerData
The sum of all tower data computed so far.


bitsForQuantumBitLengths

public long bitsForQuantumBitLengths
The number of bits written for quantum lengths.


bitsForEntryBitLengths

public long bitsForEntryBitLengths
The number of bits written for entry lenghts.


numberOfBlocks

public long numberOfBlocks
The number of written blocks.


prevEntryBitLength

public int prevEntryBitLength
An estimate on the number of bits occupied per tower entry in the last written cache, or -1 if no cache has been written for the current inverted list.


prevQuantumBitLength

public int prevQuantumBitLength
An estimate on the number of bits occupied per quantum in the last written cache, or -1 if no cache has been written for the current inverted list.

Constructor Detail

SkipIndexWriter

public SkipIndexWriter(OutputBitStream obs,
                       OutputBitStream offset,
                       int N,
                       int flags,
                       int q,
                       int h)
Creates a new skip index writer, with the specified underlying OutputBitStream.

Parameters:
obs - the underlying output bit stream.
offset - the offset bit stream.
N - the number of documents in the collection to be indexed.
flags - a bit mask setting the coding techniques to be used (see the IndexWriter introduction).
q - the cache contains at most 2h document records.
h - the maximum height of a skip tower.

SkipIndexWriter

public SkipIndexWriter(OutputBitStream obs,
                       int N,
                       int flags,
                       int q,
                       int h)
Creates a new skip index writer, with the specified underlying OutputBitStream, without an associated offset bit stream.

Parameters:
obs - the underlying output bit stream.
N - the number of documents in the collection to be indexed.
flags - a bit mask setting the coding techniques to be used (see the IndexWriter introduction).
q - the cache contains at most 2h document records.
h - the maximum height of a skip tower.
Method Detail

newInvertedList

public long newInvertedList()
                     throws IOException,
                            IllegalStateException
Description copied from class: IndexWriter
Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream; then, the stream is byte-aligned.

Overrides:
newInvertedList in class IndexWriter
Returns:
the position (in bytes) of the underlying bit stream where the new inverted list starts.
Throws:
IllegalStateException - if too few records were written for the previous inverted list.
IOException - if an exception is thrown by the underlying stream.

writeFrequency

public int writeFrequency(int frequency)
                   throws IOException,
                          IllegalStateException
Description copied from class: IndexWriter
Writes the frequency.

Overrides:
writeFrequency in class IndexWriter
Parameters:
frequency - the (positive) number of document records that this inverted list will contain.
Returns:
the number of bits written.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException

newInvertedList

public long newInvertedList(int frequency)
                     throws IOException,
                            IllegalStateException
Deprecated. Use newInvertedList() followed by writeFrequency(int).

Starts a new inverted list. The previous inverted list, if any, is actually written to the underlying bit stream; then, the stream is byte-aligned, and frequency is written onto the underlying stream using the specified coding (the number is decremented by one first, as it cannot be 0).

Overrides:
newInvertedList in class IndexWriter
Parameters:
frequency - the number of document records that this inverted list will contain.
Returns:
the position (in bytes) of the underlying bit stream where the new inverted list starts.
Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException - if too few records were written for the previous inverted list.

newDocumentRecord

public OutputBitStream newDocumentRecord()
                                  throws IOException,
                                         IllegalStateException
Description copied from class: IndexWriter
Starts a new document record.

This method must be called exactly n times after a call to newInvertedList(n).

Overrides:
newDocumentRecord in class IndexWriter
Returns:
the output bit stream where the next document record data should be written.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException - if too many records were written for the current inverted list, or if there is no current inverted list.

writeDocumentPointer

public int writeDocumentPointer(OutputBitStream out,
                                int pointer)
                         throws IOException,
                                IllegalStateException
Description copied from class: IndexWriter
Writes a document pointer.

This method must be called immediately after IndexWriter.newDocumentRecord().

Overrides:
writeDocumentPointer in class IndexWriter
Parameters:
out - the output bit stream where the pointer will be written.
pointer - the document pointer.
Returns:
the number of bits written.
Throws:
IOException - if an exception is thrown by the underlying stream.
IllegalStateException

newDocumentRecord

public OutputBitStream newDocumentRecord(int pointer)
                                  throws IOException,
                                         IllegalStateException
Deprecated. Use newDocumentRecord() followed by writeDocumentPointer(OutputBitStream, int).

Description copied from class: IndexWriter
Starts a new document record. This method should be called exactly n times after a call to newInvertedList(n).

Overrides:
newDocumentRecord in class IndexWriter
Parameters:
pointer - the document pointer (will be written to the bit stream).
Returns:
the OutputBitStream where the next document record should be written.
Throws:
IllegalStateException - if too many records were written for the current inverted list, or if there is no current inverted list.
IOException - if an exception is thrown by the underlying stream.

close

public void close()
           throws IOException,
                  IllegalStateException
Closes the underlying bit stream.

Overrides:
close in class IndexWriter
Throws:
IOException - if an exception is thrown by the underlying stream
IllegalStateException - if too few records were written for the previous inverted list.

writtenBits

public long writtenBits()
This method returns the overall number of bits written onto the underlying stream.

Overrides:
writtenBits in class IndexWriter
Returns:
the number of bits written, according to the variables keeping statistical records.

properties

public Properties properties()
Description copied from class: IndexWriter
This method should only be called after IndexWriter.close(). It returns the values for all the properties specified in IndexProperties.

Overrides:
properties in class IndexWriter
Returns:
the properties specified in IndexProperties along with their values.