|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectit.unimi.dsi.mg4j.index.IndexWriter
Provides facilities to write an inverted index.
This class is used to write inverted lists in sequential order, as follows:
newInvertedList(int)
should
be called, specifying the frequency;
newDocumentRecord(int)
should
be called: note that, after calling newInvertedList(n)
the method
newDocumentRecord(int)
must be called exactly n
times.
newDocumentRecord(int)
writes the (suitable coding of) the pointer,
and returns an OutputBitStream
that may be used to write the document data.
Note that there is no guarantee that the returned OutputBitStream
coincides with the
underlying bit stream. Moreover, there is no guarantee as to when the bits will be actually
written on the underlying stream, except that when starting a new inverted list, the previous
inverted list, if any, will be written onto the underlying stream.
The method writeDocumentPositions(OutputBitStream,int[],int,int,int)
is a facility
that writes, on the given OutputBitStream
, the list of positions
using the specified compression scheme: the method should be called, if needed,
just after a newDocumentRecord(int)
call, on the
same output bit stream returned by the latter.
As explained above, this class needs to know the compression method that it should use to write the frequency, the document pointer, and the positions; moreover, for positions, the method writes the number of positions first (called the count) followed by the sequence of positions. Hence, there are four different compression methods that one should choose (the method used for frequencies, pointers, counts and positions).
Such methods can be chosen when creating an instance of IndexWriter
using
the available compression flags specified in CompressionFlags
.
Every inverted index may have an associated OutputBitStream
of
offsets: this file contains T+1
integers, where T
is the number of inverted lists (i.e., the number of terms), and the
i
-th entry is a suitable coding of the position in bytes where
the i
-th inverted list starts (the last entry is actually the
length, in bytes, of the inverted index file itself). The coding used for
the offset stream is a γ code of the difference between the current
position and the last position.
This class performs some bookkeeping of the number of bits written
the output stream for several purposes: see, for instance, bitsForPositions
,
bitsForAlignment
, etc.
Field Summary | |
protected int |
b
The parameter b for Golomb coding of pointers. |
protected static int |
BEFORE_COUNT
This value of state can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record. |
protected static int |
BEFORE_DOCUMENT_RECORD
This value of state means that we are ready to call newDocumentRecord() . |
protected static int |
BEFORE_FREQUENCY
This value of state means that we are positioned at the start of an inverted list,
and we should call writeFrequency(int) . |
protected static int |
BEFORE_INVERTED_LIST
This value of state means that we should call newInvertedList() . |
protected static int |
BEFORE_POINTER
This value of state means that we just started a new document record, and we
should call writeDocumentPointer(OutputBitStream, int) . |
protected static int |
BEFORE_POSITIONS
This value of state can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record. |
long |
bitsForAlignment
The number of bits written for alignment. |
long |
bitsForCounts
The number of bits written for counts. |
long |
bitsForFrequencies
The number of bits written for frequencies. |
long |
bitsForPointers
The number of bits written for document pointers. |
long |
bitsForPositions
The number of bits written for positions. |
protected int |
currentDocument
The current document pointer. |
protected int |
currentTerm
The current term. |
protected static int |
FIRST_UNUSED_STATE
This is the first unused state. |
long |
flags
The coding flags. |
protected int |
frequency
The number of document records that the current inverted list will contain. |
boolean |
hasCounts
Whether this index contains counts. |
boolean |
hasPositions
Whether this index contains positions. |
protected int |
lastDocument
The last document pointer in the current list. |
protected int |
log2b
The parameter log2b for Golomb coding of pointers; it is the most significant bit of b . |
protected int |
maxDocPos
The maximum number of positions in a document record so far. |
protected int |
numberOfDocuments
The number of documents of the collection to be indexed. |
protected OutputBitStream |
obs
The underlying OutputBitStream . |
protected int |
pointerCoding
The coding for pointers. |
protected double |
relativeFrequency
The number of document records that the current inverted list will contain, divided by the number of documents. |
protected int |
state
The current state of the writer. |
protected int |
writtenDocuments
The number of document records already written for the current inverted list. |
Fields inherited from interface it.unimi.dsi.mg4j.index.CompressionFlags |
ARITH, CODING_NAME, COUNTS_DEFAULT, COUNTS_DELTA, COUNTS_GAMMA, COUNTS_SHIFT, DELTA, FREQUENCIES_DEFAULT, FREQUENCIES_DELTA, FREQUENCIES_GAMMA, FREQUENCIES_SHIFT, GAMMA, GOLOMB, INTERP, NIBBLE, NO_COUNTS, NO_POSITIONS, NONE, POINTERS_DEFAULT, POINTERS_DELTA, POINTERS_GAMMA, POINTERS_GOLOMB, POINTERS_SHIFT, POSITIONS_ARITH, POSITIONS_DEFAULT, POSITIONS_DELTA, POSITIONS_GAMMA, POSITIONS_GOLOMB, POSITIONS_INTERP, POSITIONS_SHIFT, POSITIONS_SKEWED_GOLOMB, SKEWED_GOLOMB, UNARY, ZETA |
Constructor Summary | |
IndexWriter(CharSequence basename,
int numberOfDocuments,
boolean writeOffsets,
long flags)
Creates a new index writer, with the specified basename. |
|
IndexWriter(OutputBitStream obs,
int numberOfDocuments,
int flags)
Creates a new index writer, with the specified underlying OutputBitStream ,
without an associated offset bit stream. |
|
IndexWriter(OutputBitStream obs,
OutputBitStream offset,
int numberOfDocuments,
long flags)
Creates a new index writer, with the specified underlying OutputBitStream . |
Method Summary | |
void |
close()
Closes the underlying bit stream. |
static String |
flags2String(long flags)
This method unpacks the flags, expressed as a long, and returns a string with the compression flags. |
OutputBitStream |
newDocumentRecord()
Starts a new document record. |
OutputBitStream |
newDocumentRecord(int pointer)
Deprecated. Use newDocumentRecord() followed by writeDocumentPointer(OutputBitStream, int) . |
long |
newInvertedList()
Starts a new inverted list. |
long |
newInvertedList(int frequency)
Deprecated. Use newInvertedList() followed by writeFrequency(int) . |
Properties |
properties()
This method should only be called after close() . |
int |
writeDocumentPointer(OutputBitStream out,
int pointer)
Writes a document pointer. |
int |
writeDocumentPositions(OutputBitStream out,
int[] occ,
int offset,
int len,
int docSize)
Writes the positions of the occurrences of the current term in the current document to the given OutputBitStream . |
int |
writeFrequency(int frequency)
Writes the frequency. |
int |
writePositionCount(OutputBitStream out,
int count)
Writes the count of the occurrences of the current term in the current document to the given OutputBitStream . |
long |
writtenBits()
This method returns the overall number of bits written onto the underlying stream. |
Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Field Detail |
protected static final int BEFORE_INVERTED_LIST
state
means that we should call newInvertedList()
.
protected static final int BEFORE_FREQUENCY
state
means that we are positioned at the start of an inverted list,
and we should call writeFrequency(int)
.
protected static final int BEFORE_DOCUMENT_RECORD
state
means that we are ready to call newDocumentRecord()
.
protected static final int BEFORE_POINTER
state
means that we just started a new document record, and we
should call writeDocumentPointer(OutputBitStream, int)
.
protected static final int BEFORE_COUNT
state
can be assumed only in indices that contain counts; it
means that we are positioned just before the count for the current document record.
protected static final int BEFORE_POSITIONS
state
can be assumed only in indices that contain document positions;
it means that we are positioned just before the position list of the current document record.
protected static final int FIRST_UNUSED_STATE
protected OutputBitStream obs
OutputBitStream
.
protected int state
protected int numberOfDocuments
protected int frequency
protected double relativeFrequency
protected int writtenDocuments
protected int currentDocument
protected int currentTerm
protected int lastDocument
protected int b
b
for Golomb coding of pointers.
protected int log2b
log2b
for Golomb coding of pointers; it is the most significant bit of b
.
protected int maxDocPos
public long bitsForAlignment
public long bitsForFrequencies
public long bitsForPointers
public long bitsForCounts
public long bitsForPositions
public long flags
protected int pointerCoding
public final boolean hasCounts
public final boolean hasPositions
Constructor Detail |
public IndexWriter(CharSequence basename, int numberOfDocuments, boolean writeOffsets, long flags) throws IOException
writeOffsets
, also an offset file will be produced (stemmed with .offsets).
When close()
will be called, the property file will also be produced (stemmed with .properties),
or enriched if it already exists.
basename
- the basename.numberOfDocuments
- the number of documents in the collection to be indexed.writeOffsets
- if true
, the offset file will also be produced.flags
- a bit mask setting the coding techniques to be used (see the IndexWriter
introduction).public IndexWriter(OutputBitStream obs, OutputBitStream offset, int numberOfDocuments, long flags)
OutputBitStream
.
obs
- the underlying output bit stream.offset
- the offset bit stream.numberOfDocuments
- the number of documents in the collection to be indexed.flags
- a bit mask setting the coding techniques to be used (see the IndexWriter
introduction).public IndexWriter(OutputBitStream obs, int numberOfDocuments, int flags)
OutputBitStream
,
without an associated offset bit stream.
obs
- the underlying output bit stream.numberOfDocuments
- the number of documents in the collection to be indexed.flags
- a bit mask setting the coding techniques to be used (see the IndexWriter
introduction).Method Detail |
public long newInvertedList() throws IOException, IllegalStateException
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
- if too few records were written for the previous inverted
list.public int writeFrequency(int frequency) throws IOException
frequency
- the (positive) number of document records that this inverted list will contain.
IOException
- if an exception is thrown by the underlying stream.public long newInvertedList(int frequency) throws IOException, IllegalStateException
newInvertedList()
followed by writeFrequency(int)
.
frequency
is written onto the underlying stream.
frequency
- the number of document records that this inverted list will contain.
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
- if too few records were written for the previous inverted
list.public OutputBitStream newDocumentRecord() throws IOException, IllegalStateException
This method must be called exactly n
times
after a call to newInvertedList(n)
.
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
- if too many records were written for the current inverted list,
or if there is no current inverted list.public int writeDocumentPointer(OutputBitStream out, int pointer) throws IOException, IllegalStateException
This method must be called immediately after newDocumentRecord()
.
out
- the output bit stream where the pointer will be written.pointer
- the document pointer.
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
public OutputBitStream newDocumentRecord(int pointer) throws IOException, IllegalStateException
newDocumentRecord()
followed by writeDocumentPointer(OutputBitStream, int)
.
n
times
after a call to newInvertedList(n)
.
pointer
- the document pointer (will be written to the bit stream).
OutputBitStream
where the next document record
should be written.
IOException
- if an exception is thrown by the underlying stream.
IllegalStateException
- if too many records were written for the current inverted list,
or if there is no current inverted list.public void close() throws IOException, IllegalStateException
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
- if too few records were written for the previous inverted
list.public int writePositionCount(OutputBitStream out, int count) throws IOException, IllegalStateException
OutputBitStream
.
out
- the output stream where the occurrences should be written.count
- the count.
IOException
IllegalStateException
public int writeDocumentPositions(OutputBitStream out, int[] occ, int offset, int len, int docSize) throws IOException, IllegalStateException
OutputBitStream
.
First the count (number of positions) is written using the specified coding. Then,
the specified vector of occurrences is written using the specified coding.
out
- the output stream where the occurrences should be written.occ
- the position vector (a sequence of strictly increasing natural numbers).offset
- the first valid entry in occ
.len
- the number of valid entries in occ
.docSize
- the size of the current document (only for Golomb and interpolative coding; you can safely pass 0 otherwise).
IOException
- if an exception is thrown by the underlying stream
IllegalStateException
- if there is no current inverted list.public long writtenBits()
public Properties properties()
close()
. It returns the values for all the properties
specified in IndexProperties
.
IndexProperties
along with their values.public static String flags2String(long flags)
CompressionFlags
.
flags
- the flags.
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |