pitt.search.semanticvectors
Class TermVectorsFromLucene
java.lang.Object
pitt.search.semanticvectors.TermVectorsFromLucene
- All Implemented Interfaces:
- VectorStore
public class TermVectorsFromLucene
- extends java.lang.Object
- implements VectorStore
Implementation of vector store that creates term vectors by
iterating through all the terms in a Lucene index. Uses a sparse
representation for the basic document vectors, which saves
considerable space for collections with many individual documents.
- Author:
- Dominic Widdows, Trevor Cohen.
|
Constructor Summary |
TermVectorsFromLucene(java.lang.String indexDir,
int seedLength,
int minFreq,
int nonAlphabet,
java.lang.String[] fieldsToIndex)
This constructor generates an elemental vector for each term. |
TermVectorsFromLucene(java.lang.String indexDir,
int seedLength,
int minFreq,
int nonAlphabet,
VectorStore basicDocVectors,
java.lang.String[] fieldsToIndex)
|
| Methods inherited from class java.lang.Object |
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
TermVectorsFromLucene
public TermVectorsFromLucene(java.lang.String indexDir,
int seedLength,
int minFreq,
int nonAlphabet,
VectorStore basicDocVectors,
java.lang.String[] fieldsToIndex)
throws java.io.IOException,
java.lang.RuntimeException
- Parameters:
indexDir - Directory containing Lucene index.seedLength - Number of +1 or -1 entries in basic
vectors. Should be even to give same number of each.minFreq - The minimum term frequency for a term to be indexed.basicDocVectors - The store of basic document vectors. Null
is an acceptable value, in which case the constructor will build
this table. If non-null, the identifiers must correspond to the Lucene doc numbers.fieldsToIndex - These fields will be indexed. If null, all fields will be indexed.
- Throws:
java.io.IOException
java.lang.RuntimeException
TermVectorsFromLucene
public TermVectorsFromLucene(java.lang.String indexDir,
int seedLength,
int minFreq,
int nonAlphabet,
java.lang.String[] fieldsToIndex)
throws java.io.IOException,
java.lang.RuntimeException
- This constructor generates an elemental vector for each term. These elemental (random index) vectors will
be used to construct document vectors, a procedure we have called term-based reflective random indexing
- Parameters:
indexDir - the directory of the Lucene IndexseedLength - Number of +1 or -1 entries in basic
vectors. Should be even to give same number of each.nonAlphabet - the number of nonalphabet characters permittedminFreq - The minimum term frequency for a term to be indexed.fieldsToIndex - the fields to be indexed (most commonly "contents")
- Throws:
java.io.IOException
java.lang.RuntimeException
getBasicDocVectors
public VectorStore getBasicDocVectors()
- Returns:
- The object's basicDocVectors.
getIndexReader
public org.apache.lucene.index.IndexReader getIndexReader()
- Returns:
- The object's indexReader.
getFieldsToIndex
public java.lang.String[] getFieldsToIndex()
- Returns:
- The object's list of Lucene fields to index.
getVector
public float[] getVector(java.lang.Object term)
- Specified by:
getVector in interface VectorStore
- Parameters:
term - the object whose vector you want to look up
- Returns:
- a vector (of floats)
getAllVectors
public java.util.Enumeration getAllVectors()
- Specified by:
getAllVectors in interface VectorStore
- Returns:
- an enumeration of all the object vectors in the store.
getNumVectors
public int getNumVectors()
- Specified by:
getNumVectors in interface VectorStore
- Returns:
- a count of the number of vectors in the store.