6.3. API

6.3.1. Interface elements

A few elements in the interface are specific and and need an explanation.

udi

An udi (unique document identifier) identifies a document. Because of limitations inside the index engine, it is restricted in length (to 200 bytes), which is why a regular URI cannot be used. The structure and contents of the udi is defined by the application and opaque to the index engine. For example, the internal file system indexer uses the complete document path (file path + internal path), truncated to length, the suppressed part being replaced by a hash value.

ipath

This data value (set as a field in the Doc object) is stored, along with the URL, but not indexed by Recoll. Its contents are not interpreted, and its use is up to the application. For example, the Recoll internal file system indexer stores the part of the document access path internal to the container file (ipath in this case is a list of subdocument sequential numbers). url and ipath are returned in every search result and permit access to the original document.

Stored and indexed fields

The fields file inside the Recoll configuration defines which document fields are either "indexed" (searchable), "stored" (retrievable with search results), or both.

Data for an external indexer, should be stored in a separate index, not the one for the Recoll internal file system indexer, except if the latter is not used at all). The reason is that the main document indexer purge pass would remove all the other indexer's documents, as they were not seen during indexing. The main indexer documents would also probably be a problem for the external indexer purge operation.

6.3.2. Python interface

6.3.2.1. Introduction

Recoll versions after 1.11 define a Python programming interface, both for searching and indexing.

The python interface is not built by default and can be found in the source package, under python/recoll. The directory contains the usual setup.py script which you can use to build and install the module:

        cd recoll-xxx/python/recoll
        python setup.py build
        python setup.py install
     


6.3.2.2. Interface manual

NAME
    recoll - This is an interface to the Recoll full text indexer.

FILE
    /usr/local/lib/python2.5/site-packages/recoll.so

CLASSES
        Db
        Doc
        Query
        SearchData
    
    class Db(__builtin__.object)
     |  Db([confdir=None], [extra_dbs=None], [writable = False])
     |  
     |  A Db object holds a connection to a Recoll index. Use the connect()
     |  function to create one.
     |  confdir specifies a Recoll configuration directory (default: 
     |   $RECOLL_CONFDIR or ~/.recoll).
     |  extra_dbs is a list of external databases (xapian directories)
     |  writable decides if we can index new data through this connection
     |  
     |  Methods defined here:
     |  
     |  
     |  addOrUpdate(...)
     |      addOrUpdate(udi, doc, parent_udi=None) -> None
     |      Add or update index data for a given document
     |      The udi string must define a unique id for the document. It is not
     |      interpreted inside Recoll
     |      doc is a Doc object
     |      if parent_udi is set, this is a unique identifier for the
     |      top-level container (ie mbox file)
     |  
     |  delete(...)
     |      delete(udi) -> Bool.
     |      Purge index from all data for udi. If udi matches a container
     |      document, purge all subdocs (docs with a parent_udi matching udi).
     |  
     |  makeDocAbstract(...)
     |      makeDocAbstract(Doc, Query) -> string
     |      Build and return 'keyword-in-context' abstract for document
     |      and query.
     |  
     |  needUpdate(...)
     |      needUpdate(udi, sig) -> Bool.
     |      Check if the index is up to date for the document defined by udi,
     |      having the current signature sig.
     |  
     |  purge(...)
     |      purge() -> Bool.
     |      Delete all documents that were not touched during the just finished
     |      indexing pass (since open-for-write). These are the documents for
     |      the needUpdate() call was not performed, indicating that they no
     |      longer exist in the primary storage system.
     |  
     |  query(...)
     |      query() -> Query. Return a new, blank query object for this index.
     |  
     |  setAbstractParams(...)
     |      setAbstractParams(maxchars, contextwords).
     |      Set the parameters used to build 'keyword-in-context' abstracts
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
    
    class Doc(__builtin__.object)
     |  Doc()
     |  
     |  A Doc object contains index data for a given document.
     |  The data is extracted from the index when searching, or set by the
     |  indexer program when updating. The Doc object has no useful methods but
     |  many attributes to be read or set by its user. It matches exactly the
     |  Rcl::Doc c++ object. Some of the attributes are predefined, but, 
     |  especially when indexing, others can be set, the name of which will be
     |  processed as field names by the indexing configuration.
     |  Inputs can be specified as unicode or strings.
     |  Outputs are unicode objects.
     |  All dates are specified as unix timestamps, printed as strings
     |  Predefined attributes (index/query/both):
     |   text (index): document plain text
     |   url (both)
     |   fbytes (both) optional) file size in bytes
     |   filename (both)
     |   fmtime (both) optional file modification date. Unix time printed 
     |      as string
     |   dbytes (both) document text bytes
     |   dmtime (both) document creation/modification date
     |   ipath (both) value private to the app.: internal access path
     |      inside file
     |   mtype (both) mime type for original document
     |   mtime (query) dmtime if set else fmtime
     |   origcharset (both) charset the text was converted from
     |   size (query) dbytes if set, else fbytes
     |   sig (both) app-defined file modification signature. 
     |      For up to date checks
     |   relevancyrating (query)
     |   abstract (both)
     |   author (both)
     |   title (both)
     |   keywords (both)
     |  
     |  Methods defined here:
     |  
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
    
    class Query(__builtin__.object)
     |  Recoll Query objects are used to execute index searches. 
     |  They must be created by the Db.query() method.
     |  
     |  Methods defined here:
     |  
     |  
     |  execute(...)
     |      execute(query_string, stemming=1|0)
     |      
     |      Starts a search for query_string, a Recoll search language string
     |      (mostly Xesam-compatible).
     |      The query can be a simple list of terms (and'ed by default), or more
     |      complicated with field specs etc. See the Recoll manual.
     |  
     |  executesd(...)
     |      executesd(SearchData)
     |      
     |      Starts a search for the query defined by the SearchData object.
     |  
     |  fetchone(...)
     |      fetchone(None) -> Doc
     |      
     |      Fetches the next Doc object in the current search results.
     |  
     |  sortby(...)
     |      sortby(field=fieldname, ascending=true)
     |      Sort results by 'fieldname', in ascending or descending order.
     |      Only one field can be used, no subsorts for now.
     |      Must be called before executing the search
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  next
     |      Next index to be fetched from results. Normally increments after
     |      each fetchone() call, but can be set/reset before the call effect
     |      seeking. Starts at 0
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  
    
    class SearchData(__builtin__.object)
     |  SearchData()
     |  
     |  A SearchData object describes a query. It has a number of global
     |  parameters and a chain of search clauses.
     |  
     |  Methods defined here:
     |  
     |  
     |  addclause(...)
     |      addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
     |                qstring=string, slack=int, field=string, stemming=1|0,
     |                subSearch=SearchData)
     |      Adds a simple clause to the SearchData And/Or chain, or a subquery
     |      defined by another SearchData object
     |  
     |  ----------------------------------------------------------------------
     |  Data and other attributes defined here:
     |  

FUNCTIONS
    connect(...)
        connect([confdir=None], [extra_dbs=None], [writable = False])
                 -> Db.
        
        Connects to a Recoll database and returns a Db object.
        confdir specifies a Recoll configuration directory
        (the default is built like for any Recoll program).
        extra_dbs is a list of external databases (xapian directories)
        writable decides if we can index new data through this connection

6.3.2.3. Example code

The following sample would query the index with a user language string. See the python/samples directory inside the Recoll source for other examples.

#!/usr/bin/env python

import recoll

db = recoll.connect()
db.setAbstractParams(maxchars=80, contextwords=2)

query = db.query()
nres = query.execute("some user question")
print "Result count: ", nres
if nres > 5:
    nres = 5
while query.next >= 0 and query.next < nres: 
    doc = query.fetchone()
    print query.next
    for k in ("title", "size"):
        print k, ":", getattr(doc, k).encode('utf-8')
    abs = db.makeDocAbstract(doc, query).encode('utf-8')
    print abs
    print