Monday, May 15, 2006

Django indexed search...

Zope has its 'catalog'. Any object can register with the catalog, and it gets indexed, then code can ask the catalog for a search and get a list of objects back.

How do we do something like this in Django? There are some existing ideas out there, for example Merquery. That post also has some links to other ideas for python-based text searching. Merquery is pretty much just an idea at the moment, and the examples are SQLObject oriented. What might a Django indexed search system look like? Here's some ideas:
  • Every searchable model class has a method to index or re-index instances of it in the index database
  • Index updates happen automatically on save() or other updates of the instance
  • If a django object is deleted, remove the entry in the index
  • If you change the object directly with SQL, its your job to call the re-index method
  • There must be a way to re-build the index from scratch. This would iterate over all indexable objects.
  • Comes with methods for indexing standard fields (minimally CharField and TextField) but extensible to other fields - imagine indexing PDF files...
Okay, so we've built our index and hopefully it keeps up-to-date. What about searches? What would be nice?
  • Ability to search over multiple classes - for example blog texts and blog comments.
  • Ability to restrict search to some or all fields.
  • Ability to use Fields in queries - so you could search for anyone with firstName Fred and lastName not Smith.
  • Ability to specify an ordering - by relevance, or some date field.
  • Get from the search result back to the Django object. HyperEstraier stores a user-defined URI with each document it indexes, and this would have to map to Django objects. Best I can think of at the moment is a URI which is a string concatenation of model and id.
Perhaps we should have an IndexedModel class that extends the usual Django Model class:
class Blog(IndexedModel):
name=CharField()
body=TextField()
by default this will index all fields that it knows how to index. This can be controlled with an inner class in the same way as the Admin inner class does:
class Blog(IndexedModel):
class Indexer:
fields=("name","body",)
name=CharField()
body=TextField()
secret=TextField()
moo=CowField()

This would index the name and body fields, but not the secret field, nor the moo field, since it doesn't know how to deal with CowField types.

I haven't yet thought how to organise the search end of things... Comments on all of this welcome!

B