| 1 |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|
| 2 |
;; Sphinx search module for Drupal 5.x
|
| 3 |
;; $Id: README.txt,v 1.2.2.7 2008/08/29 19:08:31 markuspetrux Exp $
|
| 4 |
;;
|
| 5 |
;; Original author: markus_petrux at drupal.org (July 2008)
|
| 6 |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|
| 7 |
|
| 8 |
REQUIREMENTS
|
| 9 |
============
|
| 10 |
|
| 11 |
- PHP 4.4.x or PHP 5.x (PHP needs to be compiled with --enable-memory-limit).
|
| 12 |
- Sphinx 0.9.8 (shell access is required here).
|
| 13 |
|
| 14 |
|
| 15 |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|
| 16 |
|
| 17 |
INSTALLATION
|
| 18 |
============
|
| 19 |
|
| 20 |
1) Install Sphinx.
|
| 21 |
|
| 22 |
It is recommended to install Sphinx on separate box, but it may also work
|
| 23 |
on any other server of your farm, or even in the same box your web server,
|
| 24 |
mysql or whatever is installed.
|
| 25 |
|
| 26 |
For more details, additional requirements, etc. please, read Sphinx
|
| 27 |
documentation. Here's just a quick start guide. You need root access
|
| 28 |
to the box.
|
| 29 |
|
| 30 |
# move to a temp directory.
|
| 31 |
cd /opt
|
| 32 |
|
| 33 |
# download and untar Sphinx source.
|
| 34 |
wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8.tar.gz
|
| 35 |
tar xzf sphinx-0.9.8.tar.gz
|
| 36 |
cd sphinx-0.9.8
|
| 37 |
|
| 38 |
# optionally, download and untar libstemmer.
|
| 39 |
wget http://snowball.tartarus.org/dist/libstemmer_c.tgz
|
| 40 |
tar xzf libstemmer_c.tgz
|
| 41 |
|
| 42 |
# you may need to adjust file ownerships.
|
| 43 |
chown -R root.root *
|
| 44 |
|
| 45 |
# build, compile and install sphinx + libstemmer.
|
| 46 |
./configure --with-mysql --with-libstemmer --prefix=/usr/local/sphinx
|
| 47 |
make
|
| 48 |
make install
|
| 49 |
|
| 50 |
|
| 51 |
2) See sphinxsearch/contrib subdirectory. It contains samples for sphinx.conf
|
| 52 |
and sphinx start/stop script.
|
| 53 |
|
| 54 |
***** IMPORTANT *****
|
| 55 |
Files in contrib subdirectory are just samples. Please, note they are
|
| 56 |
provided in order to help you setup your Sphinx installation, but without
|
| 57 |
warranties of any kind. Note that I started to learn it just recently.
|
| 58 |
Also, my environment and needs may differ a lot from yours. Please, don't
|
| 59 |
use them as-is. If you do, it is at your own risk.
|
| 60 |
*********************
|
| 61 |
|
| 62 |
|
| 63 |
3) Install sphinxsearch Drupal module.
|
| 64 |
|
| 65 |
- Copy package contents to modules/sphinxsearch.
|
| 66 |
- Copy sphinxsearch_scripts subdirectory provided within this module to
|
| 67 |
your Drupal root directory.
|
| 68 |
Instead, you may wish to setup a symbolic link from your Drupal root to
|
| 69 |
the sphinxsearch_scripts subdirectory of this module. This way you don't
|
| 70 |
need to copy files when module is updated. Please, see README-XMLPIPE.txt
|
| 71 |
for further information and examples.
|
| 72 |
- Goto admin/build/modules to install the module.
|
| 73 |
- Goto admin/user/access to adjust permissions.
|
| 74 |
(use sphinxsearch, administer sphinxsearch)
|
| 75 |
- Goto admin/settings/sphinxsearch to configure module options.
|
| 76 |
(see below)
|
| 77 |
|
| 78 |
|
| 79 |
4) Customization.
|
| 80 |
|
| 81 |
- Check module settings and adjust them to your environment.
|
| 82 |
- Create and/or adjust your sphinx.conf to include definitions for all
|
| 83 |
indexes required by your Drupal site. You need at least one main index,
|
| 84 |
optionally as many main indexes as you need, and also optionaly one
|
| 85 |
single delta index.
|
| 86 |
It is also necessary to create a distributed index that will be used
|
| 87 |
to join all your indexes when resolving search queries.
|
| 88 |
(see contrib subdirectory for examples and further information).
|
| 89 |
- Setup crontab to build your main and delta indexes at intervals.
|
| 90 |
|
| 91 |
***** IMPORTANT *****
|
| 92 |
There are options in the module settings panel that require you to
|
| 93 |
rebuild main indexes. Otherwise, you may get errors when searching.
|
| 94 |
*********************
|
| 95 |
|
| 96 |
|
| 97 |
- Watchdog logging:
|
| 98 |
|
| 99 |
XMLPipe processing generates watchdog records with information on memory
|
| 100 |
used, execution time, nodes processed, etc., to help you adjust module
|
| 101 |
settings to suit your needs.
|
| 102 |
|
| 103 |
|
| 104 |
- Steps to create your initial set of indexes:
|
| 105 |
|
| 106 |
It is assumed that your sphinxsearch module has been installed and
|
| 107 |
configured, also that you have already installed and configured your
|
| 108 |
Sphinx server accordingly.
|
| 109 |
|
| 110 |
1) Stop your searchd daemon.
|
| 111 |
2) Use Sphinx indexer to build all your main indexes.
|
| 112 |
3) Start your searchd daemon.
|
| 113 |
4) Setup cron task to rebuild your delta index at short intervals.
|
| 114 |
5) Setup cron task to rebuild your main indexes once a day or so.
|
| 115 |
|
| 116 |
Once your initial set of indexes is created, you don't need to stop
|
| 117 |
your searchd daemon. Instead, you can invoke Sphinx indexer with
|
| 118 |
--rotate argument.
|
| 119 |
|
| 120 |
See docs/contrib subdirectory of this package for sample script.
|
| 121 |
|
| 122 |
|
| 123 |
- Troubleshooting:
|
| 124 |
|
| 125 |
Symptom: When creating your initial set of main/delta indexes, you may
|
| 126 |
endup with index file names with ".new" in them. Often, Sphinx searchd
|
| 127 |
daemon deals with this naming convention transparently. However, it may
|
| 128 |
sometimes fail to recognise these files. Not exactly sure why, though.
|
| 129 |
Solution: Stop searchd daemon and rename you index files to remove
|
| 130 |
the ".new" part. ie. if you see something like "main.new.spp", you can
|
| 131 |
rename it to "main.spp". Note than each Sphinx index uses several files
|
| 132 |
with same name and different extension. Start again searchd daemon when
|
| 133 |
all files have been renamed.
|
| 134 |
|
| 135 |
|
| 136 |
|
| 137 |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|
| 138 |
|
| 139 |
SPHINX IMPLEMENTATION DETAILS
|
| 140 |
=============================
|
| 141 |
|
| 142 |
- Sphinx is a fast and scalable full-text search engine. However, it currently
|
| 143 |
has a few limitations related to the way text is indexed.
|
| 144 |
|
| 145 |
- Sphinx index documents that are composed of fields of different types. It
|
| 146 |
basically supports text fields, integers, timestamps, booleans, multi-valued
|
| 147 |
attributes (lists of integers that can be used to implement 1-N relations),
|
| 148 |
etc. For instance, to manage basic Drupal content (nodes) we can use text
|
| 149 |
fields to index titles and bodies, an integer field to store the node author
|
| 150 |
id, a timestamp field to store last updated time, a boolean field to store
|
| 151 |
the is_deleted attribute (we'll see this later) or multi-valued attributes
|
| 152 |
field to store the list of terms related to each node.
|
| 153 |
|
| 154 |
- Current version of Sphinx does not support full live index updates. Instead,
|
| 155 |
it is necessary to build indexes in jobs that transform your data into a
|
| 156 |
special kind of documents that are stored in Sphinx indexes. This process is
|
| 157 |
executed by the Sphinx indexer command and it should be invoked from the
|
| 158 |
server where Sphinx is installed. A particular Sphinx installation can manage
|
| 159 |
a number of indexes, you can partition indexes managed locally, or even
|
| 160 |
remotely from other Sphinx servers. You can install your Sphinx server on a
|
| 161 |
dedicated server (recommended) or it can coexist in one server with any other
|
| 162 |
software of your choice.
|
| 163 |
|
| 164 |
- Each Sphinx instance is configured with its own sphinx.conf file where you
|
| 165 |
can specify how your indexes are built, structure of your Sphinx documents,
|
| 166 |
how your data should be extracted to build them, as well as options that tell
|
| 167 |
Sphinx how the searchd daemon should work. The searchd daemon can be
|
| 168 |
configured to listen on a particular TCP port of the server. Then, Sphinx
|
| 169 |
provides a series of APIs that can be used to connect your application to the
|
| 170 |
searchd daemon (locally or remotely) to perform search queries, retrieve
|
| 171 |
results, build excerpts highlighting keywords, or even update some kind of
|
| 172 |
attributes. However, it is not possible to index new documents, it is not
|
| 173 |
possible to update text fields and it is not possible to delete indexed
|
| 174 |
documents. It is only possible to update non-text fields.
|
| 175 |
|
| 176 |
- Therefore, it is necessary to create indexes in batch jobs. These jobs will
|
| 177 |
index all your content at once, and it is necessary to repeat this task
|
| 178 |
periodically in order to recover space used by documents marked as being
|
| 179 |
deleted, index new documents and/or reindex documents that have been updated
|
| 180 |
since last time indexes were built.
|
| 181 |
|
| 182 |
- A note on document deletions. This module creates Sphinx documents with a
|
| 183 |
boolean attribute, is_deleted, that is used as a flag to keep track of
|
| 184 |
nodes that have been deleted from Drupal database, but that still exist
|
| 185 |
in Sphinx indexes. When a node is indexed, its own is_deleted attribute is
|
| 186 |
set to 0. When a node is deleted from the Drupal database, the Sphinx API is
|
| 187 |
used to set the is_deleted attribute of that node to 1. Finally, all search
|
| 188 |
queries sent by this module filter out documents with this attribute enabled.
|
| 189 |
This method allows us to tell Sphinx supports live document deletions, but
|
| 190 |
as you can see this is not the case.
|
| 191 |
|
| 192 |
- In this scenario, we need to work in Sphinx with the so called main + delta
|
| 193 |
scheme. See Sphinx documentation for more details. In short, main indexes
|
| 194 |
should be rebuilt periodically in order to recover space used by deleted
|
| 195 |
documents, and delta index should be rebuild as often as possible to take
|
| 196 |
care of new and updates documents until your main indexes are rebuilt.
|
| 197 |
|
| 198 |
- Once you have created your main indexes, new and/or updated nodes will be
|
| 199 |
stored in delta index. You may wish to rebuild your delta index at short
|
| 200 |
intervals using crontab from the server where Sphinx has been installed.
|
| 201 |
These intervals basically depend on the time required to process each delta
|
| 202 |
and the number of node updates in your site. You may wish to start with 5
|
| 203 |
minutes and adjust your crontab as you get more experience. The module
|
| 204 |
generates full reports in watchdog to help you monitor index processing.
|
| 205 |
|
| 206 |
- Sphinx also supports distributed indexes. This type of indexes can be used
|
| 207 |
to join a number of indexes that share exact same structure. In this case,
|
| 208 |
we join as many main indexes as we may need, plus the delta index. In case a
|
| 209 |
document is stored in more than one index, the one stored in the last index
|
| 210 |
in the list "wins". Joined indexes can be local (managed by the same Sphinx
|
| 211 |
instance) or remote. This is great in terms of scalability. In fact, this
|
| 212 |
means we can split the index rebuild process in chunks that can be easily
|
| 213 |
managed, or even spread to other servers in your infrastructure. Queries sent
|
| 214 |
to distributed indexes are resolved by Sphinx transparently, as if it was a
|
| 215 |
single index.
|
| 216 |
|
| 217 |
- Data sources to build Sphinx indexes can be of type MySQL, PostgreSQL and
|
| 218 |
XMLPipe. In the case of MySQL or PostgreSQL source types it is possible to
|
| 219 |
tell Sphinx indexer to extract data directly from your database, and this
|
| 220 |
method is impressively fast. However, these methods cannot be used to index
|
| 221 |
Drupal nodes, or at least it would be so difficult to achieve, because data
|
| 222 |
related to nodes often needs to be proprocessed by a number of hooks that may
|
| 223 |
involve a lot of small and quick (or not so quick) SQL queries and further
|
| 224 |
processing performed by core modules as well as contrib modules.
|
| 225 |
For instance, XMLPipe is the only method that allows us to index nodes along
|
| 226 |
with their comments, cck fields, taxonomy terms, etc. In fact, this method
|
| 227 |
allows us to index content the same way Drupal core search works.
|
| 228 |
|
| 229 |
- It is something important to take into account that XMLPipe generation may
|
| 230 |
require more resources than what one would expect at first, compared to other
|
| 231 |
Sphinx implementations. It all depends on the complexity of your Drupal
|
| 232 |
intallation, modules installed, size and number of nodes, available
|
| 233 |
infraestructure, etc. Note that Drupal search core solves this problem by
|
| 234 |
splitting index generation in chunks where a number of nodes is indexed at
|
| 235 |
cron intervals, however with Sphinx we need to index all content at once. Of
|
| 236 |
course, it is also possible to partition indexes so your nodes are spread
|
| 237 |
into several storage units, though this method might only be recommended when
|
| 238 |
your site has thousands of nodes, maybe millions. Again, it all depends on
|
| 239 |
the time it takes to create your indexes, which may be from a few minutes up
|
| 240 |
to one or more hours.
|
| 241 |
|
| 242 |
- So here's why this module is based on and supports XMLPipe index type
|
| 243 |
generation. Problem is now, this method is MUCH slower than indexing content
|
| 244 |
using MySQL/PostgreSQL index types. You may wish now take a look at the
|
| 245 |
docs subdirectory of this project to see the options this module provides
|
| 246 |
to help you setup and manage your index creation jobs, etc.
|
| 247 |
|
| 248 |
- In order to minimize these problems, the XMLPipe generation script provided
|
| 249 |
with this module implements a few checks that will abort XMLPipe stream
|
| 250 |
generation and report the cause of the problem to watchdog. Actually, the
|
| 251 |
module monitors memory usage and execution time in order to prevent crashes
|
| 252 |
when PHP memory_limit and/or max_execution_time values are exceeded.
|
| 253 |
Depending on module settings, it is also possible to setup the XMLPipe
|
| 254 |
generation script to restart client connection to DB server to prevent from
|
| 255 |
getting max connection time problems. You may also wish to adjust PHP
|
| 256 |
settings from the .htaccess file provided within the sphinxsearch_scripts
|
| 257 |
subdirectory of this module.
|
| 258 |
|
| 259 |
- Here's a couple of examples where I have implemented Sphinx, so you can get
|
| 260 |
an idea of how many time it may take to process your indexes, and/or a sample
|
| 261 |
reference on how to setup your Sphinx installation.
|
| 262 |
|
| 263 |
a) phpBB based board with 14+ million posts, 15,000 posts a day average, and
|
| 264 |
growing. Here, I used 4 main indexes with capacity for 5 million posts
|
| 265 |
each, and one delta index. Generation of each main index takes around 1
|
| 266 |
hour. 1 or 2 main indexes are built daily. Generation of delta index just
|
| 267 |
takes seconds and it is scheduled to run at 1 minute intevals from cron.
|
| 268 |
If you wish, you can test Sphinx search engine implemented on this site
|
| 269 |
from here: http://zonaforo.meristation.com/foros/search.php
|
| 270 |
|
| 271 |
b) Drupal based site running this module. Site has 10,000+ blog entries and
|
| 272 |
30,000+ comments. It uses 1 main index + 1 delta. Main index takes less
|
| 273 |
than 5 minutes to build and it is executed daily. Delta index takes a few
|
| 274 |
seconds and it is executed at 5 minutes intervals from cron. Again, if you
|
| 275 |
wish, you can test Sphinx search engine implemented on this site from
|
| 276 |
here: http://blogs.gamefilia.com/search
|
| 277 |
|
| 278 |
It all depends on several factors. Of course, your mileage may vary.
|
| 279 |
|
| 280 |
- New or different ideas to fight against forementioned "limitations" are
|
| 281 |
welcome. Please, use issue tracker of the module.
|
| 282 |
|
| 283 |
|
| 284 |
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
|
| 285 |
|
| 286 |
TODO
|
| 287 |
====
|
| 288 |
|
| 289 |
- Provide new hooks to allow external modules extend Sphinx document attributes
|
| 290 |
and/or alter search user interface with additional filters.
|
| 291 |
|
| 292 |
- Think about a reasonable method to implement access control to indexed data.
|
| 293 |
Currently, all content indexed by this module is available to anyone with
|
| 294 |
'use sphinxsearch' permission.
|
| 295 |
|
| 296 |
- Provide a better integration / user interface to co-exist with other search
|
| 297 |
modules that may provide solutions for searching different kinds of content,
|
| 298 |
such as users, etc. Suggestions are welcome. However, I believe this is more
|
| 299 |
a job for Drupal search framework itself. Hopefully Sphinx search integration
|
| 300 |
provided with this module can be used as proof of concept of Sphinx
|
| 301 |
capabilities and limitations that maybe can help here in some way...
|