/[drupal]/contributions/modules/sphinxsearch/README.txt
ViewVC logotype

Contents of /contributions/modules/sphinxsearch/README.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Revision Graph Revision Graph


Revision 1.4 - (show annotations) (download)
Fri Sep 12 02:44:22 2008 UTC (14 months, 2 weeks ago) by markuspetrux
Branch: MAIN
CVS Tags: HEAD
Branch point for: DRUPAL-6--1
Changes since 1.3: +2 -9 lines
File MIME type: text/plain
- Ported module from D5 to D6.
- Bugfix: undefined class method in sphinxsearch_check_connection_page().
- Bugfix: added criterion class to taxonomy elements in advanced search form.
- Bugfix: added support for mysqli and pgsql to _sphinxsearch_db_reconnect().
1 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
2 ;; Sphinx search module for Drupal 5.x
3 ;; $Id: README.txt,v 1.2.2.7 2008/08/29 19:08:31 markuspetrux Exp $
4 ;;
5 ;; Original author: markus_petrux at drupal.org (July 2008)
6 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
7
8 REQUIREMENTS
9 ============
10
11 - PHP 4.4.x or PHP 5.x (PHP needs to be compiled with --enable-memory-limit).
12 - Sphinx 0.9.8 (shell access is required here).
13
14
15 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
16
17 INSTALLATION
18 ============
19
20 1) Install Sphinx.
21
22 It is recommended to install Sphinx on separate box, but it may also work
23 on any other server of your farm, or even in the same box your web server,
24 mysql or whatever is installed.
25
26 For more details, additional requirements, etc. please, read Sphinx
27 documentation. Here's just a quick start guide. You need root access
28 to the box.
29
30 # move to a temp directory.
31 cd /opt
32
33 # download and untar Sphinx source.
34 wget http://www.sphinxsearch.com/downloads/sphinx-0.9.8.tar.gz
35 tar xzf sphinx-0.9.8.tar.gz
36 cd sphinx-0.9.8
37
38 # optionally, download and untar libstemmer.
39 wget http://snowball.tartarus.org/dist/libstemmer_c.tgz
40 tar xzf libstemmer_c.tgz
41
42 # you may need to adjust file ownerships.
43 chown -R root.root *
44
45 # build, compile and install sphinx + libstemmer.
46 ./configure --with-mysql --with-libstemmer --prefix=/usr/local/sphinx
47 make
48 make install
49
50
51 2) See sphinxsearch/contrib subdirectory. It contains samples for sphinx.conf
52 and sphinx start/stop script.
53
54 ***** IMPORTANT *****
55 Files in contrib subdirectory are just samples. Please, note they are
56 provided in order to help you setup your Sphinx installation, but without
57 warranties of any kind. Note that I started to learn it just recently.
58 Also, my environment and needs may differ a lot from yours. Please, don't
59 use them as-is. If you do, it is at your own risk.
60 *********************
61
62
63 3) Install sphinxsearch Drupal module.
64
65 - Copy package contents to modules/sphinxsearch.
66 - Copy sphinxsearch_scripts subdirectory provided within this module to
67 your Drupal root directory.
68 Instead, you may wish to setup a symbolic link from your Drupal root to
69 the sphinxsearch_scripts subdirectory of this module. This way you don't
70 need to copy files when module is updated. Please, see README-XMLPIPE.txt
71 for further information and examples.
72 - Goto admin/build/modules to install the module.
73 - Goto admin/user/access to adjust permissions.
74 (use sphinxsearch, administer sphinxsearch)
75 - Goto admin/settings/sphinxsearch to configure module options.
76 (see below)
77
78
79 4) Customization.
80
81 - Check module settings and adjust them to your environment.
82 - Create and/or adjust your sphinx.conf to include definitions for all
83 indexes required by your Drupal site. You need at least one main index,
84 optionally as many main indexes as you need, and also optionaly one
85 single delta index.
86 It is also necessary to create a distributed index that will be used
87 to join all your indexes when resolving search queries.
88 (see contrib subdirectory for examples and further information).
89 - Setup crontab to build your main and delta indexes at intervals.
90
91 ***** IMPORTANT *****
92 There are options in the module settings panel that require you to
93 rebuild main indexes. Otherwise, you may get errors when searching.
94 *********************
95
96
97 - Watchdog logging:
98
99 XMLPipe processing generates watchdog records with information on memory
100 used, execution time, nodes processed, etc., to help you adjust module
101 settings to suit your needs.
102
103
104 - Steps to create your initial set of indexes:
105
106 It is assumed that your sphinxsearch module has been installed and
107 configured, also that you have already installed and configured your
108 Sphinx server accordingly.
109
110 1) Stop your searchd daemon.
111 2) Use Sphinx indexer to build all your main indexes.
112 3) Start your searchd daemon.
113 4) Setup cron task to rebuild your delta index at short intervals.
114 5) Setup cron task to rebuild your main indexes once a day or so.
115
116 Once your initial set of indexes is created, you don't need to stop
117 your searchd daemon. Instead, you can invoke Sphinx indexer with
118 --rotate argument.
119
120 See docs/contrib subdirectory of this package for sample script.
121
122
123 - Troubleshooting:
124
125 Symptom: When creating your initial set of main/delta indexes, you may
126 endup with index file names with ".new" in them. Often, Sphinx searchd
127 daemon deals with this naming convention transparently. However, it may
128 sometimes fail to recognise these files. Not exactly sure why, though.
129 Solution: Stop searchd daemon and rename you index files to remove
130 the ".new" part. ie. if you see something like "main.new.spp", you can
131 rename it to "main.spp". Note than each Sphinx index uses several files
132 with same name and different extension. Start again searchd daemon when
133 all files have been renamed.
134
135
136
137 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
138
139 SPHINX IMPLEMENTATION DETAILS
140 =============================
141
142 - Sphinx is a fast and scalable full-text search engine. However, it currently
143 has a few limitations related to the way text is indexed.
144
145 - Sphinx index documents that are composed of fields of different types. It
146 basically supports text fields, integers, timestamps, booleans, multi-valued
147 attributes (lists of integers that can be used to implement 1-N relations),
148 etc. For instance, to manage basic Drupal content (nodes) we can use text
149 fields to index titles and bodies, an integer field to store the node author
150 id, a timestamp field to store last updated time, a boolean field to store
151 the is_deleted attribute (we'll see this later) or multi-valued attributes
152 field to store the list of terms related to each node.
153
154 - Current version of Sphinx does not support full live index updates. Instead,
155 it is necessary to build indexes in jobs that transform your data into a
156 special kind of documents that are stored in Sphinx indexes. This process is
157 executed by the Sphinx indexer command and it should be invoked from the
158 server where Sphinx is installed. A particular Sphinx installation can manage
159 a number of indexes, you can partition indexes managed locally, or even
160 remotely from other Sphinx servers. You can install your Sphinx server on a
161 dedicated server (recommended) or it can coexist in one server with any other
162 software of your choice.
163
164 - Each Sphinx instance is configured with its own sphinx.conf file where you
165 can specify how your indexes are built, structure of your Sphinx documents,
166 how your data should be extracted to build them, as well as options that tell
167 Sphinx how the searchd daemon should work. The searchd daemon can be
168 configured to listen on a particular TCP port of the server. Then, Sphinx
169 provides a series of APIs that can be used to connect your application to the
170 searchd daemon (locally or remotely) to perform search queries, retrieve
171 results, build excerpts highlighting keywords, or even update some kind of
172 attributes. However, it is not possible to index new documents, it is not
173 possible to update text fields and it is not possible to delete indexed
174 documents. It is only possible to update non-text fields.
175
176 - Therefore, it is necessary to create indexes in batch jobs. These jobs will
177 index all your content at once, and it is necessary to repeat this task
178 periodically in order to recover space used by documents marked as being
179 deleted, index new documents and/or reindex documents that have been updated
180 since last time indexes were built.
181
182 - A note on document deletions. This module creates Sphinx documents with a
183 boolean attribute, is_deleted, that is used as a flag to keep track of
184 nodes that have been deleted from Drupal database, but that still exist
185 in Sphinx indexes. When a node is indexed, its own is_deleted attribute is
186 set to 0. When a node is deleted from the Drupal database, the Sphinx API is
187 used to set the is_deleted attribute of that node to 1. Finally, all search
188 queries sent by this module filter out documents with this attribute enabled.
189 This method allows us to tell Sphinx supports live document deletions, but
190 as you can see this is not the case.
191
192 - In this scenario, we need to work in Sphinx with the so called main + delta
193 scheme. See Sphinx documentation for more details. In short, main indexes
194 should be rebuilt periodically in order to recover space used by deleted
195 documents, and delta index should be rebuild as often as possible to take
196 care of new and updates documents until your main indexes are rebuilt.
197
198 - Once you have created your main indexes, new and/or updated nodes will be
199 stored in delta index. You may wish to rebuild your delta index at short
200 intervals using crontab from the server where Sphinx has been installed.
201 These intervals basically depend on the time required to process each delta
202 and the number of node updates in your site. You may wish to start with 5
203 minutes and adjust your crontab as you get more experience. The module
204 generates full reports in watchdog to help you monitor index processing.
205
206 - Sphinx also supports distributed indexes. This type of indexes can be used
207 to join a number of indexes that share exact same structure. In this case,
208 we join as many main indexes as we may need, plus the delta index. In case a
209 document is stored in more than one index, the one stored in the last index
210 in the list "wins". Joined indexes can be local (managed by the same Sphinx
211 instance) or remote. This is great in terms of scalability. In fact, this
212 means we can split the index rebuild process in chunks that can be easily
213 managed, or even spread to other servers in your infrastructure. Queries sent
214 to distributed indexes are resolved by Sphinx transparently, as if it was a
215 single index.
216
217 - Data sources to build Sphinx indexes can be of type MySQL, PostgreSQL and
218 XMLPipe. In the case of MySQL or PostgreSQL source types it is possible to
219 tell Sphinx indexer to extract data directly from your database, and this
220 method is impressively fast. However, these methods cannot be used to index
221 Drupal nodes, or at least it would be so difficult to achieve, because data
222 related to nodes often needs to be proprocessed by a number of hooks that may
223 involve a lot of small and quick (or not so quick) SQL queries and further
224 processing performed by core modules as well as contrib modules.
225 For instance, XMLPipe is the only method that allows us to index nodes along
226 with their comments, cck fields, taxonomy terms, etc. In fact, this method
227 allows us to index content the same way Drupal core search works.
228
229 - It is something important to take into account that XMLPipe generation may
230 require more resources than what one would expect at first, compared to other
231 Sphinx implementations. It all depends on the complexity of your Drupal
232 intallation, modules installed, size and number of nodes, available
233 infraestructure, etc. Note that Drupal search core solves this problem by
234 splitting index generation in chunks where a number of nodes is indexed at
235 cron intervals, however with Sphinx we need to index all content at once. Of
236 course, it is also possible to partition indexes so your nodes are spread
237 into several storage units, though this method might only be recommended when
238 your site has thousands of nodes, maybe millions. Again, it all depends on
239 the time it takes to create your indexes, which may be from a few minutes up
240 to one or more hours.
241
242 - So here's why this module is based on and supports XMLPipe index type
243 generation. Problem is now, this method is MUCH slower than indexing content
244 using MySQL/PostgreSQL index types. You may wish now take a look at the
245 docs subdirectory of this project to see the options this module provides
246 to help you setup and manage your index creation jobs, etc.
247
248 - In order to minimize these problems, the XMLPipe generation script provided
249 with this module implements a few checks that will abort XMLPipe stream
250 generation and report the cause of the problem to watchdog. Actually, the
251 module monitors memory usage and execution time in order to prevent crashes
252 when PHP memory_limit and/or max_execution_time values are exceeded.
253 Depending on module settings, it is also possible to setup the XMLPipe
254 generation script to restart client connection to DB server to prevent from
255 getting max connection time problems. You may also wish to adjust PHP
256 settings from the .htaccess file provided within the sphinxsearch_scripts
257 subdirectory of this module.
258
259 - Here's a couple of examples where I have implemented Sphinx, so you can get
260 an idea of how many time it may take to process your indexes, and/or a sample
261 reference on how to setup your Sphinx installation.
262
263 a) phpBB based board with 14+ million posts, 15,000 posts a day average, and
264 growing. Here, I used 4 main indexes with capacity for 5 million posts
265 each, and one delta index. Generation of each main index takes around 1
266 hour. 1 or 2 main indexes are built daily. Generation of delta index just
267 takes seconds and it is scheduled to run at 1 minute intevals from cron.
268 If you wish, you can test Sphinx search engine implemented on this site
269 from here: http://zonaforo.meristation.com/foros/search.php
270
271 b) Drupal based site running this module. Site has 10,000+ blog entries and
272 30,000+ comments. It uses 1 main index + 1 delta. Main index takes less
273 than 5 minutes to build and it is executed daily. Delta index takes a few
274 seconds and it is executed at 5 minutes intervals from cron. Again, if you
275 wish, you can test Sphinx search engine implemented on this site from
276 here: http://blogs.gamefilia.com/search
277
278 It all depends on several factors. Of course, your mileage may vary.
279
280 - New or different ideas to fight against forementioned "limitations" are
281 welcome. Please, use issue tracker of the module.
282
283
284 ;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
285
286 TODO
287 ====
288
289 - Provide new hooks to allow external modules extend Sphinx document attributes
290 and/or alter search user interface with additional filters.
291
292 - Think about a reasonable method to implement access control to indexed data.
293 Currently, all content indexed by this module is available to anyone with
294 'use sphinxsearch' permission.
295
296 - Provide a better integration / user interface to co-exist with other search
297 modules that may provide solutions for searching different kinds of content,
298 such as users, etc. Suggestions are welcome. However, I believe this is more
299 a job for Drupal search framework itself. Hopefully Sphinx search integration
300 provided with this module can be used as proof of concept of Sphinx
301 capabilities and limitations that maybe can help here in some way...

  ViewVC Help
Powered by ViewVC 1.1.2