| 1 |
; $Id$
|
| 2 |
|
| 3 |
description
|
| 4 |
-------------
|
| 5 |
The DataSync module was written to import data reliably on a large scale. It
|
| 6 |
allows you to schedule and run multiple types of import jobs on multiple servers in
|
| 7 |
a reliable and centralized way. IT is NOT very scalable at the moment because
|
| 8 |
Drupal 5 does not work well with database transactions so you should only run each
|
| 9 |
consumer on one machine at a time in order to prevent race conditions. This
|
| 10 |
should be fixed in the Drupal 6 version. It is however very functional and has
|
| 11 |
run thousands of jobs on our production servers already.
|
| 12 |
|
| 13 |
|
| 14 |
overview of the modules
|
| 15 |
--------------------------
|
| 16 |
SUMMARY
|
| 17 |
-The datasync.module provides an API (both PHP functions and web service) to
|
| 18 |
schedule and keep track of data import jobs
|
| 19 |
-The datasync_consumer.module and datasync_producer.module implement that API
|
| 20 |
and will automaticlly start and run your jobs by calling PHP hooks that you
|
| 21 |
will define in a separate module (datasync_api_example.module for an example).
|
| 22 |
|
| 23 |
DATASYNC.MODULE
|
| 24 |
The datasync.module file by itself provides an API for scheduling and
|
| 25 |
running data import jobs. It also provides a library of supporting functions
|
| 26 |
and database tables that may help importing data. The API can be accessed either
|
| 27 |
by calling the functions with PHP directly or with web service calls to paths
|
| 28 |
defined in datasync_menu() (please note that as of July 21, 2008, the web service
|
| 29 |
API is fairly incomplete and untested). You should only make changes to the tables
|
| 30 |
created by datasync.module through the API, unless you are certain of what you are
|
| 31 |
doing. It is important to realize that by itself, datasync.module will not initiate
|
| 32 |
or run jobs, or do much of anything. This module just provides a way to schedule
|
| 33 |
these jobs. You should look at the comments above each API core function in
|
| 34 |
datasync.module under the heading "DATASYNC CORE FUNCTIONS" to learn how to implement
|
| 35 |
the API.
|
| 36 |
|
| 37 |
DATASYNC_CONUMER.MODULE AND DATASYNC_PRODUCER.MODULE
|
| 38 |
Since it would be a pain to fully implement this API every time you wanted to import
|
| 39 |
a new type of data, the datasync_consumer and datasync_producer modules implement it
|
| 40 |
for you and provide their own API for you to define your specific data import jobs.
|
| 41 |
These modules will schedule and run jobs for you (according to how you define them) and
|
| 42 |
should take most of the drudgery out of getting a data importer working. Please note
|
| 43 |
that if you use the datasync_consumer and datasync_producer modules, the jobs will run
|
| 44 |
by being called as PHP hooks. This means if you want to run some totally separate
|
| 45 |
non-Drupal and non-PHP system to generate the data as you import it, you probably
|
| 46 |
would not want to use these modules and instead implement the DataSync API yourself.
|
| 47 |
You should, however, be able to use the datasync_consumer and datasync_producer modules
|
| 48 |
in a majority of cases.
|
| 49 |
|
| 50 |
The datasync_producer and datasync_consumer modules work together by updating the
|
| 51 |
datasync_jobs table. The datasync_producer.module will create new jobs at specified
|
| 52 |
intervals and advance them to the next status when it has finished running its current
|
| 53 |
task. The datasync_consumer.module will take jobs when they are waiting to be processed,
|
| 54 |
call the appropriate hooks to work on the job, and then set the job status as completed when
|
| 55 |
it is done. In other words, datasync_producer.module will continually create jobs and make
|
| 56 |
sure they are ready and waiting to be processed, while datasync_consumer.module actually
|
| 57 |
processes the jobs and marks them as finished. You define your specific jobs by implementing
|
| 58 |
a module that will tell the datasync_consumer.module and datasync_producer.module what to do.
|
| 59 |
The best way to do this is probably to copy datasync_api_example.module (which is commented
|
| 60 |
heavily), and tweak the hooks and functions to work to your expectations.
|
| 61 |
|
| 62 |
The datasync_producer and datasync_consumer modules run persistently through the php command
|
| 63 |
line interface. They are spawned by the hook_cron function on the servers that you specify
|
| 64 |
and exit themselves hourly to prevent memory leaks. This means you should run the appropriate
|
| 65 |
hook_cron functions at least hourly to make sure these processes continue running.
|
| 66 |
|
| 67 |
|
| 68 |
install and configuration
|
| 69 |
--------------------------
|
| 70 |
See the INSTALL file
|
| 71 |
|
| 72 |
|
| 73 |
todo
|
| 74 |
--------------
|
| 75 |
interface for ds_variable_get('datasync_fail_job_restart', 1);
|
| 76 |
reporting mechanism for datasync failures
|
| 77 |
statistics for job completion
|
| 78 |
interface for killswitch for consume.php and produce.php
|
| 79 |
interface to purge started jobs
|
| 80 |
|
| 81 |
Originally contributed by SonyBMG
|