Here are some of things that you may find in your Slony-I logs, and explanations of what they mean.
These entries are pretty straightforward. They are informative messages about your configuration.
Here are some typical entries that you will probably run into in your logs:
CONFIG main: local node id = 1 CONFIG main: loading current cluster configuration CONFIG storeNode: no_id=3 no_comment='Node 3' CONFIG storePath: pa_server=5 pa_client=1 pa_conninfo="host=127.0.0.1 dbname=foo user=postgres port=6132" pa_connretry=10 CONFIG storeListen: li_origin=3 li_receiver=1 li_provider=3 CONFIG storeSet: set_id=1 set_origin=1 set_comment='Set 1' CONFIG main: configuration complete - starting threads
Debug notices are always prefaced by the name of the thread that the notice originates from. You will see messages from the following threads:
This is the local thread that listens for events on the local node.
The thread processing remote events. You can expect to see one of these for each node that this node communicates with.
Listens for events on a remote node database. You may expect to see one of these for each node in the cluster.
Takes care of things like vacuuming, cleaning out the confirm and event tables, and deleting old data.
Generates SYNC events.
Note that as far as slon is concerned, there is no "master" or "slave." They are just nodes.
What you can expect, initially, is to see, on both nodes, some events propagating back and forth. Firstly, there should be some events published to indicate creation of the nodes and paths. If you don't see those, then the nodes aren't likely to be able to communicate with one another, and nothing else will happen...
Create the two nodes.
No slons are running yet, so there are no logs to look at.
Start the two slons
The logs for each will start out very quiet, as neither node has much to say, and neither node knows how to talk to any other node.
Do the STORE PATH
for the
communications paths. That will allow the nodes to start to become
aware of one another.
In version 1.0, sl_listen is not set up
automatically, so things still remain quiet until you explicitly
submit STORE LISTEN
requests. In version 1.1, the
“listen paths” are set up automatically, which will much
more quickly get the communications network up and running.
If you look at the contents of the tables sl_node and sl_path and sl_listen, on each node, that should give a good idea as to where things stand. Until the slon starts, each node may only be partly configured. If there are two nodes, there should be two entries in all three of these tables once the communications configuration is set up properly. If there are fewer entries than that, well, that should give you some idea of what is missing.
If needed (e.g. - before version
1.1), submit STORE LISTEN
requests to indicate how
the nodes will use the communications paths.
Once this has been done, the nodes' logs should show a greater level of activity, with events periodically being initiated on one node or the other, and propagating to the other.
You'll set up the set (CREATE
SET
), add tables (SET ADD TABLE
), and
sequences (SET ADD SEQUENCE
), and will see relevant
events only on the origin node for the set.
Then, when you submit the SUBSCRIBE SET
request, the event should go to both nodes.
The origin node has little more to do, after that... The
subscriber will then have a COPY_SET
event, which
will lead to logging information about adding each table and copying
its data.
After that, you'll mainly see two sorts of behaviour:
On the origin, there won't be much logged, just
indication that some SYNC
events are being
generated and confirmed by other nodes.
On the subscriber, there will be reports of
SYNC
events, and that the subscriber pulls data
from the provider for the relevant set(s). This will happen
infrequently if there are no updates going to the origin node; it will
happen frequently when the origin sees heavy updates.
WriteMe: I can't decide the format for the rest of this. I think maybe there should be a "how it works" page, explaining more about how the threads work, what to expect in the logs after you run a
The script in the tools
directory called
pgsql_replication_check.pl
represents about best
answers yet arrived at in several attempts to build replication tests
to plug into the Nagios
system monitoring tool.
A former script, test_slony_replication.pl
, took a “clever”
approach where a “test script” is periodically run, which
rummages through the Slony-I configuration to find origin and
subscribers, injects a change, and watches for its propagation through
the system. It had two problems:
Connectivity problems to the single host where the test ran would make it look as though replication was destroyed. Overall, this monitoring approach has been fragile to numerous error conditions.
Nagios has no ability to benefit from the “cleverness” of automatically exploring the set of nodes.
The new script, pgsql_replication_check.pl
,
takes the minimalist approach of assuming that the system is an online
system that sees regular “traffic,” so that you can
define a view specifically for the replication test called
replication_status
which is expected to see regular
updates. The view simply looks for the youngest
“transaction” on the node, and lists its timestamp, age,
and some bit of application information that might seem useful to see.
In an inventory system, that might be the order number for the most recently processed order.
In a domain registry, that might be the name of the most recently created domain.
An instance of the script will need to be run for each node that is to be monitored; that is the way Nagios works.
This script is in preliminary stages, and may be used to do some analysis of the state of a Slony-I cluster.
You specify arguments including database
,
host
, user
,
cluster
, password
, and
port
to connect to any of the nodes on a cluster.
You also specify a mailprog
command (which should be
a program equivalent to Unix
mailx) and a recipient of email.
The script then rummages through sl_path to find all of the nodes in the cluster, and the DSNs to allow it to, in turn, connect to each of them.
For each node, the script examines the state of things, including such things as:
Checking sl_listen for some “analytically determinable” problems. It lists paths that are not covered.
Providing a summary of events by origin node
If a node hasn't submitted any events in a while, that likely suggests a problem.
Summarizes the “aging” of table sl_confirm
If one or another of the nodes in the cluster hasn't reported back recently, that tends to lead to cleanups of tables like sl_log_1 and sl_seqlog not taking place.
Summarizes what transactions have been running for a long time
This only works properly if the statistics collector is
configured to collect command strings, as controlled by the option
stats_command_string = true
in postgresql.conf
.
If you have broken applications that hold connections open, this will find them.
If you have broken applications that hold connections open, that has several unsalutory effects as described in the FAQ.
The script does some diagnosis work based on parameters in the script; if you don't like the values, pick your favorites!