Index

Monitoring

APF logs are available through the panda monitor: yes they are! APF sets the GTAG environment variable, which is passed by the pilot through to panda and shows up in the monitor. Look at most any job running at a European site, e.g.,

http://panda.cern.ch:80/server/pandamon/query?job=1209939301

Under "pilotID" you have the link to the stdout logfile. Something which could be improved is to also automatically link to the stderr file (s/out/err) and to the condor log file (s/out/log).

This should be wrapper independent (it's passed in as part of the condor set environment). If you're not getting this something needs checked.

On the factory site there are two important variables which control this:

baseLogDir = /disk/panda/factory/auto/logs

baseLogDirUrl = http://svr017.gla.scotgrid.ac.uk/factory/auto/logs

baseLogDir is the local physical disk path; baseLogDirUrl is the http URL prefix.

What's missing from the panda monitor: an overview of the pilots going to a site, so know if the site is broken or the factories serving it have died, etc.

Where is this information now: in Peter's monitor! See the last talk as s/w week for some details (I gave the talk, but the content was all his).

So, should this be in panda monitor or not? It should be crosslinked from the monitor, but the key point was to have this on an independent database, not to add load to Oracle. It's monitoring, not accounting, some losses are ok and you throw the information away after a week.

Any factory can dispatch calls to the factory monitor, just by defining

monitorURL = http://apfmon.lancs.ac.uk/mon/

in their configuration. Peter has been gradually ramping up the number of factories to test scaling, so he can report on how well that's going.

In the end this should move to CERN, but we had the (usual) problems in obtaining and configuring a machine for it so this hasn't progressed much.

This really is a shifter tool as well. It's used to help diagnose problems with site infrastructure and to submit tickets (especially when pilots don't start or abort before the payload can be executed).