Index

AutoPyFactory and the Cloud
    Architecture
    Software deployed in the image
    Issues during tests

AutoPyFactory and the Cloud

Architecture:

The design consists on two AutoPyFactory (APF) queues working simultaneously.

The first AutoPyFactory queue submits jobs to a local pool when needed to serve a given WMS jobs queue (e.g. a PanDA siteid). This local condor pool may have very few actual nodes, or even no one. In this case, jobs will be waiting in queue.

The second AutoPyFactory queue inspects this local batch system, looking for jobs pending to be run. In this case, this list of pending jobs is acting as the WMS queue. If there are pending jobs (which is the equivalent to activated status in PanDA, for example) OSG Worker Node VMs are submitted to an IaaS provider via Condor-G (currently there is one submit AutoPyFactory plugin for submission to Amazon EC2 resources. In the future, to other "cloud" resources too).

This WN starts a condor startd daemon, which contacts a given condor Central Manager and become part of its pool.

AutoPyFactory plugins can be adjusted in such a way that VMs jobs submissions to the IaaS provider rate matches the associated WMS activated jobs load.

Software deployed in the image:

Issues during tests:

Problem with network: NAT, ..

When a startd contact a Central Manager, the latter must be able to call back the startd.

However, the WM is behind a NAT, preventing this call back from being possible. To solve this, the startd contacts the Central Manager via the Condor Connection Broker (CCB).

problem with $HOME directory

First trial failed with an error line like

02 Feb 17:00:36|runJob.py | Job command 1/1 failed: res = (2560,
'/cvmfs/atlas.cern.ch/repo/sw/software/i686-slc5-gcc43-opt/16.6.2/AtlasSetup/scripts/setup_runtime.sh:
line 36: cd: /root: Permission
denied\n/var/lib/condor/execute/dir_24497/Panda_Pilot_24618_1328201237/PandaJob_1418631160_1328201243')

The reason was that jobs were running as user 'nobody' whose home directory in /etc/passwd is '/'. But it is possible that the shell environment variable $HOME is still pointed to /root.

problem with LFC (ATLAS specific):


during the devel steps, with the VM running on a box inside the BNL perimeter, it was not possible to contact the LFC catalog. For example, commands like lfc-mkdir failed. The reason is that because the host is inside BNL perimeter, the DNS gives the internal IP for the LFC host, but the LFC host has not a conduit for the sub-network where the VM is running.

NOTE: that should not be a problem when running on the cloud, outside BNL.