The first piece of code that PanDA system submits to sites by different job submission mechanisms is called "pilot wrapper". This is the first code that executes on the worker node, performs some environment checks, and downloads from a valid URL the following piece of code to continue operations, called in the PanDA nomenclature as "pilot".
This "pilot wrapper" is not unique. There are a multiplicity of versions for this part of the system, depending on the final pilot type, and the grid flavor, for example.
This multiplicity forces to maintain several pieces of software even though they have a common purpose.
On the another hand, for practical reasons, these pilot wrappers are implemented in BASH language, with the consequent lack of flexibility and inherent difficulties to implemented complicated operations. One practical case is the need to generate weighted random numbers to pick up an specific development version of the ATLAS code only a given percentage of the times. This weighted random numbers generation is more complicated in BASH language.
Finally, the new AutoPyFactory pilot submission mechanism was introduced in the scenario. This new pilot submission tool was implemented in its first version to submit a specific ad-hoc pilot wrapper, with a different set of input options and with different formats. Moreover, this specific pilot wrapper is only valid for ATLAS in EGEE, being invalid for other purposes or in OSG sites. This discrepancy adds to the multiplicity of pilot wrapper versions, and introduces difficulties for its deployment as a submission tool to replace the already existing AutoPilot.
A final reason is that these wrappers require some improvements. One example is the absence of proper validation on the number and format of the input options. Given these improvements are important, it will be always easier to introduce and maintain them in a single piece of code than in several.
For these reasons it was agreed that a refactoring of the different pilot wrappers was needed. The proposal is to create a single pilot wrapper implemented in BASH language, performing the minimum amount of checking operations. This unique code should be valid for any kind of final application, grid flavor environment, submission tool, etc. In particular, it will allow the easy deployment of AutoPyFactory as pilot submission tool.
After checking the presence of required programs needed to continue with operations, and setting up the corresponding grid environment if needed, a second piece of code will be downloaded from a valid URL to continue operations. This second code will now be written in Python, which allows for more complex operations implemented in an easier manner. Therefore, its maintainability and scalability will be improved. This will require the reimplementation of all BASH code from the multiple pilot wrappers, except those operations already done by the new unified wrapper, in Python. Finally, in this second step, the final payload code to be run will be chosen, downloaded, and executed.
A generic wrapper with minimal functionalities
input options:
--wrappervo=vo | the Virtual Organization. |
--wrapperwmsqueue=wmsqueue | is the wms queue (e.g. the panda siteid) |
--wrapperbatchqueue=batchqueue | is the batch queue (e.g. the panda queue) |
--wrappergrid=grid_middleware | is the grid flavor, i.e. OSG or EGI. The reason to include it as an input option, instead of letting the wrapper to discover by itself the current platform is to be able to distinguish between these two scenarios: (a) running on local cluster (b) running on grid, but the setup file is missing (b) is a failure and should be reported, whereas (a) is fine. A reason to include wrappergrid as an option in this very first wrapper is that for sites running condor as local batch system, the $PATH environment variable is setup only after sourcing the OSG setup file. And only with $PATH properly setup is possible to perform actions as curl/wget to download the rest of files, or python to execute them. |
--wrapperpurpose=purpose | will be the VO in almost all cases, but not necessarily when several groups share the same VO. An example is VO OSG, shared by CHARMM, Daya, OSG ITB testing group... |
--wrapperserverurl=url | is the url with the PanDA server instance |
--wrappertarballurl=url | is the base url with the wrapper tarball to be downloaded |
--wrapperspecialcmd=special_cmd | is special command to be performed, for some specific reason, just after sourcing the Grid environment, but before doing anything else. This has been triggered by the need to execute command $ module load <module_name> at NERSC after sourcing the OSG grid environment. |
--wrapperplugin=plugin | is the plug-in module with the code corresponding to the final wrapper flavor. |
--wrapperpilottype=pilot_type | is the actual pilot code to be executed at the end. |
--wrapperloglevel=log_level | is a flag to activate high verbosity mode. Accepted values are debug or info. |
--wrappermode=mode | allows performing all steps but querying and running a real job. |
This is the suggested architecture:
AutoPyFactory ---> wrapper.sh ---> wrapper.py
wrapper.sh downloads a tarball (wrapper.tar.gz), untars it, and invoked wrapper.py. The content of the tarball is something like this
- wrapper.py
- wrapperutils.py
- lookuptable.conf
- plugins/base.py
- plugins/<pilottype1>.py
- plugins/<pilottypeN>.py
The different plug-ins corresponds with the different wrapper flavors, so far written in BASH. For example, trivialWrapper.sh, atlasProdPilotWrapper.sh, atlasProdPilotWrapperCERN.sh, atlasProdPilotWrapperUS.sh, etc.) All of these wrappers share a lot of common functionalities, with only small differences between them. To take advantage from that, the different wrapper flavors will be implmented as plug-ins.
The current mechanism to choose the right plugin is implemented by inspecting a lookup table like this one:
# ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- # VO PURPOSE GRID WMSQUEUE BATCHQUEUE PLUGIN PILOTCODE PILOTCODEURL # ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- # --- default values --- ATLAS * * + + atlasprodpilot pilotcode,pilotcode-rc http://pandaserver.cern.ch:25080/cache/pilot ATLAS devel * + + atlasprodpilotdev pilotcode-dev http://project-atlas-gmsb.web.cern.ch/project-atlas-gmsb OSG * OSG + + trivial trivialPilot http://pandaserver.cern.ch:25080/cache/pilot # --- testing sites --- OSG * * TEST2 TEST2 trivial trivialPilot http://pandaserver.cern.ch:25080/cache/pilot OSG * * TEST3 TEST3 trivial trivialPilot http://pandaserver.cern.ch:25080/cache/pilot ATLAS * * ANALY_TEST-APF ANALY_TEST-APF-condor atlasprodpilot pilotcode,pilotcode-rc http://pandaserver.cern.ch:25080/cache/pilot ATLAS * * ANALY_TEST-APF2 ANALY_TEST-APF2-condor atlasprodpilot pilotcode,pilotcode-rc http://pandaserver.cern.ch:25080/cache/pilot ATLAS * * BNL_TEST_APF BNL_TEST_APF-condor atlasprodpilot pilotcode,pilotcode-rc http://pandaserver.cern.ch:25080/cache/pilot
+ means any value is accepted, but one must be provided
* means any value is accepted, or no value was provided
- means no value was provided
if no value was provided for a given field:
1) first '-' will be used
2) if the field is not '-', then '*' will be checked
if a value was provided for a given field:
1) the value is searched
2) if the value is not in the field, the '+' will be checked
3) finally, '*" will be checked
Each column has the same value for every row.
The first N columns are the patterns, and the rest are outputs.
That means that a number N of input values will be provided each time,
the row maching with those inputs is selected, and the output values will be returned.
Columns will be parsed for matching from left to right.
This means the first column is the most important field for matching, the second column is the next most important field, and so on.
When one of the provided inputs matches exactly the content of the field in the table, that row is selected.
In the given input is not provided, or it is not in the table, then fields with symbols are inspected.
If no row matches completely with the input values, then None should be returned.