Minutes of 2nd VO Box Meeting
March 3-4, HEPiX (Rome)

Attendees:

Cal Loomis, Chair
Jeff Templon, NIKHEF

Steve Traylen, RAL
Davide Salomoni, CNAF
Sven Gabriel, FZK

Federico Carminati, ALICE

Miguel Branco, ATLAS 

Peter Elmer, CMS
Ian Fisk, CMS

Andrei Tsaregorodtsev, LHCb

2nd VO Box Meeting
==================

The purpose of this second VO Box meeting was to quickly revisit the
information presented in the previous meeting and then to provide a
list of actions which should be taken to evolve the current definition
of the VO Box. 

These 'minutes' are a short summary of the principals which should
drive the evolution of the VO Box and a set of concrete actions for
the near- to long-term.  

Current Situation and Evolution
-------------------------------

The first question addressed was whether VO Boxes are here to stay or
if a suitable evolution could lead to their distinction.  

Based on the information presented in the previous meeting, there is a
need for application-level services.  These services need to be
distributed and robust.  They do not necessarily need to have a
dedicated machine, but do need some computational resources to operate
correctly. 

In general the points which are sensitive for the sites are 1)
services which require to be run inside the trusted fabric and
consequently 2) limit the possible deployment scenarios for the VO
Boxes.  

The ATLAS and LHCb services do not need to be run inside the trusted
fabric.  They really only need a set of n sites which are willing to
run the services to make sure that they are highly-available and
robust.  Natural candidates for these distributed sites are the Tier 1
centers, but this is not a functional restriction. 

The ALICE and CMS services do contain services which require them to
be run inside the trusted fabric (and also deployed on each site).
Explicitly these services are:

-- PhEDEx (CMS): 'backdoor' to simulate SRMv2 functionality
   (specifically being able to list files in a given storage element).

-- Packman (ALICE): The package manager needs access to the shared
   file system on a site used for VO-specific software. 

-- xrootd (ALICE): POSIX-like access to files in a storage element.
   Can also manage files.  Can be run in mode which doesn't touch the
   SE's internals, but this requires large cache on the VO Box and a
   copy of each file accessed.  (This is referred to below as 'insane'
   mode.)

-- MonALISA (ALICE): Monitoring of some information on a site.  This
   currently uses just the information in the information, so
   shouldn't actually be intrusive to the site.

-- Storage Adapter (ALICE): Shouldn't need direct access to site's
   resources.  Uses standard interfaces (if xrootd available). 

Removing these pieces from the VO Box would open up other deployment
options and significantly thin the VO Boxes themselves.  The evolution
should concentrate on these services first. 

Two classes of services were identified:

* CLASS 1: Can access site's services (and work correctly) from a
private network.  (I.e. is not within the trusted subset of a farm.)
Uses only service APIs/interfaces which are exposed to the external
world past their firewall.

* CLASS 2: Uses 'private' interfaces to access information/services at
the site (i.e. not exposed to those beyond the site's firewall).
Essentially this is anything which is not a CLASS 1 service. 

The CLASS 1 service because they can exist outside the firewall do not
pose any particular security or deployment problems.  The CLASS 2
services are touchy mainly because of the security issues. 

In terms of these class definitions, the overall principal for
evolving the VO Boxes is: 

   "The goal of the VO Box evolution is to get rid of CLASS 2 services
    via incorporation into standard services, modification of current
    services, or removing need for service."

To drive the evolution there needs to be a dialog between the sites
and the experiments to ensure that we always have a working system
(for the experiments) and that the evolution happens as quickly as
possible (for the sites).

Note that both the sites and the experiments have a desire to move a
good portion of the VO Box functionality into the standard
middleware.  This reduces the support and maintainance load for the
experiments and allows a more controlled deployment for the sites. 

One recommendation is to continue the VO Box working group to continue
the technical discussions between the sites and the experiments and to
monitor the evolution of the VO Box.

Policy Issues
-------------

These are a set of issues for which the group does not have the
technical experience necessary to make a definitive recommendation or
for which multiple technical solutions exist and the choice needs to
be made on non-technical requirements. 

* Outbound Network Access

One rational for using the VO Box is to bridge the site's firewall.
The experiment's jobs do need to contact external services (catalogs,
databases, ...) and if a site does not provide outbound access a path
must be created for this to happen. 

This has been discussed many times.  The discussion always ends up
with the specification of a 'network proxy' which has essentially the
characteristics of standard NAT.  One could develop these services.
Given the current manpower situation and that this duplicates an
existing standard (non-grid) service, we believe it would be better
instead to require sites to provide outbound network access from the
worker nodes.  (This could be limited to a specific set of ports, if
desired.)

We recommend that the GDB require outbound access from sites.  Failing
that, it should identify development resources and timeline for
developing an acceptable network proxy. 

* Service Certificates for VO Services

The services run by the VO's must secured.  This requires running
those services with a certificate.  A host certificate is not
appropriate because the responsibility for those rests with the site
administrator.  It also precludes using the same box for several
different VOs (because the host certificate should not be shared).
In the end, these services need to run with service certificates.
Apparently there is a procedure for obtaining these in the UK.  The
general procedure for the other regions is not know. 

The GDB should ask the appropriate security group to provide a
document describing the procedure for obtaining a service certificate,
including who should request it and any limitations (e.g. on the DN of
the certificate). 

* Proxy Handling by VO Services

Currently some VO-managed services handle user proxies directly (or
have direct access to them).  Consequently, the VO users which manage
these services are essentially 'superusers' which can impersonate
other members of the VO.  This alters the trust model of the grid
(currently only system administrators are trusted in this way).  It
may also have unintended consequences when people are members of
multiple VOs.  That is, a VO superuser can impersonate someone with
membership in multiple VOs and gain access to the other VO's
resources. 

This issue should be discussed by the appropriate security group.  It
should come up with recommendations on how this should be handled. (VO
shouldn't do this, VO superusers should go through a registration like
the site administrators, it isn't a problem, ....)

VO Box Core Services
--------------------

The following are the core services required by the experiments for
the VO Boxes: 

* Interactive access to the VO Box by a limited number of users.  This
  is preferably done consistently for all VO Boxes (currently
  gsissh). 

* All grid client tools.  Essentially the set of grid client tools
  available from the UI or WN (with one exception below).

* Service to handle proxy renewal.  Needed by VO-managed services
which initiate (or control) actions in the name of the different
members of the VO. 

All of the other services are specific to the VO. 

Specifically, the following services are *NOT* required on the VO Box: 

* gridftp server

* gatekeeper

These were included on older versions of the VO Box and should be
removed if still present. 

Points for VO Box Evolution
---------------------------

The following points were identified as needed improvements to the
proxy renewal utilities:

* Should handle VOMS renewal as well.  (Probably will require a more
  flexible indexing rather than just the DN.  Jobs from the same user
  could have different VOMS roles or capabilities.)

* Should run as root and use the host certificate for the service and
  to renew the proxies. 

* Determine a flexible, scalable mechanism on who can renew the
  proxies.  (Perhaps using VOMS roles.)


ALICE uses dynamic package management.  This currently requires that
the VO Box have access to the shared experiment software area.  The
overall idea is that this functionality should be folded into the
standard middleware somehow.  The three ideas which came up were:

* Implementing the package management as a standard service.  (Either
  adapting the ALICE package manager or developing a new one.)  The
  requirements for this were identified by GAG a couple of years ago. 

* glexec (grid sudo)  This would allow the job itself to run the
  package manager with the appropriate rights.  This would avoid
  having to do this on the VO Box; however, the development timeline
  of glexec and its inclusion in the system seems to be in the medium
  to long-term. 

* Provide a short queue with guaranteed fast access.  This changes
  mechanism of software installation significantly for ALICE.

Of the three alternatives, the only real solution seems to be the
first one.  The others are not feasible on a short timescale or alter
the ALICE computing model. 

The functionality needed by CMS (essentially srm-ls) is supposed to be
included in SRM v2.  Moving this functionality from the VO Boxes will
have to wait until the SRM v2 is widely deployed and stable.  However,
the project should push for this to happen as soon as possible. 

Evolution Strategy
------------------

The overall evolution strategy is to eliminate as quickly as possible
the CLASS 2 services from the system.  This can be accomplished by
moving some functionality into the middleware or by developing new
standard services. 

Candidates for inclusion in gLite middleware:

* ALICE package manager (2+ experiments want the functionality)  Would
  need to determine if adapting the service is easier than writing a
  new standard service. 

* xrootd: (2 experiments would like this; others would evaluate)  To
  make this fully functional requires some SRM/xrootd integration (see
  below). 

* monalisa: (alice/cms/lhcb interested)

Services to be developed within gLite middleware:

* Proxy renewal service (with the modifications above)

* Package manager:  If the ALICE package manager isn't appropriate to
  modify/generalize.  (Requirements exist as a GAG document.)

* Delegation service/API.

* SRM/xrootd integration work

* FTS VO plugins.  They need to work out a reasonable model for where
  the associated plugins will run, with what credentials, ....

* VO-service framework (longer term development).  There was a general
  realization that many of the VO Box security problems would be
  solved with a VO-service framework which sandboxed the VO services.
  This was deemed a longer-term development, but one which should be
  kept in mind.  The delegation service/API and the proxy renewal
  service/API would probably be integral parts of that framework. 

Operational Issues:

* SFT tests *MUST* be developed for the core services on the VO Box. 

* The documentation for writing application-specific SFTs must be made
  available to the application developers. 

* Responsibilities: The site administrators are responsible only for
  the core services and OS on the VO Box.  All other services are the
  responsiblity of the VO itself. 

* Related to this: Operating a VO Box does not imply that the site
  will provide a 'shrink-wrapped' operator.  The experiments must
  provide the effort to run the VO-specific services. 

* Eliminate use of shared credentials. (Must be traceable.)  It must
  be possible for all experiment users to be mapped to individual
  accounts.  This includes especially the 'software manager' and
  'production manager' accounts.  Software must be modified to
  accomodate this.  

* Running GRIS.  Need to have way to publish VO-services.  (This may
  already exist but needs to be determined.)

General Issues:

* FTS security model must be brought in line with standard one.  That
  is, FTS should use standard MyProxy renewal of certificates and not
  use a password to retrieve a new proxy.