Minutes of 2nd VO Box Meeting March 3-4, HEPiX (Rome) Attendees: Cal Loomis, Chair Jeff Templon, NIKHEF Steve Traylen, RAL Davide Salomoni, CNAF Sven Gabriel, FZK Federico Carminati, ALICE Miguel Branco, ATLAS Peter Elmer, CMS Ian Fisk, CMS Andrei Tsaregorodtsev, LHCb 2nd VO Box Meeting ================== The purpose of this second VO Box meeting was to quickly revisit the information presented in the previous meeting and then to provide a list of actions which should be taken to evolve the current definition of the VO Box. These 'minutes' are a short summary of the principals which should drive the evolution of the VO Box and a set of concrete actions for the near- to long-term. Current Situation and Evolution ------------------------------- The first question addressed was whether VO Boxes are here to stay or if a suitable evolution could lead to their distinction. Based on the information presented in the previous meeting, there is a need for application-level services. These services need to be distributed and robust. They do not necessarily need to have a dedicated machine, but do need some computational resources to operate correctly. In general the points which are sensitive for the sites are 1) services which require to be run inside the trusted fabric and consequently 2) limit the possible deployment scenarios for the VO Boxes. The ATLAS and LHCb services do not need to be run inside the trusted fabric. They really only need a set of n sites which are willing to run the services to make sure that they are highly-available and robust. Natural candidates for these distributed sites are the Tier 1 centers, but this is not a functional restriction. The ALICE and CMS services do contain services which require them to be run inside the trusted fabric (and also deployed on each site). Explicitly these services are: -- PhEDEx (CMS): 'backdoor' to simulate SRMv2 functionality (specifically being able to list files in a given storage element). -- Packman (ALICE): The package manager needs access to the shared file system on a site used for VO-specific software. -- xrootd (ALICE): POSIX-like access to files in a storage element. Can also manage files. Can be run in mode which doesn't touch the SE's internals, but this requires large cache on the VO Box and a copy of each file accessed. (This is referred to below as 'insane' mode.) -- MonALISA (ALICE): Monitoring of some information on a site. This currently uses just the information in the information, so shouldn't actually be intrusive to the site. -- Storage Adapter (ALICE): Shouldn't need direct access to site's resources. Uses standard interfaces (if xrootd available). Removing these pieces from the VO Box would open up other deployment options and significantly thin the VO Boxes themselves. The evolution should concentrate on these services first. Two classes of services were identified: * CLASS 1: Can access site's services (and work correctly) from a private network. (I.e. is not within the trusted subset of a farm.) Uses only service APIs/interfaces which are exposed to the external world past their firewall. * CLASS 2: Uses 'private' interfaces to access information/services at the site (i.e. not exposed to those beyond the site's firewall). Essentially this is anything which is not a CLASS 1 service. The CLASS 1 service because they can exist outside the firewall do not pose any particular security or deployment problems. The CLASS 2 services are touchy mainly because of the security issues. In terms of these class definitions, the overall principal for evolving the VO Boxes is: "The goal of the VO Box evolution is to get rid of CLASS 2 services via incorporation into standard services, modification of current services, or removing need for service." To drive the evolution there needs to be a dialog between the sites and the experiments to ensure that we always have a working system (for the experiments) and that the evolution happens as quickly as possible (for the sites). Note that both the sites and the experiments have a desire to move a good portion of the VO Box functionality into the standard middleware. This reduces the support and maintainance load for the experiments and allows a more controlled deployment for the sites. One recommendation is to continue the VO Box working group to continue the technical discussions between the sites and the experiments and to monitor the evolution of the VO Box. Policy Issues ------------- These are a set of issues for which the group does not have the technical experience necessary to make a definitive recommendation or for which multiple technical solutions exist and the choice needs to be made on non-technical requirements. * Outbound Network Access One rational for using the VO Box is to bridge the site's firewall. The experiment's jobs do need to contact external services (catalogs, databases, ...) and if a site does not provide outbound access a path must be created for this to happen. This has been discussed many times. The discussion always ends up with the specification of a 'network proxy' which has essentially the characteristics of standard NAT. One could develop these services. Given the current manpower situation and that this duplicates an existing standard (non-grid) service, we believe it would be better instead to require sites to provide outbound network access from the worker nodes. (This could be limited to a specific set of ports, if desired.) We recommend that the GDB require outbound access from sites. Failing that, it should identify development resources and timeline for developing an acceptable network proxy. * Service Certificates for VO Services The services run by the VO's must secured. This requires running those services with a certificate. A host certificate is not appropriate because the responsibility for those rests with the site administrator. It also precludes using the same box for several different VOs (because the host certificate should not be shared). In the end, these services need to run with service certificates. Apparently there is a procedure for obtaining these in the UK. The general procedure for the other regions is not know. The GDB should ask the appropriate security group to provide a document describing the procedure for obtaining a service certificate, including who should request it and any limitations (e.g. on the DN of the certificate). * Proxy Handling by VO Services Currently some VO-managed services handle user proxies directly (or have direct access to them). Consequently, the VO users which manage these services are essentially 'superusers' which can impersonate other members of the VO. This alters the trust model of the grid (currently only system administrators are trusted in this way). It may also have unintended consequences when people are members of multiple VOs. That is, a VO superuser can impersonate someone with membership in multiple VOs and gain access to the other VO's resources. This issue should be discussed by the appropriate security group. It should come up with recommendations on how this should be handled. (VO shouldn't do this, VO superusers should go through a registration like the site administrators, it isn't a problem, ....) VO Box Core Services -------------------- The following are the core services required by the experiments for the VO Boxes: * Interactive access to the VO Box by a limited number of users. This is preferably done consistently for all VO Boxes (currently gsissh). * All grid client tools. Essentially the set of grid client tools available from the UI or WN (with one exception below). * Service to handle proxy renewal. Needed by VO-managed services which initiate (or control) actions in the name of the different members of the VO. All of the other services are specific to the VO. Specifically, the following services are *NOT* required on the VO Box: * gridftp server * gatekeeper These were included on older versions of the VO Box and should be removed if still present. Points for VO Box Evolution --------------------------- The following points were identified as needed improvements to the proxy renewal utilities: * Should handle VOMS renewal as well. (Probably will require a more flexible indexing rather than just the DN. Jobs from the same user could have different VOMS roles or capabilities.) * Should run as root and use the host certificate for the service and to renew the proxies. * Determine a flexible, scalable mechanism on who can renew the proxies. (Perhaps using VOMS roles.) ALICE uses dynamic package management. This currently requires that the VO Box have access to the shared experiment software area. The overall idea is that this functionality should be folded into the standard middleware somehow. The three ideas which came up were: * Implementing the package management as a standard service. (Either adapting the ALICE package manager or developing a new one.) The requirements for this were identified by GAG a couple of years ago. * glexec (grid sudo) This would allow the job itself to run the package manager with the appropriate rights. This would avoid having to do this on the VO Box; however, the development timeline of glexec and its inclusion in the system seems to be in the medium to long-term. * Provide a short queue with guaranteed fast access. This changes mechanism of software installation significantly for ALICE. Of the three alternatives, the only real solution seems to be the first one. The others are not feasible on a short timescale or alter the ALICE computing model. The functionality needed by CMS (essentially srm-ls) is supposed to be included in SRM v2. Moving this functionality from the VO Boxes will have to wait until the SRM v2 is widely deployed and stable. However, the project should push for this to happen as soon as possible. Evolution Strategy ------------------ The overall evolution strategy is to eliminate as quickly as possible the CLASS 2 services from the system. This can be accomplished by moving some functionality into the middleware or by developing new standard services. Candidates for inclusion in gLite middleware: * ALICE package manager (2+ experiments want the functionality) Would need to determine if adapting the service is easier than writing a new standard service. * xrootd: (2 experiments would like this; others would evaluate) To make this fully functional requires some SRM/xrootd integration (see below). * monalisa: (alice/cms/lhcb interested) Services to be developed within gLite middleware: * Proxy renewal service (with the modifications above) * Package manager: If the ALICE package manager isn't appropriate to modify/generalize. (Requirements exist as a GAG document.) * Delegation service/API. * SRM/xrootd integration work * FTS VO plugins. They need to work out a reasonable model for where the associated plugins will run, with what credentials, .... * VO-service framework (longer term development). There was a general realization that many of the VO Box security problems would be solved with a VO-service framework which sandboxed the VO services. This was deemed a longer-term development, but one which should be kept in mind. The delegation service/API and the proxy renewal service/API would probably be integral parts of that framework. Operational Issues: * SFT tests *MUST* be developed for the core services on the VO Box. * The documentation for writing application-specific SFTs must be made available to the application developers. * Responsibilities: The site administrators are responsible only for the core services and OS on the VO Box. All other services are the responsiblity of the VO itself. * Related to this: Operating a VO Box does not imply that the site will provide a 'shrink-wrapped' operator. The experiments must provide the effort to run the VO-specific services. * Eliminate use of shared credentials. (Must be traceable.) It must be possible for all experiment users to be mapped to individual accounts. This includes especially the 'software manager' and 'production manager' accounts. Software must be modified to accomodate this. * Running GRIS. Need to have way to publish VO-services. (This may already exist but needs to be determined.) General Issues: * FTS security model must be brought in line with standard one. That is, FTS should use standard MyProxy renewal of certificates and not use a password to retrieve a new proxy.