Soumission09.html
Direction des Relations Internationales (DRI)
Programme INRIA "Equipes Associées"
I. DEFINITION
|
EQUIPE ASSOCIEE |
SER-OS: Scalable, Efficient, and Resilient Operating Systems |
|
sélection |
2009 |
|
Equipe-Projet INRIA : PARIS |
Organisme étranger partenaire : Oak Ridge National Laboratory |
|
Centre de recherche INRIA : Rennes Bretagne - Atlantique |
Pays : United States |
|
|
Coordinateur français |
Coordinateur étranger |
|
Nom, prénom |
Morin, Christine |
Scott, Stephen L. |
|
Grade/statut |
INRIA DR2 |
Senior Research Scientist |
|
Organisme d'appartenance |
INRIA Rennes Bretagne Atlantique |
Computer Science and Mathematics, Oak Ridge National Laboratory |
|
Adresse postale |
IRISA, Campus universitaire de Beaulieu, 35042 Rennes cedex, FRANCE |
Bethel Valley Road, Building 5100, Oak Ridge, TN 37831-6173 |
|
URL |
http://www.irisa.fr/paris/web/Member-Home-Pages/view-5.html |
http://www.csm.ornl.gov/srt/ |
|
Téléphone |
+33 2 99 84 72 90 |
+1 865-574-3144 |
|
Télécopie |
+33 2 99 84 71 71 |
+1 865-576-5846 |
|
Courriel |
scottsl@ornl.gov |
NOTA
: Si la proposition d'Equipe Associée
comporte plusieurs partenaires, français et/ou étrangers, vous
pouvez :
- soit ajouter une colonne,
- soit dupliquer le
tableau ci-dessus autant de fois que nécessaire, en remplaçant
"Coordinateur français ou étranger" par "Autre
participant français ou étranger".
La
proposition en bref
|
Titre de la thématique de collaboration (en français et en anglais) : Scalable, Efficient, and Resilient Operating Systems |
|
Descriptif (environ 10 lignes) : Nowadays, three main types of platforms are used for high-performance computing: clusters, large-scale parallel systems such as Cray XT series or IBM BlueGene systems, and grids. Traditionally, applications must be “ported” to those platforms before to be able to execute them. As a result, scientists working on application spend a lot of resources to do this porting. We typically say that the effort to execute the application on top of a given platform is “bottom-to-top”: the platform specification is first fixed (in terms of hardware and software) and then the applications have to be adapted to this configuration. Another solution is to select a “top-to-bottom” approach: users give a very high-level description of their applications' needs in terms of hardware and software, and then the system is adapted to those needs. Of course, it does not mean that application will never be optimized, tuned to a given platform but the idea is typically to focus on scientist productivity, i.e., they have the flexibility to focus on their scientific roadmap and not to write their roadmap based on the target technical details. For that, the way systems are designed, implemented and used will differ deeply: we have to provide new tools for the users to describe their applications' needs, new tools to create the appropriate execute environment based on this description, deploy and manage that environment. Another aspect of such an approach is to study the impact on system policies: even if the three selected HPC platforms differ in term of architecture and characteristics, are the different systems policies different in nature? Or are they only different in term of parameters? Based on this study, we could also study adaptable policies which will allow us to provide generic tools, based on a large community, that could be used on almost all HPC platforms. |
Présentation détaillée de l'Équipe Associée
1. Objectifs scientifiques de
la proposition (1 à 2 pages)
Décrivez les
objectifs de la proposition en les positionnant succinctement par
rapport à l'état de l'art ;
Faites également une courte
description des tâches scientifiques prévues sur trois ans.
The main objectives of the collaboration in the area of operating systems and system tools for HPC are:
operating systems for HPC (focusing on system-level virtualization),
system management tools for HPC platforms,
resilience for HPC systems.
1.1. System Management Tools for HPC Platforms
Nowadays, many system management software are available, each of them following the latest trends in high performance computing. The latest trends are system-level virtualization (deployment of virtual machines) and system partitioning (deployment of specialized nodes, i.e., I/O nodes).
To address this moving target, most of the major solution for system major are today based on a modular and extensible architecture. However, each solution also has its own advantages. For instance, Aladdin/Grid5000 is managed via the KaTools, a suite of tools for the deployment and the management of Grids (inter-administrative domain deployment). On the other hand, OSCAR already supports virtualization focusing on the specialization and customization of the virtual machines; and OSCAR is not tight to a specific deployment tool (OSCAR could use KaDeploy for the deployment of images).
Therefore, it is clear that one of the keyworks for such tools is customization and adaptation: how can create, deploy and manage an execution environment that will fit at best the applications' needs.
Furthermore, even if system management tools are today fairly similar, they target most of the time slightly different execution platforms: while OSCAR targets typical middle-size clusters, KaTools target Grids. In other terms, the problem can be described as a lack of standardization of methods for the deployment of execution environment. We think this is the next effort for system management tools since all the modern tools are today based on the same set of low-level mechanisms for the creation of execution environment, but not for their deployment. This effort should result in the implementation of low-level deployment mechanisms, adaptable to the target platform, which could be used by all the different system management tools such as OSCAR or the KaTools. In other terms, an effort for decoupling the deployment tool from the rest of the management tool has to be done, leading to the creation of a specialized tool that must be adaptable and configurable, for a direct integration into higher-level tools.
The ultimate target of this effort is to target all available HPC platforms: large-scale clusters, large-scale parallel systems such as Cray platforms, Grids, and clouds. As a proof of concept, we will work on the deployment of the XtreemOS prototype on various HPC systems, using such system management tools.
Expected results: this effort will lead to the release of a set of new system tools for the management of the three target platforms. Those tools will allow user to easily specify their needs and then deploy the appropriate execution environments. For that, a conjunction of virtual machine and software components will be deployed to implement appropriate system policies. We also expect to publish scientific papers and promote our solutions to the international community as a standard for the management HPC systems.
1.2. Operating Systems for HPC
Because we want to analyze system issues from a user point-of-view and develop tools and methods to make scientists more efficient, current systems will have to be extended, especially when targeting different platforms such as cluster, parallel large-scale systems and grids. For instance, the usage of grids is now very often based on the concept of virtual organization (VO). On another hand, virtual machines (VMs) are used to implement a given execution context and bypass heterogeneity issues. One of the current challenge to fill-up the gap between those two technologies is different nature of those two concepts: VOs are global and multi-user and VMs are local and very often single-user. Based on discussions between the two teams, the concept of virtual platform has been introduced to fill-up this gap. A virtual platform (VP) is a view exposed to the users, based on virtual machines, that is global and single-user. This typically expose the virtual hardware that fits needs of a specific applications. Based on VPs, it is possible to provide VOs, implementing system policies, typically for resource management (where the VP should be created?) and user access (how is supposed to access to a given VP)? Currently ORNL only has a mechanism for the description of application's needs in term of software that can be used to deploy within virtual machines via the concept of Virtual System Environment [14-15].
The ORNL team has a strong expertise in developing virtualization solutions for HPC [1, 6, 10, 12, 13]. For instance, ORNL is the designer and developer of an extension of the Xen virtualization solution that implements VMM-level system modules (similar to kernel modules from the Linux kernel) which can be used for system instrumentation, system debugging or system customization and adaptation.
The INRIA PARIS team has a strong expertise in management of execution environments for scientific applications in the context of grid computing, and in the extension of operating systems for distributed platforms via the XtreemOS project [28].
Furthermore, the ORNL team already participated in INRIA efforts in the domain of system-level virtualization. This effort led to join publications about the classification of virtualization research efforts [27] and the analysis of the combination of virtualization and SSI technologies [9].
Those primary studies open many other research topics, from the analysis of the usage of virtualization in cloud computing to the implementation of new services at the VMM level for distributed computing. A specific point of interest is the implementation of system policies at the VMM level via the concept of modules: it should enable an high-degree of adaptation, system resiliency, and also enable the implementation of services specific to Grid and cloud computing.
Expected results: this effort will lead to the release of a set of system-level tools for the deployment and management of VOs and VPs. A white paper presenting the state-of-the-art solutions on those issues and the highlights of the proposed solution to address those issues will be submitted to an international conference.
1.3 Resilience for HPC
Both the SRT and the PARIS team have a strong background in resiliency [3, 4, 5, 7, 11, 17, 21, 23, 24, 25, 26, 29, 31, 32, 33, 34, 36]. Because of the scale of HPC platforms, failures (both software and hardware) are common and impact directly applications execution if the system does not provide resiliency capabilities. The ultimate goal is to guarantee non-stopping application execution regardless of faults.
In the context of this collaboration, the two teams will evaluate the usage of the different techniques for system resiliency they have been working on during the past few years, and identify similarities and differences. This should lead to the definition of an adaptable system which could provide an efficient and scalable solution for both large-scale systems such as Cray XT series or Grids and Clouds. Specifically since, based on previous discussions between the two teams and previous studies from the international communities, it seems that the low-level mechanisms for system resiliency are the same between the different HPC platforms, only the policies driving those mechanisms differ in order to match the platform characteristics and system management constraints.
Typically, this includes the study of different reactive and proactive fault tolerance, their combination and the impact of such policies on the availability of services used by applications.
Expected results: this effort will lead to the redaction of a whitepaper that will be submitted to an international event.
2. Présentation des
partenaires (1 page environ par partenaire)
Présentez les différentes équipes participantes ;
The System Research Team (SRT), led by Dr. Stephen L. Scott, consists of ORNL computer scientists from the Computer Science Research Group. The SRT is formed to specifically focus on research and development of operating systems, system libraries, and tools for addressing new issues and challenges in large-scale High Performance Computing (HPC) environment. SRT has worked on many successful research projects such as: MOLAR, OSCAR, OSCAR-V, V2M, C3. SRT is also breaking new ground in research on system-level virtualization for HPC. Some topics currently under investigation include virtual machine scheduling, performance isolation, flexible configuration, virtual machines management, and VMM by-pass.
The PARIS project-team from INRIA Rennes - Bretagne Atlantique research centre aims at contributing to the programming of large scale parallel and distributed systems. It investigates new approaches to build software mechanisms that hide the complexity of programming computing infrastructures that are both parallel and distributed. Our contribution to the field can thus be summarized as follows: combining parallel and distributed processing whilst preserving performance and transparency. Two research topics of the PARIS research team are directly related to the SER-OS associated team: one on the design and implementation of cluster and Grid operating systems (led by Christine Morin) and one on the design of experimental infrastructures for large scale distributed system (led by Yvon Jégou). The PARIS project-team has designed and implemented the Kerrighed cluster operating system based on Linux [2]. KERRIGHED is a Single System Image (SSI) operating system for high performance computing on clusters. It provides the user with the illusion that a cluster is a virtual SMP machine. Kerrighed now evolves as part of an open source community (http://www.kerrighed.org) and is industrialized by KERLABS, a spin-off from the PARIS Project-Team created in October 2006. Since 2006, the PARIS project team has continued to contribute to the design and implementation of KERRIGHED in the framework of the XTREEMOS European IP project (http://www.xtreemos.eu). In particular, we are working on the design and implementation of kDFS (kernel/KERRIGHED Distributed File System) [30], a distributed file system exploiting the disks attached to the computing nodes of a cluster and on checkpointing mechanisms for parallel applications. The PARIS project-team has also carried out research activities on the design and implementation of Grid-ware operating systems. It has designed and implemented Vigne [34,36,37], a system for large scale dynamic Grids. Since June 2006, Christine Morin has been the scientific coordinator of the XtreemOS European Integrated Project. The objective of XTREEMOS project [20] is to design, implement and promote a Linux-based Grid operating system providing a native virtual organization support. The research activities of the PARIS Project-Team in XtreemOS are focused on the design and implementation of a fault-tolerance service offering transparent checkpointing to Grid applications [31,32], on the design of virtual organization and security services [19,22,35], and on the design and implementation of LinuxSSI, leveraging KERRIGHED SSI operating system for the cluster flavour of XTREEMOS system. The PARIS project-team has been involved in the Grid’5000 project since the beginning in 2003 (hhtps://www.grid5000.fr). Grid 5000 is an infrastructure distributed in 9 sites around France, for research in large-scale parallel and distributed systems. As of the end of 2007, 267 nodes corresponding to 534 processors and 732 cores are active on Rennes platform managed by PARIS project-team. As of the end of 2007, the production network interconnects all nodes at 1 Gb/s using Ethernet technology, and provides connectivity to GRID’5000 sites through a 10 Gb/s optical link. A private Ethernet network, the management network interconnecting all nodes, is used for node management: monitoring, reboot, etc. It is exploited by the management software of the platform (OAR, kadeploy). Two local high-performance networks are available: an Infiniband network interconnecting 66 nodes at 10 Gb/s and a Myrinet 10G network interconnecting 97 nodes at 10 Gb/s. We are now deeply involved in ALADDIN INRIA's action to support Grid'5000 during the next 4 years which has started in July 2008. Thierry Priol, head of PARIS project-team is also the head of the ALADDIN action.
Donnez, pour chaque partenaire, la liste des chercheurs impliqués
dans la proposition ainsi qu'un bref CV du responsable ;
SRT participants and introduction of the coordinator:
The participating researchers from ORNL are: Stephen L. Scott (senior scientist researcher), Christian Engelmann (research scientist), Thomas Naughton (research associate), and Geoffroy Vallée (research scientist).
Dr. Stephen L. Scott is a Senior Research Scientist in the Computer Science Group of the Computer Science and Mathematics Division at the Oak Ridge National Laboratory (ORNL), Oak Ridge, USA. Dr. Scott is the head of the System Research Team (SRT) at ORNL.
Dr. Scott's research interest is in experimental systems with a focus on high performance distributed, heterogeneous, and parallel computing. He is a founding member of the Open Cluster Group (OCG) and Open Source Cluster Application Resources (OSCAR). Within this organization, he is presently the OCG steering committee chair and in the past has served as the OSCAR release manager and working group chair.
Dr. Scott is the project lead principal investigator for the "Reliability, Availability, and Serviceability (RAS) for Petascale High-End Computing and Beyond" project. This multi-institution research effort, funded by the Department of Energy - Office of Science, concentrates on scalable technologies for providing high-level RAS for next-generation peta-scale scientific high-end computing (HEC) resources and beyond.
Previously, Dr. Scott was the project lead principal investigator for the Modular Linux and Adaptive Runtime support for HEC OS/R research (MOLAR) research team. This multi-institution research effort, also funded by the Department of Energy - Office of Science, concentrates on adaptive, reliable, and efficient operating and runtime system solutions for ultra-scale scientific high-end computing (HEC) as part of the Forum to Address Scalable Technology for Runtime and Operating Systems (FAST-OS).
Dr. Scott was also principal investigator of a project investigating techniques in virtualized system environments for peta-scale computing and is involved with a related storage effort that is investigating the advantages of storage virtualization in peta-scale computing environments.
Finally, Dr. Scott is the chair of the international Scientific Advisory Committee for the European Commission's XtreemOS project.
PARIS project-team participants and introduction of the coordinator:
The participating researchers from INRIA are Christine Morin (senior researcher), Yvon Jégou (researcher) and Thierry Priol (senior researcher).
Christine Morin is senior researcher at INRIA in the INRIA PARIS project-team. She has led research activities on single system image OS for high performance computing in clusters, resulting in Kerrighed cluster OS, now developed in open source. She is the scientific coordinator of the XtreemOS project which is a 4-year European integrated project started in June 2006. She is a co-founder of Kerlabs start-up, created in 2006 to exploit Kerrighed technology. Her research interests are in operating systems, distributed systems, fault tolerance, cluster and grid computing.
Indiquez, pour chaque partenaire, les étudiants impliqués dans la
proposition. Donnez une estimation de leur nombre et précisez si des
thèses en cotutelle sont prévues ;
Thomas Naughton, Ph.D. Student at University of Reading in the UK and full-time research staff at ORNL is working on system-level virtualization and the potential usage of such technologies for fault tolerance and system resiliency [16].
Présentez l'historique de la collaboration
entre les équipes ;
The PARIS project team has been in contact with Dr. Stephen Scott's team at ORNL since 2003, initially working on system tools for the integration of INRIA research prototypes into open source tools developed mainly at ORNL. More precisely, via the PARIS team activities, INRIA has been involved in the OSCAR initiative, a software suite for the deployment and management of distributed platforms for HPC, such as clusters. This collaboration was implemented through a joined 18-months postdoctoral position between INRIA and EDF in France, and ORNL in the USA. This project aimed to integrate the Kerrighed Single System Image (SSI) solution [2] into the OSCAR suite, leading to the creation of the SSI-OSCAR effort [18] which released several versions of the prototype, including a joined study for the usage of SSI technologies for fault tolerance [17]. Since, INRIA became a core developer of the OSCAR project, collaborating with ORNL on the design and implementation of the OSCAR prototype.
The PARIS research team also participated to the annual OSCAR Birth-of-Features (BoF) session at the international SuperComputing conference. We also organized together a tutorial on Scalable SSI Clustering with the OpenSSI and Kerrighed systems at the international SuperComputing 2005 conference. Moreover, the SRT team participated to the GridOS BoF during the SuperComputing 2007 international conference.
Following this collaboration between ORNL and the PARIS team, research teams identified a common research topic in operating systems for HPC: system-level virtualization. In this context, several publications have been submitted or accepted, redefining system-level virtualization in the current context, and studying the usage of SSI technologies in conjunction with virtualization [9]. This effort also focus on the usage of virtualization in the XtreemOS European project, project led by Christine Morin.
The SRT and PARIS teams are also both member of the “Resilience Consortium” (http://resilience.latech.edu/) which focuses on system resiliency for all typical HPC platforms (typically the three types of platform we selected for this collaboration).
Insérez des liens vers les pages des
personnes, laboratoires, organismes....
ORNL's webpage: http://www.ornl.gov
SRT's webpage: http://www.csm.ornl.gov/srt/
Christian Engelmann: http://www.csm.ornl.gov/~engelman/
Thomas Naughton: http://www.csm.ornl.gov/~naughton/
Stephen Scott: http://www.csm.ornl.gov/srt/
Geoffroy Vallée: http://www.csm.ornl.gov/srt/people/gvallee.html
INRIA's webpage: http://www.inria.fr
PARIS project-team website: http://www.irisa.fr/paris/
Christine Morin: http://www.irisa.fr/paris/web/Member-Home-Pages/view-5.html
Thierry
Priol:
http://www.irisa.fr/paris/web/Member-Home-Pages/view-17.html Thomas
Ropars:
http://www.irisa.fr/paris/web/Member-Home-Pages/view-82.html Jérôme
Gallard :
http://www.irisa.fr/paris/web/Member-Home-Pages/view-94.html 3. Impact (1
page maximum) One of the goal primary goal of the collaboration is
to generate scientific results in a larger context: instead of
focusing on a specific platform, we try to analyze challenges related
to HPC in a broader context, targeting all major HPC platforms. We
also plan to conduce those studies with real applications (including
applications from ORNL) instead of micro-benchmarks. That will allow
us to focus on the user point of view, and be sure we address user
issues. - les relations entre les partenaires et entre les
instituts (par exemple discutez de la complémentarité, de la
similarité pour un effet de masse critique, de la répartition des
tâches pour un gros développement, etc.) The
SRT team focuses on clusters and parallel large-scale system such as
Cray systems; while the PARIS project-team is expert in grid
computing. The collaboration between the two teams is therefore
complementary and will allow the “équipe
associée”
to impact all HPC platforms and federate the community efforts. It
also means that ORNL will focus on studies targeting clusters and
large-scale parallel systems while the PARIS project-team will focus
on clusters and grids. The different efforts will be organized
accordingly. 3.1.
sur la collaboration deja existante avec votre partenaire The
creation of a "équipe
associée"
between the SRT team at ORNL and the PARIS team at INRIA will foster
the study and implementation of system-level virtualization solutions
for HPC. The goal of such a collaboration is to study the usage and
implementation of operating systems and their associated tools in
distributed or parallel platforms in the context of HPC. Doing so,
the two teams will propose new standards to the international
community both in the domain of grid and cloud computing, and of
large-scale parallel systems (such as IBM and Cray machines). This
effort will initially be based on the ongoing studies on system-level
virtualization for HPC. Those studies focused on the formal
definition of virtualization and on the study of the usage of
virtualization in conjunction with SSI techniques. Those studies can
therefore be followed by some deeper studies in different domains.
For instance: How can we use system-level virtualization in the
context of cloud computing? What are the needed virtualization
services? How can we implement those services? What are the common
points between the different HPC platforms in term of virtualization
capabilities? We
ultimately target the following capabilities: Transparency:
users should be able to use distributed systems the same way they
use a standard system. For that, new tools must be implemented and
current tools extended, extending the capabilities of normal
Unix-/Posix-like tools. This is mandatory for hiding the increasing
complexity of HPC systems. Adaptability:
modern HPC platforms face two critical challenges, (i) how can we
adapt a given system to a specific platform (platforms differ deeply
in term of characteristics on the hardware aspects)?, and (ii) on a
given platform, how can we adapt the system to configuration
modifications at runtime (failures for instance). Resiliency:
because of their scale, modern platforms have to deal with many
failures. To guarantee non-stop computing at the application level,
the system must include fault tolerance or fault avoidance
mechanisms and policies. This includes adaptability capabilities but
also fault tolerance or avoidance mechanisms (which can be based on
virtualization, e.g., migration of virtual machines). The
collaboration will result in common open source software (available
for the international community) and also in scientific publications
submitted to the major international conferences (such as IEEE/ACM
SuperComputing, ISC, ACM EuroSys, IEEE/ACM EuroPar, IEEE/ACM IPDPS,
ACM VEE, IEEE CCGRID and IEEE Cluster). 3.2
sur la collaboration avec d'autres projets INRIA Browsing
the INRIA website, we did not find any previous collaboration between
ORNL and INRIA, except for collaborations between various INRIA
project-teams (e.g., GRAAL, RESO, GRAND-LARGE) and Jack
Dongarra who is co-affiliated to ORNL. Several
INRIA project-teams are working on system-level virtualization
problematics and may be interested in collaborating with the SRT and
PARIS team: the ASCOLA project-team
(with the Entropy prototype, a consolidation manager
for clusters), the RESO project-team (network
virtualization), the MESCAL project-team (with the SAMORY prototype
which is
an architecture to provide resiliency to parallel applications
running on top of virtual clusters). 3.3
sur la collaboration avec d'autres equipes de l'organisme etranger It
is also expected that this collaboration will foster new
collaborations with other ORNL groups in the area of operating
systems for HPC and system tools. Specifically,
the System Research Team works closely with the Tool group at ORNL,
led by Richard L. Graham. This group aims to define and develop the
next generation tools, targeting peta-scale platforms and behind. The
two major constraints for such tools are scalability and resiliency. The
SRT team also collaborates with several universities in the USA,
creating collaboration opportunities for the PARIS project team. For
instance, the SRT team currently have active collaborations with
University of New Mexico (on the topic operating systems for HPC -
virtualization), Northwestern University (on the topic operating
systems for HPC - virtualization), LATech (on the topic of
resiliency), and North Carolina University (on the topic of operating
systems for HPC and resiliency). 4. Divers : toute
autre information que vous jugerez utile d'ajouter. In 2009, we plan to focus on three different topics
and the organization of an event specific to the collaboration. We
plan to work on system tools, operating systems, and resiliency
challenges. Those challenges are actually not disconnected, they
should converge in a single software solution for HPC in the future. * System Tools: Identification of common capabilities between system
tools developed at INRIA (the Aladdin software stack and some
research prototype based on this software stack for the management of
virtual machines) and those developed at ORNL (OSCAR and more
precisely the OSCAR-V package [8]). The goal of this effort is to try
to identify common mechanisms and to try to define "standards"
for such tools, discussing with the international community on the
subject. This will focus on the "user point-of-view": how
can we simplify the user (application user) life when using
distributed or parallel platforms, i.e., the descriptions of
application's needs in term of software stack and execution
environment, and the description of needed resource. Underneath, the
system tools will find and allocate requested resources, prepare the
execution environment dynamically, deploy it and execute the
application. The tool will have to be functional on the three target
platforms: clusters, Cray-like systems and Grids. This effort will lead to the release of a version of
OSCAR/OSCAR-V that can generate Aladdin images for a deployment on
top of Aladdin. This release will allow users to describe the
execution environment described by the users and automatically deploy
the environment on the target platform. In other terms, the goal is
to implement the notion of virtual platform and evaluate its usage on
the three target platforms. * Operating Systems: The execution of application on Grids is based on
the concept of Virtual Organization (VO) which is the implementation
of user access policies, resources allocation of shared resources,
and deployment and management of applications [19,22,35]. At the other end,
physical resources must be assigned to users, via VOs. This could be
done using virtual machines that isolate the user from the bare
hardware (security) and provide a execution environment that fits
application's needs. In between, in order to extend the local view of
virtual machines (virtual machines are about local resources, there
is no notion of distributed platforms), we propose the definition and
implementation of "virtual platforms" (VP). The virtual
platform is a subset (in other terms partition) of the distributed
physical resources; for instance, it can be a et of virtual machines.
Therefore a VP is only the user view of the distributed platform,
i.e., compared to VOs, it does not include policies about user
access authorization and policies about the resources usage
(accounting and so on) which are specific to the VO. Our first task
for this effort is therefore to specify the software stack VO/VP/VM
based on the three target platforms, with the goal of writing and
submitting a white paper describing the results of the study. * Resiliency: The three target platforms, because of their scale,
have in common failures that can impact application execution. To
address this issue, resiliency policies have been implemented for all
of these platforms. We propose to identify typical system policies
for those platforms and analyze common points and differences. This
effort aims to answer the following question: does the differences of
the three target platforms impacts the nature of resiliency policies?
or is it only a difference of parameters (the policy semantic remains
the same; policies do not differ in nature)? We plan to write and submit a whitepaper on the
subject to an international scientific event. * Miscellaneous: The two teams already have several open
source software prototypes that could be used as basis for the
developments performed by the “équipe associcée”: XtreemOS: http://www.xtreemos.eu/ Kerrighed:
http://www.kerrighed.org/ Aladdin:
https://www.grid5000.fr/mediawiki/index.php/Grid5000:Home OSCAR-V:
http://www.csm.ornl.gov/srt/oscarv/ The collaboration will lead to the
implementation of new open source software or the extension of
existing software. 1. Echanges
The
collaboration will be implemented via the exchange of junior and
staff researchers. A Ph.D. Student, Jérôme Gallard, will visit ORNL
during 3 months during spring 2009 for the design and implementation
of the concept of virtual platforms. Christine Morin, the PhD
advisor, will also visit ORNL for a one week period during this
internship. We
plan to organize a workshop open to INRIA and ORNL partners, focusing
on some of the challenges addressed by the associated team. We
actually plan to organize such a workshop every year, going
forth-and-back between France and the USA. The first workshop will be
organized in France. An
application for a one month sabbatical period has been submitted to
the University of Rennes 1 by Stephen L. Scott with the goal of
spending one month during the summer 2009 (June-August period). If
the application is accepted, our workshop will be organized during
this period, and the fundings allocated in the framework of the SER-OS associated team to the visit of Stephen L.
Scott will be reassigned for the visit of other ORNL team members or
for the internship of other junior researchers from the PARIS
project-team at ORNL. Geoffroy
Vallée,
research scientist at ORNL, and Thomas Naughton, both Ph.D. Student
at the university of Reading in the UK and research staff at ORNL,
will visit the PARIS project-team for a one-week period (most
certainly in December). We
also plan to meet during the SuperComputing conference organized each year in November in the United States in order to
manage the project and synchronize the different efforts. 1. ESTIMATION DES DÉPENSES EN
MISSIONS INRIA VERS LE PARTENAIRE Nombre de personnes Coût estimé Description de la mission Chercheurs confirmés 1 2.000 euros Discussions and project management.
Post-doctorants Doctorants 1 3.000 euros 3 months period at ORNL for the design and implementation of
the concept of “virtual platforms” which will be used to
implement the concept of “virtual organizations”. Context:
the Operating System effort. Stagiaires
Autre (précisez) : Total 2 5.000 euros 2. ESTIMATION DES DÉPENSES EN
INVITATIONS DES PARTENAIRES Nombre de personnes Coût estimé Déscription de la mission Chercheurs confirmés 2 2.000 euros (Dr. Vallée) + 3.000 euros (Dr. Scott) Geoffroy Vallée's visit for the coordination and release of
the extension of the OSCAR/OSCAR-V prototype that will implement
the concept of virtual platform. Context: the system tools
effort. Stephen L. Scott's visit during the
summer 2009 (period May-June). Note: a sabbatical application has
been submitted to the University of Rennes 1, this budget may be
reassigned. Context: the resiliency effort. Post-doctorants Doctorants Stagiaires
Autre (précisez) : 4 2.500 euros Organization of the annual workshop for the collaboration. The
four non-local members of the collaboration will participate to
the workshop. Total 6 15.000 euros 2. Cofinancement
Cette coopération bénéficie-t-elle déjà d'un
soutien financier de la part de l'INRIA, de l'organisme étranger
partenaire ou d'un organisme tiers (projet européen, NSF, ...) ? The XtreemOS
project will provide 5.000 euros to cover travel expenses for
researchers involved in XtreemOS. The SRT team
can cover a couple of travels for different meetings. No, but ORNL and more specifically Department of
Energy has programs for summer student internships. The ORNL team has
a project for 3 years, starting in 2008, titled “Reliability,
Availability, and Serviceability (RAS) for Petascale High-End
Computing and Beyond”, and funded by Office of Advanced Scientific
Computing Research, Office of Science, U.S. Department of Energy.
Program: Operating/Runtime Systems for Extreme Scale Scientific
Computation (LAB 07-23). 3. Demande
budgétaire Indiquez, dans le tableau ci-dessous, le coût global
estimé de la proposition et le budget demandé à la DRI dans le
cadre de cette Equipe Associée (maximum 20 K€). Commentaires Montant A. Coût global de la proposition (total des tableaux
1 et 2 : invitations, missions, ...) 20.000 euros B. Cofinancements utilisés (financements autres que
Equipe Associée) Financement "Équipe Associée"
demandé (A.-B.) 20.000 euros References [1] Stephen L. Scott, Geoffroy Vallée,
Thomas Naughton, Anand Tikotekar, Christian Engelmann, and Hong Ong.
Research on system-level virtualization at the oak ridge national
laboratory. Future Generation Computer Systems, 2008. To appear. [2] Christine Morin, Pascal Gallard, Renaud
Lottiaux, and Geoffroy Vallée.
Towards an efficient single system image cluster operating system.
Future Generation Computer Systems, 20(2), January 2004. [3] Christian Engelmann, Geoffroy
Vallée, Thomas Naughton, and Stephen L. Scott. Proactive
Fault Tolerance Using Preemptive Migration: Model and Classification.
In Proceedings of the 17th Euromicro International Conference on
Parallel, Distributed and network-based Processing (PDP 2009). To
appear. [4] Kulathep Charoenpornwattana, Chokchai
Leangsuksun, Geoffroy Vallée, Anand Tikotekar, and Stephen Scott. A
scalable unified fault tolerance for HPC environments. In Proceegins
of the 9th LCI International Conference on
High-Performance Clustered Computing, April 2008. [5] Geoffroy Vallée, Kulathep Charoenpornwattana,
Christian Engelmann, Anand Tikotekar, Chokchai Leangsuksun, Thomas
Naughton, and Stephen L. Scott. A framework for proactive fault
tolerance. In Proceedings of the Third International Conference on
Availability, Reliability and Security (ARES 2008 - The International
Dependability Conference), pages 659–664, Barcelona, Spain, March
4-7, 2008. IEEE Computer Society. [6] Geoffroy Vallée, Stephen L. Scott, and al.
System-level virtualization for high performance computing. In
Proceedings of the 16th Euromicro International Conference on
Parallel, Distributed and network-based Processing (PDP 2008), pages
636–643, Toulouse, France, February 13-15, 2008. IEEE Computer
Society. [7] Anand Tikotekar, Geoffroy Vallée, Thomas
Naughton, Stephen L. Scott, and Chokchai Leangsuksun. Evaluation of
fault-tolerant policies using simulation. In Proceedings of the 9th
IEEE International Conference on Cluster Computing (Cluster), Austin,
Texas, USA, September 17-20, 2007. [8] Geoffroy Vallée, Thomas Naughton, and Stephen
L. Scott. System management software for virtual environments. In CF
’07: Proceedings of the 4th international conference on
Computing frontiers, pages 153–160, New York, NY, USA, May 7-9,
2007. ACM. [9] Jérôme
Gallard, Geoffroy Vallée,
Adrien Lèbre, Christine
Morin, Pascal Gallard, and Stephen L. Scott. Complementarity between
virtualization and single system image technologies. In 3rd
Workshop on Virtualization in High-Performance Cluster and Grid
Computing (VHPC ’08), Las Palmas de Gran Canaria, Canary Island,
Spain, August 2008. [10] Anand Tikotekar, Geoffroy Vallée,
Thomas Naughton, Hong Ong, Christian Engelmann, and Stephen L. Scott.
An analysis of hpc benchmark applications in virtual machine
environments. In 3rd Workshop on Virtualization in
High-Performance Cluster and Grid Computing (VHPC ’08), Las Palmas
de Gran Canaria, Canary Island, Spain, August 2008. [11] Geoffroy Vallée,
Anand Tikotekar, Chokchai Leangsuksun, and Stephen L. Scott. Impact
of fault- tolerance policies: Feasibility study. In HAPCW’08: High
Availability and Performance Computing Workshop, Denver, Colorado,
USA, April 3–4, 2008. Held in conjunction with High-Performance
Computer Science Week (HPCSW) 2008. [12] Anand Tikotekar, Geoffroy Vallée,
Thomas Naughton, Hong Ong, Christian Engelmann, and Stephen L. Scott.
Effects of virtualization on a scientific application – running a
hyperspectral radiative transfer code on virtual machines. In
Proceedings of the 2nd Workshop on System-level Virtualization for
High Performance Computing (HPCVirt) 2008, in conjunction with the
3rd ACM SIGOPS European Conference on Computer Systems
(EuroSys) 2008, Glasgow, UK, March 31, 2008. [13] T. Naughton, G. Vallée,
and S. L. Scott. Dynamic adaptation using Xen. In Proceedings of the
1st Workshop on System-level Virtualization for High
Performance Computing (HPCVirt) 2007, in conjunction with the 2nd
ACM SIGOPS European Conference on Computer Systems (EuroSys) 2007,
Lisbon, Portugal, March 20, 2007. [14] C. Engelmann, S. L. Scott, H. Ong, G. Vallée,
and T. Naughton. Configurable virtualized system environments for
high performance computing. In Proceedings of the 1st Workshop on
System-level Virtualization for High Performance Computing (HPCVirt)
2007, in conjunction with the 2nd ACM SIGOPS European
Conference on Computer Systems (EuroSys) 2007, Lisbon, Portugal,
March 20, 2007. [15] Geoffroy Vallée
and Stephen L. Scott. Xen-oscar for cluster virtualization. In ISPA
Workshop on XEN in HPC Cluster and Grid Computing Environments
(XHPC’06), pages 487–498, December 2006. [16] Geoffroy Vallée,
Thomas Naughton, Hong Ong, and Stephen L. Scott. Checkpoint/restart
of virtual machines based on xen. In HAPCW’06: High Availability
and Performance Computing Workshop, Santa Fe, New Mexico, USA,
October 2006. Held in conjunction with LACSI 2006. [17] Geoffroy Vallée,
Christine Morin, and Stephen L. Scott. A framework for high
availability based on a single system image. In HAPCW’05: High
Availability and Performance Computing Workshop, Santa Fe, New
Mexico, USA, October 2005. Held in conjunction with LACSI 2005. [18] Geoffroy Vallée,
Stephen L. Scott, Christine Morin, Jean-Yves Berthou, and Hugues
Prisker. SSI-OSCAR: a cluster distribution for high performance
computing using a single system image. In The 3rd Annual
OSCAR Symposium, University of Guelph, Guelph, Ontario, Canada, May
2005. Held in conjunction with the 19th International Symposium on
High Performance Computing Systems and Applications (HPCS 2005). [19] Massimo Coppola, Yvon Jégou,
Brian Matthews, Christine Morin, Luis Pablo Prieto, Óscar David
Sánchez, Erica Y Yang, Haiyan Yu. "Virtual Organization Support
within a Grid-Wide Operating System", IEEE Internet
Computing, Vol. 12, No. 2, 2008 [20] C. Morin. XtreemOS: a Grid
Operating System Making your Computer Ready for Participating in
Virtual Organizations, IEEE International Symposium on
Object/component/service-oriented Real-time distributed Computing
(ISORC), Santorini Island, Greece, May 2007. [21] Thomas Ropars. Combining Optimism
and Pessimism in a Grid Message Logging Protocol. In Student Forum of
International Conference on Dependable Systems and Networks (DSN
2007) (Supplemental Volume), Edinburgh, UK, June 2007 [22] E. Yang, B. Matthews, A. Lakhani,
Y. Jégou, C. Morin, O. Sanchez, C. Franke, P. Robinson, A. Hohl, B.
Scheuermann, D. Vladusic, H. Yu, A. Qin, R. Lee, E. Focht, M.
Coppola. Virtual Organization Management in XtreemOS: an Overview,
CoreGrid Symposium, Rennes, France, August 2007. [23] John Mehnert-Spahn, Michael
Schöttner, Thomas Ropars, David Margery, Christine Morin, Julita
Corbalán, and Toni Cortes. XtreemOS Grid Checkpointing Architecture.
IEEE International Symposium on Cluster Computing and the Grid
(poster), Lyon, France, May 19-22, 2008. [24] Thomas Ropars and Christine Morin.
"O2P : un protocole à enregistrement de messages extrêmement
optimiste". In Actes de RenPar'18, 2008 [25] Thomas Ropars, Christine Morin.
"Fault Tolerance in a Cluster Federation with O2P-CF", In
Workshop on Resiliency in High-Performance Computing (Resilience
2008). Held in conjunction with CCGrid 2008 [26] Thomas Ropars, Emmanuel Jeanvoine
and Christine Morin. GAMoSe: An Accurate Monitoring Service for Grid
Applications. In 6th International Symposium on Parallel and
Distributed Computing (ISPDC 2007), Pages 295—302, Hagenberg,
Austria, July 2007. [27] Gallard J., Gallard P., Lebre A.,
Morin C., Scott S., Vallée G. Refinement Proposal of the Goldberg's
Theory. INRIA research report N° RR-6613 (2008) [28] C. Morin et al. XtreemOS: a Vision
for a Grid Operating System, XtreemOS technical report, XosTechRep_04 (http://www.xtreemos.eu), May
2008. [29] Matthieu Fertré and Christine Morin. Extending a cluster SSI OS for transparently checkpointing
message-passing parallel applications. In International Symposium on Parallel Architectures
Algorithms, and Networks (I-SPAN05), Las Vegas, Nevada, USA, December 2005.
[30] Adrien Lèbre, Renaud Lottiaux, Erich Focht, and Christine Morin. Reducing kernel development
complexity in distributed environments. In Europar 2008, August 2008.
[31] Sébastien Monnet, Christine Morin, and Ramamurthy Badrinath. Hybrid checkpointing for parallel
applications in cluster federations. In 4th IEEE/ACM International Symposium on Cluster
Computing and the Grid, Chicago, IL, USA, April 2004. CCGrid 2004, IEEE. Electronic version.
[32] John Mehnert-Spahn, Michael Sch¨ottner, and Christine Morin. Checkpointing process groups
in a grid environment. In Proc. of the International Conference on Parallel and Distributed
Computing (PDCAT ’08), December 2008.
[33] Matthieu Fertré and Christine Morin. Transparent message-passing parallel applications checkpointing
in kerrighed. In High Availability and Performance Computing Workshop 2005
(HAPCW05), Santa Fe, New Mexico, USA, October 2005.
[34] Louis Rilling and Christine Morin. Partage de données transparent et tolérant aux fautes pour la
grille. In Actes de la 4ème Conférence Fran¸caise sur les Systèmes d’Exploitation (CFSE 4), pages
135–146, Le Croisic, France, April 2005.
[35] Sylvain Jeuland, Yvon Jégou, Oscar David Sanchez, and Christine Morin. Support
d’organisations virtuelles au sein d’un système d’exploitation pour la grille. In Actes de RenPar´18, Fribourg, Switzerland, February 2008.
[36] Louis Rilling and Christine Morin. A fault-tolerant transparent data sharing service for the grid.
Research report 5427, INRIA, Rennes, France, December 2004.
[37] Emmanuel Jeanvoine, Louis Rilling, Christine Morin, and Daniel Leprince. Using overlay
networks to build operating system services for large scale grids. Scalable Computing : Practice
and Experience, 8(3) :229–239, September 2007.
© INRIA - mise à jour le
11/08/2008
Indiquez l'impact de cette collaboration sur :
-
les objectifs scientifiques des équipes participantes ;
II. PREVISIONS
2009
Programme
de travail
Programme
d'échanges avec budget prévisionnel
Décrivez
les échanges prévus dans les deux sens : invitations de chercheurs
de votre partenaire et missions INRIA vers votre partenaire
;
Précisez
s'il s'agit de chercheurs confirmés ou de juniors (stagiaires,
doctorants, post-doctorants) ;
Motivez,
si possible, les raisons scientifiques (travail commun,
workshop,..) et précisez la durée prévue
;
Résumez
ensuite ces informations dans les tableaux 1 et 2 ci-dessous en
faisant une estimation budgétaire :
Indiquez ces éléments et donnez les
montants associés. Dans le cas où votre proposition serait retenue,
vous parait-il probable d'obtenir de l'organisme étranger partenaire
un soutien financier symétrique ? De quel montant ?
(maximum 20
K€)