Overview of MOSIX2
Prof. Amnon Barak Department of Computer Science The Hebrew University
http:// www . MOSIX . Org
July 2009
Copyright Amnon Barak 2009
Copyright Amnon Barak 2009
Background
Clusters, multi-clusters (intra-organizational Grids) and Clouds are popular platforms for HPC Typically, users need to run multiple jobs with minimal burden how the resources are managed
Prefer not to:
Modify applications Copy files or login to different nodes Lose jobs when some nodes are disconnected
Users dont know (and doesnt care):
What is the configuration, the status and the locations of the nodes Availability of resources, e.g. CPU speed, load, free memory, etc.
2
Copyright Amnon Barak 2009
Traditional management packages
Most cluster management packages are batch dispatchers that place the burden of management on the users For example, these packages:
Use static assignment of jobs to nodes
May lose jobs when nodes are disconnected
Not transparent to applications
May require to link application with special libraries View the cluster as a set of independent nodes
One user per node, cluster partition for multi-users
Copyright Amnon Barak 2009
Traditional management packages
Applications
One-way assignment of jobs, no feedback
Dispatcher
Dual 4Core
2Core
4Core
Independent Linux workstations and servers
A failed node
Copyright Amnon Barak 2009
What is MOSIX (Multi-computer OS)
An operating system-like management system for distributed-memory architectures, such as clusters and multi-clusters, including remote clusters on Clouds Main feature: Single-Systems Image (SSI) Users can login on any node and need not know where their programs run Automatic resource discovery
Continuous monitoring of the state of the resources
Dynamic workload distribution by process migration
Automatic load-balancing Automatic migration from slower to faster nodes and from nodes that run out of free memory
Copyright Amnon Barak 2009
MOSIX is a unifying management layer
Applications
SSI
Continuous feedback about the state of resources
Transparent
MOSIX management
All the active nodes run like one server with many CPUs
6
Copyright Amnon Barak 2009
MOSIX Version 1
Can manage a single cluster
Main features:
Provides a SSI by process migration Supports scalable file systems
9 major releases developed for Unix, BSD, BDSI, Linux-2.2 and Linux-2.4
Production installations since 1989 Based on Linux since 1998
Copyright Amnon Barak 2009
MOSIX Version 2 (MOSIX2)
Can manage clusters and multi-clusters, with some tools for running applications on Clouds
Developed for Linux-2.6 Geared for High Performance Computing (HPC), especially for application with moderate amounts of I/O Main features:
Provides a SSI by process migration Process migration within a cluster and among different clusters Secure run time environment (sandbox) for guest processes Live queuing - queued jobs preserve their full generic Linux environment Supports batch jobs, checkpoint and recovery
Copyright Amnon Barak 2009
Running applications in a MOSIX cluster
MOSIX recognizes 2 type of processes: Linux processes - are not affected by MOSIX
Usually administrative tasks that are not suitable for migration Processes that use features not supported by MOSIX, e.g. threads MOSIX processes -usually applications that can benefit from migration All such processes are created by the ``mosrun''command They are started from standard Linux executables, but run in an environment that allows each process to migrate from one node to another Each MOSIX process has a unique home-node, which is usually the node in which the process was created Linux processes created by the ``mosrun -E'' command can still benefit from MOSIX, e.g be assigned to the least loaded nodes
Copyright Amnon Barak 2009
Examples: running interactive jobs
Possible ways to run myprog:
> myprog - run as a Linux process on the local node > mosrun myprog - run as a MOSIX process in the local cluster > mosrun -b myprog - assign the process to the least loaded node > mosrun -b m700 myprog - assign the process only to a nodes with 700MB of free memory > mosrun E -b m700 myprog - run as a native Linux job > mosrun M -b m700 myprog - run a MOSIX job whose home node can be any node in the local cluster
Copyright Amnon Barak 2009
10
Running batch jobs
To run 2000 instances of myprog on a multi-cluster
> mosrun G b m700 q S64 myfile
-G assign the job to a node in another cluster -S64 run up to 64 jobs at a time from the queue myfile a file with a list of 2000 jobs
Copyright Amnon Barak 2009
11
How does it work
Automatic resource discovery by a gossip algorithm Provides each node with the latest info about the cluster/multi-cluster resources (e.g free nodes) All the nodes disseminate information about relevant resources: speed, load, memory, local/remote I/O, IPC Info exchanged in a random fashion - to support scalable configurations and overcome failures Useful for high volume transaction processing
Example: a compilation farm - assign the next compilation to least loaded node
Copyright Amnon Barak 2009
12
Dynamic workload distribution
A set of algorithms that match between required and available resources
Geared to maximize the performance Initial allocation of processes to the best available nodes in the users private cluster
Not to nodes outside the private cluster Automatic load-balancing Automatic migration from slower to faster nodes Authorized processes move to idle nodes in other clusters
Multi-cluster-wide process migration
Outcome: users need not know the current state of the cluster and the multi-cluster resources
Copyright Amnon Barak 2009
13
Core technologies
Process migration move the process context to a remote node OS virtualization layer allow migrated processes to run in remote nodes, away from their creation (home) nodes
Lo
OS Virtualization layer
Lo
MOSIX Link reroute syscalls
OS Virtualization layer
Linux
Linux
Home node
Remote node
Gu e st
ca l
ca l
A migrated process
Copyright Amnon Barak 2009
14
The OS virtualization layer
Provides the necessary support for migrated processes
By intercepting and forwarding most system-calls to the home node
Result: migrated processes seem to be running in their respective home nodes
The users home-node environment is preserved No need to change applications, copy files or login to remote nodes or to link applications with any library Migrated processes run in a sandbox
Outcome: users get the illusion of running on one node
Drawback: increased communication and virtualization overheads
Reasonable vs. added cluster/multi-cluster services (see next slide)
Copyright Amnon Barak 2009
15
Reasonable overhead:
Linux vs. migrated MOSIX process times (Sec.), 1Gbit-Ethernet
Application RC SW JEL BLAT
Local - Linux process
Total I/O (MB)
Migrated process- same cluster slowdown
723.4
0
627.9
90
601.2
206
611.6
476
725.7
0.32%
637.1
1.47%
608.2
1.16%
620.1
1.39%
Migrated process across 1Km campus slowdown
Sample applications: RC = CPU-bound job JEL = electron motion
Copyright Amnon Barak 2009
727.0
0.5%
639.5
1.85%
608.3
1.18%
621.8
1.67%
SW = proteins sequences BLAT = protein alignments
16
Main multi-cluster features
Administrating a multi-cluster
Priorities among different clusters
Scheduling and monitoring
Supports batch jobs, checkpoint and recovery
Supports disruptive configurations MOSIX Reach the Clouds (MRC)
Copyright Amnon Barak 2009
17
Administrating a multi-cluster
A federation of x86 (both 32-bit and 64-bit) clusters, servers and workstations whose owners wish to cooperate from time to time Collectively administrated
Each owner maintains its private cluster Determine the priorities vs. other clusters Clusters can join or leave the multi-cluster at any time
Dynamic partition of nodes to private virtual clusters
Users of a group access the multi-cluster via their private clusters and workstations
Process migration among different cluster
Outcome: each cluster and the multi-cluster performs like a single computer with multiple processors
Why an intra-organizational Grid: due to trust
Copyright Amnon Barak 2009
18
The priority scheme
Cluster owners can assign priorities to processes from other clusters
Local and higher priority processes force out lower priority processes
c2 Symmetrically c1 symmetrically(C1-C2) or asymmetrically(C3-C4)
Pairs of clusters could be shared, A cluster could be shared (C6) among other clusters (C5, C7) or blocked for migration from other clusters (C7) Dynamic partitions of nodes to private
virtual clusters
c4
A-symmetrically
c3
Outcome: flexible use of nodes in shared clusters
c6
c5
c7
19
Copyright Amnon Barak 2009
When priorities are needed
Scenario 1: one cluster, some users run many jobs, depriving other users from their fair share
Solution: partition the cluster to several sub-clusters and allow each user to login to only one sub-cluster
Users in each sub-cluster can still benefit from idle nodes in the other sub-clusters Processes of local users (in each sub-cluster) has higher priority over guest processes from other sub-clusters
Scenario 2: some users run long jobs while other user need to run (from time to time) short jobs Scenario 3: several groups using a shared cluster
Sysadmin can assign different priorities to each group
Copyright Amnon Barak 2009
20
Scheduling and monitoring
Batch jobs run as Linux processes in different nodes Checkpoint & recovery - time basis, manually or by the program Live queuing queued jobs maintain an organic connection with
their Unix environment Queue management provides means for tracing jobs, changing priorities, order of execution, for running parallel e.g. MPI jobs Queued jobs are released gradually in a manner that prevents flooding the local cluster or other clusters
Built-in on-line monitor for the local cluster resources On-line web monitor of the multi-cluster and each cluster
http://www.mosix.org/webmon
21
Copyright Amnon Barak 2009
Example: queuing
With the -q flag, mosrun places the job in a queue Jobs from all the nodes in each cluster share one queue Queue policy: first-come-first-serve, with several exceptions Users can assign priorities to their jobs, using the q{pri} option
The lower the value of pre the higher priority The default priority is 50. It can be changed by the sysadmin Running jobs with pri < 50 should be coordinated with the clusters manager
Out-of-order and fair-share
These options allow to instantly start a fix number of jobs per user, overriding the queue
Examples:
> mosrun q b m1000 myprog (queue a MOSIX program to run in the cluster) > mosrun q60 G b -J1 myprog (queue a low priority job to run in a different cluster)
Copyright Amnon Barak 2009
>mosrun q30 E m500 myprog (queue a high priority batch job)
22
mosq view and control the queue
mosq list list the jobs waiting in the queue mosq listall list jobs already running from the queue and jobs waiting in the queue Mosq delete {pid} delete a waiting job from the queue Mosq run {pid} run a waiting process now Mosq cngpri {newpri}{pid} change the priority of a waiting job Mosq advance {pid} move a waiting job to the head of its priority group within the queue Mosq retard {pid} move a waiting job to the end of its priority group within the queue
More options in the mosq manual
Copyright Amnon Barak 2009
23
Disruptive configurations
When a cluster is disconnected:
All guest processes move out To available remote nodes or to the home cluster All migrated processes from that cluster move back
Returning processes are frozen (image stored) on disks Frozen processes are reactivated gradually
Outcome:
Long running processes are preserved No overloading of nodes
Copyright Amnon Barak 2009
24
MOSIX Reach the Clouds (MRC)
MRC is a tool that allows applications to run in remote nodes on Clouds, without pre-copying files to these nodes Main features: Runs on both MOSIX clusters and Linux computers (with unmodified kernel) No need to pre-copy files to remote clusters Applications can access both local and remote files Supports file sharing among different computers Stdin/out/err are preserved locally Can be combined with "mosrun" on remote MOSIX clusters
Copyright Amnon Barak 2009
25
Hebrew University multi-cluster campus Grid (HUGI)
17 production MOSIX clusters ~350 nodes, ~750 CPUs
In Life-sciences, Med-school, Chemistry and Computer Science Sample applications that our users are running: Nano-technology Molecular dynamics Protein folding, Genomics (BLAT, SW) Weather forecasting Navier-Stokes equations and turbulence (CFD) CPU simulator of new hardware design (SimpleScalar)
Copyright Amnon Barak 2009
26
Priorities among HUGI clusters
CS student Farm
100
20
CS Theory group cluster
20 50
100
Biology1
CS general Priority for Accepting processes From cluster
Theory Student Farm Biology1 Biology2
Biology 2 Priority for accepting processes From cluster
Theory
Priority
20 Blocked Blocked 50
20
CS General Cluster
20 50
Biology2
Priority
50 Blocked 70 20 27
Student Farm CS General
70
Biology1
Copyright Amnon Barak 2009
Day use: idle shared nodes allocated to users
HUGI Chemistry
Computer Science
Life Sciences
Student farms
Group 2 clusters
Group 1 cluster
Student and guest processes
Copyright Amnon Barak 2009
Guest processes from Group 1
28
Night use: most nodes are allocated to one group
HUGI
Computer Science
Student farms
Group 2 clusters
Group 1 cluster
Copyright Amnon Barak 2009
29
Web monitor: www.MOSIX.org/webmon
Display: Total number of nodes/CPUs Number of nodes in each cluster Average load
Copyright Amnon Barak 2009
30
Zooming on each cluster
Display: Load Free/used memory Swap space Uptime Users
Copyright Amnon Barak 2009
31
Conclusions
MOSIX2 is a comprehensive set of tools for automatic management of Linux clusters and multi-clusters Self-management algorithms for dynamic allocation of system-wide resources
Cross clusters performance nearly identical to a cluster
Many supporting tools for ease of use
MRC for running applications on Clouds
Includes an installation script and manuals Can run in native mode or on top of virtual machine packages, e.g. VMware, Xen, MS Virtual Server, over an unmodified OS (Linux, Windows, OS X)
Copyright Amnon Barak 2009
32
How to obtain a copy of MOSIX
A free, unlimited trial copy is provided to faculty, staff and researchers for use in academic, research and non-profit organizations A free, limited evaluation copy is provided for nonprofit use Non-academics copies are available
Details at
http://www.MOSIX.org
Copyright Amnon Barak 2009
33