Access Space Renderfarm Project
Keep it simple, stupid
We all have days when the simplest of tasks can evolve into a bloody minded determination to singlehandedly relocate an entire beach, one grain of sand at a time, using only a pair of chopsticks. I recently had one of those days. I was bored. Life without a job can easily become one of endless procrastination. I needed a reason to play around with Linux, whilst studying for Linux Professional Institute certification. I needed an project. Something that would cover the whole gamut of hardware and software problems. But not something dull like simply installing linux. I needed a project of megalomaniacal proportions.
The Big Idea
Wouldn't it be cool if people coming into Access Space had the ability to do big jobs like video processing or raytracing? None of the computers available in the space are powerful enough to quickly process these cpu intensive tasks; Few users are prepared to give up long periods of time to the god of slowly growing blue bars whilst waiting, and waiting, and waiting... However there are a lot of unused cpu cycles when the network is considered as a whole. If only it were possible to harness this wasted processing power, you know - kind of like the SETI@Home and Folding@home projects do, but limited to the local network environment.
Clustering
Clustering is a means by which a number of computers, or 'nodes' can share a large job, or ensure that a particular service, such as a web server, remains available even if one or more nodes fail. Simply, there are three types of clusters: Fail-over, Load-balancing and High Performance Computing. The Fail-over and Load balancing types are commonly used as webfarms and databases and can ensure an extremely high availability of service. However, high performance computing is what I want: raw, unbridled processing power. High performance clusters share features with load balancing clusters; a task is split into chunks which can be processed independently of each other, and at the same time. This is a type of parallel processing.
Until recent years parallel computing was only realistically achievable on extremely expensive supercomputers. A supercomputer is roughly analogous to many ordinary computers linked together. A cluster may consist of many ordinary computers linked together. Indeed, of the computers considered as the Top 500 most powerful in November 2005, the majority were clusters and many of these were Linux based systems. The availability of cheap PC and network hardware means that what used to be considered supercomputer performance is now realisticly accessible even with older computers of the type which are available in the Access Space.
Beowulf
"Beowulf Clusters are scalable performance clusters based on commodity hardware, on a private system network, with open source software (Linux) infrastructure.
Each consists of a cluster of PCs or workstations dedicated to running high-performance computing tasks. The nodes in the cluster don't sit on people's desks; they are dedicated to running cluster jobs. It is usually connected to the outside world through only a single node" [1]
"There isn't a software package called "Beowulf". There are, however,several pieces of software many people have found useful for building Beowulfs. None of them are essential. They include MPICH, LAM, PVM, the Linux kernel, the channel-bonding patch to the Linux kernel (which lets you 'bond' multiple Ethernet interfaces into a faster 'virtual' Ethernet interface) and the global pid space patch for the Linux kernel (whichlets you see all the processes on your Beowulf with ps, and eliminate them), DIPC (which lets you use sysv shared memory and semaphores and message queues transparently across a cluster)." [2]
Sadly a Beowulf cluster would be fairly useless in Access Space, even if enough people had large jobs to process there would still be only a single point of access to the beowulf network , and thus only one person at a time could benefit from it. There would also be problems with physical space, power supply, and running costs. I would still like to build one in the future.
OpenMosix
"The openMosix software package turns networked computers running GNU/Linux into a cluster. It automatically balances the load between different nodes of the cluster, and nodes can join or leave the running cluster without disruption of the service. The load is spread out among nodes according to their connection and CPU speeds.
Since openMosix is part of the kernel and maintains full compatibility with Linux, a user's programs, files, and other resources will all work as before without any further changes. The casual user will not notice the difference between a Linux and an openMosix system. To her, the whole cluster will function as one (fast) GNU/Linux system." [3]
Most of the following paraphrases or is lifted directly from the openMosix Howto, which was written in 2003 and is in places confusing and contradictory. Consider this a precis of sorts, but not a replacement.
OpenMosix is a set of patches to the linux kernel and some user space tools providing job control. There are several ways to set up an openMosix cluster. Of these there are two which would be both valid and useful within the context of Access Space. The first is the 'Single Pool', in which all nodes form a pool of processing resources, and can migrate processes to other nodes. The second is the 'Adaptive Pool', in which nodes can join and leave the pool of resources. This could be done by means of a control script called by user login/logout, with the node leaving the pool when a user logs in, and rejoining upon when there are no longer any users logged on. I personally feel the second is the most viable since at times of high overall load across the openMosix cluster a user with a desktop session may experience what amounts to denial of service. [4]
Of course the Single Pool setup will have to be achieved before Adaptive pool can be tried.
Things To Do
Download and install the openmosix-kernel and openmosix-user packages from the openMosix repository on sourceforge. At the time of writing the latest stable version of openMosix is based on 2.4.x kernel. There are also beta versions based on the 2.6.x kernel, but I'm staying away from those for now. Unless you like working at a prompt you may also like to snarf the openmosixview cluster management gui, which will need compiling from source. After installating openMosix and its dependencies (ssh is required for remote access to nodes) reboot and select openMosix as the bootloader screen. You will see messages from the openMosix discovery daemon or omdiscd during the startup. Log on, open a terminal shell if using X, and test functionality by running the following command:
awk `BEGIN {for(i=0;i<10000;i++)for(j=0;j<10000;j++);}'
Careful! Those are backticks, not apostrophes! Now run mosmon to see the process migration status. Hells, whilst you're there read the fine mosmon manual too. Now would also be a good time to play with openmosixview , which requires that it be run by a user capable of logging on as root at each node. Use ssh.
Nothing is easy any more...
For openMosix to be viable in the Access Space it would need to be set as the default kernel on all the users machines. Unfortunately the pre-rolled kernels do not have sound support compiled. It would be both annoying for those people working on audio projects, and detrimental to the usefulness of the openMosix cluster to have machines constantly requiring a reboot in order either to use a multimedia kernel, or be part of the cluster. I also had other issues [5] and so was forced to use the source package. The sourcecode will not compile with ggc 5.x or 4.x. It may (or may not) compile with gcc 3.4.5. The kernel.org website recommends using gcc>=2.95.3 to compile the 2.4.x kernels. This caused further problems [6] and i ended up downloading, compiling, and installing gcc-2.95.3 manually to an alternative location following these instructions which I have mirrored here in case the original page disappears into the ether..
More text goes here
Blah, blah, blah, to be continued....
Footnotes / References
[1] Taken from the Beowulf FAQ Section 1.
[2] Also from the Beowulf FAQ, Section 3.
[3] From the OpenMosix Howto.
[4] This is conjecture, I am not currently aware how openMosix controls cpu loads and priority between migrated tasks and local tasks. I also wonder what the effect would be upon a migrated job being processed on a node at the time a user logs on. Will data be lost?
[5] Sometimes the openMosix kernel would refuse to use the network config in /etc/sysconfig/network-scripts/ifcfg-eth*, fail to initialise the network, and consequently fail to join the cluster. I have no idea why. Probably not an openMosix problem.
[6] Due to (I think) a package misconfiguration, and gcc being a dependency of ,basically everything, it was not possible to install an rpm of a second version of gcc alongside a pre-existing and more recent version. Mandrake/Mandriva (amongst other disto's) does not currently respect the /usr/local/bin install path, and so tries to overwrite shared libraries and config files. This may be a problem specific to the gcc-3.4.5 rpm package which I downloaded from rpmfind.net.