Korundi Cluster RETIRED!

9.4.2010: Instructions for running parallel jobs can be found here.

4.8.2014: Korundi is now (well, already in spring 2014) updated to
Scientifc Linux and the batch job system has changed from SGE to SLURM.

1.1.2017: Korundi is retired.

Introduction

korundi.grid.helsinki.fi is a 400-core (50-node) Linux cluster owned and operated jointly by the Department of Physics, Department of Chemistry and Helsinki Institute of Physics. It is intended for computational research of the mentioned institutes.

The cluster environment is similar to that of the alcyone cluster. If you have run jobs on alcyone all you have to change in your run script is the queue name (or more precisely: resource requirements of the run). The instructions how to see what queues are available can be found alcyone web page.

The cluster environment is built on the Scientifc Linux distribution, which is a Red Hat Enterprise Linux based distribution specialized for cluster installation, configuration, monitoring and maintenance.

Cluster environment

Hardware

  • Front-end node: korundi.grid.helsinki.fi
    • Dell PowerEdge 1950 III server
    • Runs the batch queue system and other services.
    • Fileserver for home directories.
    • The only machine intended for interactive use.
  • Disk server:
    • Dell PowerVault MD3000
    • 8 TB diskspace
  • Computational nodes:
    • To be accessed only trough the batch queue system

    • Batch jobs submitted by the command qsub (see below).

    • A nodes
      • Total number 38
      • Dell PowerEdge 1950 III server
      • Two Intel Xeon X5450 Quad Core CPUs
      • 16 GB memory
      • Two 500 GB 7.2k hotswap SATA disks
    • B nodes
      • Total number 12
      • Dell PowerEdge 2950 III server
      • Two Intel Xeon X5450 Quad Core CPUs
      • 32 GB memory
      • Three 500 GB 7.2k hotswap SATA disks

Account allocation

Accounts are allocated to research scientists who work at the Department of Physics, Department of Chemistry participating labs or HIP and need computer capacity larger than that of a few workstations.

To apply for an account, the group leader should fill in this form. Fill and sign it, and send it to Antti Kuronen (address in the document). The form is also available in LaTeX format.

The user list should only contain scientists who need massive computational capacity.

After the initial application has been accepted, new users can be added to the group by a simple e-mail request from the group leader.

The group leader has the responsibility to ensure that the cluster users in the group are aware of the rules of usage (listed below), and that they know enough of the use of Unix systems to be able to follow the rules.

Advice on usage

There are two e-mail alias lists for korundi
  • korundi-admin: administrators
  • korundi-users: users

Both of these are (at)helsinki.fi. The alias korundi-users will be used locally and should be kept low-volume i.e. used only when it is really necessary to reach all users. Ordinary users normally should not need to e-mail this list, but may find the korundi-admin alias useful. Remember, though, that basic advice on usage should come from within your own research group.

Security

Be very careful with all passwords and passphrases you use (i.e. use good passwords and only log in from trusted sites). Report any suspicious activities in the cluster which might be of cracker origin to the technical administrators!

Login

The only supported protocol for login and data transfer is ssh.

Login is only possible from IP addresses which are explicitly opened in the firewall, which basically are those of the owned labs.

Login with ssh korundi.grid.helsinki.fi
  • Note: The first time you login the system asks for a passphrase. Please give an empty one by pressing return! This is needed to make the batch queue system work properly.
  • If you have given a non-empty passphrase, just do rm ~/.ssh/identity* and then login again to give an empty one.

How to run the jobs

All runs should be started under the SLURM batch queue system from the front-end node.

User support

The cluster uses a standard Linux operating system. Knowledge of Unix/Linux systems, and computational methods is a prerequisite for the use of this system. No dedicated user support is available. New members should be guided and supported in their cluster use by the more experienced users whithin the research group.

Software support

The only software supported by the administrators is the standard Rocks distribution Linux software and the compilers. Advice on suitable compiler options will be given later.

Commercial software may be installed to the system if the group(s) needing the software provide the funding and do the installation and maintenance.

Compilers

Currently, in addition to the GNU compilers (gcc and gfortran), Intel Linux compilers are installed. Compilers are set up using the module system; for details, see the alcyone web page.

File system

Use command df to check exact file system status. Quotas are effective on users’ home directories.

Directories visible to all machines:

  • /home/<username>/ contains user’s home directory. It is meant for software sources and binaries, small input and data files, but not for long term data storage or running simulations. Disk quotas are effective on home directories and they are backed up.
  • /scratch/<username>/ does not have quotas effective. It can be used for storing results of calculations. Note that scratch directories are not backed up. This means that if you by accident delete a file it is gone forever.

Directories visible to the local computational node only:

  • /tmp is meant for running all jobs. It is a RAID 0 array built from local disks. Do not run batch jobs on /home!
  • All batch jobs should be done so that the code first does a cd to /tmp/username, then runs below that. This is because the /home directory is an NFS directory, and accessing that is much slower than accessing the local directories.
  • Files on /tmp are removed at every boot of the node. It is thus the user’s responsibility to copy the output files to his/her home directory.
  • Users should make sure they do not fill up any local /tmp disk too much themselves.

Batch queue system: SLURM

Korundi uses the Simple Linux Utility for Resource Management (SLURM). The same system is used in our alcyone cluster, so refer to its web page for instructions.

Currently the following queues are configured:
  • 2G_long_ser
  • 2G_short_par
  • 2G_short_ser
  • 4G_long_ser
  • 4G_short_ser

Jobs in the 2G (4G) queues go to the A (B) nodes. CPU time limit for the short queues is 7 days, and 30 days for the long queues.

In addition there are queues:
  • courses_par
  • debug_ser

These are for teaching purposes and debugging, respectively.

Logging in to individual nodes

Normally you should never need to do this. Nevertheless, if you need to do it, e.g., to figure out why your batch job crashed, you can log in to a computational node useing ssh. The node names are koNN or kofNN (f meaning fat node, i.e. a B node). It is expressly forbidden to circumvent the batch system and run jobs by logging in to the individual nodes.

Known problems and how to avoid them

None so far.

Rules of usage for the korundi computer cluster

The cluster is intended for the use of personnel of the Department of Physics, Department of Chemistry and HIP.

The cluster is administered by Administrators appointed by the Heads of the Departments.

Allowed use is research and education in physics and chemistry utilizing efficient simulation and numerical codes. Any large-scale simulations should be run with compiled software, that is, extensive runs using interpreting programs such as Matlab, Mathematica etc. or script languages such as awk and perl are not allowed unless an explicit exception is granted by one of the Administrators. Running password cracking, cryptography, and “seti@home“-kinds of programs on korundi is naturally strictly prohibited.

Research use accounts are given on a group-by basis. Eligible groups are those working at the owning institutions. In unclear cases the Head of the respective owning institutions decides whether a group is eligible for an account.

To open a group research account, the group leader should fill in the initial application form, and send it to the Account Administrator. After the initial application has been accepted, new users can be added to the group by a simple e-mail request from the group leader.

Educational accounts might be allocated for the lecturer of a computational physics course requiring parallel computing resources for the period of the course, according to a separate agreement with one of the Administrators. The lecturer bears the responsibility that the use of the educational accounts is limited to proper course use, and for guiding the course students into proper use of the cluster.

As of now, there are no pre-defined limits for usage. Groups are expected to use the machine in a gentlemanly manner, not attempting to hoard as much computer capacity for themselves as possible at the expense of other groups. All CPU use of each group is logged, and if a single group has used what seems like an obviously unreasonable share of the cluster for a long period of time, the Adminstrators have the right to ask them to limit their use in the future. If after several warnings the group still uses unreasonable amounts of capacity, the group accounts can be closed for a fixed period of time.

The use of the machine should take into account hardware limitations such as memory and hard drive space limitations. Hard disk space for public use is allocated on the /home, and /tmp disks. Each user should keep their disk space usage to a reasonable minimum, and clean out stuff they no longer need. All long jobs should put their output to the /tmp disks, which are not backuped, and are not intended for long-term storage. Old files from the /tmp disks may be removed without prior warning to the user.

Although the cluster is intended mainly for serial jobs small scale parallel runs can be executed. Detailed information of the parallel environment and the maximum number of processors allocatable will be given separately. Running embarrasingly trivial parallel jobs using scripts is allowed within the limits set by the batch queue system on the number of jobs.

The group leader has the responsibility to ensure that the users in the group are aware of these rules, and that they know enough of the use of Unix systems to be able to follow these rules.

Any cluster user is allowed and indeed encouraged to report clear violations of these rules to the Administrators.

In case of clear violations of these rules, whether intentional or due to negligence or poor understading of the system, the Administrators can issue formal warnings to the group leader or course lecturer. If after two warnings the group still does not comply with the rules, the group account on the cluster will be closed for a fixed amount of time, or permanently.

Naturally you should also follow the University of Helsinki general rules of computer usage.

Persons in charge

The Administrators of the cluster are Kai Nordlund, Tomas Lindén, and Juha Vaara.

The technical administrators are (24.4.2009):
  • Pekko Metsä
  • Teemu Pennanen (batch queues)
  • Antti Kuronen (user accounts)

Email addresses of the administrators are of the form firstname.lastname@helsinki.fi with diacritics removed. However, note that in technical matters it is best to sent the message to the korundi-admin mailing list.

Latest update: 4 Aug 2014, A.Kuronen