Dal Alert!

Receive alerts from Dalhousie by text message.

X

Research Compute Environments

Executive summary

Computer Science operates several computer systems that its researchers can use to run their jobs. It would be a good idea to research the specific choices based on historical usage/over-usage patterns, appropriateness to your particular task (ie match compute-bound jobs to fast CPUs, and memory-bound processes to large memory machines).

Your network mounted storage can be found in /remote, and local storage in /local. Use your CSID and password to login via SSH.

Available FCS systems include:

There are also system located outside the FCS that may be available for your use.

For the current status of the systems, you can consult monitoring pages from Nagios or Munin. (These pages will only load when accessed from Dalhousie's network.)

Finally, here is a copy of the slides used in a CS Seminar presentation on these services.

FCS systems

Computing environment overview

The Computer Science Faculty operates several computer systems for research. Some of them are distinctly for all Faculty use, and some have been placed at the general research community’s disposal because the group that acquired the equipment is not using the maximum capacity at the present time during the publication cycle.

This introduction is intended to give you an idea of what facilities are available, what you can expect one login, and give you enough information that you can select the equipment most appropriate to your research problem.

If you have questions, or ideas for variables that are not listed that would better assist you in selecting appropriate equipment, please contact Technical Support via cshelp@cs.dal.ca.

Before you start

Please remember that you are a guest on the equipment which means that:

  • Not all requests can be accomodated. Requests for additional software or changes to the existing software on the machine may not be possible if they conflict with the needs of the hosting group.
  • Hosting group needs take precedence over general use. Contact cshelp@cs.dal.ca if you would like to schedule training on available facilities for managing job priorities and monitoring resource usage.
  • Avoiding filling the disk with your data or results for an extended period of time.

You can also check the current and past host load to get an idea of how much each system is used.

Standard environment

The following pieces of equipment have been brought in line with these standards so that when you login you can start using the equipment to maximum effect with a minimum of searching around.

Login Name: Each machine is connected to the FCS LDAP server, and so you can use your CSID to login, along with the same password that you use for other resources.

Storage/disk space

The standard storage locations are summarized in this table, and described in more detail below.

Location Description Reliability Speed Accessability
/users/type/name Local Home directory Depends on machine Fast Local Only
/remote/users/type/name Network Home Dir Backed up Slow Available on all machines
/local/data Local copy of data Manual removal, may or may not have backups Fast Local Only
/remote/data/{user|group}/name Network Data Likely backed up Slow Available on requested machines
/local/scratch Local Working set Likely cleaned by size or age Fast Local Only
/tmp and /var/tmp Standard scratch space Very transient Fast Local Only

Home directory

Local: When you login, you will generally be using a local home directory that is created on demand when you login. This allows you to customize your login environment appropriately if there are variations between the tools or such installed on each machine. This local directory is generally of limited storage, and might not be backed up, but being local storage should have low-latency access.

Network: In addition, on each machine you can access the home directory you have on Bluenose/Hector. This home directory is backed up, and available on all machines, but being accessed over the network is high-latency. You can use this directory to store your programs and small data sets for quick deployment on new machines, but generally you should not run your programs with this as your scratch and results storage location.

Data storage

Local data space is in /local/data. This is where the host project is likely storing their data so there may or may not be space available. One can copy down your data set from a remote source to be worked on by your program. When no longer using the machine, please clean up after yourself. This area is manually cleaned if necessary.

Network Data Space: Found in /net/data, this is further broken down into /net/data/user/username, and /remote/data/group/groupname. Your research group might have invested in augmented storage to be mounted in their space. If they have a separate fileserver, then it will be mounted by the automounter into this space. This is coming over the network, so factor this into your decision about whether to copy it down to /local/data or /local/scratch or whether to access it directly from the remote source. Depending on the group, this space might be backed up, but is at least very likely to be on RAID storage.

Scratch space

This is for your working set. It is local storage, and might be on faster disk if the machine is optimized for various classes of storage. It is definitely not backed up, and likely to have an automated cleaning cron job running on it. It is possible that it is simply a soft link to /var/tmp if a separate filesystem was not allocated for it. If the system has a separate scratch disk, then it is likely fairly large.

/var/tmp and /tmp : These are very transient storage to be used for working files. /tmp is sometimes a RAM disk, and as such might be very fast, and very small. /var/tmp is on a disk.

Standard tools

Machines in the standard infrastructure have a base set of tools that you can assume will be installed. It is system dependent about what commitments were made about versions that are installed. (Some projects require old versions of tools to work, and some are currently on older distributions of the operating system. Some projects need certain development libraries/modules installed within the development environment.) You can assume that there will be:

Editor: vim, emacs, pico
Development: gcc/g++ C-Compiler, PHP, Perl, Java JDK, Python.

Guidelines for running jobs

In choosing the machine to run on, and what directories to use as the input, output and working directories you should consider matching your program’s characteristics to the nature of the system you are considering. Determine whether your job is:

  • CPU-bound
  • IO-bound, or
  • Memory-intensive, and
  • whether your jobs is single-threaded or multi-threaded.

Choosing a really fast machine with small memory and disk space would not be a good fit for a job that processes huge data sets.

In deciding whether to copy your data down from the network storage, ask whether your program will be doing one run where it processes each line of the file once, in which case it makes more sense to just read it directly over the network, versus it is going to do multiple runs where it jumps back and forth in the data file, in which case you should make a local copy.

Long running processes: some machines have longer uptimes than others. For machines with periodic maintenance, jobs that run for a week or more might end up cancelled if they span a maintenance window that requires a reboot to load a new kernel. In addition, if your job is running and consulting a network drive, then if the machine that serves that drive is rebooted, then your job might encounter errors.

Remember these factors when deciding which machines to run on, your checkpointing interval, and which directories you run from.

Systems

The following systems have been brought into the standard environment, with notes about their particular features:

Name Project CPUs Memory Internal Storage Notes/Special Features
Hector FCS 2x8 @ 2.67GHz 56GB 120GB Faculty of Comp Sci general use research machine
CGM6 Risk Analysis 2x16@2 GHz 256GB 12TB In addition to having large memory, this machine has 2 PHI cards for highly parallel computing jobs.
CGM7 Risk Analysis 2x16@2.6 GHz 128G 4.2TB  
Ares MALNIS 4x6 @ 2GHz 64GB 1.3TB Matlab installed
Demeter MALNIS 2x8@ 2.27GHz 42GB 324GB Matlab installed
Selene MALNIS 2x4 @ 2.3GHz 16GB 1GB Matlab installed
Helios MALNIS 2x4 @ 2.3GHz 16GB 1.3TB  
           
Risklab11 Risk Analytics 1x4@3.6GHz 32GB 1.8TB Tesla C2050 (CUDA)

Specialty Systems

 

Hugh
A Hadoop cluster with most of the Cloudera stack installed (YARN, HDFS, Squoop, HBase, Spark, etc.) on a cluster of 24 nodes. Each node is 4 core Xeon @ 2.6GHz, 4GB RAM, 2TB disk. Note that only the front node is setup as the standard environment.

 

Systems located outside the FCS

Compute Canada/ACENET

The national initiative Compute Canada makes shared computing resources available to researchers across the country. Because this is a large pool of computing resources, this is where you would take your project once you have developed it to scale it up to the full application. See the ACENET user guide for more details. Talk to your principle researcher to apply for access.

Institute for Big Data Analytics

There is also an IBM Netezza available for research use. Instructions for applying for access can be obtained from the Institute for Big Data Analytics.