Yellow

505.277.6900
help@carc.unm.edu

Walk-in office hours
with Dr. Ryan Johnson,
Applications Scientist

Wednesdays 10 am to noon

25. Running COMSOL on a Parallel Cluster under PBS

STP Boyd, June 9, 2015; based on experiences under COMSOL 5.1.

The execution of a COMSOL .mph file depends somewhat on the environment in which it is launched:

  • interactively inside the COMSOL GUI,
  • in batch mode on a workstation or other single-node environment, or
  • in batch mode on a cluster.

Execution inside the GUI is the simplest and most robust approach and is always the recommended strategy if possible. This makes sense because COMSOL Multiphysics started out as a GUI-based Windows program: COMSOL started by developing the PDE Toolbox for MATLAB, which became FEMLAB, which in turn became COMSOL Multiphysics.

Single-node batch execution is more fragile than working in the GUI, and cluster batch execution is the most fragile of all. What I mean by fragile here is you might have a .mph file that will execute perfectly inside the GUI. The same file might run to completion with no error messages in single node batch--however only later do you find out that it failed to produce the accumulated probe table (APT) you were expecting. And if you fire up the same .mph file under batch on a Linux cluster it might simply hang, or error-terminate with a terse error message. This has been my experience under COMSOL 5.1.

Since the GUI-based approach cannot be used under batch execution in a Linux supercomputing environment, we have developed a recipe to help reduce or eliminate problems associated with this fragility. This QuickByte presents our recipe. We gratefully acknowledge the help of Mina Sierou, Par Persson Mattson, and Iraj Gholami of COMSOL, and the help of Ryan Johnson, Ben Archuleta, and Susan Atlas of CARC in developing this recipe. The steps are described for a fairly simple problem with a single “study” where many thousands of parametric sweep cases need to be run and the results output in a COMSOL “accumulated probe table” (ATP). 

After your first few successful cluster computations, you may want to start skipping Part 2, and only go back to those steps if problems develop.

Note: you MUST have a floating network license for COMSOL Multiphysics to be able to compute on a cluster. COMSOL can easily change your licensing arrangement if you are paying the annual support fees. There is an additional charge for floating network license versus machine-locked license.

Recipe Part 1: “Conditioning” the .mph file in the GUI

Perform these steps in the COMSOL GUI, which will work best on your own Windows laptop or PC. The Linux version of the GUI is not as smooth as the Windows version, but also works fine for performing these steps. If you have paid for the COMSOL Multiphysics license, note that the installation DVD allows you to install on Windows, Linux, or Mac.

Perform this procedure AFTER any additions or deletions to parameters, probes, and parametric sweep nodes. You can still change the parameter argument values in the parametric sweep nodes after you have performed this conditioning.

You will see there are two main steps in this part of the recipe:

  • attempting to erase COMSOL’s “memory” of any prior calculations
  • performing an initial execution within the GUI to ensure that COMSOL will create the accumulated probe table later under batch mode.

Skipping any of these steps can lead to a variety of problems when executing under batch on the cluster.

  1. Modify the parameter argument values in your parametric sweep nodes to perform a sweep that will only produce a few lines in the accumulated probe table. You will need to execute your .mph file under the GUI as part of “conditioning” the file, and you don’t want that execution to take more than a few minutes.
  2. If you have loaded any data files for interpolation functions into the .mph file, you need to “discard” them and set them up to load each time from the data file. I ran into this bug as soon as I got my .mph file otherwise OK for execution on the cluster: an error message saying that it couldn’t find the file for the interpolation function, even though the interpolation function data was already loaded into the .mph file. You should probably keep these data files in the same directory as the .mph file you are executing. Note that when you “browse” for the data file within the interpolation function definition, COMSOL defaults to a root-based file location, i.e. /users/username/path/to/filename.dat, which can be a problem if you want to be able to execute your file in different directories. I recommend hand-typing “./filename.dat” into the form, which tells COMSOL to look in the local directory and makes your code more portable in this sense.
  3. In your innermost parametric sweep node, under “Output While Solving” ensure that you have selected:

                   a.  Probes: all
                   b.  check the checkbox for “accumulated probe table”
                   c.  check the checkbox for “use all probes”
                   d.  for “keep solutions in memory” choose “only last”

fig.1

  1. Delete all solutions
  2. Delete all meshes
  3. Compact History – even though this consistently FAILS for my .mph files, I do it anyway—it’s quick and it might help
  4. Delete All Job Configurations (right-click on the Job Configurations node under the study node)
  5. Re-Initialize Solver (right-click on the study node, choose “Show Default Solver”
  6. Re-Initializing the Solver will re-populate the Job Configurations section. For each parametric sweep node in Job Configurations, check the checkbox for “distribute parameters” if appropriate for that node

    fig.2

  1. Compute your reduced parameter sweep study in the GUI.
  2. Verify that the accumulated probe table was produced and has the correct number of lines.
  3. Clear the accumulated probe table using the broom icon

    fig.3

  1. Save your .mph file – this will now serve as a working template for all jobs of this type. You might want to mark it as “read only” at this point.

Recipe Part 2: Quick test on the Cluster (Optional)

At this point your .mph file is set up to execute quickly, so it’s an ideal time to do a quick test on the cluster, before you submit for a multi-hour run.

  1. Create/modify your .pbs script for the quick test. Note that there are two levels of “nodes” and “cores” in the script.
  • “PBS” nodes and cores requested by the pbs script line that looks like: “#PBS –l nodes=15:ppn=8”. CARC tech support can tell you what to ask for on each machine. The important quantity is (nodes) × (cores).

  • “comsol” nodes and cores, which are specified in the command-line call to COMSOL. For example, if your PBS script request looks like the one I showed above, PBS (nodes) x (cores) = 120. If your job will be distributing more than 30 cases across the cluster, your arguments for nodes and cores in the COMSOL command line could be “-nn 30 –np 4”. The product of COMSOL nodes and cores should be equal to the product of PBS nodes and cores.

  • For the quick test, try to use at least two COMSOL nodes so that parameter distribution can actually occur.

  1. use “tail –f logfilename.log” to monitor progress. Note that at a couple of points the logfile updates will pause. Here is a place where my jobs almost always have a pause:

fig.4

Part of the value of doing the quick test is that the “long pauses” will be only a few minutes, so it will be quick to tell if your job has hung or not. If you go straight to a big job you might spend a lot of time waiting for the logfile to start updating again, and maybe it’s not going to. Another part of the value is that the logfile will give you an idea of how much memory your jobs are taking, which might lead you to modify your nodes and cores requests so you can be most efficient in the full run.

  1. Hopefully your quick test completes successfully. Once I started following the steps in this recipe I started having consistent successes. Inspect the completed logfile. This step is important. If you see headings of the form “Node ##:” in the latter half of the file, you are probably distributing your parametric sweeps across the nodes. If you don’t see those headings, you are probably executing your cases one at a time, spread over the entire set of nodes (if you see this make sure you did step 9 correctly). A third case is a pure error: you might see the Node ## headings, but you will see that it is distributing THE IDENTICAL CASE across all the nodes, i.e. solving the same problem 30 times (or whatever) in parallel. If you perform steps 1-9 correctly you should never see this problem.
  2. Open the output .mph file you have produced. I strongly recommend always having separate input and output files, so that you don’t overwrite your input file. which is now a valuable template for other jobs. Verify that the APT was produced and has the correct number of lines.

Recipe Part 3: Run the Full Problem on the Cluster

  1. If you did Part 2, at this point you have seen that your input .mph actually works on the cluster. Copy your input .mph to a new file, open it in the COMSOL GUI, and expand the parametric sweep to its full ranges, and save out the result. Modify “nodes” and “cores” in your .pbs file to accommodate the full run, taking into account the memory usage you saw in Part 2: e.g. if the jobs are tiny, you might want to use only 1 core per COMSOL node.
  2. The input file you have created and tested can be re-used for additional parametric sweeps on the cluster as long as the only thing that changes is the arguments to the parametric sweep node parameters. If you add, delete, or reorder parameters or probes, or you change the structure of your parametric sweep nodes, be sure to repeat this “conditioning” procedure. Once you are set up properly and you have done it a few times, it only adds a few minutes to the beginning of a run.

Comments:

  1. This “conditioning” procedure will NOT keep COMSOL 5.1 from mis-labeling columns in the accumulated probe table. In my .mph files COMSOL consistently reverses the column heading order of the inner-most parametric sweep node. If that node is sweeping over parameters a, b, c, the corresponding columns a, b, c in the APT will be mis-labeled c, b, a. Since we don’t know the origin of this problem, and it can be a big problem for data reduction codes that start from the accumulated probe table, the safest approach is to define “probes” which are just the parameters themselves, for those parameters you need in the APT. I have never seen a mis-labeled probe column in the APT.
  2. My “full” cluster problems specify nn=15:ppn=8 in the pbs script, for a total of 120 cores to work with. I have been able to run my COMSOL problems successfully with –nn 60 –np 2, but when I try for –nn 120 –np 1, the entire job executes successfully (for a couple of hours, about the time I would expect it to run if everything was working), but then fails at the end with this message:

    fig.5

  3. and the accumulated probe table in the output .mph file has 64 lines in it, instead of 7680. Unforunately, this costs the full time of a calculation before you find out that it failed, and you will only see it on your big runs. I am mentioning this problem because it may be telling us there is an upper limit to how many nodes you can ask COMSOL to keep track of, although there is no mention of this limit in the COMSOL documentation.

  4. We found that batch jobs will, more often than not, fail to access the floating network license. The workaround is to start up the GUI on the machine on which you want to run the batch modes. The GUI never fails to access the floating network license. You don't need to do anything in the GUI, just start it up, observe that it opens (i.e. has accessed the license successfully), and then close it again. Your batch jobs will now run successfully on that machine.

Center for Advanced Research Computing

MSC01 1190
1601 Central Ave. NE
Albuquerque, NM 87106

p: 505.277.8249
f:  505.277.8235
e: