Tuesday, August 13, 2013

WRF Job Submission on Janus

It's a poorly kept secret that HPC jobs with smaller wall time requests tend to get onto the queue quicker. As a result, if I have a WRF job that I expect will take about 16 hours to complete, I can probably get it done faster using four 4-hour jobs rather than one 16-hour job. If you keep scaling that process up to longer jobs... you can end up with a lot of jobs (I've run 100+ job submissions before). So I wrote a few scripts to automate the submission of multi-part WRF jobs that others may find useful.

Click here to download the Janus job submitter (tar file)

Assuming you follow the suggested folder WRF structure, you should extract the tar file into your base WRF directory (where your WRFV3.X and possibly WPSV3.X folders are located). Inside the resulting WRF/jobs folder, you will find the following:
  • long_nl - a folder that contains the namelist files for the long run. In here, you will find a template namelist file for the LES tutorial case. You should replace the namelist.temp file with one of your own that contains the settings you want for your job (you don't need to change the time settings though; that comes next)
  • gen_namelists - an interactive Fortran program that reads in the namelist.temp file and splits it into user-specified time increments. For example, if I have a six hour run that I split into two-hour increments, this program will create namelist.01, namelist.02, and namelist.03 files.
  • new_project.sh - a short script that creates the folder structure necessary to run the other scripts. Your job output and settings will be stored in these folders.
  • run_wrf_long.temp - a placeholder script. This file needs to be here, but you shouldn't need to edit it (unless you want to!)
  • start_wrf_long.sh - this script submits your split WRF jobs to the Moab scheduler. The necessary command line arguments are described at the top of the script.
So let's say I wanted to start a PBL comparison project and the first WRF run would test the MYJ configuration. I'd first create the project folders.

./new project.sh PBL_TESTS

Then I'd customize the WRF namelist to my liking and place it inside the long_nl folder as namelist.temp. Then I'd run gen_namelists and specify how I want the run split temporally. After generating the run segment namelist files, I'd be ready to submit the job(s).

./start_wrf_long.sh 1 3 PBL_TESTS MYJ_TEST 4 2 janus-small

That command would submit 3 dependent jobs. It would create a run called MYJ_TEST inside of the PBL_TESTS project folders. The jobs would use 4 nodes (12 cores per, for a total of 48 cores) and each job would request 2 hours of wall time. Finally, the jobs would be submitted to the janus-small queue. There is also an optional argument to the script to set the allocation you want to charge the wall time to. If you don't specify one, your default allocation will be used.

Feel free to give it a try and let me know if anything is unclear or you encounter any problems. The script is designed to cancel subsequent jobs automatically if any one job fails, but I haven't tested that part of it on Janus yet. I hope you find it useful!

No comments:

Post a Comment