Run ACCESS-OM2
Prerequisites
General prerequisites
Before running ACCESS-OM2, you need to fulfil general prerequisites outlined in the First Steps section.
If you are unsure whether ACCESS-OM2 is the right choice for your experiment, take a look at the overview of ACCESS Models.
Note
In this documentation the same code is sometimes shown in a highlight code-block, and also in
a simulated terminal. The code-block is useful because it is easy to copy the code example to
your clipboard (mouse over the code block and click the icon on the far right of the code block).
The simulated terminal is to illustrate what happens when commands are run on a terminal on Gadi
.
Model-specific prerequisites
Join the vk83 and qv56 projects at NCI
To join these projects request membership on the respective vk83 and qv56 NCI project pages.
For more information on how to join specific NCI projects, please refer to How to connect to a project.
Payu
Payu is a workflow management tool for running numerical models in supercomputing environments for which there is extensive documentation.
Payu on Gadi is available through a dedicated conda
environment in the vk83 project.
After joining the vk83 project load the payu environment:
module use /g/data/vk83/modules
module load payu
To check that payu
is available, run:
payu --version
Note: payu
version >=1.1.3 is required
Get ACCESS-OM2 configuration
All released ACCESS-OM2 configurations are available from the ACCESS-OM2 configs GitHub repository. Released configurations are tested and supported by ACCESS-NRI. ACCESS-NRI has adapted these model configurations from those originally developed by COSIMA.
There are global configurations for three resolutions: 1°, 0.25°, 0.1°. For each resolution there are two options of atmospheric forcing: Repeat Year Forcing (RYF) and Interannual Forcing (IAF). Each configuration also has a biogeochemical (BGC) configuration if this is required. Note the BGC experiments are slower and so consume more resources, both compute time and generally also disk space.
Each configuration is stored as a separate specially named branch in the ACCESS-OM2 configs GitHub repository. Anyone using a configuration is advised to clone only a single branch and not attempt to keep this structure. The ACCESS-OM2 configs repo has more information about the available experiments and the naming scheme of the branches.
The first step is to choose a configuration from those available. For example, if the 1° horizontal resolution configuration with repeat-year JRA55 forcing (without bgc) is the required configuration then the release-1deg_jra55_ryf
branch is the correct configuration.
The next step is to clone this branch to a location on Gadi:
In the example above the payu clone
command clones the 1° repeat-year JRA55 configuration (-B release-1deg_jra55_ryf
)
to a new experiment branch (-b expt
) to a directory named 1deg_jra55_ryf
:
mkdir -p ~/access-om2
cd ~/access-om2
payu clone -b expt -B release-1deg_jra55_ryf https://github.com/ACCESS-NRI/access-om2-configs.git 1deg_jra55_ryf
Note
payu
uses branches to differentiate between different experiments the same git repository. So it is recommended to always change the branch name you clone into to something meaningful for the planned experiment. See the payu
tutorial for more information.
Running an ACCESS-OM2 configuration
ACCESS-OM2 configurations run on Gadi through a PBS Job submission managed by payu
.
The general layout of a payu
-supported model run consists of two main directories:
- The control directory contains the model configuration and serves as the execution directory for running the model (in this example, the cloned directory
~eaccess-om2/1deg_jra55_ryf
). - The laboratory directory, where all the model components reside. For ACCESS-OM2, it is typically
/scratch/$PROJECT/$USER/access-om2
.
This separates the small text configuration files from the larger binary outputs and inputs. In this way, the control directory can be in the $HOME
directory (as it is the only filesystem actively backed-up on Gadi). The quotas for $HOME
are low and strict, which limits what can be stored there, so it is not suitable for larger files.
The laboratory directory is a shared space for all payu
experiments using the same model.
Within the laboratory directory are two subdirectories within which payu
automatically creates directories named uniquely for the experiment being run:
work
→ a temporary directory is created within here when the model is run. It gets created as part of a run and then removed after the run succeeds.archive
→ the directory within which the output is stored following each successful run.
payu
creates symbolic links in the control directory called archive
and work
that point to the corresponding directories in the laboratory directory.
This design allows multiple self-resubmitting experiments that share common executables and input data to be run simultaneously.
Run configuration
To run a configuration:
payu run
This will submit a single job to the queue with a run length of restart_period
. restart_period
is defined in the accessom2.nml
file in the control directory.
Note
You can add the -f
option to payu run
and it will run even if there is an existing non-empty work
directory created from a previous failed run or from running payu setup
.
Run configuration multiple times
If you want to run a configuration multiple times automatically use the option -n
:
payu run -n <number-of-runs>
This will run number-of-runs
times with a total length of restart_period * number-of-runs
, where restart_period
is the length of each model run.
For example, to run a configuration for a total of 50 years with restart_period
of 5 years the number-of-runs
should be set to 10
:
payu run -n 10
Note
restart_period
is the configuration option that sets how long the model will run. See how to change run length for a description of where this is set and how to change it.
Monitor ACCESS-OM2 runs
payu run
reports the PBS job-ID
, e.g. 110020843.gadi-pbs
, as the last line to the terminal. qstat
can be used to query the status of the job, e.g.
To show the status of all your submitted PBS jobs:
The default name of the your job is the name of the payu control directory, in this example 1deg_jra55_ryf
. This can be changed by altering the jobname
in set in the config.yaml
.
S indicates the status of your run, where:
- Q → Job waiting in the queue to start
- R → Job running
- E → Job ending
- H → Job on hold
If there are no jobs listed with your jobname
(or if no job is listed), your run either successfully completed or was terminated due to an error.
A job can be on hold for a number of reasons, see the NCI documentation for more information.
PBS output files
When the model completes PBS writes the standard outout and error streams to two files into the control directory: jobname.o<job-ID>
and jobname.e<job-ID>
respectively. This is terminal output that isn't otherwise redirected into model log files.
You can archive these files using payu sweep which moves them to the archive directory.
Stop a run
If you want to manually terminate a run, you can do so by executing:
qdel job-ID
Which will kill the current job without waiting for it to complete. If you have used the -n
option ( e.g., payu run -n
), but subsequently decide not to keep running after the current process completes, you can create a file called stop_run
in the control directory, and this will prevent payu
from submitting another job.
Error and output log files
While the model is running, payu saves the model standard output and standard error in the access-om2.out
and access-om2.err
log files in the control directory. You can examine the contents of these files to check on the status of a run as it progresses.
At the end of a successful run these log files are archived to the archive
directory and will not be found in the control directory. If they remain in the control directory after the PBS job for a run has completed this is an indication the run has failed.
Did the model run correctly?
To determine if a model has run correctly it must first be established that it has finished. The qstat
commands above and the presence of PBS log files should be used to determine if the PBS job has ended.
If the model did not run to completion correctly the following will still be in the control
directory:
work/
access-om2.err
access-om2.out
This is because payu
will only run the archive
step when the model runs without error.
ACCESS-OM2 outputs
At the end of a successful model run, output files, restart files and log files are moved from the work directory to the archive directory. A symbolic link in the control directory to a directory in the laboratory (/scratch/$PROJECT/$USER/access-om2/archive
) is provided for convenience.
If a model run is unsuccessful the work
directory is left untouched to facilitate "run forensics" to determine the cause of the model failure.
Outputs and restarts are stored in subfolders within the archive
, subdivided for each run of the model.
Output folders are outputXXX
and restart folders restartXXX
, where XXX is the run number starting from 000
.
Model components are separated into subdirectories within the output and restart directories.
Model Live Diagnostics
ACCESS-NRI developed the Model Live Diagnostics framework to check, monitor, visualise, and evaluate model behaviour and progress of ACCESS models currently running on Gadi.
For a complete documentation on how to use this framework, check the Model Diagnostics documentation.
Trouble-shooting
If payu
doesn't run correctly for some reason a good first step step, from within the control directory, is to run:
payu setup
This will prepare the model run: create the ephemeral work
directory based on the experiment configuration, generate manifests and report some useful information to the user, such as the location of the laboratory where the work
and archive
directories are located.
This can help to isolate issues such as permissions problems accessing files and directories, missing files or malformed/incorrect paths.
Note
By default payu run
will not proceed if there is an existing work
directory. So after payu setup
either payu sweep
before attempting to run the configuration, or use payu run -f
Modifying an ACCESS-OM2 configuration
Once you are comfortable that you can clone and run an existing configuration it is often the case that you will want to modify the configuration depending on your science goals.
This section describes the model configuration and how to modify it.
Modifications can be to the way the model is run by payu
, or can change the way specific model components are configured, or the coupling between them. Sometimes changes are required to both, if the model component changes require a change to the resources needed for the model to complete.
The config.yaml
file located in the control directory, is the Master Configuration file, which controls the general model configuration. It contains several parts, some of which it is more likely will need modification, and others are rarely changed without significant understanding of how the model is configured.
Change run length
One of the most common changes is to adjust the duration of the model run. For example when debugging changes to a model, it is common to reduce the run length to minimise resource consumption and return faster feedback on changes.
To change the run length, edit the restart_period
field in the &date_manager_nml
section of the ~/access-om2/1deg_jra55_ryf/accessom2.nml
file:
&date_manager_nml
forcing_start_date = '1958-01-01T00:00:00'
forcing_end_date = '2019-01-01T00:00:00'<br>
! Runtime for a single segment/job/submit, format is years, months, seconds,
! two of which must be zero.
restart_period = 5, 0, 0
&end
For example, to make the model run for only 1 month change restart_period
to
restart_period = 0, 1, 0
Note
While restart_period
can be reduced, it should not be increased to more than 5 years to avoid errors. For more information about the difference between run length and total experiment length, or how to run the model for more than 5 years, refer to the section Run configuration for multiple years.
Modify PBS resources
If the model has been altered and needs longer to complete, more memory, or you want to change which queue it uses then this is the part of config.yaml
you need to modify:
# If submitting to a different project to your default, uncomment line below
# and replace PROJECT_CODE with appropriate code. This may require setting shortpath
# project: PROJECT_CODE
# Force payu to always find, and save, files in this scratch project directory
# shortpath: /scratch/PROJECT_CODE
queue: normal
walltime: 3:00:00
jobname: 1deg_jra55_ryf
mem: 1000GB
These lines can be edited to change the PBS directives for the PBS job.
For example, to run ACCESS-OM2 under the ol01
project (COSIMA Working Group), uncomment the line beginning with # project
by deleting the #
symbol and replace PROJECT_CODE
wih ol01
:
project: ol01
If other compute projects will be used to run a configuration then the shortpath
will also need to be uncommented and the path to the desired /scratch/PROJECT_CODE
added. Doing this will make sure the same /scratch
location is used for the laboratory regardless of which project is used to run the experiment.
Note
To run ACCESS-OM2, you need to be a member of a project with allocated Service Units (SU). For more information, check how to join relevant NCI projects
Syncing output data
As discussed above the laboratory directory is typically in a directory on ephemeral /scratch
storage where files are regularly deleted once they have been unaccessed for a period of time. For this reason climate model outputs need to be moved to a location with longer term storage. On gadi this is typically under a project code on /g/data
.
payu
has in-built support to sync outputs, restarts and a copy of the control directory git history to another location. To do this modify this section of the config.yaml
shown below: change enable
to True
, and set path
to a location on /g/data
.
# Sync options for automatically copying data from ephemeral scratch space to
# longer term storage
sync:
enable: False # set path below and change to true
path: none # Set to location on /g/data or a remote server and path (rsync syntax)
exclude:
- '*.nc.*'
- 'iceh.????-??-??.nc'
Saving model restarts
The model outputs restart files after every run so the model can then run again from the saved model state.
Restart files can occupy a significant amount of disk space and it isn't necessary to be able to restart the model at every point where the model was stopped during a run. The restart_freq
specifies a strategy for what restart files are retained.
This can either be a number, which retains every nth numbered restart, or a pandas style date-time frequency alias. For example to preserve the ability to restart the model every 50 model run years:
restart_freq: 50Y
See the documentation for more detail.
Other rarely modified configuration options
Model configuration
This tells payu
which driver to use for the main model configuration (access-om2
) and the location of all inputs that are common to all the component models, or submodels.
name: common
model: access-om2
input: /g/data/ik11/inputs/access-om2/input_20201102/common_1deg_jra55
name
field here is not actually used for the configuration run so you can safely ignore it.
Submodels
ACCESS-OM2 is a coupled model deploying multiple submodels (i.e. model components).
This section specifies the submodels and configuration options required to execute the model correctly.
Each submodel contains additional configuration options that are read in when the submodel is running. These options are specified in the subfolder of the control directory, whose name matches the submodel's name (e.g., configuration options for the ocean
submodel are in the ~/access-om2/1deg_jra55_ryf/ocean
directory).
Expand for detail
submodels:
- name: atmosphere
model: yatm
exe: /g/data/vk83/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/libaccessom2-git.2023.10.26=2023.10.26-ieiy3e7hidn4dzaqly3ly2yu45mecgq4/bin/yatm.exe
input:
- /g/data/vk83/experiments/inputs/access-om2/remapping_weights/JRA55/global.1deg/2020.05.30/rmp_jrar_to_cict_CONSERV.nc
- /g/data/vk83/experiments/inputs/JRA-55/RYF/v1-4/data
ncpus: 1
- name: ocean
model: mom
exe: /g/data/vk83/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/mom5-git.2023.11.09=2023.11.09-ewcdbrfukblyjxpkhd3mfkj4yxfolal4/bin/fms_ACCESS-OM.x
input:
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/grid_spec.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/ocean_hgrid.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/mosaic/global.1deg/2020.05.30/ocean_mosaic.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/bathymetry/global.1deg/2020.10.22/topog.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/bathymetry/global.1deg/2020.10.22/ocean_mask.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/grids/vertical/global.1deg/2020.10.22/ocean_vgrid.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/processor_masks/global.1deg/216.16x15/2020.05.30/ocean_mask_table
- /g/data/vk83/experiments/inputs/access-om2/ocean/chlorophyll/global.1deg/2020.05.30/chl.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/initial_conditions/global.1deg/2020.10.22/ocean_temp_salt.res.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/tides/global.1deg/2020.05.30/tideamp.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/tides/global.1deg/2020.05.30/roughness_amp.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/tides/global.1deg/2020.05.30/roughness_cdbot.nc
- /g/data/vk83/experiments/inputs/access-om2/ocean/surface_salt_restoring/global.1deg/2020.05.30/salt_sfc_restore.nc
ncpus: 216
- name: ice
model: cice5
exe: /g/data/vk83/apps/spack/0.20/release/linux-rocky8-x86_64/intel-19.0.5.281/cice5-git.2023.10.19=2023.10.19-rh3xfkrgajya3ghtliacuhlx3pgvrzqs/bin/cice_auscom_360x300_24x1_24p.exe
input:
- /g/data/vk83/experiments/inputs/access-om2/ice/grids/global.1deg/2020.05.30/grid.nc
- /g/data/vk83/experiments/inputs/access-om2/ice/grids/global.1deg/2020.10.22/kmt.nc
- /g/data/vk83/experiments/inputs/access-om2/ice/initial_conditions/global.1deg/2020.05.30/i2o.nc
- /g/data/vk83/experiments/inputs/access-om2/ice/initial_conditions/global.1deg/2020.05.30/o2i.nc
- /g/data/vk83/experiments/inputs/access-om2/ice/initial_conditions/global.1deg/2020.05.30/u_star.nc
- /g/data/vk83/experiments/inputs/access-om2/ice/initial_conditions/global.1deg/2020.05.30/monthly_sstsss.nc
ncpus: 24
Collation
The MOM model typically outputs model diagnostics as tiles: rather than output a single file it is saved as a number of smaller tiles each of which contain a part of the model grid.
The collate
process combines a number of these smaller files into a single output file. Restart files are typically tiled in the same way and will also be combined together if the restart
option is set to true
.
collate:
restart: true
walltime: 1:00:00
mem: 30GB
ncpus: 4
queue: normal
exe: /g/data/ik11/inputs/access-om2/bin/mppnccombine</code></pre>
- runlog
runlog: true
git
if runlog
is set to true
.
This should not be changed as it is an essential part of the provenance of an experiment. payu
updates the manifest files for every run, and relies on runlog
to save this information in the git
history so there is a record of all inputs, restarts and executables used in an experiment.
- Platform-specific defaults
platform:
nodesize: 48
Set platform-specific default parameters. In this case it sets the default number of cpus per node to 48. This might need changing if the configuration is run on hardware with different nodesize
.
- User scripts
userscripts:
error: resub.sh
run: rm -f resubmit.count
A dictionary to run scripts or subcommands at various stages of a payu submission.
error
gets called if the model does not run correctly and returns an error code. run
gets called after the model execution, but prior to model output archive
Miscellaneous
There rest of the configuration settings should never need changing: stacksize
, mpirun
, qsub_flags
and env
.
Show
stacksize: unlimited
mpirun: --mca io ompio --mca io_ompio_num_aggregators 1
qsub_flags: -W umask=027
env:
UCX_LOG_LEVEL: 'error'
To find out more about other configuration settings for the config.yaml
file, refer to how to configure your experiment with payu
.
Edit a single ACCESS-OM2 component configuration
Each of the model components contains configuration options specific to that model that are read in when the model component is running. These options are typically useful to modify the physics used in the model, the input data or the model variables saved in the output files.
These configuration options are are specified in files in a subfolder of the control directory, named the same as the submodel's name in the config.yaml
submodel
section (e.g., configuration options for the ocean submodel are in the ~/access-om2/1deg_jra55_ryf/ocean
directory).
To modify these options please refer to the User Guide of each individual model component.