FAQ: Difference between revisions

From CMU -- Language Technologies Institute -- HPC Wiki
Jump to navigation Jump to search
Created page with "== FAQ on Babel Resource Allocation and Best Practices == '''Q: How are jobs prioritized by the scheduler?''' Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedica..."
 
Line 1: Line 1:
== FAQ on Babel Resource Allocation and Best Practices ==
== FAQ on Babel Resource Allocation and Best Practices ==
'''Q: How are jobs prioritized by the scheduler?'''
==='''Q: How are jobs prioritized by the scheduler?'''===


Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group.
Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group.


'''Q: Do you have advice for long-running jobs?'''
=== '''Q: Do you have advice for long-running jobs?''' ===
# Make sure your code saves checkpoints frequently so that it can recover from being preempted.
# Make sure your code saves checkpoints frequently so that it can recover from being preempted.
# Post on the <code>#babel-babble</code> Slack channel first to alert other users.
# Post on the <code>#babel-babble</code> Slack channel first to alert other users.
Line 12: Line 12:
Please message the <code>babble-babel</code> channel, tagging the user with the problematic job as well as <code>@help-babel</code>. Remember to '''communicate with respect'''; most errors are honest mistakes.
Please message the <code>babble-babel</code> channel, tagging the user with the problematic job as well as <code>@help-babel</code>. Remember to '''communicate with respect'''; most errors are honest mistakes.


'''Q: I have other questions which aren't answered here.'''
=== '''Q: I have other questions which aren't answered here.''' ===


Reach out on the <code>babble-babel</code> Slack channel, tagging in <code>@help-babel</code>. If you discover an answer which may be useful to others, please
Reach out on the <code>babble-babel</code> Slack channel, tagging in <code>@help-babel</code>. If you discover an answer which may be useful to others, please

Revision as of 12:16, 20 February 2025

FAQ on Babel Resource Allocation and Best Practices

Q: How are jobs prioritized by the scheduler?

Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group.

Q: Do you have advice for long-running jobs?

  1. Make sure your code saves checkpoints frequently so that it can recover from being preempted.
  2. Post on the #babel-babble Slack channel first to alert other users.
  3. Consider running on the `long` partition.

Q: What should I do if I notice another user's jobs/files are disrupting usage of the cluster for others?

Please message the babble-babel channel, tagging the user with the problematic job as well as @help-babel. Remember to communicate with respect; most errors are honest mistakes.

Q: I have other questions which aren't answered here.

Reach out on the babble-babel Slack channel, tagging in @help-babel. If you discover an answer which may be useful to others, please