FAQ: Difference between revisions
Created page with "== FAQ on Babel Resource Allocation and Best Practices == '''Q: How are jobs prioritized by the scheduler?''' Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedica..." |
No edit summary |
||
| (One intermediate revision by the same user not shown) | |||
| Line 1: | Line 1: | ||
== FAQ on Babel Resource Allocation and Best Practices == | == FAQ on Babel Resource Allocation and Best Practices == | ||
'''Q: How are jobs prioritized by the scheduler?''' | ==='''Q: How are jobs prioritized by the scheduler?'''=== | ||
Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group. | Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group. | ||
'''Q: Do you have advice for long-running jobs?''' | === '''Q: Do you have advice for long-running jobs?''' === | ||
# Make sure your code saves checkpoints frequently so that it can recover from being preempted. | # Make sure your code saves checkpoints frequently so that it can recover from being preempted. | ||
# Post on the <code>#babel-babble</code> Slack channel first to alert other users. | # Post on the <code>#babel-babble</code> Slack channel first to alert other users. | ||
# Consider running on the `long` partition. | # Consider running on the `long` partition. | ||
=== '''Q: What should I do if I notice another user's jobs/files are disrupting usage of the cluster for others?''' === | |||
Please message the <code>babble-babel</code> channel, tagging the user with the problematic job as well as <code>@help-babel</code>. Remember to '''communicate with respect'''; most errors are honest mistakes. | Please message the <code>babble-babel</code> channel, tagging the user with the problematic job as well as <code>@help-babel</code>. Remember to '''communicate with respect'''; most errors are honest mistakes. | ||
'''Q: I have other questions which aren't answered here.''' | === '''Q: What should I do if a model requires more compute resources?``` === | ||
Try to allocate more GPUs when you start the Shell session on the assigned compute node. | |||
=== '''Q: What does it mean if I get an error message saying 'Unable to contact Slurm controller'?''' === | |||
Something has gone horribly wrong. Contact the system administrators to resolve this problem. | |||
=== '''Q: How does this relate to front-end and back-end development?''' === | |||
Deploy both front-end and back-end server on the same compute node to avoid port forwarding issues | |||
=== '''Q: I have other questions which aren't answered here.''' === | |||
Reach out on the <code>babble-babel</code> Slack channel, tagging in <code>@help-babel</code>. If you discover an answer which may be useful to others, please | Reach out on the <code>babble-babel</code> Slack channel, tagging in <code>@help-babel</code>. If you discover an answer which may be useful to others, please feel free to add to this FAQ. | ||
Latest revision as of 12:49, 20 February 2025
FAQ on Babel Resource Allocation and Best Practices
[edit | edit source]Q: How are jobs prioritized by the scheduler?
[edit | edit source]Generally, all users have equally priority. Though the debug, general, and long partitions do not have user or group-specific priority, each partition is ranked in priority from high to low. User fairshare is factored into scheduling. In some cases, some users may have higher-priority access; for example, research groups who have donated nodes may request dedicated partitions for priority access to (a subset of) those nodes or may request a dedicated node reserved exclusively for their research group.
Q: Do you have advice for long-running jobs?
[edit | edit source]- Make sure your code saves checkpoints frequently so that it can recover from being preempted.
- Post on the
#babel-babbleSlack channel first to alert other users. - Consider running on the `long` partition.
Q: What should I do if I notice another user's jobs/files are disrupting usage of the cluster for others?
[edit | edit source]Please message the babble-babel channel, tagging the user with the problematic job as well as @help-babel. Remember to communicate with respect; most errors are honest mistakes.
Q: What should I do if a model requires more compute resources?```
[edit | edit source]Try to allocate more GPUs when you start the Shell session on the assigned compute node.
Q: What does it mean if I get an error message saying 'Unable to contact Slurm controller'?
[edit | edit source]Something has gone horribly wrong. Contact the system administrators to resolve this problem.
Q: How does this relate to front-end and back-end development?
[edit | edit source]Deploy both front-end and back-end server on the same compute node to avoid port forwarding issues
Q: I have other questions which aren't answered here.
[edit | edit source]Reach out on the babble-babel Slack channel, tagging in @help-babel. If you discover an answer which may be useful to others, please feel free to add to this FAQ.