Editing
CXCSCMU GroupWiki
(section)
Jump to navigation
Jump to search
Warning:
You are not logged in. Your IP address will be publicly visible if you make any edits. If you
log in
or
create an account
, your edits will be attributed to your username, along with other benefits.
Anti-spam check. Do
not
fill this in!
== Codebases == === [https://github.com/cxcscmu/Lightning-Pretrain Lightning-Pretrain] === An LLM codebase built on Lit-GPT and Pytorch Lightning, which is especially useful for efficient pre-training LM from scratch. '''What can it do:''' * Pre-train state-of-the-art decoder-only models (Llama, Llama 2, Pythia, Vicuna, GPT-2, ...) * Fine-tune using task-specific data * Evaluation on [https://github.com/EleutherAI/lm-evaluation-harness Language Model Evaluation Harness] '''Pros:''' * State-of-the-art distributed training strategies: DDP, FSDP, DeepSpeed * Modern acceleration strategies: FlashAttention, Fused Adam, mixed precision * Parameter-efficient fine-tune: Adapter, Adapter v2, LoRA, QLoRA, ... * Large-scale evaluation datasets: almost cover every common task in NLP and keep updating * Comparable training speed with huggingface but better flexibility * Relatively easy to convert the model weights from/to huggingface: name mapping * Detailed [https://github.com/cxcscmu/Lightning-Pretrain/tree/main/tutorials tutorials] for each usage, and it is pretty easy to begin with '''Cons:''' * Does not support models in other structures such as T5, BERT * Does not support as many training datasets as huggingface, you may need to define the dataset class or preprocess the dataset by yourself * Still in development and requires everyone's effort to maintain it === [https://github.com/cxcscmu/OpenMatch OpenMatch V2.0] === (Yu et al., 2022) [https://dl.acm.org/doi/abs/10.1145/3539618.3591813 Paper] | [https://github.com/OpenMatch/OpenMatch Github] | [https://openmatch.readthedocs.io/en/latest/ Docs] A Python-based library for conducting Neural Information Retrieval (Neu-IR) research experiments. This library contains both neural and traditional IR modules, making it easy to conduct baselines experiments for comparison. '''What can it do:''' * ''Template-based Data Processing.'' Convenient templates for processing of raw data -- no need to reformat data to conform to software's input. * ''Efficient Data Accessing''. Integrated with HF datasets which enables access to large dataset with minimal memory overhead. * ''Sharded Search''. Implements two-stage sharded search which bypasses the need to load whole datasets into a single memory. * A sample of models that are supported: DPR, ANCE, T5, BERT, etc. '''Requirements:''' * Pytorch * HuggingFace, Datasets * Transformers * Faiss A working directory of OpenMatch V2.0 exists in `/data/group_data/cx_group/OpenMatch` on the Babel server.
Summary:
Please note that all contributions to CMU -- Language Technologies Institute -- HPC Wiki may be edited, altered, or removed by other contributors. If you do not want your writing to be edited mercilessly, then do not submit it here.
You are also promising us that you wrote this yourself, or copied it from a public domain or similar free resource (see
Project:Copyrights
for details).
Do not submit copyrighted work without permission!
Cancel
Editing help
(opens in new window)
Navigation menu
Personal tools
Not logged in
Talk
Contributions
Create account
Log in
Namespaces
Page
Discussion
English
Views
Read
Edit
Edit source
View history
More
Search
Navigation
Main page
Recent changes
Random page
Help about MediaWiki
Special pages
Tools
What links here
Related changes
Page information