Editing CXCSCMU GroupWiki (section)

== Codebases ==

=== [https://github.com/cxcscmu/Lightning-Pretrain Lightning-Pretrain] ===

An LLM codebase built on Lit-GPT and Pytorch Lightning, which is especially useful for efficient pre-training LM from scratch.

'''What can it do:'''

* Pre-train state-of-the-art decoder-only models (Llama, Llama 2, Pythia, Vicuna, GPT-2, ...)
* Fine-tune using task-specific data
* Evaluation on [https://github.com/EleutherAI/lm-evaluation-harness Language Model Evaluation Harness]

'''Pros:'''

* State-of-the-art distributed training strategies: DDP, FSDP, DeepSpeed
* Modern acceleration strategies: FlashAttention, Fused Adam, mixed precision
* Parameter-efficient fine-tune: Adapter, Adapter v2, LoRA, QLoRA, ...
* Large-scale evaluation datasets: almost cover every common task in NLP and keep updating
* Comparable training speed with huggingface but better flexibility
* Relatively easy to convert the model weights from/to huggingface: name mapping
* Detailed [https://github.com/cxcscmu/Lightning-Pretrain/tree/main/tutorials tutorials] for each usage, and it is pretty easy to begin with

'''Cons:'''

* Does not support models in other structures such as T5, BERT
* Does not support as many training datasets as huggingface, you may need to define the dataset class or preprocess the dataset by yourself
* Still in development and requires everyone's effort to maintain it

=== [https://github.com/cxcscmu/OpenMatch OpenMatch V2.0] ===
(Yu et al., 2022) [https://dl.acm.org/doi/abs/10.1145/3539618.3591813 Paper] | [https://github.com/OpenMatch/OpenMatch Github] | [https://openmatch.readthedocs.io/en/latest/ Docs]

A Python-based library for conducting Neural Information Retrieval (Neu-IR) research experiments. This library contains both neural and traditional IR modules, making it easy to conduct baselines experiments for comparison. 

'''What can it do:'''

* ''Template-based Data Processing.'' Convenient templates for processing of raw data -- no need to reformat data to conform to software's input. 
* ''Efficient Data Accessing''. Integrated with HF datasets which enables access to large dataset with minimal memory overhead. 
* ''Sharded Search''. Implements two-stage sharded search which bypasses the need to load whole datasets into a single memory. 
* A sample of models that are supported: DPR, ANCE, T5, BERT, etc. 

'''Requirements:''' 

* Pytorch 
* HuggingFace, Datasets
* Transformers 
* Faiss

A working directory of OpenMatch V2.0 exists in `/data/group_data/cx_group/OpenMatch` on the Babel server.