Editing CXCSCMU GroupWiki (section)

== Building from open source ==
A formalization of the process of building from open-source repositories.

=== Incremental validation ===
Conducting incremental validation of an open-source repository is highly recommended so that we can be sure that all components are performing as expected. 

* Evaluate the performance of outputs 
** Evaluate quickly published outputs and check if performance agrees with what they claim.
** For example, an “output” refers to the predicted rank in a retrieval task on a dev set.
* Evaluate performance on published artefacts
** For example, if trained embeddings are available, validate performance on that using a prediction step of the downloaded trained model. 
** Unlikely to be available due to large file sizes (see: inference step).
* Run inference (forward) pass
** To obtain predictions on downstream task 
** Follow open-source instructions for this step.
** Check if performance agrees with performance claims. 
* Run training 
** Required if you are further fine-tuning the model (or parts thereof)  
** Follow up with inference 

=== Mitigation / Debugs when validation fails ===
Some reasons why code doesn't work and possible solutions. 

* Code is not written to open-source standards 
** Code may have worked for authors, but it is not generally usable
** Sometimes fixes are easy – check-in/out directories
* Code is from a while ago 
** Check package versions that authors use, and replicate in venv if possible 
** If intending to fine-tune model for future use, you probably want to update the code -- see documentation on any breaking updates 
* Missing components needed 
** Probably the authors also based their project off someone else’s – can do some digging to see if the missing component is anywhere else on the internet
* Everything runs without error, but results are just not replicable 😡 
** Check data processing (lookout for corrupted / truncated files)
** Check model parameters (especially optimization parameters) 
** Check that “default” parameters in args correspond to optimal values they report in the paper 

Always a good idea to check github issues and see if you problem is a "known" problem, and if there are any workarounds.