van Miltenburg, E., Braggaar, A., Braun, N., Damen, D., Goudbeek, M., van der Lee, C., Tomas, F., & Krahmer, E. (2023). How reproducible is best-worst scaling for human evaluation? A reproduction of `Data-to-text Generation with Macro Planning'. In A. Belz, M. Popović, E. Reiter, C. Thomson, & J. Sedoc (Eds.), Proceedings of the 3rd Workshop on Human Evaluation of NLP Systems (pp. 75-88). Incoma Ltd., Shoumen, Bulgaria. https://aclanthology.org/2023.humeval-1.7