We study the problem of learning to generate an answer (or completion) to a question (or prompt), where there could be multiple correct answers, any one of which is acceptable at test time. Learning is based on demonstrations of some correct answer to each training question, as in Supervised Fine Tuning (SFT). Current standard practice focuses on maximum likelihood approaches (i.e., log loss minimization) but we argue that likelihood-maximization methods can fail even in simple settings. Instead, we view the problem as apprenticeship learning (i.e., imitation learning) in contextual bandits, with offline demonstrations from some expert (optimal or very good) policy, and suggest alternative simple approaches with strong guarantees.
This is joint work with Nirmit Joshi, Gene Li, Siddharth Bhandari, Shiva Kasiviswanathan, and Cong Ma.