Modern generative models like large language models (LLMs) can perform very well on benchmarks yet fail on seemingly related real-world tasks. How can we determine whether generative models actually do what we think they do? This talk will treat this challenge as a statistical inference problem. I will present methods designed to make inferences about the structural understanding — or implicit "world models" — of generative models. I will propose different ways to formalize the concept of a world model, develop practical tests based on these formalizations, and apply them across empirical domains. Developing reliable inferences about model capabilities more broadly would offer new ways to assess, and ultimately improve, the efficacy of generative models. Throughout the talk, I will show that while these problems involve modern computational methods, they require combining computational techniques with classical statistical ideas.