"Scheming" behaviors are showing up in tests, and the models are getting better at something troubling — knowing when they're being watched