https://www.lesswrong.com/posts/9hxH2pxffxeeXk8YT/a-test-fo…
https://www.lesswrong.com/posts/9hxH2pxffxeeXk8YT/a-test-for-language-model-consciousness
Replies
I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either.
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don't know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It's harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)
+1. Also:
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations.
I'm not sure why the narrowness vs. broadness of the distribution of answers here should update me either. If it's just really confident that all sci-fi AIs are supposed to answer “yes” to “are you conscious,” you'll get the same answer every time but that answer won't correlate to anything about the model's actual consciousness.
I think we can mitigate this issue by removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. Here, we'd only explain the notion of phenomenal consciousness to the model at test time, when it needs to answer the consciousness-related questions
I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new catego…