https://www.alignmentforum.org/posts/9hxH2pxffxeeXk8YT/a-te…

metamitya · September 6, 2024

https://www.alignmentforum.org/posts/9hxH2pxffxeeXk8YT/a-test-for-language-model-consciousness

Replies

metamitya · September 6, 2024

I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either.
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don't know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It's harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)
I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don't answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:
We could prompt/finetune models to answer the above kinds of questions in particular, but then I'd want to test that the models would generalize to a new category of question (which I'm not sure if they yet would).
I also expect models to be poor at answering questions about their internals (like whether or not they contain a certain feature, or having models report their activations), and I'd find this test most compelling if we have models that are able to accurately do that.
Re sci-fi AI role-playing - I agree…