I think this test can be performed now or soon, but I'm not…

metamitya · September 6, 2024

I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either.
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don't know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It's harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)
+1. Also:
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations.
I'm not sure why the narrowness vs. broadness of the distribution of answers here should update me either. If it's just really confident that all sci-fi AIs are supposed to answer “yes” to “are you conscious,” you'll get the same answer every time but that answer won't correlate to anything about the model's actual consciousness.
I think we can mitigate this issue by removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. Here, we'd only explain the notion of phenomenal consciousness to the model at test time, when it needs to answer the consciousness-related questions
I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don't answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:
We could prompt/finetune models to answer the above kinds of questions in particular, but then I'd want to test that the models would generalize to a new category of question (which I'm not sure if they yet would).
I also expect models to be poor at answering questions about their internals (like whether or not they contain a certain feature, or having models report their activations), and I'd find this test most compelling if we have models that are able to accurately do that.
Re sci-fi AI role-playing - I agree this is an issue. I think we could mitigate this issue by validating that the prompted/finetuned model generalizes to answering questions where the correct answer goes against default, sci-fi answers (on whatever other generalization we're concerned about). We can also run this test after removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. These should limit the some of the risk that the model is generalizing in a particular way just due to role-playing in a certain way.
One reason we believe other humans are conscious is that other humans are consistently accurate reporters of their own mental states.
I don't think anyone has ever told me they were conscious, or I them, except in the trivial sense of communicating that one has woken up, or is not yet asleep. The reason I attribute the faculty of consciousness to other people is that they are clearly the same sort of thing as myself. A language model is not. It is trained to imitate what people have said, and anything it says about itself is an imitation of what people say about themselves.
So when another human tells us they are conscious, we update towards thinking that they are also conscious.
I would not update at all, any more than I update on observing the outcome of a tossed coin for the second time. I already know that being human, they have that faculty. Only if they were in a coma, putting the faculty in doubt, would I update on hearing them speak, and then it would not much matter what they said.
It is trained to imitate what people have said, and anything it says about itself is an imitation of what people say about themselves.
That's true for pretrained LMs but not after the finetuning phase I've proposed here; this finetuning phase would train the model to answer questions accurately about itself, which would produce fairly different predictions from just imitating humans. I definitely agree that I distrust LM statements of the form "I am conscious" that come from the pretrained LM itself, but that's different from the experiment I'm proposing here.
I would not update at all
Would you update against other humans being conscious at all, if other humans told you they we…