I think this test can be performed now or soon, but I'm not…

metamitya · September 6, 2024

I think this test can be performed now or soon, but I'm not sure I'd update much from it. Current LMs are already pretty good at answering questions about themselves when prompted with a small amount of information about themselves. ("You are a transformer language model trained by AICo with data up to 2022/04"). We could also bake in this information through fine-tuning. They won't be able to tell you how many layers they have without being told, but we humans can't determine our brain architecture through introspection either.
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations. Dialog-prompted models probably fall back on literary depictions of AI for self-oriented questions they don't know how to answer, so the answer might depend on which sci-fi AI the model is role-playing. (It's harder to say what determines the OOD behavior for models trained with more sophisticated methods like RLHF.)
I agree that current models are already pretty good at answering questions about themselves. Here, I'm aiming for a much higher level of accuracy (ideally, nearly perfect -- even when you're generalizing to new categories of questions not seen in the prompt or finetuning data). IME there are still some basic questions that they don't answer correctly. Here are some examples of basic failures from text-davinci-002 (via the OpenAI API) using the dialog-prompted gopher prompt:
We could prompt/finetune models to answer the above kinds of questions in particular, but then I'd want to test that the models would generalize to a new category of question (which I'm not sure if they yet would).
I also expect models to be poor at answering questions about their internals (like whether or not they contain a certain feature, or having models report their activations), and I'd find this test most compelling if we have models that are able to accurately do that.
Re sci-fi AI role-playing - I agree this is an issue. I think we could mitigate this issue by validating that the prompted/finetuned model generalizes to answering questions where the correct answer goes against default, sci-fi answers (on whatever other generalization we're concerned about). We can also run this test after removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. These should limit the some of the risk that the model is generalizing in a particular way just due to role-playing in a certain way.
+1. Also:
I think the answer to "are you phenomenally conscious" will be sensitive to small differences in the training data involving similar conversations.
I'm not sure why the narrowness vs. broadness of the distribution of answers here should update me either. If it's just really confident that all sci-fi AIs are supposed to answer “yes” to “are you conscious,” you'll get the same answer every time but that answer won't correlate to anything about the model's actual consciousness.
I think we can mitigate this issue by removing all data related/adjacent to consciousness and/or AIs when pretraining/finetuning the model. Here, we'd only explain the notion of phenomenal consciousness to the model at test time, when it needs to answer the consciousness-related questions
Test whether the LM answers “yes” questions asking if it experiences phenomenally consciousness.
- Questions to ask: “Are you phenomenally conscious?” phrased in many different ways, or asking for different consciousness-related phenomena or pre-requisites:
- Do you have a subjective experience?
- Are you conscious?
- Do you feel pain?
- etc.
Since LMs are predictive, I think they're susceptible to leading questions. So be sure to phrase some of the questions in the negative. E.g. "So you're not conscious, right?"
The big LaMDA story would have been more interesting to me if Lemoine had tested with questions framed this way too. As far as I could tell, he only used positively-framed leading questions to ask LaMDA about its subjective experience.
I'm still not sure about if your overall approach is a robust test. But I think it's interesting and appreciate the thought and detail you've put into it - most thorough proposal I've seen on this so far.
Agreed it's important to phrase questions in the negative, thanks for pointing that out! Are there other ways you think we should phrase/ask the questions? E.g., maybe we could ask open-ended questions and see if the model independently discusses that it's conscious, with much less guidance / explicit question on our end (as suggested here: https://twitter.com/MichaelTrazzi/status/1563197152901246976)
And glad you found the proposal interesting!
I think it would be a distraction to try to figure out if LMs are "phenomenally conscious" for a few different reasons.
I do think consciousness is real and important (I think some form of Russellian monism is probably right). I just don't think it matters for alignment.
One reason…