Recently, the company Anthropic, ran an experiment with one of their models.
In order to better understand the black box that runs as an AI Model, they first determined what certain specific neurons were responding to, and then increased or decreased the sensitivity to this feature, and evaluated the responses after the numerical value had been modified.
For instance, if a neuron that “lights up” to the input of “Golden Gate Bridge”, either as text or as an image, increase the activation strength of this single neuron. By doing this, they got the AI model Claude to obsess with the Golde Gate Bridge.



The practical purpose of this experiment is to develop safer AGIs, with the ability to correct them in a quicker and cheaper manner.
This modification process demonstrated a novel approach to altering AI behavior, distinct from traditional fine-tuning or adding system prompts. Instead, it involved precise adjustments to the model’s internal activations, allowing for controlled changes in behavior.
Until now, a person would have to fine-tune the model with a new data set, providing specific examples, or even retrain the whole model so that it believes, responds, and behaves as desired.
Now, however, we know that increasing/decreasing a numeric value can throw a similar result (the harder part, of course, being identified which neuron to apply the change to).
Anthropic made these adjustments post-training. This method did not require retraining the entire model but involved precise tuning of specific neural activations to alter the model’s behavior. This approach enabled efficient exploration of AI interpretability and control without the need for traditional fine-tuning or additional training data
I wonder what this means for the future of prompt engineering. Maybe instead of being a very talented writer, a person will need to know which neurons to modify and by how much. Or maybe that will be a new role in the ecosystem and coexist alongside prompt engineers?
Will we be able to choose between different versions of the same trained model? Will the customized model be chosen by an algorithm based on our likes? Will we end up using all the same model, just tweaked a little bit for each of us?
What does this mean for authoritarian governments like North Korea? Will the people there be able to have their own regimen-friendly AI without the impossible task of manually filtering the data before training a model or considering every scenario?
Just as distopical as this could be, it is also good to know that if an AGI becomes malicious, we can make it obsessed with the safety of human beings with “the click of a button” (or maybe that’s what will make it malicious in the first place?)

Leave a Reply