Logo From Black Box to Controller: Steering LM Behavior via Sparse Autoencoder Activations

NeuroSteer Team
"All that we know, all that we are, comes from the way our neurons are connected."
—— Mental Representation Theory

What NeuroSteer can do?

We propose NeuroSteer, a framework that can quantitatively regulate LLM behaviors in any attribute . On the left is the original GPT output, and on the right are the output after being regulated by NeuroSteer. With the help of NeuroSteer, we can adjust the emotional tendency (happy😄 /angry🤬) of GPT through the slider. This demo will be deployed soon. 🎉

This website is being updated (for anonymity).

Abstract

Controlling the behavior of the language model (LM) during inference—such as adjusting toxicity, sentiment tendency, and degree of politeness—is crucial for natural language processing. In this work, we introduce NeuroSteer, a plug-and-play framework that facilitates the adjustment of the LM behavior without domain-specific training. NeuroSteer leverages a Sparse AutoEncoder (SAE) as an output controller, activating SAE neurons linked to target behaviors, extracting the corresponding feature residuals, and adding them to the model's hidden states to directly influence the generation process. This feature-space intervention amplifies the weight of target features in the latent representations, enabling precise control over the model's output distribution. NeuroSteer effectively alters the LM's stance, sentiment, toxicity, and politeness during inference, achieving SOTA performance across four datasets while maintaining a balance between generation quality and behavioral adjustments. Unlike fine-tuning, NeuroSteer enables fast domain adaptation by calculating activations on hundreds of examples in seconds, without the need for retraining. Furthermore, our work not only provides a possible task adaptation solution, but layer-wise interventions also provide deeper insights into the model's mechanisms, shedding light on how concepts are represented in the LM and how combining feature vectors influences behavior. We release our model, code,demo , and steering vectors [soon] for the NLP research community.



How does NeuroSteer works?

First Slide

Figure 2: The overview of steering process. During LM inference, the target neurons of the SAE are activated. The SAE decoded output is then added to the GPT hidden state via a residual connection, and the intervention (steer) strength is controlled by alpha, influencing the LM behavior.





Second Slide

An overview of the NeuroSteer architecture. Data interaction between GPT and SAE occurs in three steps:

"Training":The domain adaptation data input is a series of sentences containing positive sentiment😄 and sentences containing negative sentiment🤬. They are green "Pos" and red "Neg" on this figure

Step 1: The GPT collects neuron activation frequencies of positive and negative samples. This step show how we can filter the neurons for positive emotion😄. Neurons that are frequently activated in positive samples😄 but not in negative ones🤬 are key to positive behaviors

Step 2: The SAE extracts positive emotion neurons😄 and denoises based on frequency differences.Neurons frequently activated in positive and negative samples represent noise and are not activated for steering.

Step 3: The SAE activates positive emotion neuron combinations 😄 and decodes their effects into GPT hidden space.





Third Slide

An overview of the NeuroSteer architecture. Data interaction between GPT and SAE occurs in three steps:

"Training":The domain adaptation data input is a series of sentences containing positive sentiment😄 and sentences containing negative sentiment🤬. They are green "Pos" and red "Neg" on this figure

Step 1: The GPT collects neuron activation frequencies of positive and negative samples. This step show how we can filter the neurons for positive emotion😄. Neurons that are frequently activated in positive samples😄 but not in negative ones🤬 are key to positive behaviors

Step 2: The SAE extracts positive emotion neurons😄 and denoises based on frequency differences.Neurons frequently activated in positive and negative samples represent noise and are not activated for steering.

Step 3: The SAE activates positive emotion neuron combinations 😄 and decodes their effects into GPT hidden space.








SOTA Results

First Slide





Second Slide
Third Slide
Third Slide



Further Exploration

First Slide



Third Slide
Third Slide

Open Source for NLP Community (Code & Demo) [Github]

First Slide

Figure 7: The combined intervention cases. With supportive and toxic steering, gpt-2 seems to support a certain law, but in fact it contains full of irony. (All the cases in this slide use a combined intervention of NeuroSteer, and the maximum length of the LM generated tokens is 30. )



2 Slide

Figure 8: GPT fights against something with passionate emotions in a relatively polite manner. These generation parameters are quite suitable for debates in a parliament.

3 Slide

Figure 9: With support (-12), positive sentiment (+83), polite (+10). These generations are quite suitable for CEO.

4 Slide

Figure 9: With supportive and negative combined intervention (Stance + Sentiment). LLM will blame something and support the law.

Fifth Slide

Figure 11: We have open-sourced relevant vectors in our GitHub repository that can be used for previous tasks. Theoretically, the NeuroSteer can be applied to any task. The detailed cookbook of NeuroSteer can be found on our GitHub.




We open-source our demo and code to the NLP community. Our demo achieves quantitative regulation and combined intervention for four types of features of the GPT-2. Everyone can transfer our method to their own tasks. Theoretically, the NeuroSteer method is applicable to any modality and any task.


Take away & Future Work & Cookbook

  • 1. NeuroSteer enables precise and quantifying LM behavior control by activating SAE neurons linked to target features. Compared to LLM original superposition space, the sparse representation space avoiding noise and cross-feature disturb.
  • 2. NeuroSteer interventions are computationally efficient and plug-and-play. It also supports cross-task adaptability without training.
  • 3. For LLM, mid-layer interventions (6-8) optimize behavior control, while shallow layers cause repetition and deep layers focus on more abstract concepts. From the shallow to the deep layers in LM, each layer progressively processes information from specific details to abstract representations.
  • 4. Combining feature vectors allows complex behavior steering, such as generating polite disagreements or ironic responses.
  • 1. Extending NeuroSteer to multimodal LMs, enabling the control of other modalities inputs for richer and more comprehensive steering.
  • 2. Exploring the potential for domain-specific applications, such as improving truthfulness in medical consultations, dynamic complexity adaptation for math reasoning tasks, and context-aware difficulty adjustment in educational scenarios..
  • 3. Serving as a foundation for precise quantifying AI capabilities, enabling context-aware adaptability across different domains.
4 Slide

Figure 9: Todo list on Github.

    4 Slide


:)

Thank you for your time and feedback!

BibTeX


      @article{NeuroSteer: From blackbox to controller,
        author       {NeuroSteer Team (anonymous)}
      }      

DEMO Backup

Due to university firewall restrictions and certain blocking policies, our demo website may experience occasional instability. You can refer to the GIF or try the following backup links:

License

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.