NeuroSteer: LLM's Quantitative Behavior Controller

Abstract

Controlling the behavior of the language model (LM) during inference—such as adjusting toxicity, sentiment tendency, and degree of politeness—is crucial for natural language processing. In this work, we introduce NeuroSteer, a plug-and-play framework that facilitates the adjustment of the LM behavior without domain-specific training. NeuroSteer leverages a Sparse AutoEncoder (SAE) as an output controller, activating SAE neurons linked to target behaviors, extracting the corresponding feature residuals, and adding them to the model's hidden states to directly influence the generation process. This feature-space intervention amplifies the weight of target features in the latent representations, enabling precise control over the model's output distribution. NeuroSteer effectively alters the LM's stance, sentiment, toxicity, and politeness during inference, achieving SOTA performance across four datasets while maintaining a balance between generation quality and behavioral adjustments. Unlike fine-tuning, NeuroSteer enables fast domain adaptation by calculating activations on hundreds of examples in seconds, without the need for retraining. Furthermore, our work not only provides a possible task adaptation solution, but layer-wise interventions also provide deeper insights into the model's mechanisms, shedding light on how concepts are represented in the LM and how combining feature vectors influences behavior. We release our model, code,demo , and steering vectors [soon] for the NLP research community.

How does NeuroSteer works?

Figure 2: The overview of steering process. During LM inference, the target neurons of the SAE are activated. The SAE decoded output is then added to the GPT hidden state via a residual connection, and the intervention (steer) strength is controlled by alpha, influencing the LM behavior.

An overview of the NeuroSteer architecture. Data interaction between GPT and SAE occurs in three steps:

"Training"：The domain adaptation data input is a series of sentences containing positive sentiment😄 and sentences containing negative sentiment🤬. They are green "Pos" and red "Neg" on this figure

Step 1: The GPT collects neuron activation frequencies of positive and negative samples. This step show how we can filter the neurons for positive emotion😄. Neurons that are frequently activated in positive samples😄 but not in negative ones🤬 are key to positive behaviors

Step 2: The SAE extracts positive emotion neurons😄 and denoises based on frequency differences.Neurons frequently activated in positive and negative samples represent noise and are not activated for steering.

Step 3: The SAE activates positive emotion neuron combinations 😄 and decodes their effects into GPT hidden space.

An overview of the NeuroSteer architecture. Data interaction between GPT and SAE occurs in three steps:

"Training"：The domain adaptation data input is a series of sentences containing positive sentiment😄 and sentences containing negative sentiment🤬. They are green "Pos" and red "Neg" on this figure

Step 1: The GPT collects neuron activation frequencies of positive and negative samples. This step show how we can filter the neurons for positive emotion😄. Neurons that are frequently activated in positive samples😄 but not in negative ones🤬 are key to positive behaviors

Step 2: The SAE extracts positive emotion neurons😄 and denoises based on frequency differences.Neurons frequently activated in positive and negative samples represent noise and are not activated for steering.

Step 3: The SAE activates positive emotion neuron combinations 😄 and decodes their effects into GPT hidden space.

SOTA Results

Further Exploration

Open Source for NLP Community (Code & Demo) [Github]

Figure 7: The combined intervention cases. With supportive and toxic steering, gpt-2 seems to support a certain law, but in fact it contains full of irony. (All the cases in this slide use a combined intervention of NeuroSteer, and the maximum length of the LM generated tokens is 30. )

Figure 8: GPT fights against something with passionate emotions in a relatively polite manner. These generation parameters are quite suitable for debates in a parliament.

Figure 9: With support (-12), positive sentiment (+83), polite (+10). These generations are quite suitable for CEO.

Figure 9: With supportive and negative combined intervention (Stance + Sentiment). LLM will blame something and support the law.

Figure 11: We have open-sourced relevant vectors in our GitHub repository that can be used for previous tasks. Theoretically, the NeuroSteer can be applied to any task. The detailed cookbook of NeuroSteer can be found on our GitHub.

We open-source our demo and code to the NLP community. Our demo achieves quantitative regulation and combined intervention for four types of features of the GPT-2. Everyone can transfer our method to their own tasks. Theoretically, the NeuroSteer method is applicable to any modality and any task.

Take away & Future Work & Cookbook

1. NeuroSteer enables precise and quantifying LM behavior control by activating SAE neurons linked to target features. Compared to LLM original superposition space, the sparse representation space avoiding noise and cross-feature disturb.
2. NeuroSteer interventions are computationally efficient and plug-and-play. It also supports cross-task adaptability without training.
3. For LLM, mid-layer interventions (6-8) optimize behavior control, while shallow layers cause repetition and deep layers focus on more abstract concepts. From the shallow to the deep layers in LM, each layer progressively processes information from specific details to abstract representations.
4. Combining feature vectors allows complex behavior steering, such as generating polite disagreements or ironic responses.

1. Extending NeuroSteer to multimodal LMs, enabling the control of other modalities inputs for richer and more comprehensive steering.
2. Exploring the potential for domain-specific applications, such as improving truthfulness in medical consultations, dynamic complexity adaptation for math reasoning tasks, and context-aware difficulty adjustment in educational scenarios..
3. Serving as a foundation for precise quantifying AI capabilities, enabling context-aware adaptability across different domains.

Figure 9: Todo list on Github.

From Black Box to Controller: Steering LM Behavior via Sparse Autoencoder Activations

^♣NeuroSteer Team

Data[soon] Code[OK] Log[soon] Demo[OKK]

"All that we know, all that we are, comes from the way our neurons are connected."

—— Mental Representation Theory

What NeuroSteer can do?

Abstract

How does NeuroSteer works?

SOTA Results

Further Exploration

Open Source for NLP Community (Code & Demo) [Github]

Take away & Future Work & Cookbook

:)

BibTeX

DEMO Backup

License

From Black Box to Controller: Steering LM Behavior via Sparse Autoencoder Activations ♣NeuroSteer Team Data[soon] Code[OK] Log[soon] Demo[OKK] "All that we know, all that we are, comes from the way our neurons are connected." —— Mental Representation Theory

What NeuroSteer can do?

Abstract

How does NeuroSteer works?

SOTA Results

Further Exploration

Open Source for NLP Community (Code & Demo) [Github]

Take away & Future Work & Cookbook

:)

BibTeX

DEMO Backup

License

From Black Box to Controller: Steering LM Behavior via Sparse Autoencoder Activations

^♣NeuroSteer Team

Data[soon] Code[OK] Log[soon] Demo[OKK]

"All that we know, all that we are, comes from the way our neurons are connected."

—— Mental Representation Theory