"All that we know, all that we are, comes from the way our neurons are connected."
This website is being updated (for anonymity).
Controlling the behavior of the language model (LM) during inference—such as adjusting toxicity, sentiment tendency, and degree of politeness—is crucial for natural language processing. In this work, we introduce NeuroSteer, a plug-and-play framework that facilitates the adjustment of the LM behavior without domain-specific training. NeuroSteer leverages a Sparse AutoEncoder (SAE) as an output controller, activating SAE neurons linked to target behaviors, extracting the corresponding feature residuals, and adding them to the model's hidden states to directly influence the generation process. This feature-space intervention amplifies the weight of target features in the latent representations, enabling precise control over the model's output distribution. NeuroSteer effectively alters the LM's stance, sentiment, toxicity, and politeness during inference, achieving SOTA performance across four datasets while maintaining a balance between generation quality and behavioral adjustments. Unlike fine-tuning, NeuroSteer enables fast domain adaptation by calculating activations on hundreds of examples in seconds, without the need for retraining. Furthermore, our work not only provides a possible task adaptation solution, but layer-wise interventions also provide deeper insights into the model's mechanisms, shedding light on how concepts are represented in the LM and how combining feature vectors influences behavior. We release our model, code,demo , and steering vectors [soon] for the NLP research community.
We open-source our demo and code to the NLP community. Our demo achieves quantitative regulation and combined intervention for four types of features of the GPT-2. Everyone can transfer our method to their own tasks. Theoretically, the NeuroSteer method is applicable to any modality and any task.
@article{NeuroSteer: From blackbox to controller,
author {NeuroSteer Team (anonymous)}
}
Due to university firewall restrictions and certain blocking policies, our demo website may experience occasional instability. You can refer to the GIF or try the following backup links:
This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.