About ModelPeek.ai

ModelPeek.ai is a visualization tool created by Eduardo Slonski, based on the paper Detecting Memorization in Large Language Models. It allows you to visualize memorized sentences, loss metrics, predictions, and attention patterns.

Detector

The detector is the core feature of the app. It reveals which tokens the LLM has memorized by analyzing its internal patterns. Blue tokens indicate a high likelihood of memorization, while red indicates the opposite. You can switch the detector from Memorization to Repetition mode to identify when the model recognizes repeated patterns. Additionally, you can select different activation types and layers of the trained detector. Learn more about how the detector works.

Loss

Displays the cross entropy loss of each token, with darker red indicating higher loss values.

Attention

Visualizes the attention patterns of selected tokens. The default enhanced mode amplifies values for better color visualization, though this may affect the proportions of larger values. To view the actual attention values, hold the [ALT] key or disable 'Enhance Attention' in Options. You can explore attention patterns across different layers and heads of the model.

Predictions

Displays the top predictions for a selected token, with the correct next token highlighted in blue.

Samples

The Random category contains 2,000 samples drawn from a random subset of the Slimpajama dataset. Other categories (anthems, quotes, etc.) include both memorized and non-memorized samples that were carefully selected and tested. These samples demonstrate that the detector is unbiased toward specific categories and genuinely detects memorization.

Minimap

A floating minimap appears by default on the right side of your screen when viewing Random samples. This helps you quickly identify token patterns without scrolling. Click the minimap to close it, or click its box to reopen it.

Model

The implementation uses Pythia 1B, featuring 16 layers, 8 attention heads per layer, and a hidden size of 2048.

How the Detector Works

The Detector implements the methodology from the paper Detecting Memorization in Large Language Models. It uses a probe trained on the model's neuron patterns through the following process:

  1. Collect samples that the model has and has not memorized.
  2. Apply statistical methods to identify the dimensions that best distinguish between these two groups.
  3. Use the most effective distinguishing dimension to label millions of tokens in a general dataset.
  4. Train a probe on the activations of memorized and non-memorized tokens. The probe classifies tokens as either memorized (1) or non-memorized (0).

The paper reports great results, with the probe achieving near-perfect accuracy in distinguishing between memorized and non-memorized tokens. The key success factors are maintaining a diverse set of memorized and non-memorized samples, and training the probe on a general random dataset. This approach ensures the probe remains unbiased across different types of content.

We apply the same methodology to train the Repetition probe, focusing on sentences that repeat earlier content where the model can simply copy the text.

Key Points

    1. Why do some sentences classified as memorized show high loss?

    When I first observed this phenomenon, I suspected an issue with the trained probe. After running these sequences through the model, I found that the neurons responsible for detecting memorization still classified them as memorized. I have two hypotheses for this behavior:

    The model becomes overconfident in earlier layers, and this mechanism propagates to later layers without verification. This is supported by the observation that when placing a memorized quote above a non-memorized one, the initial non-memorized tokens are classified as memorized.

    There's a distinction between identifying memorization and recalling the continuation. The model may have lost its ability to predict certain sequences while retaining its ability to identify them. This is plausible since later layers handle more complex tasks in converting context to predictions and are less stable during gradient updates.

    I've trained probes that label high-loss sequences classified as memorized as non-memorized (0), which significantly improves the issue but isn't perfect. I believe scaling the training will resolve this completely. The evidence suggests that the model genuinely identifies these sequences as memorized but struggles to continue them accurately.

    2. What are the practical applications of this work?

    Beyond the scientific value, identifying memorization wasn't my primary goal - I was developing a method to enhance LLM training and capabilities, and memorization was an obstacle. This work will be crucial for that research and others that need to identify when models rely on memory.

    It can also significantly impact LLM evaluation, where it's traditionally been difficult to determine whether a model truly understands a problem or has simply encountered it during training. Moreover, the paper presents a methodology applicable to many internal LLM mechanisms. I strongly believe that next-generation models will require trillions of high-quality, labeled, and filtered tokens, and these types of classifiers will help us fine-tune what matters most.

    3. Do LLMs rely on memorization or reasoning?

    The evidence clearly shows that they memorize very little of their training data. By revealing memorization, repetition, and other internal mechanisms directly through the model's neurons, we can see that it develops sophisticated representations to accomplish its next-token prediction objective.

    Whether we call this 'reasoning' or use another term, we cannot ignore their ability to store facts, recognize patterns, and apply logic when solving problems. Memorization plays a minor role in the model's functionality, and the mathematics of LLM training naturally favors the development of representations over pure memorization.