Ethically Hacking LLMs | 1 – Neural Networks

Series Intro (AKA Why You Should Care About This)

If your news feeds are anything like mine, then you’re probably being bombarded with constant updates about new AI breakthroughs, models, products, or how organizations are adapting it into their workflows. We are fully in the AI gold rush, and no company or executive wants to be left behind or beat to market by a competitor’s AI-based tool. This is why we’re seeing AI being pushed into so many products so quickly. With this speed and desperation to get products to market, security is frequently seen as a ‘nice-to-have’ or afterthought.

Does this pattern sound familiar? Unfortunately, this isn’t the first time (nor probably the last) that this scenario has played out. For example, in the earlier days of Microsoft, when they were working to monopolize the operating system market, they were focused on releasing features and products instead of security, which was then left lacking. In those days, serious zero days were a frequent occurrence on Windows or other Microsoft products. This eventually led to Microsoft landing a bad reputation and left large customers like the US government considering switching to a more secure option. In 2002, Bill Gates released a famous memo outlining a shift for Microsoft to make security a top priority for all products.

“When we face a choice between adding features and resolving security issues, we need to choose security.”

Bill Gates, 2002 Internal Memo

Many companies right now are in the same spot Microsoft was over 20 years ago, prioritizing being first to the market with their AI products and features instead of focusing on their security. This is especially challenging for companies due to the complex and novel nature of LLM risks, their significant differences from other types of security risks, and the major impact they can have.

In this blog series, we’re going to talk about these risks, how they differ from and are similar to other common security risks, and why it can be deceptively challenging to rely on LLMs themselves as their own security. In this first blog, we’re going to learn about the fundamentals of how neural networks and LLMs actually work. This is imperative to understanding the risk associated with and also demystifying what’s actually happening behind the scenes. If you want to learn about how AI can be used to assist you when coding securely, we have a Programming with AI course taught by Heath Adams on the TCM Academy.

If you’re interested in learning about AI security, then grab yourself a coffee, tea, or your favourite beverage and get locked in for the ride!

The Math Isn’t As Complicated As It Looks

To fully understand why some LLM risks are unique, a basic understanding of how LLMs and neural networks work is required. I’ll do my best in this blog to keep it as basic as possible while still conveying the most important concepts. If you already have a solid grasp of this, then this will mostly be review.

Disclaimer: I’m going to make some generalizations and simplifications here.

At the heart of LLMs (and most modern AI/ML applications) is a computational model designed to learn from data, known as a neural network. They get their name from some similarities to how the human brain works. Neural networks are made up of layers of units called neurons that take in information, perform a calculation on it, and pass it along to the next layer.

The first layer, where the input data is received, is called the input layer, the inner layers are called the hidden layers, and the final layer is called the output layer. Through training, the neural network can be tweaked to provide specific outputs for types of inputs.

As a basic example, we’ll use a classic machine learning problem of classifying a 28×28 pixel picture of a hand-written numeral from 0-9.

But neural networks work off of numbers, not a picture. In our example, and for LLMs, a one-dimensional array of numbers, which is referred to as a vector or embedding, is used. For the 28 x 28 pixel black and white image in our example, this can be accomplished by flattening it into a single 784-dimensional vector.

Each pixel is presented numerically in a 784-length array, such as:

[A1, A2…A28, B1, B2…B28, c1…c28]

And each grid location is represented with a darkness scale from 0-1 (for example, J5 would be 0, J11 would be 0.5, J12 would be 1). Now we have our input of a picture as a numerical representation.

To make visual representation easier, we’ll rotate our input vector down 90 degrees and use vertical arrays of neurons to represent each layer.

One way I like to think of individual neurons is that they are just a placeholder for a number (usually between 0 and 1) and that number is calculated from all of the neurons in the layer before it. How much each neuron impacts another neuron in the next layer depends on how strongly they are related. The following formula is used to calculate how much one single neuron contributes to a neuron in the succeeding layer:

or
activation = weight(activation Level) + bias

Okay, so I know there is some math here, but it’s not as complex as it may appear. The “a” represents the activation of the neuron, which is usually a number between 0 and 1. When talking about neurons, we say those with a higher value are more active. The “w” represents the weight, which is the relation of one neuron to a neuron in the succeeding layer and is one of the factors that will be adjusted during training. The “x” is the activation level of the previous neuron, and the “b” is the bias level, which allows the model to shift the results to help neurons learn better.

Each neuron’s activation is impacted by all of the neurons in the previous layer, using the same formula as above for each neuron, and summing the result inside the activation function gives the full formula as:

Again, some more math that looks kinda scary, but really it’s just performing the above calculation for each neuron from the preceding layer using that neuron’s weight and activation, and then adding them all up and putting them through a normalization function that squashes the result to be between 0 and 1. The “σ” represents the activation function, which squashes the activations down into a range between 0 and 1 (usually softmax, Relu, or sigmoid, but for our simplified version, not something you need to dig into).

This calculates the activation for one single neuron, so now we need to repeat this exact same step for each of the neurons in the first layer using the set weights and biases for each neuron. This will then give us the activation of each of the neurons in that layer, and we can then use those activations, and the weights and biases, from them to calculate the activation of the next layer. We repeat this process until we reach the output layer.

The size of the hidden layers is tweakable, for example, when I was testing this implementation, I used a first hidden layer of 64 neurons and a second hidden layer of 32. Even for our relatively small network with 4 layers (1 input, 2 hidden, and 1 output), we have a lot of connections between neurons and math to perform. Luckily, computers are very good at this kind of matrix multiplication, which is required to perform the calculations.

In this example, the output layer is 10 neurons. After a pass through the network, each of the neurons in the output layer will represent a numeral from 0 to 9 and have a value between 0 and 1 that can be used to represent the confidence level the model has that an input matches each output. For our example of “4” as an input, the ideal output after training would be [0,0,0,0,1,0,0,0,0,0], however, in a real application, the neurons will all have some level of activation, and ideally, the neuron for 4 would be significantly higher than the activation of the rest of the neurons.

Below is a visual representation of the network that I trained which uses the same neural network architecture and math that is discussed in this blog. The sample image shows the picture of the numeral used, the bar graph shows the output probabilities for each numeral, and the bottom neural graph shows the neuron activations and weights for the two hidden layers and output layer. The input layer of the flattened image is not shown as it would be too big.

The above is how the math works for our neural network moving forward, however, in order for neural networks to do anything meaningful, they need to go through a lot of training. This is where the math starts to get a bit more complicated, so I won’t get into the specifics (it’s calculus though in case you were wondering).

During training, a pre-labelled data set is used and fed into the neural network. At the start, the weights and biases for each connection between the neurons are randomized. When a data point is fed into the network, the output is observed and compared to the desired output. The difference between the actual output and the desired output is referred to as the “cost,” and the greater the disparity, the higher the cost. By changing the weights and biases, the cost will change as well. Because of this, the cost can actually be represented as a function called “the cost function.”

Using a method called “backpropagation,” the weights and biases can be tweaked in the direction that moves them to a lower cost. This process is repeated backward through each layer and then repeated multiple times for the data point in the training set, and then potentially repeated through the data set over and over as well, ideally improving the performance of the model over enough repetitions.

One common way to think of training a neural network is that each neuron is a knob that can be turned in either direction. The network is made up of thousands of knobs in our example and we can keep tweaking all of the knobs until the network is tuned to provide the desired outputs. The training and backpropagation help inform which way and how much to turn each knob to try and improve performance.

If there is sufficient training data and the network is of sufficient size and parameters for the task it can be tuned to produce results that can sometimes be useful, such as our example of identifying a hand-written numeral. Or in more complex networks, accurately predict the next word in a sequence.

Wait, It’s Just Fancy Autocompletion?

Now we’ve got a basic understanding of neural networks (remember that disclaimer from before, there is some stuff I simplified or left out, FYI), this series is actually about LLMs, which, instead of classifying images, perform Natural Language Processing (NLP). We started by learning about neural networks, because in LLMs they are where the real magic happens to allow them to both understand the semantic meaning of a sequence of words, and also complete sequences of words. This enables them to respond to things like questions or have a conversation.

While the implementation and design of neural networks for LLMs is much, much more complex than the basic example above, the core ideas of how they function, predict an outcome based on an input, and are trained all carry over to the networks in LLMs. For example, in one architecture of LLMs called a transformer decoder, the input is a sequence of text, and the output is the next most likely word that would follow that sequence. One of the biggest differences in LLM networks, aside from their much larger embedding sizes and amount of hidden layers, transformers process data in parallel during training using attention, but during inference, LLMs generate outputs sequentially (auto-regressively), one token at a time. We won’t get much more into the specifics because the math and details get complicated quickly (at least for me), and we don’t need to know them at this point for our intro to hacking or securing LLMs.

Transformer LLMs

Almost all modern LLMs these days are called Transformer LLMs. These are designed to transform one piece of text into another. One of the first applications was translation, which is essentially transforming text from one language into another. The full transformer architecture consists of both an encoder and decoder, which we’ll explore first before taking a look at encoder or decoder-only LLMs.

Before we dive into the architecture of transformer LLMs, remember how in the basic neural network we discussed above, neural networks work off of numbers, more specifically, vectors. The same goes for LLMs. Each word (or part of a word, we’ll get to that later) needs to be converted into a vector. This process is usually referred to as embedding, and the resulting vector is also called the embedding. This part of the process was one that I struggled with wrapping my own head around when learning. How does one convert a word into numbers?

LLMs use a fixed-length embedding to represent each token. In modern LLMs, embeddings are typically thousands of dimensions long (e.g., 768 in smaller models, up to ~12,000 in some large-scale models). Each dimension in the embedding (just a number in the array) can then be used in training by the neural network to represent semantic details about the word.

The exact semantic representation of each dimension becomes too complicated to decipher, however, for learning’s sake, you could think of each dimension representing something like colour, gender, plurality, or any of the many meanings of words. For me at least, dealing with dimensions greater than 3 starts to get confusing since my brain likes to think of dimensions as they exist to describe our physical world. Of course, in LLM embeddings, they don’t represent that at all; they represent the meaning of the word.

One fascinating aspect of LLMs is that if you perform matrix math on these vectors, you can start to see very interesting patterns. For example, if you calculate the distance between the king and queen vectors, and then add that resulting distance vector to the word uncle, you’ll land somewhere very close to the vector for aunt.

If we take a 2-dimensional slice for easier visualization, you can think of it looking something like this (again an over-simplified visualization, but the math works the same in the full dimensions).

The first step in LLM processing is to map each word to its corresponding embedding. Remember how I said we’d get to the parts of words later? Let’s cover that now.

In LLMs, the smallest representation of text is referred to as a token. These can be full words, however, they may also be parts of words, a number, a single letter, punctuation, or special characters like the End of Sequence (eos). For example, the word “drinking” would most likely be split up into two tokens, one for “drink” and another for “ing”. The entire collection of tokens an LLM uses is referred to as its vocabulary, and each token is assigned a unique integer ID to simplify looking up its corresponding embedding. To make this explanation easier, we’re going to accept another simplification that every word is treated as a single token.

In an LLM during training, the embedding for tokens is adjusted just like other layers of the network. At the start, the token embeddings are randomized and over time are adjusted through back propagation during training.

After that detour of how words are represented as multi-dimensional vectors, we can get back to transformer LLM architectures. Let’s dig in first with a visual showing a full encoder-decoder transformer LLM designed to perform English to French translation. There will be some steps here that don’t make sense, but we’ll tackle them next.

Example of English to French translation LLM

On the left, in orange, we have the encoder. The purpose of the encoder is to take the input in natural language and convert it into vectors that capture the context and semantic meaning of the input. Now you may be thinking, isn’t that what tokenization does by converting the words into their token embeddings? However, language is much more complicated than just the single meanings of each word. This is why a direct translation word-for-word doesn’t work for most languages.

Take the English word “foot” for example. This could refer to a person’s foot, a unit of measurement, or even the bottom/end of something (“foot of the bed”). If you read the sentence “The sandwich was one foot long”, based on the context of the rest of the words around it, you can infer the meaning of the word “foot”. This is exactly what the encoder hidden layers do: they add context to all of the input embeddings based on each other and their context. The output of the encoder is then a context-rich set of vectors that are intended to capture the semantic meaning of the input sequence.

One way that I like to think of the encoder is that it takes natural language and converts it into a standardized language-independent representation of the semantic meaning.

At this point, one possibility would be to attempt to convert the vectors back to the desired output language by looking them up against their tokens, but again, natural language is much more nuanced and complex than this. This is where the decoder is used to transform the encoder output into the desired output (shown in green on the right in the above diagram).

Attention Layers and Context

Decoders are trained specifically to predict the next word in a given sequence of words. One extremely important part of decoders is a group of its hidden layers called the Attention Layers. These layers allow the context from other words to be added to the current word traversing the decoder. In an encoder-decoder, there are two types of attention: cross-attention and self-attention. Cross-attention refers to the context that is provided from the output of the encoder, and self-attention refers to the context from the preceding words that have already been processed by the decoder. In translation, the self-attention allows the decoder to pick up on nuances of the output language, and the cross-attention helps to ensure the correct semantic meaning.

Since the decoder works one word at a time, but also has the self-attention to adapt its predictions based on the previous words generated, a process called auto-regression is used. In auto-regression, after a word is predicted, it’s then added back to the sequence, which is then fed back through the algorithm. This is repeated until a special token called the end of sequence <eos> is generated.

While this is how a full encoder-decoder model works for translation, this same architecture can also be used to take a question input and create an answer output. The encoder provides the context of the question, and the decoder works to generate the words in the answer based on the context provided by the encoder and the output already generated by the decoder.

Over time, as the self-attention layers of decoders improved, they began to become practical to use as a standalone architecture, where instead of using any context from an encoder, they started the decoding process from the input and continued auto-regression from there. Nowadays, most modern LLMs people interact with, like ChatGPT, are decoder-only transformer LLMs. Encoder-only LLMs are also used to derive the context of text, as this can be useful for things like classifying documents for easy retrieval based on their semantic context, or deriving the sentiment of text, such as “happy”, “sad”, “angry” etc. In future blog posts, we’ll learn how they can actually be very useful as an added guard rail for decoder-only LLMs.

To wrap up, let’s look at how auto-regression would work for the prompt “Tell me a joke.” The diagram below shows the different steps taken in the process.

Each word is predicted one at a time, and importantly, the model has no knowledge of what any future words other than the current one it is predicting will be. In this example, the fact that the prompt “tell me a joke” is included in the context of each step is what drives the output to end up as a joke, even though it is working forward one word at a time. When the model predicts the first word after the input to be “I’m” it doesn’t know the rest of the final sentence structure at all, or that the joke is going to be about seafood. The model just keeps building off word by word to complete the sequence of words based on the data it’s been trained on.

So yes, in essence, it is fancy autocomplete. However, the magic really lies in the attention layers and how they can add the full context to the word being predicted.

Reading This Was Worth It, I Promise!

You probably came here because you’re interested in AI hacking or security, and this entire blog just outlines how neural networks and LLMs work! To think outside of the box and come up with unique ways to attack LLMs, you need to understand their inner workings. You’ll see why in future blogs, I promise. To wrap up this blog I want to end on a few interesting thoughts that hopefully make more sense with the knowledge you’ve gained from reading this far.

We Only Kind Of Know What Is Happening Inside

AI researchers design frameworks and train models to produce useful outputs (most of the time). However, once training is complete, the exact reasons behind the behavior of neural networks are usually extremely difficult, if not impossible, to fully understand. Remember how I mentioned that the precise semantic meaning of embeddings are usually difficult to fully decipher? Now add on hundreds of layers that have been trained and it becomes virtually impossible to pinpoint why a neural network in an LLM predicts the words that it does.

All The Outputs Exist, You Just Need The Prompt To Unlock Them

LLMs are naturally deterministic, which means if you give the exact same input over and over again, you’ll always get the same output. However, you’ve maybe experienced commercial chatbots like ChatGPT, and this isn’t the case. This is because of additional mechanisms like temperature and seed values that add some amount of randomness to the words selected as the output. For example, maybe the second-highest statistically predicted word would be selected sometimes. Usually, for these chatbots, the temperature is limited to fixed ranges to keep the output within reason (the more randomness, the less sense the responses start to make).

Because of this, even though the space of possible outputs is extremely vast, all of the potential outputs for an LLM are stored in the model’s parameters; they just need to be calculated and “unlocked” with the correct prompt. This is an interesting thought experiment when considering the challenges of securing an LLM and the outputs that it could produce.

Decoder Transformer LLMs Are Just Itching To Finish This ________.

Decoder LLMs thrive off of finishing and completing sequences of words; it’s what they’ve been trained to do. In future blogs, we’ll take a look at how this can be used to attack them to generate undesirable outputs.

LLMs Can Recall Facts From Their Learned Parameters

If you ask a properly trained LLM something like “Who was the first person to walk on the moon?” Without the need to search on the internet, it will most likely respond with the correct answer of Neil Armstrong, and depending on the model, it may even then give a small blurb about the details of the first moonwalk. For those who don’t know the specifics of how decoder transformer LLMs work, this may not seem as impressive, because of course it’s been trained on data from the internet and saw many passages about this. However, how LLMs recall this information is actually quite spectacular.

Now, with the understanding of autoregression and LLMs working one word at a time, we can begin to see how an LLM stores facts within its parameters, such as the weights and biases for its layers and its token embeddings. When passing context from the prompt and each new word, the attention is able to pass enough context to the new words being generated to properly recover the fact. In this example, there would be details passed onto the calculation of the vector of the current word for things like name, space, history, and even more. And all of these combined guide the LLM to recover the facts and information stored in its parameters to predict the words “Neil” and then “Armstrong”.

This same feature of autoregression with self-attention is also the main contributing factor to what is usually referred to as hallucinations, where the AI makes up something completely wrong. For example, if you asked almost the same question, “Who was the first person to walk on Mars?” an LLM without sufficient training may respond with “Neil Armstrong” and then make up a fake blurb about the first Mars walk. This happens because of the LLM’s design to work word by word, the attention and autoregression drives the LLM to predict Neil Armstrong, and then once there is even a small error, it compounds on itself by adding to the context of future words.

Wrapping It All Up!

If you combine these details of how LLMs work, I think it will provide an excellent baseline for why it can be hard to secure LLMs on their own without the use of additional architecture and safeguards. Trying to “train” an LLM out of giving improper responses is usually not sufficient on its own, and a clever adversary can most likely use these characteristics of an LLM to find undesirable outputs, such as giving out free discounts, leaking internal training data, or other confidential information. The potential for AI performing harmful actions using tools it has access to, or a myriad of other AI risks, is something that we’ll explore later in this blog series.

I hope you found this information interesting and useful, and will see you in the next blog of this series, where I promise we’ll do some hacking of LLMs!

About the Author: Andrew Bellini

My name is Andrew Bellini and I sometimes go as DigitalAndrew on social media. I’m an electrical engineer by trade with a bachelor’s degree in electrical engineering and am a licensed Professional Engineer (P. Eng) in Ontario, Canada. While my background and the majority of my career has been in electrical engineering, I am also an avid and passionate ethical hacker.

I am the instructor of our Beginner’s Guide to IoT and Hardware Hacking, Practical Help Desk, and Assembly 101 courses and I also created the Practical IoT Pentest Associate (PIPA) and Practical Help Desk Associate (PHDA) certifications.

In addition to my love for all things ethical hacking, cybersecurity, CTFs and tech I also am a dad, play guitar and am passionate about the outdoors and fishing.

About TCM Security

TCM Security is a veteran-owned, cybersecurity services and education company founded in Charlotte, NC. Our services division has the mission of protecting people, sensitive data, and systems. With decades of combined experience, thousands of hours of practice, and core values from our time in service, we use our skill set to secure your environment. The TCM Security Academy is an educational platform dedicated to providing affordable, top-notch cybersecurity training to our individual students and corporate clients including both self-paced and instructor-led online courses as well as custom training solutions. We also provide several vendor-agnostic, practical hands-on certification exams to ensure proven job-ready skills to prospective employers.

See How We Can Secure Your Assets

Let’s talk about how TCM Security can solve your cybersecurity needs. Give us a call, send us an e-mail, or fill out the contact form below to get started.

Ethically Hacking LLMs | 1 – Neural Networks

Series Intro (AKA Why You Should Care About This)

The Math Isn’t As Complicated As It Looks

Wait, It’s Just Fancy Autocompletion?

Transformer LLMs

Attention Layers and Context

Reading This Was Worth It, I Promise!

We Only Kind Of Know What Is Happening Inside

All The Outputs Exist, You Just Need The Prompt To Unlock Them

Decoder Transformer LLMs Are Just Itching To Finish This ________.

LLMs Can Recall Facts From Their Learned Parameters

Wrapping It All Up!

About the Author: Andrew Bellini

About TCM Security

See How We Can Secure Your Assets

tel: (877) 771-8911 | email: [email protected]

Recent Posts

Categories

Ethically Hacking LLMs | 1 – Neural Networks

Series Intro (AKA Why You Should Care About This)

The Math Isn’t As Complicated As It Looks

Wait, It’s Just Fancy Autocompletion?

Transformer LLMs

Attention Layers and Context

Reading This Was Worth It, I Promise!

We Only Kind Of Know What Is Happening Inside

All The Outputs Exist, You Just Need The Prompt To Unlock Them

Decoder Transformer LLMs Are Just Itching To Finish This ________.

LLMs Can Recall Facts From Their Learned Parameters

Wrapping It All Up!

About the Author: Andrew Bellini

About TCM Security

See How We Can Secure Your Assets

tel: (877) 771-8911 | email: [email protected]

Recent Posts

Categories

Tags