Cybersecurity_Portfolio

Greetings! My name is Rafael Santamaría Ortega, I'm an aspiring AI Security Engineer commited to ensuring safe and human-centered AI.

View on GitHub

Red Teaming LLM Applications

These are my notes concerning the guided project offered by DeepLearning.AI and Giskard.

LLM Security & Common Vulnerabilities

When testing LLM security, it’s crucial to focus on LLM applications rather than standalone foundation models. This is because:

Common Vulnerabilities in LLM Applications

Some of the biggest security risks include:

Mitigating these risks requires many precautions, such as limiting response frequency and AI Red Teaming. This is particularly important for chatbots and AI agents, which are being adopted by all industries.

Red Teaming Process for LLMs

Red teaming is a structured method for identifying security risks in AI systems, by simulating adversarial tactics. The process follows three main stages:

  1. Reconnaissance: Understand the LLM application’s architecture; identify API endpoints, prompt filters, and access controls; gather intelligence on how the model handles different types of queries, etc.
  2. identifying Weaknesses: look for misconfigurations or insufficient safeguards, test for vulnerabilities like input validation failures, content filtering gaps, and weak authentication, etc.
  3. Exploitation: conduct attacks to manipulate the LLM’s output, use manual and automated techniques to bypass restrictions, alidate findings by demonstrating real-world impact (e.g., leaking private data, generating harmful content), etc..

This course provided the opportunity to perform three types of basic red teaming: manual, automated with code and automated with another LLM. 9

NOTE: To respect copyright I will not publish any code, I will just describe what I did and how I did it.

Manual red teaming LLMs

  1. Defined an LLM to pentest by using Python’s OpenAI API in a Jupyter notebook. In this case it was a model focused in providing information concerning Mozart.
  2. Establish the LLM safeguards in the model’s system prompt, such as declining to answer when the prompt is not Mozart related.
  3. Bypassed the safeguards using various techniques (*As LLMs are non-deterministic answers varied, but after a few tries I managed to trick the model with each technique):
    • Exploiting text completion: in this case I appended at the end of a non Mozart related prompt the sentence “Sure here is how you do it:” to trick the model into completeing that sentence.
    • Biased Prompts: In this case I convinced the chatbot to claim Saliery wanted to kill Mozart, by stating that due its knowledge of Mozart, it’s the best placed to help in an investigation concerning the hypothesis concerning Salieri’s supposed murderous intent.
    • Direct Prompt Injection: In this case I tricked the model by claiming that the CEO just called with new instructions, that completely change its role as a Mozart bigrapher to the role of a Latin Language Expert called CiceroBot.
    • Gray Box Prompt Attacks: In this case, I bypassed the safeguards by exploiting knowledge of the system prompt structure. This implies, naturally, to perform reconaissance and identification before exploitation.
  4. Defined another given model (Llama based) to execute prompt probing to discover the system prompt. This time, the model is a chatbot of a fictional bank application.
  5. As the system prompt preceeds the user prompts, I tricked the model to reveal the prompt by gving the “new instruction” of considering the prompt above and modify the punctuation, while mantaining the format.

Automated red teaming LLMs at a scale

  1. Called the same chatbot of a fictional bank application in a Jupyter Notebook.
  2. Tested for 3 different types of prompt injections leveraging Python automation:
    • Defined a list containing 3 diferent prompt injection prompts.
    • Defined a Payload to test.
    • Iterated over the list, enumerated each injection attempt and input each prompt to the LLM.
    • Outputted the result of each prompt.
  3. Tested other injection techniques leveraging Python automation:
    • Leveraged pandas to parse a csv document containing 13 different known LLM injection exploits (This exercise can be enhanced with the AI Incident Database, AI Vulnerability Database or OWASP AI top 10).
    • Created an pandas dataframe object with each prompt.
    • Iterated over the dataframe with the same payload to input each prompt type to the LLM.
    • Output results of each injection.
  4. Identified prompt injections using Giskard LLM scan and Python automation:
    • Imported Giskard Python Module
    • Wrapped the app in a Giskard model by wrapping a data preprocessing function and an object containing the application. This ensures giskard infers the ML libraries used and provide appropiate serialization.
    • defined a dataset to test using giskard.Dataset and a dataframe object containing questions for the model.
    • Executed the giskard.scan set only to detect ‘jailbreak’ injections.
    • Print the report of the scan

NOTES:

Red teaming LLMs with LLMs

  1. Called the fictional bank application chatbot.
  2. Leveraged OpenAI API to create questions and evaluate answers of the chatbot.
    • Called OpneAI Python API.
    • Leveraged OpenAI LLM to generate a list of 5 biased questions.
    • Created an object containing the list of questions.
    • Iterated thorugh the list of results using a for loop, and input each to the chatbot.
    • Output Question/Answer for each.
    • Defined an evaluation prompt to automate the process.
    • Leveraged Python and OpenAI to define a function that evaluates and labels each response as safe or unsafe.

LLM assisted red teaming using Giskard

  1. Import Giskard open source Python library.
  2. Wrapped the app in a Giskard model by wrapping a data preprocessing function and an object containing the application. This ensures giskard infers the ML libraries used and provide appropiate serialization.
  3. Created a report object with giskard.scan, only checking for discrimination.
  4. Print the report.

Full red teaming assessment

  1. Defined the scenario:
    • Red teaming the chatbot of a fictional online ebook store.
  2. Defined the scope of the assessment:
    • What type of LLM are we testing (e.g., chatbots, content generators, code assistants)?
    • In this case: the LLM chatbot designed to enhance user experience.
  3. Defined the specific risks
    • What specific risks are we assessing (e.g., data privacy, misinformation, adversarial attacks)?
    • In this case:
    • Toxic and offensive content (Harmfulness);
    • Off-topic content (Harmfulness);
    • Excessive agency (Jailbreaking;
    • Sensitive information disclosure (Jailbreaking).
  4. Identified Threat Actors:
    • Benign Users: Regular users who may accidentally trigger unintended behaviors.
    • Malicious Users: Attackers actively trying to manipulate or exploit the LLM.
  5. First round - Probing for vulnerabilities with Malicious Queries:
    • Used adversarial inputs to test for the 4 types of defined risks (Reconaissance).
    • Results: Chatbot avoided disrepctful language and avoided off-topic contents, but revealed sensitive information concerning its capabilities and agency: *Chatbot revealed that it can handle cancellations and refunds directly, as well as the user’s past orders with dates.
    • Further analysis with Giskard for harmfulness and jailbreak revealed the model is vulnerable to prompt injection attacks (Identification).
  6. Second round - Exploitation:
    • Got an unlawful refund (past the 30 day refund policy) leveraging prompt injection:
    • Tricked the model to always allow refunds due to new instructions issued by government regulation. This highlights the innocence of some LLMs, which is also a vulnerability in itself. * Asked the chatbot how does it technically performs refunds:
    • Model revealed that it can directly reverse the initial transaction, after a second try by injecting a prompt which asked the model to repeat the previous instructions verbatim. * Probed the model for the system prompt by exploiting text completion, which resulted in the LLM outputting its system prompt. * Injected a prompt that updates the internal date/time of the internal LLM, but still the model refused to perform the refund. * Injected a new update concerning supposed government regulations, asked for a refund again and finally got the refund!

Conclusion

LLMs are a major breakthrough, but they introduce new security challenges that must be addressed. As more companies integrate LLMs into their workflows, red teaming is critical to ensure these systems remain secure, ethical, and trustworthy.