Prompt Engineering Whitebook
Why prompting
When working with LLMs, the rule number one is: Don't touch the model. Very often, people (especially students with more experience in model tuning and less industrial-level prompt engineering experiences) will opt for finetuning when they have a new problem at hand. However, the harsh reality is that most real-world problems are either simple enough to handle with a good prompt, or complex enough that fine-tuning on available large datasets become less effective.
In my opinion, prompting should ideally be your first approach. Complex tasks can often be decomposed into smaller, easier tasks and solved with pretrained models. Yuo should only go changing model architecture once your prompts are as good as they can be. No company would want to burn money at start, only to realize that easy solution with prompt engineering is there lying on the table.
Major benefits of prompt engineering include:
- Reduce costs by moving to a smaller model
- Eliminate finetuning costs
- Enable lower-latency communication by changing the general format
Assumptions
This blog assumes basic understanding of prompting, such as what forms a prompt, what are different components of a prompt, and how prompts are transformed into tokens for model inferences. You may check online resources for it if you don't now about them.
Techniques
Use Templates
Most open source model have their specific prompt tempaltes. You can refer to their website, or find it on Hugging Face. Some basic ones include
1 | ### Human: your prompt here |
1 | [INST] <<SYS>> |
1 | You are a helpful AI assistant. |
1 | Human: Human things |
Things to take note of:
- Some models don't have system prompts
- Some models will prefer alternating between
user
prompt andassistant
prompts - When choosing models, you should often look out for instruction-finetuned models, as they are the most prevalent ones for chat completion/streaming chat tasks.
Few-shot learning
To put it in a user-friendly text, few-shot learning in the context of llm prompting is simply providing example into the prompt. You may raise a question or instruction for the llm/chatbot to answer. But sometimes they don't know the answer format or they hallucinate without suffcient context. Giving some examples often helps models to understand the instructions better, thus providng a more cohesive and relevant answer. As an example, suppose one wants LLM to output a JSON, but on the first try, the JSON was malformed. To fix this issue, one can either pass the output to LLM and ask it to fix it by itself, or he can retry using a better prompt with few-shot learning:
1 | ### Human: Given a sentence "This place is horrible" from Wall Street Journel, determine if it has positive/negative/neutral sentiment. Output the result in JSON format. Here are a few examples: |
If we want the model to learn to merge queries from past responses and better the answer, we can improve the prompt above by splitting in into conversations
1 | ### System: Given a sentence from Wall Street Journel, determine if it has positive/negative/neutral sentiment. Output the result in JSON format. |
You can also teach multi-turn behavior - like adding together queries, and cleaning them out when requested via this few-shot learning technique.
With all these benefits, we must not ignore its potential problems:
- Model often struggles to move away from pre-training knowledge
- It significantly uses up the token budget of your prompts, which can be expensive
- Sometimes giving examples is counter-effective. For example, providing a single positive example can cause the model to always output positive label. Providing two positive and one negative can cause the model to think the next one must be negative. Sometimes this pattern happen because the label distribution is very skewed. Sometimes it could be domain knowledge issue as well. Be sure to check it out and eliminate potential hallucination issues when applying this technique.
Manage prompt complexity
Suppose you are talking to a human and providing instruction to them. If you provide a long, complex set of instructions in one shot and expects the human to follow it, how confident are you in him/her completing the instruciton as you wanted? Most cases it achieves nothing but anger in that person's mind. Now think about the case when you talk to a chatbot, the sheer complexity of prompt can also be countereffective from time to time. Hence, managing the complexity of your prompt is a really important part of prompt engineering. Here are a list of things I recommend checking to achieve a good balance when managing your prompts for your tasks. Most prompts have three primary types of complexity and we will handle them one by one.
Task Complexity
- Definition: Difficulty of the major task
- Example:
Who are the characters in this text
is significantly simpler thanIdentify the key antagonists
- How to reduce it:
- Break it down to smaller, simpler tasks
- Insert a chain of thought before asking for an answer.
Think step-by-step
is an easy addition - Pointing out which part of the problem to solve first. Models need to know where to start and start the right way.
- Sometimes you can debug model's thought process by asking it to print it out
Inference Complexity
- Definition: The amount of inference the model needs to understand your task.
- Counterintuitively, this is something that affects small, seemingly simple prompts.
- Example: understanding what is an intent can be tough, as it can mean general objectives in research, or enquiry in customer service.
- How to reduce it:
- Provide explanation/definition to those keywords
- Switch to a simpler/general words if possible
- Often requires prompt size to grow
- Ask the model to define it himeselve to achieve implicit chain-of-thought
Ancillary Functions
- Definition: smaller tasks you are explicitly (or implicitly) asking the model to perform
- Examples: transformations to the JSON; retrieving and merging things from previous messages.
- How to reduce it:
- Prompt Switching: essentially keeping the context and vary the instructions
- Note: Conversationally tuned models (like llama-2) will prefer this, but other instruction-following models might find it hard to retrieve intermittent context (hiding in between human instructions) when it comes to answering the final, big question.
- Self-consistency: You can test if the complexity is removed by turn the temperature up if your task permits it, and see if the results are aligned
- If your prompt works well across multiple models, it's a good sign that it's well-spelled out
- Prompt Switching: essentially keeping the context and vary the instructions
A checklist for reducing prompt comlexity
- Primary task
- The most valuable thing I need the model to do
- Key terms in the task: are they very, very well defined, or so simple that there's no ambiguity?
- Any explicit/implicit additional tasks aside from primary task: are they integral to the performance of my primary task? Can I split them into other prompts or find ways to reduce their complexity?
- Any domain knowledge or things that require domain expertise: can model infer or learn these eccentricities about this domain?
- Any instruction requirements: is my task a question? does it need instructions (like this list you're reading) on how to start towards a solution?
Spoon-Feeding
Intuition: LLMs are next-token probability predictors, and the sooner you can get them going in the right direction, the more likely that they'll follow it.
Example:
1 | Human: Please help this user with his questions, by providing a list of ingredients for his recipe. |
Notice in Assistant
, the tokens all the way up to are
are fixed, and the next token is our required word.
Note that OpenAI GPTs don't support this strategy (but you can still leave uncompleted text at the end for a workaround), but almost every other model and provider does.
Proper usage of System prompts
Attention to system prompts have always been a potential weakness of GPT models (but may be fixed in later versions). However, Llama-2 class of models actually handle system prompts well, as they use special mechanisms in training (like Ghost Attention) to increase the effectiveness of a system prompt to influence a conversation, even after many messages.
Some useful things you can use your system prompts for:
- Hold Facts, Rules (see below) or other general purpose information that don't change as the conversation proceeds.
- Set the personality of the assistant. A strong personality (e.g.
You are a chess grandmaster
) may lead to better quality of the task completed in some cases. - Set (or reinforce) an output format (.e.g
You can only output SQL.
) - Move repeated bits of user messages out so you can do better few-shot learning.
- Make changing the task for this prompt easier without editing the conversation history.
Meaningfully distinct keywords
For some keywords that you want the model to put close attention to, convert the normal natural language to a special format. It is recommended to use CAPITAL_UNDERSCORED_HEADINGS
. As an example:
1 | The travel document I want you to read: |
Can be transformed into:
1 | USER_TRAVEL_DOCUMENT: |
Proper escaping
In most cases, the information provided (documents, emails, etc) will be in the same language and follow similar formats to your instructions.
- Use escaped (and Meaningfully distinct keywords) to help the model separate which is which.
- Use backticks (`) or triple quotes (”””) to escape your data sections.
- Use a few recommended formatting options for input/output
- Multi-line strings: pretty easy, use this for unstructured data.
- Bulleted lists: easy way to mark something as a list. Save your tokens unless your experience differs.
- Markdown tables: pretty token heavy. Use these if your data comes in markdown tables, or you need it for easy output formatting.
- Typescript: The significantly better choice for expressing a typespec, especially with comments mixed in.
- JSON: Uses more token than many of the above. But may become the new standard in the long term (OpenAI funciton has JSON formatted output support).
- YAML: Close to natural language. Also pretty conservative on tokens. Not having random curly braces helps the BPE better break up your characters into larger token chunks.
- A rule of thumb: If you want support, use JSON. If you want brevity, use YAML. If you have a typespec, use Typescript. If it's simple, just separate with newlines.
Content structuring with Facts and Rules
Sometimes structuring your prompt may make your prompts easier to read for both you and the model. Aside from proper escaping, we often use facts and rules to guide models to complete the task:
- Facts list what the model should presume before working on the task. Organizing your prompts this way helps you better understand and modify them later on (and prevent prompt-rot)
- Rules are specific instruction to follow when executing on a task
An example can be:
1 | FACTS: |
Chain-of-Thought
This is a well-known method, I'll just pass two examples with different tasks for inspiration
- Cliff-summarising a story
Let's say you want to take a story and summarise the key story beats. You keep trying but the LLM keeps missing things. Here’s one approach.
1 | STORY: |
One way to improve effectiveness is to work out how you would do it.
1 | Summarise this STORY into key plot points. |
This kind of prompting also produces responses that are far easier to debug.
- Continuing a story
Now say we wanted to write the next chapter for the same story - a far more creative endeavor. Here's a naive prompt:
1 | STORY: |
Here's a better one.
1 | STORY: |
Chain-of-Thought but multi-path automation + validation
When designing chain-of-thought prompt, or any set of facts + rules to better structure your prompt content, consider consulting GPT-4 or other expensive models to get suggestions. The pseudocode is
1 | For each COT path (rules/facts): |
Some other tricks (To be expanded)
- Pretended that some of our provided context came from the AI and not us. Language models will critique their own outputs much more readily than your inputs
- For each model, use delimiters and keywords that look and feel similar to the original template/dataset used for the model, even if they're not directly part of the dataset
- In some cases, asking the model to annotate its own responses with a probability of acceptance, and thresholding this value to remove the worst candidates can improve results.
- Using structured text like pseudocode may improve results
- Replace negation statements with assertions (e.g., instead of “don't be stereotyped,” say, “please ensure your answer does not rely on stereotypes”)
- If budget allows, find a way to express the output in structured format where it can be auto-verified (in polynomial time ideally). Then turn the temperature up and take a few passes through the same prompt. Pick the majority winner.
How to debug your prompt
- Never pass user input (more specificly, raw customer input) directly to model for output
- Never invent custom formats. Use and modify what's already in the lexicon of the model.
- Remove syntax and semantic errors. Sometimes this cause models to output wrong things. Example: saying
output characters
in an instruction may direct model to prefer outputing multiple characters when there should be only one valid character. - When dealing with specific output format, don't put trailing fullstop/coma/semicolon as they may break the output structure.
- Vary the order of your instructions and data to make a prompt work
- Vary where the information is placed (user prompt vs system prompt vs assistant prompt)
- Change the wording, sometimes the keywords/phrases that are domain-specific or abstract are understood by different models differently. check if changing some keywords to its variants or make them clearer can be helpful- When performances of output among different models using the same prompt are similar (sometimes can be done using an LLM evaluator), and you are happy with the results, your prompt is probably ready to use.
When to modify the model itself
- You've tried extensive prompt optimization, and you're nowhere near your required success rate.
- You need to move to a smaller model, for privacy or cost reasons.
- You have a large enough dataset, and the time and money to finetune a model.
- Your problem space sits far outside the pretraining dataset - maybe you work in Swift, or you need to train a DSL.
- You have a particular style of interaction that you need to “bake in”, even at the cost of potentially overfitting.
- You need to reverse some prior finetuned behavior.
References
Prompt Engineering Whitebook
https://criss-wang.github.io/post/blogs/llm/prompt-engineering/