Blogs · LLM

Prompt Engineering Whitebook

A copy of what I learned and gathered about prompting, both from online and from work

2024.03.20 · 11 min read · by Zhenlin Wang

Why prompting

When working with LLMs, the rule number one is: Don’t touch the model. Very often, people (especially students with more experience in model tuning and less industrial-level prompt engineering experiences) will opt for finetuning when they have a new problem at hand. However, the harsh reality is that most real-world problems are either simple enough to handle with a good prompt, or complex enough that fine-tuning on available large datasets become less effective.

In my opinion, prompting should ideally be your first approach. Complex tasks can often be decomposed into smaller, easier tasks and solved with pretrained models. Yuo should only go changing model architecture once your prompts are as good as they can be. No company would want to burn money at start, only to realize that easy solution with prompt engineering is there lying on the table.

Major benefits of prompt engineering include:

  1. Reduce costs by moving to a smaller model
  2. Eliminate finetuning costs
  3. Enable lower-latency communication by changing the general format

Assumptions

This blog assumes basic understanding of prompting, such as what forms a prompt, what are different components of a prompt, and how prompts are transformed into tokens for model inferences. You may check online resources for it if you don’t now about them.

Techniques

Use Templates

Most open source model have their specific prompt tempaltes. You can refer to their website, or find it on Hugging Face. Some basic ones include

### Human: your prompt here
### Assistant:
[INST] <<SYS>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don\'t know the answer to a question, please don\'t share false information.
<</SYS>>
{prompt}[/INST]
You are a helpful AI assistant.

USER: {prompt}
ASSISTANT:
Human: Human things
Assistant: {{Response}}

Things to take note of:

Few-shot learning

To put it in a user-friendly text, few-shot learning in the context of llm prompting is simply providing example into the prompt. You may raise a question or instruction for the llm/chatbot to answer. But sometimes they don’t know the answer format or they hallucinate without suffcient context. Giving some examples often helps models to understand the instructions better, thus providng a more cohesive and relevant answer. As an example, suppose one wants LLM to output a JSON, but on the first try, the JSON was malformed. To fix this issue, one can either pass the output to LLM and ask it to fix it by itself, or he can retry using a better prompt with few-shot learning:

### Human: Given a sentence "This place is horrible" from Wall Street Journel, determine if it has positive/negative/neutral sentiment. Output the result in JSON format. Here are a few examples:
{"sentence": "The food is enjoyable", "sentiment": "positive"}
{"sentence": "Princess Kate was diagnosed with cancer", "sentiment": "netural"}
{"sentence": "War criminials need to be punished heavily", "sentiment": "negative"}
### Assistant:

If we want the model to learn to merge queries from past responses and better the answer, we can improve the prompt above by splitting in into conversations

### System: Given a sentence from Wall Street Journel, determine if it has positive/negative/neutral sentiment. Output the result in JSON format.
### Human: The sentence is "The food is enjoyable", the output JSON is:
### Assistant: {"sentence": "The food is enjoyable", "sentiment": "positive"}
### Human: The sentence is "Princess Kate was diagnosed with cancer", the output JSON is:
### Assistant: {"sentence": "Princess Kate was diagnosed with cancer", "sentiment": "netural"}
### Human: The sentence is "War criminials need to be punished heavily", the output JSON is:
### Assistant: {"sentence": "War criminials need to be punished heavily", "sentiment": "negative"}
### Human: The sentence is "The place is horrible", the output JSON is:
### Assistant:

You can also teach multi-turn behavior - like adding together queries, and cleaning them out when requested via this few-shot learning technique.

With all these benefits, we must not ignore its potential problems:

Manage prompt complexity

Suppose you are talking to a human and providing instruction to them. If you provide a long, complex set of instructions in one shot and expects the human to follow it, how confident are you in him/her completing the instruciton as you wanted? Most cases it achieves nothing but anger in that person’s mind. Now think about the case when you talk to a chatbot, the sheer complexity of prompt can also be countereffective from time to time. Hence, managing the complexity of your prompt is a really important part of prompt engineering. Here are a list of things I recommend checking to achieve a good balance when managing your prompts for your tasks. Most prompts have three primary types of complexity and we will handle them one by one.

Task Complexity

Inference Complexity

Ancillary Functions

A checklist for reducing prompt comlexity

  1. Primary task
  2. The most valuable thing I need the model to do
  3. Key terms in the task: are they very, very well defined, or so simple that there’s no ambiguity?
  4. Any explicit/implicit additional tasks aside from primary task: are they integral to the performance of my primary task? Can I split them into other prompts or find ways to reduce their complexity?
  5. Any domain knowledge or things that require domain expertise: can model infer or learn these eccentricities about this domain?
  6. Any instruction requirements: is my task a question? does it need instructions (like this list you’re reading) on how to start towards a solution?

Spoon-Feeding

Intuition: LLMs are next-token probability predictors, and the sooner you can get them going in the right direction, the more likely that they’ll follow it.

Example:

Human: Please help this user with his questions, by providing a list of ingredients for his recipe.

Human: I'm making a mud pie!

Assistant: Cool! The ingredients you'll need are

Notice in Assistant , the tokens all the way up to are are fixed, and the next token is our required word.

Note that OpenAI GPTs don’t support this strategy (but you can still leave uncompleted text at the end for a workaround), but almost every other model and provider does.

Proper usage of System prompts

Attention to system prompts have always been a potential weakness of GPT models (but may be fixed in later versions). However, Llama-2 class of models actually handle system prompts well, as they use special mechanisms in training (like Ghost Attention) to increase the effectiveness of a system prompt to influence a conversation, even after many messages.

Some useful things you can use your system prompts for:

  1. Hold Facts, Rules (see below) or other general purpose information that don’t change as the conversation proceeds.
  2. Set the personality of the assistant. A strong personality (e.g. You are a chess grandmaster) may lead to better quality of the task completed in some cases.
  3. Set (or reinforce) an output format (.e.g You can only output SQL.)
  4. Move repeated bits of user messages out so you can do better few-shot learning.
  5. Make changing the task for this prompt easier without editing the conversation history.

Meaningfully distinct keywords

For some keywords that you want the model to put close attention to, convert the normal natural language to a special format. It is recommended to use CAPITAL_UNDERSCORED_HEADINGS. As an example:

The travel document I want you to read:
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.

Use the travel document provided to extract the key destinations the user is travelling to.

Can be transformed into:

USER_TRAVEL_DOCUMENT:
"""
Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus PageMaker including versions of Lorem Ipsum.
"""

Extract the key destinations from USER_TRAVEL_DOCUMENT.

Proper escaping

In most cases, the information provided (documents, emails, etc) will be in the same language and follow similar formats to your instructions.

Content structuring with Facts and Rules

Sometimes structuring your prompt may make your prompts easier to read for both you and the model. Aside from proper escaping, we often use facts and rules to guide models to complete the task:

FACTS:
1. Today's date is 6 September 2023.
2. Pax, pp, per person all mean the same thing.

RULES:
1. You need to outline your logical premise in Prolog before each sentence.
2. Write the text as the user, not on behalf of them.

Chain-of-Thought

This is a well-known method, I’ll just pass two examples with different tasks for inspiration

  1. Cliff-summarising a story

Let’s say you want to take a story and summarise the key story beats. You keep trying but the LLM keeps missing things. Here’s one approach.

STORY:
"""
Just wakin' up in the mornin', gotta thank God
I don't know but today seems kinda odd
No barkin' from the dog, no smog
And momma cooked a breakfast with no hog
"""

Summarise this story into the key plot points.

One way to improve effectiveness is to work out how you would do it.

Summarise this STORY into key plot points.

STORY:
"""
Just wakin' up in the mornin', gotta thank God
I don't know but today seems kinda odd
No barkin' from the dog, no smog
And momma cooked a breakfast with no hog
"""

Go step by step to get the plot points:
1. Outline the key players in the story. Who are the characters?
2. List the major plot points and who was involved.
3. For each plot point, list the consequences of this happening.
4. For each consequence, see if there are any story beats missing from the first list, and list them.
5. Resummarise the story in terms of beats, labelling each point as positive or negative and it's contribution to the story.

This kind of prompting also produces responses that are far easier to debug.

  1. Continuing a story

Now say we wanted to write the next chapter for the same story - a far more creative endeavor. Here’s a naive prompt:

STORY:
"""
Just wakin' up in the mornin', gotta thank God
I don't know but today seems kinda odd
No barkin' from the dog, no smog
And momma cooked a breakfast with no hog
"""

Write the next chapter of the STORY.

Here’s a better one.

STORY:
"""
Just wakin' up in the mornin', gotta thank God
I don't know but today seems kinda odd
No barkin' from the dog, no smog
And momma cooked a breakfast with no hog
"""

We need to write the next chapter of STORY, but let's go through the steps:
1. List the main characters in the STORY, and what their personalities are.
2. What are their arcs so far? Label each one on a scale of 1-10 for how interesting it is, and how important it is to the main story.
3. List which arcs are unfinished.
4. List 5 new characters that could be introduced in the next chapter.
5. List 5 potential, fantastical things that could happen - major story beats - in the next chapter.
6. Grade the new characters and the new occurrences 1-10 on how fun they would be, and how much they fit within the theme of the existing story.
7. Write the next chapter.

Chain-of-Thought but multi-path automation + validation

When designing chain-of-thought prompt, or any set of facts + rules to better structure your prompt content, consider consulting GPT-4 or other expensive models to get suggestions. The pseudocode is

For each COT path (rules/facts):
	Build prompt with these context
	Run inference to get results
	Perform debugging step and generate a score
Select the candidate with the highest score

Some other tricks (To be expanded)

How to debug your prompt

When to modify the model itself

  1. You’ve tried extensive prompt optimization, and you’re nowhere near your required success rate.
  2. You need to move to a smaller model, for privacy or cost reasons.
  3. You have a large enough dataset, and the time and money to finetune a model.
  4. Your problem space sits far outside the pretraining dataset - maybe you work in Swift, or you need to train a DSL.
  5. You have a particular style of interaction that you need to “bake in”, even at the cost of potentially overfitting.
  6. You need to reverse some prior finetuned behavior.

References