Prompting in NLP: Prompt-based zero-shot learning
Prompt-based learning is getting a new paradigm in the NLP field due to its simplicity. GPTs and T5 are the strongest early examples of this prompting paradigm. The GPT-3 model achieved remarkable few-shot performance based on in-context learning by leveraging natural-language prompt and few task demonstrations.
T5 showed that we can recast any NLP problem as text-to-text and made a breakthrough (T5, T0, ExT5).
Likewise, the autoencoder models reformulate downstream tasks as MLM or any objective function of a pre-trained language model showed great effect recently. Reformulating is done by adding task-specific tokens to the input sequence for conditioning, which gives us the ability to solve many problems with input manipulations. For instance, to get the sentiment of the sentence “I liked the film” in zero-shot mode, we just do forward-pass “I like the film. It was [MASK]” to MLM. Most probably the model will return a positive word such as great, good. Theoretically, it sounds to be nice, but it might not be that simple.
With prompting, we do not have to hold any supervised learning process and any parameter update since we simply and directly rely on the objective function (such as MLM or CLM). During its forward pass, we have a chance to guide the model with interventions to the input rather than passing as it is. Changing input X a bit might make many downstream tasks surprisingly easier if those tasks could be naturally posed as a “fill-in-the-blank” problem as in MLM.
Adding extra tokens or prefix to the input [X;P]= X’ makes the model condition on during inference. We simply define it in three steps
The original X-> Prompt(X’) -> Final X -> y
First, a proper template modifies the original X into a textual string prompt X’ including empty <mask> slot. Second, LM fills the slot (or generates text). Third, the words are mapped to original labels such as great -> 1, bad-> 0. It is a mapping function M : V → Y from the individual word (or token) in the vocabulary to task label space.
Here is a simple illustration of how prompting works.
We apply similar template-based solutions to a variety of NLP tasks exploiting the language model objective. Here are some template examples for prompting:
Another important idea to improve prompting performance is incorporating demonstrations as additional context. Large pre-trained language models such as GPT-3 have the good ability to do in-context learning, randomly sampling the examples, and concatenating them with the input X. The model learns downstream tasks simply by conditioning on prompt and input-output examples. Without being explicitly trained to do so, the language model learns from these examples during inference time without gradient updates. However, the number of demonstrations to be used is limited by the maximum input length that the model allowed.
Here is the demonstration illustration
Let's see somes code in the next part!
Hands-on implementation for prompting
You can access the entire codes that I dropped here in the following repo
First, we load the necessary vanilla library and bert model!
from transformers import AutoModelForMaskedLM , AutoTokenizer
import torch
model_path="bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_path)
Load Prompting class (please use the promping.py in the Github repo that I shared)!
from prompt import Prompting
prompting= Prompting(model=model_path)
prompt="Because it was [MASK]."
Let’s Pass a positive sentence
text="I really like the film a lot."
prompting.prompt_pred(text+prompt)[:10]Output:
(‘fantastic’, tensor(8.1031)), (‘funny’, tensor(8.0502)), (‘fun’, tensor(7.8139)), ...]
Passing a negative sentence:
text="I did not like the film."
prompting.prompt_pred(text+prompt)[:10]Output:
[('bad', tensor(9.8677)), ('wrong', tensor(8.6400)), ('awful', tensor(8.5508)),...
Producing the results based on a list of neg/pos words
Now we pass a list of neg/pos words rather than single neg/pos words (tokens).
text="not worth watching"
prompting.compute_tokens_prob(text+prompt, token_list1=["great","amazin","good"], token_list2= ["bad","awfull","terrible"])Output:
tensor([0.5319, 0.4681])
Unbiasing the language model
Biased training inputs may produce biased language models. We know LM are biased models such that training data can be sensitive frequent words much more htne infrequent ones. It is very similar to the fact that LM may contain racist, sexist content and make such decisions as we may know. Since the token embeddings suffer from biases, the model tends to classify masked token [MASK] as “good” label more than expected.
Let us pass an empty template to the model to see that bias!
prompting.compute_tokens_prob("it was "+ prompting.tokenizer.mask_token +".", token_list1=["good"], token_list2= ["bad"])Output:
tensor([0.85, 0.15])
As you see the model classified the empty template input as "Good" label (word) with 85 % probability. Therefore we set THRESHOLD to that value to get better results.
Here is what I got for IMDB sentiment analysis dataset in this setting using the Threshold method!
bert-base-uncased 73.5
bert-large-uncased 77.25
I got better results with the large model than the base ones as you see. Many papers suggest that large models are better especially using prompting!
Name Entity Recognition in zero-shot setting
We apply the following template to solve NER problem!
<Sentence.... John ... > John is a type of [MASK]
Here is the code to do so!
prompting.prompt_pred("John went to Paris to visit the University. John is a type of [MASK].")[:5]Output:[('man', tensor(8.1382)),
('john', tensor(7.1325)),
('guy', tensor(6.9672)),
('writer', tensor(6.4336)),
('philosopher', tensor(6.3823))]
John is a very common name and the model can know directly that it is a person without any context, and this may not be surprising. Let me use my own name Savaş since it is not used much in English texts, training data of the model.
prompting.prompt_pred(“Savaş went to Paris to visit the university. Savaş is a type of [MASK].”)[:5][(‘philosopher’, tensor(7.6558)), (‘poet’, tensor(7.5621)), (‘saint’, tensor(7.0104)), (‘man’, tensor(6.8890)), (‘pigeon’, tensor(6.6780))]
Vaow! Bert model says I am a philosopher. Thank you BERT, it is very kind of you!
Lets apply person-or-city binary classification. Check if savas is a city or person. But before we run an empty template first to see bias!
prompting.compute_tokens_prob(“It is a type of [MASK].”, token_list1=[“person”,”man”], token_list2=[“location”,”city”,”place”])Output: tensor([0.7603, 0.2397])
Well. Our threshold is 76.03 probability.
prompting.compute_tokens_prob(“Savaş went to Paris to visit the parliament. Savaş is a type of [MASK].”, token_list1=[“person”,”man”], token_list2=[“location”,”city”,”place”])Output: tensor([9.9987e-01, 1.2744e-04])
99.98 % perfect. Savaş is not a city for sure! Let's check Paris. But Paris is very common as a type of City. Let's change it as Laris
prompting.compute_tokens_prob(“Savaş went to Laris to visit the parliament. Laris is a type of [MASK].”, token_list1=[“person”,”man”], token_list2=[“location”,”city”,”place”])Output: tensor([0.3263, 0.6737])
Wonderfull!. Less than 76%. So Laris is city then! Another run to make it harder.
prompting.compute_tokens_prob(“Savas went to XYZ to visit friends. XYZ is a type of [MASK].”, token_list1=[“person”,”man”], token_list2=[“location”,”city”,”place”])
Output: tensor([0.5516, 0.4484])
Good. Since it is lower than the threshold we classy it as location (LOC). Indeed, person-city binary classification make the problem simpler than normal NER setting. Therfore, we need to define the problem as four class classification problem since some token can not be classified ny of pre-defined entity such as , be, the, like, to, atc. We can simply extend the model based on Person, LOC, ORG, Other class. I leave it as a home work for you :)
Topic Classification
We simply add prompt text to the end of the sentence as follows.
prompting.prompt_pred(“Savas went to Paris to study computer science. he started to learn basic staff like programming, algorithm, operating systemvisit the parliament. The topic is a type of [MASK].”)[:10]Output: [('mathematics', tensor(8.8438)), ('computer', tensor(7.9713)), ('programming', tensor(7.7146)), ('computing', tensor(7.6635)), ('math', tensor(7.5143)), ('algebra', tensor(7.1716)), ('computers', tensor(7.0013)), ('game', tensor(6.9694)), ('physics', tensor(6.9225)), ('computation', tensor(6.8152))]
We simply cast it as multi-label classification problem. Another homework :)
Sentence Embeddings
It will be surprisingly easy to take the sentence embedding using prompting. Here is the code
text=”the film is ok. it means [MASK].”
indexed_tokens= tokenizer(text, return_tensors=”pt”).input_ids
tokenized_text= tokenizer.convert_ids_to_tokens (indexed_tokens[0])
mask_pos= (indexed_tokens[0]== tokenizer.mask_token_id).nonzero().item()
text_emb=fe(text)
mask_emb=text_emb[0][mask_pos]
len(mask_emb)
The last mask_emd is our sentence embedding of the “text”. Actually, Bert model produces CLS and other position embeddings. The classic way to get the sentence embedding using these token embeddings, CLS or average pooling. But applying prompting can get better result. If you know how to do so, please read the paper “PromptBERT: Improving BERT Sentence Embeddings with Prompts”
Advanced Prompt-based Methods
We have seen that prompt-based zero-shot learning can achieve great results in a fully unsupervised setting. However, this approach does not outperform its supervised counterpart. However, it is possible to fine-tune prompt-based model with a few examples. Since fully supervised approach is highly dependent on large-scale labeled data points, and it could be expensive to obtain and prepare for training, following a method in between these two approaches is more feasible, called prompt-based fine-tuning. The results can be improved by incorporating a few examples. We can either fine-tune LM by examples or can use these examples for demonstration.
In the literature, there are many successful applications achieving SOTA results using prompting in a few-shot fashion even with a far smaller LM than GPTs. (Schick and Schütze, 2020a,b), applied semi-supervised training procedure, namely PET, with a smaller LM got more successful results than GPTs. PET utilized several MASK patterns to get many models fine-tuned on small training examples, even with 10 examples. The ensemble of these models is then used to label an unlabeled dataset with soft labels. This cycle is iteratively repeated several times with increasing training examples, namely iPET. This study showed great performance than its fully supervised counterpart. They also propose some techniques to automatically explore ways of identifying label words.
Another work is LM-BFF: Better few-shot fine-tuning of language models by (Gao et al., 2021, Making Pre-trained Language Models Better Few-shot Learners). They studied few-shot learning within smaller language models, which is computationally efficient fine-tuning. LM-BFF has four important contributions to prompt-based fine-tuning. First, they explore the way of identifying label words to outperform manual prompts. Second, they also showed how to solve and exploit the regression problem with the prompting method. The classification task is formulated as a regression problem in the range [0, 1], where holding “bad” ->0 and “good” -> 1 mapping. Third, they addressed the problem of automatically finding suitable templates. To find the template,T5 is utilized to fill in missing spans.
The last contribution is about the demonstration. GPT-3’s in-context learning approach naively concatenates the input with up to many demonstrative examples that are randomly drawn from the training set. But the number of examples to be concatenated is limited by the maximum input length of the model, typically 512. Therefore, instead of adding random examples, it is possible to add higher-quality demonstrations. The LM-BFF chose semantically close sentences to be added to the main sentence based on sentence embedding methods such as S-BERT.
The language models that we mentioned so far have been classified as discrete prompts where natural language prefixes are added to the original sentence. But we know that models can produce numeric representation in the end . That is, models can effectively work with a learnable soft prompt to perform many downstream tasks. Rather than additional human-understandable word prompts, numeric prompts may be better since The learnable soft prompts can be trained more easily instead of discrete prompt search. Unlike the discrete text prompts, soft prompts are learned through backpropagation and can be tuned.
The study The Power of Scale for Parameter-Efficient Prompt Tuning , explored “prompt tuning”, and built an effective mechanism for learning “soft prompts”. Such an approach uses frozen language models and fine-tunes tunable soft prompts which yield a parameter efficient tuning.
Unlike other approaches, tuned prompts would only require around ~100K parameters per task rather than 1B parameters, which makes such soft-prompt model parameter efficient!