How do you measure the quality of language generation output?

The article on prompt engineering is very interesting thanks for sharing! It discusses useful techniques for influencing output generated by a language model, but only uses anecdotal examples to prove that the techniques have been successful. What techniques have people developed for assessing the quality of the output, e.g.

  • grammar
  • some measurement of relevance to the topic in the prompt

Without a way to measure the output quality I don’t think we can definitively prove that the prompt engineering works (despite everything pointing towards it working).

It really depends on the task. We can for example shape a prompt for sentiment classification, in which case we can use traditional classification metrics to evaluate it. Free-form generated text has more variety in how it can be evaluated. Perplexity is one common measure which translates to downstream tasks like some conversational use cases. Given the wide variety of use cases, a lot of developers examine the generations qualitatively to measure their fit for their use case. In research, human evaluation is often a useful benchmark for complicated tasks that are harder to quantify.

Do you have a specific use case in mind? Happy to brainstorm about the best approach.

Sorry to hijack into the topic. I have similar question in mind.

In our project, we want the user to confirm the output only in some specific cases.

For example:
If user ask: “I want to find coffee shop near Helsinki”
The cohere will extract “coffee shop”, “Helsinki”, great! Now I use a popup to ask user to confirm that this is the data he wants.
But if user enter “asdf”, definitely cohere can’t guess because the input is trash. Is there a way that we can prevent it pop out the confirm, or even prevent cohere doing anything.

Of course the length of text can be checked, but sometimes they can do “asdfasdfasfasdfasfasdfasdfasdf”.