Naming text clusters of short texts

Text and document clustering (using embeddings) allows for some very useful analysis of text.

I’m keen on developing a good way to use generative models to assign names to text clusters. My first approach is to assign the names using keywords extracted using the cTFIDF algorithm because not only does it raise important keywords for a certain cluster, but it also takes into consideration that they be different from those in other clusters.

I named a few clusters myself and added them as examples in this prompt:

The common theme of the following words: books, book, read, the, you, are, what, best, in, your
is that they all relate to favorite books to read.
---
The common theme of the following words: startup, company, yc, failed
is that they all relate to startup companies and their failures.
---
The common theme of the following words: freelancer, wants, hired, be, who, seeking, to, 2014, 2020, april
is that they all relate to hiring for a freelancer to join the team of a startup.
---
The common theme of the following words: <insert keywords here>
is that they all relate to

Here are its results for each cluster:

1- new, site, facebook, happened, account, internet, search, does, hn, google
Model’s suggested theme: the new Facebook search engine.

2- 2015, 2016, november, april, 2019, 2020, 2021, hired, wants, hiring
Model’s suggested theme: hiring for a freelancer to join the team of a startup.

3- open, source, web, code, software, using, tools, alternatives, best, use
Model’s suggested theme: open source software.

4- life, reading, ve, hn, blogs, best, book, favorite, read, books
Model’s suggested theme: the life of a blogger who reads books and blogs about them.

5- project, startups, passive, money, did, business, successful, company, income, startup
Model’s suggested theme: passive income from a successful startup.

6- career, tech, developer, remote, software, jobs, seeking, work, job, freelancer
Model’s suggested theme: finding a career in the tech industry as a software developer.

7- advice, things, people, doing, deal, working, career, did, work, life
Model’s suggested theme: advice on how to deal with things in your career and life.

8- self, cs, courses, design, good, best, programming, resources, learning, learn
Model’s suggested theme: learning how to code.


Some of these are better than others. For the first one, I probably would’ve name this a “Big Tech” cluster. I’m biased by looking at the articles though. The model over-indexed on “Facebook” here and made a good attempt at making a sentence out of the word soup.

For the next iteration, I’m thinking to reshape the prompt to include a few titles from each cluster to the examples, and better indicate to the model that they keywords are simply keywords from a group of articles, so we’re not nudging the model to overextend itself making a sentence out of seemingly unrelated words without much context.

Happy to share the code I’m working with if there’s interest. Let me know.

3 Likes

Hi @jay would love to check the code and experiment/share any other methods.

Happy to! WIll be sharing the code this week :+1:

This is the class for the structured prompt:

class cohereStructuredPrompt():
    def __init__(self, examples, example_labels, labels, task_desciption, prompt_output_indicator):
        self.examples = examples
        self.example_labels = example_labels
        self.labels = labels
        self.task_desciption = task_desciption
        self.prompt_output_indicator = prompt_output_indicator

    def make_prompt(self, example):
        examples = self.examples + [example]
        labels = self.example_labels + [""]
        return (
                "\n---\n".join( [self.task_desciption + examples[i] + "\n" +
                                self.prompt_output_indicator + 
                                 labels[i] for i in range(len(examples))]))

    def generate(self, example):
        result = co.generate(
          model='large',
          prompt=self.make_prompt(example),
          max_tokens=50,
          temperature=0.3,
          num_generations=5,
          return_likelihoods='GENERATION',
          stop_sequences=["\n"])
        return(result.generations)

Notice that the Generate call has num_generations=5 – we get five results and choose the best from amongst them (defined as having the best average likelihood score).

These are the examples I labeled:


example_inputs = [['books', 'book', 'read', 'the', 'you', 'are', 'what', 'best', 'in', 'your'],
 ['startup', 'company', 'yc', 'failed'],
 ['freelancer', 'wants', 'hired', 'be', 'who', 'seeking', 'to', '2014', '2020', 'april']]


example_outputs = [' favorite books to read.',  
                   ' startup companies and their failures.', 
                   ' hiring for a freelancer to join the team of a startup.']

And this is how the prompt object is put together:

clusterNamer = cohereStructuredPrompt(example_inputs, 
                                       example_outputs, [],
                                       "The common theme of the following words: ", 
                                       "is that they all relate to")

Now that the prompt is ready, we can loop over the clusters and generate five suggested titles for each cluster:

results = []
for idx in range(len(words_per_class)):
    
    topic_words = get_topic_keywords(words_per_class, idx)
    
    cluster_names = clusterNamer.generate(', '.join(topic_words))
    results.append(cluster_names)

So now results is a list of lists – containing 5 suggested names for each cluster. We can then proceed to get the top name for each cluster:

topic_names = []
for row in results:
    index_of_most_likely_gen = np.argmax([gen.likelihood for gen in row])
    most_likely_gen = row[index_of_most_likely_gen].text
    topic_names.append(most_likely_gen)
    
topic_names
1 Like

I found that extracting KW phrases instead of just single words significantly improve cluster name generation.

It would be nice to see what the generated prompt looks like.

Oh that’s very interesting. How are you generating the phrases?

It would be nice to see what the generated prompt looks like.

For phrases you mean? The one for keywords is listed above:

The common theme of the following words: books, book, read, the, you, are, what, best, in, your
is that they all relate to favorite books to read.
---
The common theme of the following words: startup, company, yc, failed
is that they all relate to startup companies and their failures.
---
The common theme of the following words: freelancer, wants, hired, be, who, seeking, to, 2014, 2020, april
is that they all relate to hiring for a freelancer to join the team of a startup.
---
The common theme of the following words: <insert keywords here>
is that they all relate to

Thanks. I implemented a custom IT Industry KW extraction process that utilizes a few methods including lexical (dictionary based), Corpus level TFIDF bi-grams and tri-grams and some NER via SpaCy. I use this process to extract key-words/phrases from a corpus of IT docs, cluster these KWs (HDBSCAN), clean them up a bit using an outlier process of sorts, then finally Prompt GPT-3 to name each cluster and extract some business themes from them.

This is the prompt I used, which returned a KW cluster label of “Supply chain management”.

Please describe this cluster of keywords using one phrase:

[‘Inventory’, ‘Inventory Control’, ‘Strategic Supplier Relations Group’, ‘Supplier Relationship Management’, ‘Supply Chain Finance’, ‘Supply Chain Management’, ‘Supply Chain Planning’, ‘What is supply chain management’, ‘blockchain supply chain’, ‘chain management’, ‘chain management scm’, ‘chain management software’, ‘demanddriven supply chain’, ‘financial supply chain management’, ‘food supply chain’, ‘global supply chain’, ‘improve supply chain’, ‘inventory management’, ‘inventory optimization’, ‘inventory turns’, ‘supplier relationship management’, ‘supply chain’, ‘supply chain analytics’, ‘supply chain disruption’, ‘supply chain finance’, ‘supply chain management’, ‘supply chain planning’, ‘supply chain software’, ‘supply chain sustainability’, ‘supply chain system’, ‘supply chain technology’, ‘supply chain visibility’]

The one phrase that describes this cluster is:

Very cool. Thanks for sharing! Seeing that BERTopic’s ClassTFIDF is a subclass of TfidfTransformer, I think that should allow for ngrams of multiple words. Cool note!