Naming text clusters of short texts

Text and document clustering (using embeddings) allows for some very useful analysis of text.

I’m keen on developing a good way to use generative models to assign names to text clusters. My first approach is to assign the names using keywords extracted using the cTFIDF algorithm because not only does it raise important keywords for a certain cluster, but it also takes into consideration that they be different from those in other clusters.

I named a few clusters myself and added them as examples in this prompt:

The common theme of the following words: books, book, read, the, you, are, what, best, in, your
is that they all relate to favorite books to read.
---
The common theme of the following words: startup, company, yc, failed
is that they all relate to startup companies and their failures.
---
The common theme of the following words: freelancer, wants, hired, be, who, seeking, to, 2014, 2020, april
is that they all relate to hiring for a freelancer to join the team of a startup.
---
The common theme of the following words: <insert keywords here>
is that they all relate to

Here are its results for each cluster:

1- new, site, facebook, happened, account, internet, search, does, hn, google
Model’s suggested theme: the new Facebook search engine.

2- 2015, 2016, november, april, 2019, 2020, 2021, hired, wants, hiring
Model’s suggested theme: hiring for a freelancer to join the team of a startup.

3- open, source, web, code, software, using, tools, alternatives, best, use
Model’s suggested theme: open source software.

4- life, reading, ve, hn, blogs, best, book, favorite, read, books
Model’s suggested theme: the life of a blogger who reads books and blogs about them.

5- project, startups, passive, money, did, business, successful, company, income, startup
Model’s suggested theme: passive income from a successful startup.

6- career, tech, developer, remote, software, jobs, seeking, work, job, freelancer
Model’s suggested theme: finding a career in the tech industry as a software developer.

7- advice, things, people, doing, deal, working, career, did, work, life
Model’s suggested theme: advice on how to deal with things in your career and life.

8- self, cs, courses, design, good, best, programming, resources, learning, learn
Model’s suggested theme: learning how to code.


Some of these are better than others. For the first one, I probably would’ve name this a “Big Tech” cluster. I’m biased by looking at the articles though. The model over-indexed on “Facebook” here and made a good attempt at making a sentence out of the word soup.

For the next iteration, I’m thinking to reshape the prompt to include a few titles from each cluster to the examples, and better indicate to the model that they keywords are simply keywords from a group of articles, so we’re not nudging the model to overextend itself making a sentence out of seemingly unrelated words without much context.

Happy to share the code I’m working with if there’s interest. Let me know.

3 Likes

Hi @jay would love to check the code and experiment/share any other methods.

Happy to! WIll be sharing the code this week :+1:

This is the class for the structured prompt:

class cohereStructuredPrompt():
    def __init__(self, examples, example_labels, labels, task_desciption, prompt_output_indicator):
        self.examples = examples
        self.example_labels = example_labels
        self.labels = labels
        self.task_desciption = task_desciption
        self.prompt_output_indicator = prompt_output_indicator

    def make_prompt(self, example):
        examples = self.examples + [example]
        labels = self.example_labels + [""]
        return (
                "\n---\n".join( [self.task_desciption + examples[i] + "\n" +
                                self.prompt_output_indicator + 
                                 labels[i] for i in range(len(examples))]))

    def generate(self, example):
        result = co.generate(
          model='large',
          prompt=self.make_prompt(example),
          max_tokens=50,
          temperature=0.3,
          num_generations=5,
          return_likelihoods='GENERATION',
          stop_sequences=["\n"])
        return(result.generations)

Notice that the Generate call has num_generations=5 – we get five results and choose the best from amongst them (defined as having the best average likelihood score).

These are the examples I labeled:


example_inputs = [['books', 'book', 'read', 'the', 'you', 'are', 'what', 'best', 'in', 'your'],
 ['startup', 'company', 'yc', 'failed'],
 ['freelancer', 'wants', 'hired', 'be', 'who', 'seeking', 'to', '2014', '2020', 'april']]


example_outputs = [' favorite books to read.',  
                   ' startup companies and their failures.', 
                   ' hiring for a freelancer to join the team of a startup.']

And this is how the prompt object is put together:

clusterNamer = cohereStructuredPrompt(example_inputs, 
                                       example_outputs, [],
                                       "The common theme of the following words: ", 
                                       "is that they all relate to")

Now that the prompt is ready, we can loop over the clusters and generate five suggested titles for each cluster:

results = []
for idx in range(len(words_per_class)):
    
    topic_words = get_topic_keywords(words_per_class, idx)
    
    cluster_names = clusterNamer.generate(', '.join(topic_words))
    results.append(cluster_names)

So now results is a list of lists – containing 5 suggested names for each cluster. We can then proceed to get the top name for each cluster:

topic_names = []
for row in results:
    index_of_most_likely_gen = np.argmax([gen.likelihood for gen in row])
    most_likely_gen = row[index_of_most_likely_gen].text
    topic_names.append(most_likely_gen)
    
topic_names
1 Like

I found that extracting KW phrases instead of just single words significantly improve cluster name generation.

It would be nice to see what the generated prompt looks like.

Oh that’s very interesting. How are you generating the phrases?

It would be nice to see what the generated prompt looks like.

For phrases you mean? The one for keywords is listed above:

The common theme of the following words: books, book, read, the, you, are, what, best, in, your
is that they all relate to favorite books to read.
---
The common theme of the following words: startup, company, yc, failed
is that they all relate to startup companies and their failures.
---
The common theme of the following words: freelancer, wants, hired, be, who, seeking, to, 2014, 2020, april
is that they all relate to hiring for a freelancer to join the team of a startup.
---
The common theme of the following words: <insert keywords here>
is that they all relate to

Thanks. I implemented a custom IT Industry KW extraction process that utilizes a few methods including lexical (dictionary based), Corpus level TFIDF bi-grams and tri-grams and some NER via SpaCy. I use this process to extract key-words/phrases from a corpus of IT docs, cluster these KWs (HDBSCAN), clean them up a bit using an outlier process of sorts, then finally Prompt GPT-3 to name each cluster and extract some business themes from them.

This is the prompt I used, which returned a KW cluster label of “Supply chain management”.

Please describe this cluster of keywords using one phrase:

[‘Inventory’, ‘Inventory Control’, ‘Strategic Supplier Relations Group’, ‘Supplier Relationship Management’, ‘Supply Chain Finance’, ‘Supply Chain Management’, ‘Supply Chain Planning’, ‘What is supply chain management’, ‘blockchain supply chain’, ‘chain management’, ‘chain management scm’, ‘chain management software’, ‘demanddriven supply chain’, ‘financial supply chain management’, ‘food supply chain’, ‘global supply chain’, ‘improve supply chain’, ‘inventory management’, ‘inventory optimization’, ‘inventory turns’, ‘supplier relationship management’, ‘supply chain’, ‘supply chain analytics’, ‘supply chain disruption’, ‘supply chain finance’, ‘supply chain management’, ‘supply chain planning’, ‘supply chain software’, ‘supply chain sustainability’, ‘supply chain system’, ‘supply chain technology’, ‘supply chain visibility’]

The one phrase that describes this cluster is:

Very cool. Thanks for sharing! Seeing that BERTopic’s ClassTFIDF is a subclass of TfidfTransformer, I think that should allow for ngrams of multiple words. Cool note!

Yes. You can set ngrams (as well as stop words) in BERTopic like this:

vectorizer_model = CountVectorizer(ngram_range=(1, 3), stop_words=“english”)

AllModel = BERTopic( vectorizer_model=vectorizer_model)

1 Like

@drob True! Thanks!

I’ve been playing around with that using the HN notebook: Google Colab

Step #5 explicitly does a bunch of this. I’ve indeed found good results with ngram_range=(1, 3) as well as setting max_df (to remove words/phrases that appeared in multiple clusters). I set it to 0.8, but then found even better results with 0.5.

This is truly cool stuff @jay. Thanks for the tip on max_df - my experience with countvectorizer is summarized in the snippet I posted :wink: . I really like the stuff you are doing around topic word definitions. Something I’ve had on my mind for a while but haven’t delved into. Another quick and dirty I’ve found useful is simply adding recurring words to stopwords. Overall I’ve found the ‘secret’ to good topic word vocabularies is having a large enough set of documents so that whatever words surface are really descriptive of just that topic - nice when it works. Beyond all that I really like that you’ve pulled out c-TF-IDF, which I consider a hidden gem within BERTopic. I’ve begun thinking about using different algos to extract vocabularies from clusters. Was playing a bit with Top2Vec and although not a big fan of their clustering I was pretty impressed by the speed and initial output of their vocab generation. Thanks for the notebook - going into my ‘keepers’ list!

Glad to hear!

Top2Vec has been on my list to experiment with but haven’t had the time yet. Completely agree on how awesome CTFIDF is! As @dplutcho suggested, I’m really excited to try to generate titles using these key phrases (longer than one word).

I went back to Top2Vec. Not crazy about their clustering. They use HDBSCAN to define a cluster, but then they remove the -1, average out the vectors, and do assignments based on the distance from the topic average. This creates what looks like alot of oddly categorized documents. One other outcome of this is that they get a matrix of documents->topics - so a kind of weighting. Although, again, I found the weighting to be of suspect use. However, on the more interesting side they get topic vocabulary by concatenating all the documents in a topic, running CountVectorizer against it, then pulling a matrix of the top 50 cosine similar words and using that as the topic vocabulary. It is very fast and the results seemed interesting at first glance.

Very interesting!

I’m also following the work in GitHub - lmcinnes/thisnotthat: A visual labeling system implemented in Jupyter widgets..

Leland McInnes shares a few thoughts here:

In summary, labelling options are:

  • Use Top2Vec style “topic” labelling when joint vector spaces (e.g. word/document) exist.
  • Use descriptive feature importance based labelling when additional descriptive feature metadata exists.
  • Use submodular selection based labelling

The real trick is to build hierarchical layers of labelling based on clustering (with some outlier detection for higher layers). This allows for high level labels that resolve into more detailed labels as you zoom in.

https://twitter.com/leland_mcinnes/status/1547244707193012226

I’m developing better and better prompts to name the clusters with generative models, so far it’s one approach I’m excited about.

what is get_topic_keywords?

@oliver That function just gets a value from the words_per_class dict. You can see how that dict is built here: Google Colab