Text and document clustering (using embeddings) allows for some very useful analysis of text.
I’m keen on developing a good way to use generative models to assign names to text clusters. My first approach is to assign the names using keywords extracted using the cTFIDF algorithm because not only does it raise important keywords for a certain cluster, but it also takes into consideration that they be different from those in other clusters.
I named a few clusters myself and added them as examples in this prompt:
The common theme of the following words: books, book, read, the, you, are, what, best, in, your is that they all relate to favorite books to read. --- The common theme of the following words: startup, company, yc, failed is that they all relate to startup companies and their failures. --- The common theme of the following words: freelancer, wants, hired, be, who, seeking, to, 2014, 2020, april is that they all relate to hiring for a freelancer to join the team of a startup. --- The common theme of the following words: <insert keywords here> is that they all relate to
Here are its results for each cluster:
1- new, site, facebook, happened, account, internet, search, does, hn, google
Model’s suggested theme: the new Facebook search engine.
2- 2015, 2016, november, april, 2019, 2020, 2021, hired, wants, hiring
Model’s suggested theme: hiring for a freelancer to join the team of a startup.
3- open, source, web, code, software, using, tools, alternatives, best, use
Model’s suggested theme: open source software.
4- life, reading, ve, hn, blogs, best, book, favorite, read, books
Model’s suggested theme: the life of a blogger who reads books and blogs about them.
5- project, startups, passive, money, did, business, successful, company, income, startup
Model’s suggested theme: passive income from a successful startup.
6- career, tech, developer, remote, software, jobs, seeking, work, job, freelancer
Model’s suggested theme: finding a career in the tech industry as a software developer.
7- advice, things, people, doing, deal, working, career, did, work, life
Model’s suggested theme: advice on how to deal with things in your career and life.
8- self, cs, courses, design, good, best, programming, resources, learning, learn
Model’s suggested theme: learning how to code.
Some of these are better than others. For the first one, I probably would’ve name this a “Big Tech” cluster. I’m biased by looking at the articles though. The model over-indexed on “Facebook” here and made a good attempt at making a sentence out of the word soup.
For the next iteration, I’m thinking to reshape the prompt to include a few titles from each cluster to the examples, and better indicate to the model that they keywords are simply keywords from a group of articles, so we’re not nudging the model to overextend itself making a sentence out of seemingly unrelated words without much context.
Happy to share the code I’m working with if there’s interest. Let me know.