Thanks to visit codestin.com
Credit goes to github.com

Skip to content

x-tabdeveloping/topicwizard

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

202 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

topicwizard


Pretty and opinionated topic model visualization in Python.

Open in Colab PyPI version pip downloads python version Code style: black

topicwizard_0.5.0_compressed.mp4

New in version 0.5.0 🌟

  • Enhanced readibility and legibility of graphs.
  • Added helper tooltips to help you understand and interpret the graphs.
  • Improved stability.
  • Negative topic distributions are now supported in documents.

Features

  • Investigate complex relations between topics, words, documents and groups/genres/labels interactively
  • Easy to use pipelines that can be utilized for downstream tasks
  • Sklearn, Gensim and BERTopic compatible (stay tuned for more) πŸ”©
  • Interactive and composable Plotly figures
  • Automatically infer topic names, oooor...
  • Name topics manually
  • Easy deployment 🌍

Installation

Install from PyPI:

pip install topic-wizard

The main abstraction of topicwizard around a topic model is a topic pipeline, which consists of a vectorizer, that turns texts into bag-of-tokens representations and a topic model which decomposes these representations into vectors of topic importance. topicwizard allows you to use both scikit-learn pipelines or its own TopicPipeline.

Let's build a pipeline. We will use scikit-learns CountVectorizer as our vectorizer component:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(min_df=5, max_df=0.8, stop_words="english")

The topic model I will use for this example is Non-negative Matrix Factorization as it is fast and usually finds good topics.

from sklearn.decomposition import NMF

model = NMF(n_components=10)

Then let's put this all together in a pipeline. You can either use sklearn Pipelines...

from sklearn.pipeline import make_pipeline

topic_pipeline = make_pipeline(vectorizer, model)

Or topicwizard's TopicPipeline

from topicwizard.pipeline import make_topic_pipeline

topic_pipeline = make_topic_pipeline(vectorizer, model)

Let's load a corpus that we would like to analyze, in this example I will use 20newsgroups from sklearn.

from sklearn.datasets import fetch_20newsgroups

newsgroups = fetch_20newsgroups(subset="all")
corpus = newsgroups.data

# Sklearn gives the labels back as integers, we have to map them back to
# the actual textual label.
group_labels = [newsgroups.target_names[label] for label in newsgroups.target]

Then let's fit our pipeline to this data:

topic_pipeline.fit(corpus)

You can launch the topic wizard web application for interactively investigating your topic models. The app is also quite easy to deploy in case you want to create a client-facing interface.

import topicwizard

topicwizard.visualize(corpus, pipeline=topic_pipeline)

From version 0.3.0 you can also disable pages you do not wish to display thereby sparing a lot of time for yourself:

# A large corpus takes a looong time to compute 2D projections for so
# so you can speed up preprocessing by disabling it alltogether.
topicwizard.visualize(corpus, pipeline=topic_pipeline, exclude_pages=["documents"])
Topics Words
topics screenshot words screenshot
Documents Groups
documents screenshot groups screenshot

If you want customizable, faster, html-saveable interactive plots, you can use the figures API. Here are a couple of examples:

from topicwizard.figures import word_map, document_topic_timeline, topic_wordclouds, word_association_barchart
Word Map Timeline of Topics in a Document
word_map(corpus, pipeline=topic_pipeline) document_topic_timeline( "Joe Biden takes over presidential office from Donald Trump.", pipeline=topic_pipeline)
word map screenshot doc_timeline
Wordclouds of Topics Topic for Word Importance
topic_wordclouds(corpus, pipeline=topic_pipeline) word_association_barchart(["supreme", "court"], corpus=corpus, pipeline=topic_pipeline)
wordclouds topic_word_imp

For more information consult our Documentation

Releases

No releases published

Packages

 
 
 

Contributors

Languages