python - Clustering Using Latent Symantic Analysis -
suppose have corpus of documents , run lsa algorithm on it. how can use final matrix obtained after applying svd semantically cluster words appearing in corpus of documents? wikipedia says lsa can used find relation between terms. there library available in python can me accomplish task of semantically clustering words based on lsa?
try gensim
(http://radimrehurek.com/gensim/index.html), install following these instruction: http://radimrehurek.com/gensim/install.html
then here code sample:
from gensim import corpora, models, similarities documents = ["human machine interface lab abc computer applications", "a survey of user opinion of computer system response time", "the eps user interface management system", "system , human system engineering testing of eps", "relation of user perceived response time error measurement", "the generation of random binary unordered trees", "the intersection graph of paths in trees", "graph minors iv widths of trees , quasi ordering", "graph minors survey"] # remove common words , tokenize stoplist = set('for of , in'.split()) texts = [[word word in document.lower().split() if word not in stoplist] document in documents] # remove words appear once all_tokens = sum(texts, []) tokens_once = set(word word in set(all_tokens) if all_tokens.count(word) == 1) texts = [[word word in text if word not in tokens_once] text in texts] dictionary = corpora.dictionary(texts) corp = [dictionary.doc2bow(text) text in texts] # extract 400 lsi topics; use default one-pass algorithm lsi = models.lsimodel.lsimodel(corpus=corp, id2word=dictionary, num_topics=400) # print contributing words (both positively , negatively) each of first ten topics lsi.print_topics(10)