A Topic Modeling toolbox

Related tags

Deep Learningtopik
Overview

Build Status Coverage Status Scrutinizer Code Quality Documentation Status

Topik

A Topic Modeling toolbox.

Introduction

The aim of topik is to provide a full suite and high-level interface for anyone interested in applying topic modeling. For that purpose, topik includes many utilities beyond statistical modeling algorithms and wraps all of its features into an easy callable function and a command line interface.

Topik is built on top of existing natural language and topic modeling libraries and primarily provides a wrapper around them, for a quick and easy exploratory analysis of your text data sets.

Please see our complete documentation at ReadTheDocs.

LICENSE

New BSD. See License File.

Comments
  • Error in `/home/usr/anaconda2/bin/python': free(): invalid pointer:

    Error in `/home/usr/anaconda2/bin/python': free(): invalid pointer:

    Hi, I installed topik 0.3.0 on ubuntu 15.0, however, I got this error when running topik. Anyone has idea why and how to fix it?

    Error in `/home/usr/anaconda2/bin/python': free(): invalid pointer:

    Thanks

    opened by kenyeung128 10
  • Problem running tutorial code

    Problem running tutorial code

    Hello, I was trying out the topik package and ran into some problems with the basic examples in the tutorial (http://topik.readthedocs.org/en/latest/example.html). Specifically, I was trying to get an LDAvis visualization using a variation of your basic code:

    from topik.run import run_model run_model("reviews", content_field="text", r_ldavis=True, dir_path="./topic_model")

    The parameters don't seem to match what's on the documentation, so I'm going by trial and error. With the present code, I get the error below. Could you kindly let me know how to properly invoke lDAvis services? Thanks and best regards

    Alex

    ----> 1 run_model("reviews", content_field="text", r_ldavis=True, dir_path="./topic_model") /Users/alexmckenzie/anaconda/lib/python2.7/site-packages/topik-0.1.0-py2.7.egg/topik/run.pyc in run_model(data_source, source_type, year_field, start_year, stop_year, content_field, clear_es_index, tokenizer, n_topics, dir_path, model, termite_plot, output_file, r_ldavis, json_prefix, seed, **kwargs) 116 117 if r_ldavis: --> 118 to_r_ldavis(processed_data, dir_name=os.path.join(dir_path, 'ldavis'), lda=lda) 119 os.environ["LDAVIS_DIR"] = os.path.join(dir_path, 'ldavis') 120 try: /Users/alexmckenzie/anaconda/lib/python2.7/site-packages/topik-0.1.0-py2.7.egg/topik/utils.pyc in to_r_ldavis(corpus_bow, lda, dir_name) 40 np.savetxt(os.path.join(dir_name, 'topicTermDist'), tt_dist, delimiter=',', newline='\n',) 41 ---> 42 corpus_file = corpus_bow.filename 43 corpus = gensim.corpora.MmCorpus(corpus_file) 44 docTopicProbMat = lda.model[corpus] AttributeError: 'DigestedDocumentCollection' object has no attribute 'filename'

    bug 
    opened by AHMcKenzie 8
  • conda installation of 0.3.0 is not working ->

    conda installation of 0.3.0 is not working -> "ImportError: No module named cli "

    Hi, I just installed the 0.3.0 update with conda, however I get an error msg even when executing a simple command-line "help". This is the error:

    $ topik --help Traceback (most recent call last): File "/Users/alexmckenzie/anaconda/bin/topik", line 4, in from topik.cli import run ImportError: No module named cli

    I'd rather keep using conda and not download the source zip. Thanks for your help Alex

    bug 
    opened by AHMcKenzie 6
  • Various fixes + logging + refactoring.

    Various fixes + logging + refactoring.

    Added numpy 1.9.4 as requirement (argpartition bug was showing up in termite parsing code; it was fixed in np 1.9.4, numpy issue 5524) Added requirement for nose, stop_words In fileio/in_document_folder.py - Added support to ignore invalid UTF; but progress normally + log the fact that we encountered an error Added suitable test data (_junk) and test case to test_in_document_folder Added connectionerror handling for elasticsearch tests; if elasticsearch is not running, simply skip the tests Corrected tokenizer names in simple_run/cli.py Added stopword support to simple_run/run.py Corrected tokenizer names in simple_run/run.py Added logging in simple_run/run.py Tee generator in entities.py to avoid exhaustion. Support quadgrams and refactored code in ngrams.py Tee generator in ngrams.py + added some logging Added appropriate test case for quadgrams + tweaked test data in test_ngrams.py Added test case using a generator that will demonstrate exhaustion problem. All tests now succeeding (NB: ElasticSearch ones not tested - no changes aside from exception handling in tests though.)

    opened by brianrusso 3
  • ValueError: could not convert string to float: s

    ValueError: could not convert string to float: s

    I get this identical error on both Ubuntu 15 + pip install (or directly from the latest github) AND on Ubuntu 14 LTS + conda2; so I am pretty sure this is not an issue with my environment.

    Following the tutorial on the movie reviews data (not sure if that matters).. I get..

    [email protected]:[~]$ topik -d reviews -c text 2016-04-18 14:29:00,880 : WARNING : too few updates, training might not converge; consider increasing the number of passes or iterations to improve accuracy Traceback (most recent call last): File "/home/brian/anaconda2/bin/topik", line 6, in sys.exit(run()) File "/home/brian/anaconda2/lib/python2.7/site-packages/click/core.py", line 716, in call return self.main(_args, *_kwargs) File "/home/brian/anaconda2/lib/python2.7/site-packages/click/core.py", line 696, in main rv = self.invoke(ctx) File "/home/brian/anaconda2/lib/python2.7/site-packages/click/core.py", line 889, in invoke return ctx.invoke(self.callback, *_ctx.params) File "/home/brian/anaconda2/lib/python2.7/site-packages/click/core.py", line 534, in invoke return callback(_args, *_kwargs) File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/simple_run/cli.py", line 27, in run termite_plot=termite) File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/simple_run/run.py", line 67, in run_pipeline model = models.registered_models[model](vectorized_data, ntopics=ntopics, **kwargs) File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/models/lda.py", line 82, in lda return ModelOutput(vectorized_corpus=vectorized_output, model_func=_LDA, ntopics=ntopics, *_kwargs) File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/models/base_model_output.py", line 20, in init vectorized_corpus, **kwargs) File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/models/lda.py", line 72, in _LDA for topic_no in range(ntopics)} File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/models/lda.py", line 72, in for topic_no in range(ntopics)} File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/models/lda.py", line 12, in _topic_term_to_array term_scores = {term: float(score) for score, term in topic} File "/home/brian/anaconda2/lib/python2.7/site-packages/topik/models/lda.py", line 12, in term_scores = {term: float(score) for score, term in topic} ValueError: could not convert string to float: s

    opened by brianrusso 2
  • Add wait mechanism in preprocess between append and subsequent get_field

    Add wait mechanism in preprocess between append and subsequent get_field

    I consistently encounter a KeyError for the "token_..." field when using the 'elastic' output_type. I can see that the field exists if I manually view an individual document in the browser, but it appears there is some lag between appending the tokenized document and actually being able to retrieve it back. I added a 1-second wait after the append loop and that appears to have solved the problem.

    bug 
    opened by youngblood 2
  • Need way of saving corpus

    Need way of saving corpus

    Obviously, per-class specific. I envision dictionary storage doing some serialization, but the Elasticsearch backend should store a file with only connection details and current field selections.

    enhancement 
    opened by msarahan 2
  • Avoid use of variables that are commonly used for other purposes, like `np`

    Avoid use of variables that are commonly used for other purposes, like `np`

    Example: https://github.com/ContinuumIO/topik/blob/master/topik/tokenizers/entities.py#L87

    np is commonly used to point at NumPy, via import numpy as np.

    Eliminate all such occurrences (as well as other common ones, like sp for scipy, pd for pandas, etc.).

    opened by gpfreitas 1
  • Youngblood/store param strings

    Youngblood/store param strings

    adds vectorization to CLI changed project run_model default to lda changed the datatype of the individual weight values in the lda matrices from numpy.float64 to float in order to match plsa and more importantly successfully decode from file using jsonpickle. minor documentation updates

    opened by youngblood 1
  • Youngblood/cli fixes

    Youngblood/cli fixes

    -renames run.run_model to run.run_pipeline, updates imports and function calls accordingly -changes default visualization for run_pipeline to lda_vis -fixes some default parameters in models.run_model -minor updates to documentation code examples -prevents TFIDF/LDA combination when using projects -(full fix including storage of corpus parameter strings coming in separate PR)

    opened by youngblood 1
  • Youngblood/add viz to docs

    Youngblood/add viz to docs

    Added plots to documentation. This is a workaround to keep using readthedocs for now, and I am intentionally not closing the associated issue because it will need to be solved again once we switch doc hosting platforms.

    opened by youngblood 1
  • pyLDAvis ValidationError: Not all rows (distributions) in doc_topic_dists sum to 1

    pyLDAvis ValidationError: Not all rows (distributions) in doc_topic_dists sum to 1

    i am getting the below error when trying to visualize HDP model trained on gensim

    **_--------------------------------------------------------------------------- ValidationError Traceback (most recent call last) in () ----> 1 vis_data_hdp = gensimvis.prepare(hdpmodel, corpus, dictionary) 2 #pyLDAvis.display(vis_data_hdp)

    C:\Anaconda2\lib\site-packages\pyLDAvis\gensim.pyc in prepare(topic_model, corpus, dictionary, doc_topic_dist, **kwargs) 110 """ 111 opts = fp.merge(_extract_data(topic_model, corpus, dictionary, doc_topic_dist), kwargs) --> 112 return vis_prepare(**opts)

    C:\Anaconda2\lib\site-packages\pyLDAvis_prepare.pyc in prepare(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency, R, lambda_step, mds, n_jobs, plot_opts, sort_topics) 372 doc_lengths = _series_with_name(doc_lengths, 'doc_length') 373 vocab = _series_with_name(vocab, 'vocab') --> 374 _input_validate(topic_term_dists, doc_topic_dists, doc_lengths, vocab, term_frequency) 375 R = min(R, len(vocab)) 376

    C:\Anaconda2\lib\site-packages\pyLDAvis_prepare.pyc in _input_validate(*args) 63 res = _input_check(*args) 64 if res: ---> 65 raise ValidationError('\n' + '\n'.join([' * ' + s for s in res])) 66 67 ValidationError:

    • Not all rows (distributions) in doc_topic_dists sum to 1._**

    To train hdp model i have used the following syntax: hdpmodel = models.hdpmodel.HdpModel(corpus, dictionary)

    corpus looks like this: [[(0, 1), (1, 1), (2, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1), (8, 1), (9, 1), (10, 1), (11, 2), (12, 1), (13, 2), (14, 1), (15, 1), (16, 1), (17, 1), (18, 1), (19, 1), (20, 1), (21, 1), (22, 1), (23, 1), (24, 1), (25, 1), (26, 1), (27, 1), (28, 1), (29, 1), (30, 1), (31, 1), (32, 1), (33, 4), (34, 1), (35, 1), (36, 1), (37, 1), (38, 1), (39, 2), (40, 1), (41, 2), (42, 1), (43, 2), (44, 1), (45, 1), (46, 1), (47, 3), (48, 1), (49, 1), (50, 2), (51, 1), (52, 1), (53, 1), (54, 1), (55, 1), (56, 1), (57, 1), (58, 1), (59, 1), (60, 1), (61, 1), (62, 1), (63, 1), (64, 1), (65, 1)]

    dictionary looks like this: [u'', u'dacteur', u'reallocations', u'advcompliance', u'resolveboth............

    opened by imranshaikmuma 2
  • Intel MKL FATAL ERROR: Cannot load libmkl_avx.so or libmkl_def.so.

    Intel MKL FATAL ERROR: Cannot load libmkl_avx.so or libmkl_def.so.

    I encountered the error saying "Intel MKL FATAL ERROR: Cannot load libmkl_avx.so or libmkl_def.so." when running topik --help.

    I installed topik using conda install -c memex topik and running with Python 2.7.11 :: Anachonda 2.5.0

    The two files in question are in /home/user/anaconda2/lib directory and they look intact, 36M and 30M respectively in size. and the directory path is in my LD_LIBRARY_PATH and DYLD_LIBRARY_PATH env variable.

    Is there anything I am missing here? Any help?

    opened by geledek 0
  • pyLDAvis Plotting Data Structures Issues

    pyLDAvis Plotting Data Structures Issues

    There are several issues with the various data structures that need fixing. These fixes will make them much more coherent. I'll list them here:

    • [ ] prepared_model_vis_data.token_table uses a non-unique index, namely the unique id for each term. This needs to be a proper index, as it causes attempts to serialize the DataFrame to fail.
    enhancement 
    opened by brittainhard 0
  • Exclude empty documents and log their occurrence.

    Exclude empty documents and log their occurrence.

    We should exclude empty documents because they generate useless output at best, and crashes at worst.

    However, we must not silently drop the document, as it may be useful for the user to know that there is an empty document in the database.

    opened by gpfreitas 0
  • Add list of phrases to look for in the simple parser.

    Add list of phrases to look for in the simple parser.

    In some domains, certain expressions (phrases, compound words) are very common and meaningful. Having the simple tokenizer recognize such words would be very useful, and it could be done simply by passing all tokens through a transformation that recognizes those expressions and replaces sequences of tokens with said expressions. Reference:

    http://www.mimno.org/articles/phrases/

    That would improve the performance of models using tokenizers.simple, especially in certain domains.

    enhancement 
    opened by gpfreitas 0
Releases(v0.3.1)
  • v0.3.1(Apr 21, 2016)

  • v0.3.0(Nov 30, 2015)

    This version is a major update of the API to be consistent across all modules. Each step is now expected to be a function that returns either an iterator of content or some more complicated object that aids in presentation of results. Each step is registered with a borg-pattern dictionary, which hopefully will facilitate future integration with GUIs.

    Source code(tar.gz)
    Source code(zip)
  • v0.2.2(Oct 15, 2015)

    • update documentation to show (interactive!) plots
    • fix LDA model issue where word weights did not sum to 1, causing an LDAvis validation error
    Source code(tar.gz)
    Source code(zip)
  • v0.2.1(Oct 10, 2015)

  • v0.2.0(Oct 9, 2015)

    • Refactor with aim towards modularity at each step
    • add elasticsearch input source
    • add elasticsearch as output backend option
    • add initial PLSA model algorithm
    • expand documentation; add examples of using Topik with Python API
    • add API docs from docstrings
    • add continuous integration with Travis CI
    • add code coverage monitoring with Coveralls
    • add code analysis with Scrutinizer
    • replace R-LDAvis with PyLDAvis to eliminate R dependency for simplicity
    • multitudinous bug fixes guided by Travis + doctests
    Source code(tar.gz)
    Source code(zip)
Owner
Anaconda, Inc. (formerly Continuum Analytics, Inc.)
Advanced data processing, analysis, and visualization tools for Python & R.
Anaconda, Inc. (formerly Continuum Analytics, Inc.)
Semiconductor Machine learning project

Wafer Fault Detection Problem Statement: Wafer (In electronics), also called a slice or substrate, is a thin slice of semiconductor, such as a crystal

kunal suryawanshi 1 Jan 15, 2022
TraND: Transferable Neighborhood Discovery for Unsupervised Cross-domain Gait Recognition.

TraND This is the code for the paper "Jinkai Zheng, Xinchen Liu, Chenggang Yan, Jiyong Zhang, Wu Liu, Xiaoping Zhang and Tao Mei: TraND: Transferable

Jinkai Zheng 32 Apr 04, 2022
Code accompanying the paper "Knowledge Base Completion Meets Transfer Learning"

Knowledge Base Completion Meets Transfer Learning This code accompanies the paper Knowledge Base Completion Meets Transfer Learning published at EMNLP

14 Nov 27, 2022
Massively parallel Monte Carlo diffusion MR simulator written in Python.

Disimpy Disimpy is a Python package for generating simulated diffusion-weighted MR signals that can be useful in the development and validation of dat

Leevi 16 Nov 11, 2022
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation Where we are ? 12.27 目前和原论文仍有1%左右得差距,但已经力压很多SOTA了 ckpt__448_epoch_25.pth mIoU

zichengsaber 60 Dec 11, 2022
Code for "Unsupervised Source Separation via Bayesian inference in the latent domain"

LQVAE-separation Code for "Unsupervised Source Separation via Bayesian inference in the latent domain" Paper Samples GT Compressed Separated Drums GT

Michele Mancusi 30 Oct 25, 2022
基于AlphaPose的TensorRT加速

1. Requirements CUDA 11.1 TensorRT 7.2.2 Python 3.8.5 Cython PyTorch 1.8.1 torchvision 0.9.1 numpy 1.17.4 (numpy版本过高会出报错 this issue ) python-package s

52 Dec 06, 2022
Image Captioning using CNN ,LSTM and Attention

Image Captioning using CNN ,LSTM and Attention This is a deeplearning model which tries to summarize an image into a text . Installation Install this

ASUTOSH GHANTO 1 Dec 16, 2021
The official implementation of Equalization Loss for Long-Tailed Object Recognition (CVPR 2020) based on Detectron2

Equalization Loss for Long-Tailed Object Recognition Jingru Tan, Changbao Wang, Buyu Li, Quanquan Li, Wanli Ouyang, Changqing Yin, Junjie Yan ⚠️ We re

Jingru Tan 197 Dec 25, 2022
Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [CVPR 2021]

Deep Occlusion-Aware Instance Segmentation with Overlapping BiLayers [BCNet, CVPR 2021] This is the official pytorch implementation of BCNet built on

Lei Ke 434 Dec 01, 2022
Model of an AI powered sign language interpreter.

TEXT AND SPEECH TO SIGN LANGUAGE. A web application which takes in text or live audio speech recording as input, converts and displays the relevant Si

Mark Gatere 4 Mar 30, 2022
PyTorch implementation of convolutional neural networks-based text-to-speech synthesis models

Deepvoice3_pytorch PyTorch implementation of convolutional networks-based text-to-speech synthesis models: arXiv:1710.07654: Deep Voice 3: Scaling Tex

Ryuichi Yamamoto 1.8k Jan 08, 2023
Code repository for Self-supervised Structure-sensitive Learning, CVPR'17

Self-supervised Structure-sensitive Learning (SSL) Ke Gong, Xiaodan Liang, Xiaohui Shen, Liang Lin, "Look into Person: Self-supervised Structure-sensi

Clay Gong 219 Dec 29, 2022
Official Repository of NeurIPS2021 paper: PTR

PTR: A Benchmark for Part-based Conceptual, Relational, and Physical Reasoning Figure 1. Dataset Overview. Introduction A critical aspect of human vis

Yining Hong 32 Jun 02, 2022
[Link]deep_portfolo - Use Reforcemet earg ad Supervsed learg to Optmze portfolo allocato []

rl_portfolio This Repository uses Reinforcement Learning and Supervised learning to Optimize portfolio allocation. The goal is to make profitable agen

Deepender Singla 165 Dec 02, 2022
Fast Soft Color Segmentation

Fast Soft Color Segmentation

3 Oct 29, 2022
Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021)

RSCD (BS-RSCD & JCD) Towards Rolling Shutter Correction and Deblurring in Dynamic Scenes (CVPR2021) by Zhihang Zhong, Yinqiang Zheng, Imari Sato We co

81 Dec 15, 2022
The Generic Manipulation Driver Package - Implements a ROS Interface over the robotics toolbox for Python

Armer Driver Armer aims to provide an interface layer between the hardware drivers of a robotic arm giving the user control in several ways: Joint vel

QUT Centre for Robotics (QCR) 13 Nov 26, 2022
Official implementation of the paper "Light Field Networks: Neural Scene Representations with Single-Evaluation Rendering"

Light Field Networks Project Page | Paper | Data | Pretrained Models Vincent Sitzmann*, Semon Rezchikov*, William Freeman, Joshua Tenenbaum, Frédo Dur

Vincent Sitzmann 130 Dec 29, 2022
A PyTorch implementation of Mugs proposed by our paper "Mugs: A Multi-Granular Self-Supervised Learning Framework".

Mugs: A Multi-Granular Self-Supervised Learning Framework This is a PyTorch implementation of Mugs proposed by our paper "Mugs: A Multi-Granular Self-

Sea AI Lab 62 Nov 08, 2022