Questions tagged [text-mining]

Refers to a subset of data mining concerned with extracting information from data in the form of text by recognizing patterns. The goal of text mining is often to classify a given document into one of a number of categories in an automatic way, and to improve this performance dynamically, making it an example of machine learning. One example of this type of text mining are spam filters used for email.

642 questions
173
votes
3 answers

How does Keras 'Embedding' layer work?

Need to understand the working of 'Embedding' layer in Keras library. I execute the following code in Python import numpy as np from keras.models import Sequential from keras.layers import Embedding model = Sequential() model.add(Embedding(5, 2,…
prashanth
  • 3,747
  • 4
  • 21
  • 33
45
votes
2 answers

Difference between naive Bayes & multinomial naive Bayes

I've dealt with Naive Bayes classifier before. I've been reading about Multinomial Naive Bayes lately. Also Posterior Probability = (Prior * Likelihood)/(Evidence). The only prime difference (while programming these classifiers) I found between…
garak
  • 2,033
  • 4
  • 26
  • 31
43
votes
6 answers

How to quasi match two vectors of strings (in R)?

I am not sure how this should be termed, so please correct me if you know a better term. I've got two lists. One of 55 items (e.g: a vector of strings), the other of 92. The item names are similar but not identical. I wish to find the best…
Tal Galili
  • 19,935
  • 32
  • 133
  • 195
35
votes
8 answers

In Naive Bayes, why bother with Laplace smoothing when we have unknown words in the test set?

I was reading over Naive Bayes Classification today. I read, under the heading of Parameter Estimation with add 1 smoothing: Let $c$ refer to a class (such as Positive or Negative), and let $w$ refer to a token or word. The maximum likelihood…
33
votes
6 answers

Statistical classification of text

I'm a programmer without statistical background, and I'm currently looking at different classification methods for a large number of different documents that I want to classify into pre-defined categories. I've been reading about kNN, SVM and NN.…
Emil H
  • 431
  • 5
  • 5
32
votes
2 answers

StackExchange fires a moderator, and now in response hundreds of moderators resign: is the increase in resignations statistically significant?

I am doing a study on StackExchange. The management of StackExchange has demodded (for unclear reasons) a moderator, and now the network is on fire. Currently many moderators resign or suspend their activities because they are dissatisfied. I wish…
32
votes
4 answers

Machine learning techniques for parsing strings?

I have a lot of address strings: 1600 Pennsylvania Ave, Washington, DC 20500 USA I want to parse them into their components: street: 1600 Pennsylvania Ave city: Washington province: DC postcode: 20500 country: USA But of course the data is dirty:…
Jay Hacker
  • 451
  • 1
  • 5
  • 3
30
votes
4 answers

R packages for performing topic modeling / LDA: just `topicmodels` and `lda`

It seems to me that only two R packages are able to perform Latent Dirichlet Allocation: One is lda, authored by Jonathan Chang; and the other is topicmodels authored by Bettina Grün and Kurt Hornik. What are the differences between these two…
bit-question
  • 2,637
  • 6
  • 25
  • 26
30
votes
3 answers

How well does R scale to text classification tasks?

I am trying to get upto speed with R. I eventually want to use R libraries for doing text classification. I was just wondering what people's experiences are with regard to R's scalability when it comes to doing text classification. I am likely to…
Andy
  • 1,583
  • 3
  • 21
  • 19
29
votes
1 answer

Is cross validation a proper substitute for validation set?

In text classification, I have a training set with about 800 samples, and a test set with about 150 samples. The test set has never been used, and waiting to be used until the end. I am using the whole 800 sample training set, with 10 fold cross…
Flake
  • 1,131
  • 2
  • 13
  • 21
28
votes
2 answers

Bag-of-Words for Text Classification: Why not just use word frequencies instead of TFIDF?

A common approach to text classification is to train a classifier off of a 'bag-of-words'. The user takes the text to be classified and counts the frequencies of the words in each object, followed by some sort of trimming to keep the resulting…
shf8888
  • 845
  • 1
  • 7
  • 11
26
votes
3 answers

Topic models and word co-occurrence methods

Popular topic models like LDA usually cluster words that tend to co-occur together into the same topic (cluster). What is the main difference between such topic models, and other simple co-occurrence based clustering approaches like PMI ? (PMI…
23
votes
1 answer

Has the reported state-of-the-art performance of using paragraph vectors for sentiment analysis been replicated?

I was impressed by the results in the ICML 2014 paper "Distributed Representations of Sentences and Documents" by Le and Mikolov. The technique they describe, called "paragraph vectors", learns unsupervised representations of arbitrarily-long…
21
votes
2 answers

How to calculate perplexity of a holdout with Latent Dirichlet Allocation?

I'm confused about how to calculate the perplexity of a holdout sample when doing Latent Dirichlet Allocation (LDA). The papers on the topic breeze over it, making me think I'm missing something obvious... Perplexity is seen as a good measure of…
drevicko
  • 394
  • 1
  • 3
  • 11
20
votes
2 answers

Why does ridge regression classifier work quite well for text classification?

During an experiment for text classification, I found ridge classifier generating results that constantly top the tests among those classifiers that are more commonly mentioned and applied for text mining tasks, such as SVM, NB, kNN, etc. Though, I…
Flake
  • 1,131
  • 2
  • 13
  • 21
1
2 3
42 43