Questions tagged [spark-mllib]

The Apache Spark distributed machine learning library.

MLLib is the machine learning library for the distributed computing platform Apache Spark. It contains implementations of many standard machine learning algorithms for a distributed setting.

58 questions
7
votes
1 answer

Difference Between Linear Regression in Machine Learning and Statistical Model

I had the understanding that the major difference between machine learning and statistical model is, the later "assumes" certain type of distribution of data & based on that different model paradigm as well as statistical results we obtain (e.g.…
Beta
  • 5,784
  • 9
  • 33
  • 44
7
votes
5 answers

K-Means Cluster has over 50% of the points in one cluster. How to optimize it?

I am running a clustering algorithm in Spark and I have to choose between K-Means and Bisecting-Kmeans. However the only thing that differes between the two is the runtime because the performance is equally bad. I have a dataset of some 1.3 million…
5
votes
1 answer

Given a topic distribution over words from LDA model how to calculate document distribution over topics for new document?

I'm using Spark 1.6.2 via the Python API. It seems that as of when this post is being written, the only data available from the LDA (latent Dirichlet allocation) model calculations is a topic distribution over words i.e. p(word | topic). What I…
thecity2
  • 1,485
  • 2
  • 15
  • 22
5
votes
1 answer

Is it possible to share models between R, scikit-learn and spark?

If I create machine learning models in Python or R, is it possible to export the models in a format that could be imported by spark MLlib?
Chris Snow
  • 619
  • 6
  • 13
4
votes
1 answer

How to estimate most important dimensions of the clusters after performing k-means?

I need to cluster customers of retail shops based on the products that they purchased. Therefore, I need to obtain, as results, both the customers belonging to each cluster and in each cluster the products that mostly influence the specified…
Nko
  • 41
  • 2
4
votes
2 answers

How to use Kullback-leibler divergence if mean and standard deviation of of two Gaussian Distribution is provided?

With Apache Spark MLLib library I am trying to find Clusters within a dataset by using Gaussian Mixture Model (number cluster =3) . Now it returns 3 different values of mean and standard deviation. I am trying to find that if there exists any…
4
votes
0 answers

(Cross) Correlation of time series with very different sampling intervals (sec. vs days)

This is my first post on Cross-Validated. I read a lot of question related to my problem, but no one was completely satisfying. I have two time series that are sampled at very different time intervals, e.g. one is sampled every 10 seconds, while…
3
votes
2 answers

Understanding and interpreting the output of Spark's TF-IDF implementation

I am currently trying to understand what the example code provided as part of Spark's TF-IDF implementation is doing. Given the example code block (taken from Spark's Github repository) val sentenceData = sess.createDataFrame(Seq( (0.0, "Hi I…
3
votes
1 answer

how to handle sparse data problem in unsupervised learning .i'm going to use k means on data set

how to handle sparse data problem in unsupervised learning .i'm going to use k-means on the dataset. I have 200 variables, nearly in each column have 70% zeros. how can I handle without discarding any column?
3
votes
1 answer

Linear Regression in Spark's MLLib gives a seeming incorrect result

I am running the example found here. The training data for the model can be found in this CSV, where the first column is the response variable and the second column is a space separated list of predictors. After running the example with the modified…
Jon Claus
  • 535
  • 1
  • 4
  • 12
3
votes
1 answer

How to apply word2vec for k-means clustering?

Background: I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via…
mlee_jordan
  • 209
  • 1
  • 2
  • 10
2
votes
0 answers

MLeap and Spark ML SQLTransformer

I have a question. I am trying to serialize a PySpark ML model to mleap. However, the model makes use of the SQLTransformer to do some column-based transformations e.g. adding log-scaled versions of some columns. As we all know, Mleap doesn't…
femibyte
  • 131
  • 5
2
votes
0 answers

If I can put all my data in memory, why do I need frameworks like Spark?

Just wondering - if my organisation's data never runs into sizes than are bigger than my instances' memory size, why do I need something like Spark? I can scale the memory up using cloud instances, these days it seems that you can really push the…
lppier
  • 73
  • 1
  • 8
2
votes
0 answers

Logistic Regression Class Imbalance and the use of weighting and undersampling

I have been working on a machine learning model using Spark (binomial) LogisticRegression. The dataset has what I think is a high degree of imbalance - roughly 1% of rows are labelled as events. The original author has used a weightCol to try and…
2
votes
0 answers

How to perform data quality check on large number of features using Spark?

I am used to work with manageable number of features. I usually print some descriptive statistics and visualise the histograms of each feature using Python and Pandas or R. I check for outliers and if the data points follow normal distribution or…
1
2 3 4