Highest Voted 'spark-mllib' Questions - Statistical Analysis Stack Exchange

7

votes

1 answer

Difference Between Linear Regression in Machine Learning and Statistical Model

I had the understanding that the major difference between machine learning and statistical model is, the later "assumes" certain type of distribution of data & based on that different model paradigm as well as statistical results we obtain (e.g.…

machine-learning spark-mllib

asked May 28 '18 at 18:34

Beta

5,784
9
33
44

7

votes

5 answers

K-Means Cluster has over 50% of the points in one cluster. How to optimize it?

I am running a clustering algorithm in Spark and I have to choose between K-Means and Bisecting-Kmeans. However the only thing that differes between the two is the runtime because the performance is equally bad. I have a dataset of some 1.3 million…

clustering k-means hierarchical-clustering spark-mllib

asked Nov 02 '16 at 20:07

Mnemosyne

173
1
4

5

votes

1 answer

Given a topic distribution over words from LDA model how to calculate document distribution over topics for new document?

I'm using Spark 1.6.2 via the Python API. It seems that as of when this post is being written, the only data available from the LDA (latent Dirichlet allocation) model calculations is a topic distribution over words i.e. p(word | topic). What I…

latent-variable spark-mllib

asked Jul 08 '16 at 16:00

thecity2

1,485
2
15
22

5

votes

1 answer

Is it possible to share models between R, scikit-learn and spark?

If I create machine learning models in Python or R, is it possible to export the models in a format that could be imported by spark MLlib?

r machine-learning scikit-learn spark-mllib

asked Feb 18 '16 at 22:28

Chris Snow

619
6
13

4

votes

1 answer

How to estimate most important dimensions of the clusters after performing k-means?

I need to cluster customers of retail shops based on the products that they purchased. Therefore, I need to obtain, as results, both the customers belonging to each cluster and in each cluster the products that mostly influence the specified…

feature-selection k-means spark-mllib

asked Dec 04 '16 at 16:26

Nko

41
2

4

votes

2 answers

How to use Kullback-leibler divergence if mean and standard deviation of of two Gaussian Distribution is provided?

With Apache Spark MLLib library I am trying to find Clusters within a dataset by using Gaussian Mixture Model (number cluster =3) . Now it returns 3 different values of mean and standard deviation. I am trying to find that if there exists any…

normal-distribution gaussian-mixture-distribution kullback-leibler java spark-mllib

asked Sep 13 '16 at 13:45

Avik Dutta

41
1
2

4

votes

0 answers

(Cross) Correlation of time series with very different sampling intervals (sec. vs days)

This is my first post on Cross-Validated. I read a lot of question related to my problem, but no one was completely satisfying. I have two time series that are sampled at very different time intervals, e.g. one is sampled every 10 seconds, while…

time-series correlation cross-correlation spark-mllib

asked Jun 06 '16 at 16:51

McKracken

51
3

3

votes

2 answers

Understanding and interpreting the output of Spark's TF-IDF implementation

I am currently trying to understand what the example code provided as part of Spark's TF-IDF implementation is doing. Given the example code block (taken from Spark's Github repository) val sentenceData = sess.createDataFrame(Seq( (0.0, "Hi I…

machine-learning text-mining spark-mllib tf-idf

asked Nov 01 '17 at 16:24

Jesús Zazueta

201
2
7

3

votes

1 answer

how to handle sparse data problem in unsupervised learning .i'm going to use k means on data set

how to handle sparse data problem in unsupervised learning .i'm going to use k-means on the dataset. I have 200 variables, nearly in each column have 70% zeros. how can I handle without discarding any column?

machine-learning k-means unsupervised-learning data-preprocessing spark-mllib

asked Aug 03 '17 at 07:00

Newbie

141
2
9

3

votes

1 answer

Linear Regression in Spark's MLLib gives a seeming incorrect result

I am running the example found here. The training data for the model can be found in this CSV, where the first column is the response variable and the second column is a space separated list of predictors. After running the example with the modified…

regression spark-mllib

asked Feb 01 '17 at 15:20

Jon Claus

535
1
4
12

3

votes

1 answer

How to apply word2vec for k-means clustering?

Background: I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via…

k-means text-mining word2vec spark-mllib

asked Sep 30 '16 at 13:58

mlee_jordan

209
1
2
10

2

votes

0 answers

MLeap and Spark ML SQLTransformer

I have a question. I am trying to serialize a PySpark ML model to mleap. However, the model makes use of the SQLTransformer to do some column-based transformations e.g. adding log-scaled versions of some columns. As we all know, Mleap doesn't…

machine-learning spark-mllib

asked Jan 02 '20 at 20:41

femibyte

131
5

2

votes

0 answers

If I can put all my data in memory, why do I need frameworks like Spark?

Just wondering - if my organisation's data never runs into sizes than are bigger than my instances' memory size, why do I need something like Spark? I can scale the memory up using cloud instances, these days it seems that you can really push the…

machine-learning python spark-mllib

asked Dec 17 '19 at 11:38

lppier

73
1
8

2

votes

0 answers

Logistic Regression Class Imbalance and the use of weighting and undersampling

I have been working on a machine learning model using Spark (binomial) LogisticRegression. The dataset has what I think is a high degree of imbalance - roughly 1% of rows are labelled as events. The original author has used a weightCol to try and…

logistic unbalanced-classes rare-events oversampling spark-mllib

asked May 02 '19 at 21:12

user2711693

21
1

2

votes

0 answers

How to perform data quality check on large number of features using Spark?

I am used to work with manageable number of features. I usually print some descriptive statistics and visualise the histograms of each feature using Python and Pandas or R. I check for outliers and if the data points follow normal distribution or…

data-visualization data-transformation descriptive-statistics data-preprocessing spark-mllib

asked Dec 19 '16 at 15:15

amrakm

21
3

Questions tagged [spark-mllib]