Questions tagged [scalability]

11 questions
21
votes
1 answer

How can we simulate from a geometric mixture?

If $f_1,\ldots,f_k$ are known densities from which I can simulate, i.e., for which an algorithm is available. and if the product $$\prod_{i=1}^k f_i(x)^{\alpha_i}\qquad \alpha_1,\ldots,\alpha_k>0$$ is integrable, is there a generic approach to…
Xi'an
  • 90,397
  • 9
  • 157
  • 575
6
votes
4 answers

Solving a practical machine learning problem

I am currently doing my Phd in computational biology at Stanford. I get the data I need to answer the questions I am interested in. The data sets are sometimes "large" and these large problems take longer time periods to solve (a couple of days…
Sid
  • 2,489
  • 10
  • 15
4
votes
2 answers

What are some uses of logistic regression at scale?

Many libraries that scale linear and logistic regression assume a tall-skinny design matrix (many samples, few features), but I don't understand why you would need billions of samples if your data has 250 features. In what scenarios would more data…
4
votes
0 answers

How does t-SNE slow down with increasing number of dimensions?

I'm trying to understand the computational bounds of t-SNE. It's learned with SGD, so it'll have to go through some number of gradient-descent iterations. We can ignore that here, and focus on the time for each iteration. Barnes-Hut changes it…
2
votes
2 answers

Bisecting K-means using Dynamic Time Warping

I'm trying to cluster time series of different length and I came up to an idea to use DTW as a similarity measure, which seems to be adequate, but the thing is, I cannot use it with K-means, since it's hard to define centroids based on time series…
Kobe-Wan Kenobi
  • 2,437
  • 3
  • 20
  • 33
1
vote
0 answers

handling multiple time series through common model?

I have 1.5 lac/ 150 K timeseries . These are divided by geo locations. I have total 32 geo locations.Customer is expecting to have minimum number of model for all the 1.5 lac forecasting. How should i cluster my time series in such scenario ? DTW/…
1
vote
1 answer

Bayes and Naive Bayes code implementations

I know that Bayes classifier assigns the new data point $\pmb{x}$ to the class $\omega_j, \ j=1,\dots,M$, when $p(\omega_j \mid \pmb{x}) = \max_{q=1,\dots,M}p(\omega_q \mid \pmb{x})$, where $p(\omega_j\mid \pmb{x}) = \frac{p(\pmb{x}\mid…
1
vote
2 answers

Best Scalable Classification Algorithms

I have a very large data set that I want to perform classification tasks on. There are about 40 million instances, 16 features, and 2 classes. I'm attempting to use SciKit-learn LinearSVC and LogisticRegression, but after several hours the…
MVTC
  • 113
  • 6
0
votes
1 answer

Persistent Cluster ID's for DBSCAN

When executing the DBSCAN algorithm over multiple runs on similar data (but not the same), I would like to generate persistent ID's so we can monitor how the clusters changed over time. Selection of another algorithm is not possible. This question…
John Zhu
  • 1
  • 2
0
votes
1 answer

Scalable machine learning for bigger data

I am aware of the theory of stochastic gradient descent, which is a faster way of developing linear regression. Through this we can have an 'optimized implementation' of linear regression. There are similar techniques for non-parametric methods as…
StatguyUser
  • 874
  • 3
  • 9
  • 27
0
votes
1 answer

Scalability comparison with the help of regression

I created an algorithm and I tested it against a current algorithm. The results are in this form: Power Processes Method Time(s) 1 3 1 19,94 1 4 1 20,04 1 5 1 20,06 1 6 …