Questions tagged [h2o]

H2O is an in-memory platform for distributed, scalable machine learning.

H2O is an in-memory platform for distributed, scalable machine learning. H2O uses familiar interfaces like R, Python, Scala, Java, JSON and the Flow notebook/web interface, and works seamlessly with big data technologies like Hadoop and Spark. H2O provides implementations of many popular algorithms such as GBM, Random Forest, Deep Neural Networks, Word2Vec and Stacked Ensembles. It is an open source project maintained by H2O.ai (formerly known as 0xdata).

77 questions
6
votes
1 answer

Special values in continuous numerical variables/features in Random Forest

I have a binary response variable I am seeking to predict using Random Forest. I have a sizable dataset of 150k rows, I have about 200 independent variables or features to use to model the outcome. Many of my features are continuous numerical…
JPErwin
  • 443
  • 2
  • 10
6
votes
2 answers

H2O: Can I use the h2o for time series predictions?

I understand that there is not a specific model for time series modeling in H2O. Is there a workaround in order to use Deep Learning or/and GBM? Is some kind of data transformation necessary? are there any examples? Are there any plans for ARIMA or…
erculeo
  • 73
  • 1
  • 3
6
votes
3 answers

How does h2o handle time-series cross validation?

I've read about How does h2o.r cross validation work?. However, for a time series dataset, does H2o support the type of CV described here Using k-fold cross-validation for time-series model selection? In particular, something like this: fold 1 :…
5
votes
3 answers

Random forest variable importance in h2o (classification problem)

I cannot find out how the variable importance for classification problems is calculated in h2o. There is a Stackoverflow question asking the same, but the accepted answer does not help (keeps referring to "squared error" where I would expect…
cryo111
  • 160
  • 1
  • 5
5
votes
1 answer

With H2O AutoML is it okay to use my test set as the leaderboard?

Normally in machine learning we will split our data into train, valid and test. The valid data is used to tune the parameters, and the test data is then used to check the performance of our best tuned model. (Watching out for notably different…
Darren Cook
  • 1,772
  • 1
  • 12
  • 26
4
votes
1 answer

H2O AUTOML: How to save, reuse and build on top of existing automl models?

I have two questions on h2o.automl and I couldn’t find any documentation on these topics. I can save/reuse the leader (automl) model in R using h2o.saveModel and h2o.loadModel. But how do I save/reuse other automl models, say the 6th model in the…
qed
  • 141
  • 1
  • 4
4
votes
1 answer

Training threshold vs validation threshold for better prediction results?

Between the two, should I use a model's training or validation threshold to get best results (from a distributed random forest binary classifier built using h2o.ai) (especially when their values differ by orders of magnitude)? Details: Used h2o…
4
votes
2 answers

Deviances in H2O

does anyone know how exactly the Deviances (Poisson, Gamma, Tweedie) are computed in H2O? I cannot find the functions. For interpretation purposes I would like to know the calculations. Thank you!
Zugi
  • 41
  • 1
4
votes
1 answer

Regularized GLM with aggregated data

I am fitting a poisson GLM to model claim rates. Since I have 1.5m+ records, I have aggregated my data (to improve efficiency). My understanding is that using aggregated data with a poisson GLM will not effect the coefficient estimates. Indeed, if I…
4
votes
1 answer

use of sample_rate = 1 in randomForest - to fit a single tree

I would like to fit a single tree. In the h2o R package, I can use h2o.randomForest() with the following options: h2o.randomForest(y = y, x = x, training_frame = data, ntrees = 1, mtries =…
Sergey
  • 41
  • 1
3
votes
3 answers

LASSO or random forest (RF) to use for variable selection when having highly correlated features in a relatively small dataset with many features?

I have a cross sectional data-set with around 1000 features and 5000 observations. There are many features (no categorical features) which are highly correlated (higher than 0.85). I want to decrease my feature set before modelling. I know that…
mlee_jordan
  • 209
  • 1
  • 2
  • 10
3
votes
1 answer

New factor levels in testing data set not present in training data in h20.randomforest

In randomforest classification using h20 package, there are factor levels which are present in testing data but not in training data.There is a warning message in predicting the values of testing data, it says : test/validation dataset column 'x'…
PA17
  • 31
  • 2
3
votes
1 answer

binomial responses in h2o gbm

I am modeling the probability of success in a dataset where I have a both the number of trials and the number of successes (and, obviously, I am modeling $p_i=\frac{total successes}{total trials}$). I wonder how to do it in h2o, since the classical…
Giorgio Spedicato
  • 3,444
  • 4
  • 29
  • 39
3
votes
1 answer

categorical_encoding in h2o - what is the difference between the options

I'm trying to understand the pros/cons and when to use the various encoding options that are available to me in h2o with the parameter 'categorical_encoding'. It would be helpful if people could point out general rules of thumb on how to use…
user3788557
  • 1,479
  • 4
  • 22
  • 24
3
votes
1 answer

H2O PCA number of components

I wonder why number of components in H2o PCA algorithm is limited to 9. It is not sure sometimes to be enough. k: Specify the rank of matrix approximation. This can be a value from 1 to 9 and defaults to 1.
NiMa
  • 31
  • 1
1
2 3 4 5 6