Split dataset randomly

Question

I have a database with 500 records. I want to split these records to 75% and 25% *randomly*in order to use the different datasets for training and testing to machine learning algorithms. Does anyone knows how to do that? For example using an sql query

To Get Random Rows Fom the SQL DATAtable Or DATASET follow below ling http://niravdaraniya.blogspot.in/2013/07/how-to-generate-random-rows-in-dataset.html And also here http://codeoverflow.net/2013/07/23/how-to-generate-random-rows-in-dataset-using-c-net/ — , May 28 '14 at 06:10

Atilla Ozgur · Accepted Answer · 2013-05-01T21:35:39.113

You did not provide SQL product so I will give one example. You need a random function from your DBMS. For example oracle provides dbms_random.value, SQL server NEWID or RAND(). MySql provides RAND().

Assume that you have a table named DATA which holds your 500 records.

WITH TEMP
(
  SELECT  ROW_NUMBER() AS ROW_ID , NEWID() as RANDOM_VALUE,D.*
        FROM   DATA D
       ORDER BY RANDOM_VALUE
)

SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID =< 375
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 375

Trick is to Order by Random Function. 375 = 500 * 75/100. that is %75 of data is training and %25 is testing.

This example for SQL Server but it will work with Oracle and Other vendors if you change random function.

Thank you very much. You helped me very much! Voting as best answer — , May 07 '13 at 14:07

score 1 · Answer 2 · answered May 02 '13 at 00:37

A sequential algorithm using conditional probabilities can also be used. To demonstrate a simple case, suppose you want to sample 1 item out of 4. Generate a uniform[0,1] random number $U_1$. If $U_1 < {1 \over 4}$, select the first element. If not, generate $U_2$. If $U_2 < {1 \over 3}$, select the second element. If not, generate $U_3$. If $U_3 < {1 \over 2}$, select the third element. If not, select the last element. The probability of selecting the first element is $1 \over 4$. The probability of selecting the 2nd element is $({3 \over 4})({1 \over 3}) = {1 \over 4}$. And so on for the 3rd and 4th elements. Google "Faster Methods for Random Sampling" to find a paper by JS Vitter that presents several sequential methods.

Split dataset randomly

2 Answers2