2

I have a database with 500 records. I want to split these records to 75% and 25% *randomly*in order to use the different datasets for training and testing to machine learning algorithms. Does anyone knows how to do that? For example using an sql query

  • To Get Random Rows Fom the SQL DATAtable Or DATASET follow below ling http://niravdaraniya.blogspot.in/2013/07/how-to-generate-random-rows-in-dataset.html And also here http://codeoverflow.net/2013/07/23/how-to-generate-random-rows-in-dataset-using-c-net/ –  May 28 '14 at 06:10

2 Answers2

4

You did not provide SQL product so I will give one example. You need a random function from your DBMS. For example oracle provides dbms_random.value, SQL server NEWID or RAND(). MySql provides RAND().

Assume that you have a table named DATA which holds your 500 records.

WITH TEMP
(
  SELECT  ROW_NUMBER() AS ROW_ID , NEWID() as RANDOM_VALUE,D.*
        FROM   DATA D
       ORDER BY RANDOM_VALUE
)

SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID =< 375
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 375

Trick is to Order by Random Function. 375 = 500 * 75/100. that is %75 of data is training and %25 is testing.

This example for SQL Server but it will work with Oracle and Other vendors if you change random function.

Atilla Ozgur
  • 1,251
  • 1
  • 11
  • 17
  • Thank you very much. You helped me very much! Voting as best answer –  May 07 '13 at 14:07
1

A sequential algorithm using conditional probabilities can also be used. To demonstrate a simple case, suppose you want to sample 1 item out of 4. Generate a uniform[0,1] random number $U_1$. If $U_1 < {1 \over 4}$, select the first element. If not, generate $U_2$. If $U_2 < {1 \over 3}$, select the second element. If not, generate $U_3$. If $U_3 < {1 \over 2}$, select the third element. If not, select the last element. The probability of selecting the first element is $1 \over 4$. The probability of selecting the 2nd element is $({3 \over 4})({1 \over 3}) = {1 \over 4}$. And so on for the 3rd and 4th elements. Google "Faster Methods for Random Sampling" to find a paper by JS Vitter that presents several sequential methods.

soakley
  • 4,341
  • 3
  • 16
  • 27