I have a database with 500 records. I want to split these records to 75% and 25% *randomly*in order to use the different datasets for training and testing to machine learning algorithms. Does anyone knows how to do that? For example using an sql query
-
To Get Random Rows Fom the SQL DATAtable Or DATASET follow below ling http://niravdaraniya.blogspot.in/2013/07/how-to-generate-random-rows-in-dataset.html And also here http://codeoverflow.net/2013/07/23/how-to-generate-random-rows-in-dataset-using-c-net/ – May 28 '14 at 06:10
2 Answers
You did not provide SQL product so I will give one example. You need a random function from your DBMS. For example oracle provides dbms_random.value, SQL server NEWID or RAND(). MySql provides RAND().
Assume that you have a table named DATA which holds your 500 records.
WITH TEMP
(
SELECT ROW_NUMBER() AS ROW_ID , NEWID() as RANDOM_VALUE,D.*
FROM DATA D
ORDER BY RANDOM_VALUE
)
SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID =< 375
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 375
Trick is to Order by Random Function. 375 = 500 * 75/100. that is %75 of data is training and %25 is testing.
This example for SQL Server but it will work with Oracle and Other vendors if you change random function.

- 1,251
- 1
- 11
- 17
A sequential algorithm using conditional probabilities can also be used. To demonstrate a simple case, suppose you want to sample 1 item out of 4. Generate a uniform[0,1] random number $U_1$. If $U_1 < {1 \over 4}$, select the first element. If not, generate $U_2$. If $U_2 < {1 \over 3}$, select the second element. If not, generate $U_3$. If $U_3 < {1 \over 2}$, select the third element. If not, select the last element. The probability of selecting the first element is $1 \over 4$. The probability of selecting the 2nd element is $({3 \over 4})({1 \over 3}) = {1 \over 4}$. And so on for the 3rd and 4th elements. Google "Faster Methods for Random Sampling" to find a paper by JS Vitter that presents several sequential methods.

- 4,341
- 3
- 16
- 27