February 21st, 2008

returning a random row from MySQL

Something this useful: you'd suppose MySQL would do it efficiently. Return a random row from a table.

the conventional way to do it is:

select * from mytable order by rand() limit 1

that's fine on very small tables... but as you apply it to larger datasets, it gets quite expensive. I applied the above method to a table with 800,000 rows and averaged 12807 ms over 100 attempts.

greggdev suggests another technique, which has two deficiencies: 1) it's not random, and 2) it's not SQL.

So I came up with this idea. First, you create a RAND column in your table with a DOUBLE datatype, and store a random number in it.

update mytable set rand=rand()

This populates the rand column with random double-precision numbers. You only need to do this once.

Then when you want a random row, select the first row which is greater than a random number.

select * from mytable where rand >rand() order by rand limit 1

Running that query 100 times revealed an average execution of 6703 ms. About twice as fast!

It's still not an efficient solution. The table is still being sorted on a column of random numbers - only with my technique, the random numbers are prebaked. The only reason it goes twice as fast is because the conventional technique has to create 800,000 random numbers as part of the query. In my technique, SQL only creates one random number. Both methods sort the entire table and return one row.

Moreover, the results are not especially random, especially with smaller datasets. Imagine the analogy: you've thrown a pack of cards down the stairs. Then you roll the dice: 5. So, you decide, you'll go to the fifth step, start walking up and choose the first card you see.

For subsequent selections, you don't re-throw the cards - instead you just roll the dice again.

That seems like it should produce a good random selection, as long as you assume an even distribution of cards on the steps. But - what if most of the cards fell on the first two steps, and the lower steps don't have as many cards on them? Those lower cards will have a greater chance of being lower than my dice roll than ones that are packed closer together.

On a large dataset, this technique is pretty good. But really, the randomness only persists if you rethrow the cards. You could try this:

select max(*) from mytable order by rand limit 1; update dict set rand=rand()

in essence, you do your superfast selection, which your application may use right away. Then in an asynchronous process, rethrow the cards. Problem there is that in my 800,000 record table, the UPDATE takes about 37 seconds to execute. It wouldn't take too many of those per second to shut my database down! Heaven forbid if I use random rows in a loop! This method is actually slower than the conventional ORDER BY rand() technique.

The next alternate technique is to pick your random rows in advance, before the query is requested, and keep them in a queue for fast retreival. Quick analogy: shuffle the deck. The process would be: 1) pick 100 random rows (which would potentially take a few minutes to complete), 2) pop those off the temp table as you need them, then 3) when your stack is running low, do it again. This technique could have advantages, but for me it seems a bit too "cron" and not very elegant.

By this time it's obvious that the best way to grab a random row would be to count the number of rows n, choose a random number between 0 and n, then just grab the nth row. Thankfully, MySQL has a way to do that, using the LIMIT offset. Offset is usually used for paginating data, e.g. to show results 50-59,60-69,70-79 etc. of a large data set. Here, we're going to use it as described above.

The SQL looks like this:

SELECT * FROM mytable LIMIT offset,1

where offset is an integer between 0 and COUNT(*).

This can be done in pure SQL, but it goes a little faster when you get PHP involved. This technique, adapted from akinas.com, looks like this:

[PHP]

None of these methods are particularly fast. Simply choosing a row takes 3-4 seconds, and when I add a WHERE clause to limit the search, the query takes 6-7 seconds to complete. But the last one seems to be the best I've tried so far.

Filed under Uncategorized

Leave a Reply