Re: [Monetdb-developers] MonetDB: default - The sample module for the monetdb5 server.

Roberto, we were also puzzled with which syntax to adopt. The first issue is that the syntax is also dictating in a way implementation (although that sounds not correct:)). A syntax that has a order by rand() implies that you are attaching random numbers to the rows and then sort on that. This sorting is very coslty and it is the worst case since you have random order to begin with. Something like that would cost NlogN where N is the size of rows. Moreover, that syntax implies that you are sampling the base table (or some intermediate result anyway) and then "continue" the query evaluation. however sampling is not a very kind operator and cannot by easily pushed down: a sample of a join is not equal with the join of 2 samples. These are in my view the implications (or benefits!) of such a syntax. So MonetDB could as well support the syntax you propose and should! On the other hand, the syntax SAMPLE we just introduce, is applied solely on the final result set of the query, after or query evaluation is done. The algorithm has a complexity of S, where S is the size of the sample (as opposed to NlogN of the previous). Also , the sample operator has the same semantics of limits, it is an operator that shows a limited (random) portion of the entire (usually computed) result. that means that all joins, predicated etc. have been evaluated before a sample is taken. Of course we would love to have that functionality in sub-queries too, hence the workaround. What is really missing now is a function where the table definition is a parameter, indeed! With this 2 views in mind, I would suggest that both syntax are necessary to exist! Each of them serving a different purpose (and different underline implementation eventually). the other solution would be to allow SAMPLE is subqueries and "depart" from the SQL spirit. What do you think? lefteris On Mon, Sep 26, 2011 at 10:41 AM, Roberto Cornacchia < roberto.cornacchia@gmail.com> wrote:

Yes, I agree about the difference in efficiency. My considerations were only about usability and in particular from the point of view of *generated* SQL. The syntax I suggested works well in MonetDB, so yes, both can be useful, with different focus. They actually achieve the same result, in two different ways. The SAMPLE clause is a special top-k, that draws k tuples in random order The syntax I'm using is a special ranking. It ranks all tuples randomly and then takes the first k. To allow the SAMPLE syntax in subqueries would also imply to allow LIMIT in subqueries, because they really boil down to the same semantics. Although I do think SQL should allow this, I do not think that MonetDB/SQL should deviate from the standard. Roberto On Mon, Sep 26, 2011 at 15:30, Lefteris <lsidir@gmail.com> wrote:
participants (2)
-
Lefteris
-
Roberto Cornacchia