
Hi MonetDB will automatically create the hashes if that is consider worthwhile for the join operations. These hashes are non-persistent (in the stable releases) and it is known that there are corner cases where the hash is build on the wrong join leg. Calling the server with --algorithms will show you some details on what decisions are made. In general, understanding the behavior of a system should be a multistep approach, from tiny, small,... huge tables. Then the SQL TRACE command will show you where time is actually spent. This is particularly relevant when queries become more complex. regards, Martin On 08/04/15 21:05, Valentin Reis wrote:
Quoting Niels Nes (2015-04-08 19:16:09)
On Wed, Apr 08, 2015 at 04:01:29PM +0200, Valentin Reis wrote:
Hello all, To the purpose of evaluating MonetDB, I am trying to perform the following:
Load two csv files into two tables, from disk A. Perform a JOIN operation on two fields, using disk B for the database. Pipe the result to a csv file, on disk A.
This results in a very high disk B usage, and subsequent MT_mremap failure, as seen in merovingian.log: 2015-04-08 12:50:14 ERR google-trace[2737]: = gdk_posix.c:428: MT_mremap(./bat/06/675.tail,7e8738e90000,91992817664,110391328768): GDKextendf() failed
Does somebody have ideas or explanations? Details follow:
I am loading csv data from hard drive A. The files have size: -rw-r--r-- 1 fre fre 2.6G Apr 5 16:25 task_events_cut.csv -rw-r--r-- 1 fre fre 34G Apr 5 16:04 task_usage_cut.csv
I create a database in a separate hard drive B (2TB):
rm -rf /mnt/diskB/mdb mkdir /mnt/diskB/mdb/ monetdbd create /mnt/diskB/mdb monetdbd start /mnt/diskB/mdb monetdb create google-trace monetdb release google-trace
I then load the data into the database:
mclient ct_trace_events_reduced.sql -d google-trace pv /mnt/diskA/task_events_cut.csv |mclient -d google-trace -s "COPY INTO task_events_reduced FROM STDIN USING DELIMITERS ',','\\n'" - mclient ct_trace_usage_reduced.sql -d google-trace pv /mnt/diskA/task_usage_cut.csv |mclient -d google-trace -s "COPY INTO task_usage_reduced FROM STDIN USING DELIMITERS ',','\\n'" -
using very standard scripts:
$ cat ct_trace_events_reduced.sql ct_trace_usage_reduced.sql DROP TABLE task_events_reduced;
CREATE TABLE task_events_reduced ( "job_id" BIGINT , "task_id" BIGINT , "class" SMALLINT , "priority" SMALLINT); DROP TABLE task_usage_reduced;
CREATE TABLE task_usage_reduced ( "job_id" BIGINT , "task_id" BIGINT , "cpu_mean" FLOAT , "cpu_sample" FLOAT);
These two operations take about 50 minutes, which is very reasonable.
I then use mclient to do my join:
mclient join.sql -d google-trace
using the script:
$ cat join.sql COPY (SELECT te.job_id, te.task_id, te.class, te.priority, tu.cpu_mean, tu.cpu_sample FROM (SELECT * FROM task_events_reduced )AS te RIGHT JOIN (SELECT * FROM task_usage_reduced )AS tu ON(te.job_id=tu.job_id AND te.task_id=tu.task_id) ) INTO '/diskA/join.csv' USING DELIMITERS ',','\n'; Why are you using a RIGHT 'OUTER' join? Outer joins are much more expensive then innen joins.
Niels Hello,
denote: A=task_usage_reduced B=task_events_reduced
Morally, the goal is to augment rows of A -big table- with some information (class and priority) from B -smaller table-.
I am not sure whether B does contain information for all (job_id,task_id) entries of A. In the cases where a row from A wouldn't have an entry in B, I would like that the row be present in the output, only with NA's instead of the information from B. Therefore: RIGHT JOIN. Please do correct me again if I am mistaken, the SQL class is far away :)
What I hoped from MonetDB (and what I still hope to get :) ) is the following behavior, denoting n the length of A, m the length of B: -build a hash-table of B indexed by job_id and task_id (somehow O(m)) -augment each row of A by doing a O(1) lookup in the hash table. -> O(n) Therefore, O(m+n). processing time with O(m+n) memory usage (hashtable+result).
I am nonetheless looking at the performance of the inner join now.
Cheers, Valentin
This results in more than three hours of data crunching on a google compute engine machine (16 processors and 100GB RAM), where disk B is being increasingly used, until it is full (2TB HDD). Then, the aforementionned error happens.
I am not hoping that MonetDB would perform streaming I/O on the right file from the join. However the disk usage seems quite high. Is there a way to force MonetDB to do a hash join?
Thanks a lot, Valentin Reis University of Warsaw _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list