
Quoting Niels Nes (2015-04-08 19:16:09)
On Wed, Apr 08, 2015 at 04:01:29PM +0200, Valentin Reis wrote:
Hello all, To the purpose of evaluating MonetDB, I am trying to perform the following:
Load two csv files into two tables, from disk A. Perform a JOIN operation on two fields, using disk B for the database. Pipe the result to a csv file, on disk A.
This results in a very high disk B usage, and subsequent MT_mremap failure, as seen in merovingian.log: 2015-04-08 12:50:14 ERR google-trace[2737]: = gdk_posix.c:428: MT_mremap(./bat/06/675.tail,7e8738e90000,91992817664,110391328768): GDKextendf() failed
Does somebody have ideas or explanations? Details follow:
I am loading csv data from hard drive A. The files have size: -rw-r--r-- 1 fre fre 2.6G Apr 5 16:25 task_events_cut.csv -rw-r--r-- 1 fre fre 34G Apr 5 16:04 task_usage_cut.csv
I create a database in a separate hard drive B (2TB):
rm -rf /mnt/diskB/mdb mkdir /mnt/diskB/mdb/ monetdbd create /mnt/diskB/mdb monetdbd start /mnt/diskB/mdb monetdb create google-trace monetdb release google-trace
I then load the data into the database:
mclient ct_trace_events_reduced.sql -d google-trace pv /mnt/diskA/task_events_cut.csv |mclient -d google-trace -s "COPY INTO task_events_reduced FROM STDIN USING DELIMITERS ',','\\n'" - mclient ct_trace_usage_reduced.sql -d google-trace pv /mnt/diskA/task_usage_cut.csv |mclient -d google-trace -s "COPY INTO task_usage_reduced FROM STDIN USING DELIMITERS ',','\\n'" -
using very standard scripts:
$ cat ct_trace_events_reduced.sql ct_trace_usage_reduced.sql DROP TABLE task_events_reduced;
CREATE TABLE task_events_reduced ( "job_id" BIGINT , "task_id" BIGINT , "class" SMALLINT , "priority" SMALLINT); DROP TABLE task_usage_reduced;
CREATE TABLE task_usage_reduced ( "job_id" BIGINT , "task_id" BIGINT , "cpu_mean" FLOAT , "cpu_sample" FLOAT);
These two operations take about 50 minutes, which is very reasonable.
I then use mclient to do my join:
mclient join.sql -d google-trace
using the script:
$ cat join.sql COPY (SELECT te.job_id, te.task_id, te.class, te.priority, tu.cpu_mean, tu.cpu_sample FROM (SELECT * FROM task_events_reduced )AS te RIGHT JOIN (SELECT * FROM task_usage_reduced )AS tu ON(te.job_id=tu.job_id AND te.task_id=tu.task_id) ) INTO '/diskA/join.csv' USING DELIMITERS ',','\n'; Why are you using a RIGHT 'OUTER' join? Outer joins are much more expensive then innen joins.
Niels Hello,
denote: A=task_usage_reduced B=task_events_reduced Morally, the goal is to augment rows of A -big table- with some information (class and priority) from B -smaller table-. I am not sure whether B does contain information for all (job_id,task_id) entries of A. In the cases where a row from A wouldn't have an entry in B, I would like that the row be present in the output, only with NA's instead of the information from B. Therefore: RIGHT JOIN. Please do correct me again if I am mistaken, the SQL class is far away :) What I hoped from MonetDB (and what I still hope to get :) ) is the following behavior, denoting n the length of A, m the length of B: -build a hash-table of B indexed by job_id and task_id (somehow O(m)) -augment each row of A by doing a O(1) lookup in the hash table. -> O(n) Therefore, O(m+n). processing time with O(m+n) memory usage (hashtable+result). I am nonetheless looking at the performance of the inner join now. Cheers, Valentin
This results in more than three hours of data crunching on a google compute engine machine (16 processors and 100GB RAM), where disk B is being increasingly used, until it is full (2TB HDD). Then, the aforementionned error happens.
I am not hoping that MonetDB would perform streaming I/O on the right file from the join. However the disk usage seems quite high. Is there a way to force MonetDB to do a hash join?
Thanks a lot, Valentin Reis University of Warsaw _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list