Re: Join memory overhead, COPY INTO csv, MT_mremap failure.

8 Apr 2015

      Quoting Niels Nes (2015-04-08 19:16:09)
...
On Wed, Apr 08, 2015 at 04:01:29PM +0200, Valentin Reis wrote:
...
Hello all,
To the purpose of evaluating MonetDB, I am trying to perform the following:
Load two csv files into two tables, from disk A.
 Perform a JOIN operation on two fields, using disk B for the database.
 Pipe the result to a csv file, on disk A.
This results in a very high disk B usage, and subsequent MT_mremap
failure, as seen in merovingian.log:
2015-04-08 12:50:14 ERR google-trace[2737]: = gdk_posix.c:428: MT_mremap(./bat/06/675.tail,7e8738e90000,91992817664,110391328768): GDKextendf() failed
Does somebody have ideas or explanations? Details follow:
I am loading csv data from hard drive A. The files have size:
-rw-r--r-- 1 fre fre 2.6G Apr  5 16:25 task_events_cut.csv
-rw-r--r-- 1 fre fre  34G Apr  5 16:04 task_usage_cut.csv
I create a database in a separate hard drive B (2TB):
rm -rf /mnt/diskB/mdb
mkdir /mnt/diskB/mdb/
monetdbd create  /mnt/diskB/mdb
monetdbd start   /mnt/diskB/mdb
monetdb  create  google-trace
monetdb  release google-trace
I then load the data into the database:
mclient ct_trace_events_reduced.sql -d google-trace
pv /mnt/diskA/task_events_cut.csv |mclient -d google-trace -s  "COPY  INTO  task_events_reduced FROM  STDIN  USING  DELIMITERS ',','\\n'" -
mclient ct_trace_usage_reduced.sql -d google-trace
pv /mnt/diskA/task_usage_cut.csv |mclient -d google-trace -s  "COPY  INTO  task_usage_reduced FROM  STDIN  USING  DELIMITERS ',','\\n'" -
using very standard scripts:
$ cat ct_trace_events_reduced.sql ct_trace_usage_reduced.sql
DROP TABLE task_events_reduced;
CREATE TABLE task_events_reduced  ( "job_id"     BIGINT
                                  , "task_id"    BIGINT
                                  , "class"    SMALLINT
                                  , "priority" SMALLINT);
DROP TABLE task_usage_reduced;
CREATE TABLE task_usage_reduced  ( "job_id"     BIGINT
                                  , "task_id"    BIGINT
                                  , "cpu_mean"   FLOAT
                                  , "cpu_sample" FLOAT);
These two operations take about 50 minutes, which is very reasonable.
I then use mclient to do my join:
mclient join.sql -d google-trace
using the script:
$ cat join.sql
COPY
(SELECT te.job_id, te.task_id, te.class, te.priority, tu.cpu_mean, tu.cpu_sample FROM
(SELECT *
FROM task_events_reduced
)AS te
RIGHT JOIN
(SELECT *
FROM task_usage_reduced
)AS tu
ON(te.job_id=tu.job_id
AND te.task_id=tu.task_id)
)
INTO '/diskA/join.csv'
USING DELIMITERS ',','\n';
Why are you using a RIGHT 'OUTER' join?
Outer joins are much more expensive then innen joins.
Niels
Hello,
denote:
A=task_usage_reduced
B=task_events_reduced

Morally, the goal is to augment rows of A -big table- with
some information (class and priority) from B -smaller table-.

I am not sure whether B does contain information
for all (job_id,task_id) entries of A.
In the cases where a row from A wouldn't have an entry in
B, I would like that the row be present in the output,
only with NA's instead of the information from B.
Therefore: RIGHT JOIN. Please do correct me again if I am mistaken, the
SQL class is far away :)

What I hoped from MonetDB (and what I still hope to get :) ) is the
following behavior, denoting n the length of A, m the length of B:
-build a hash-table of B indexed by job_id and task_id (somehow O(m))
-augment each row of A by doing a O(1) lookup in the hash table. -> O(n)
Therefore, O(m+n). processing time with O(m+n) memory usage
(hashtable+result).

I am nonetheless looking at the performance of the inner join now.

Cheers,
Valentin
...
...
This results in more than three hours of data crunching on a google
compute engine machine (16 processors and 100GB RAM), where disk B is
being increasingly used, until it is full (2TB HDD). Then, the aforementionned
error happens.
I am not hoping that MonetDB would perform streaming I/O on the right
file from the join. However the disk usage seems quite high. Is there a
way to force MonetDB to do a hash join?
Thanks a lot,
Valentin Reis
University of Warsaw
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
-- 
Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI)
Science Park 123, 1098 XG Amsterdam, The Netherlands
room L3.14,  phone ++31 20 592-4098     sip:4098@sip.cwi.nl
url: https://www.cwi.nl/people/niels    e-mail: Niels.Nes@cwi.nl
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list