Re: Join memory overhead, COPY INTO csv, MT_mremap failure.

8 Apr 2015

      Hi

MonetDB will automatically create the hashes if that is consider worthwhile
for the join operations. These hashes are non-persistent (in the stable releases)
and it is known that there are corner cases where the hash is build on
the wrong join leg.

Calling the server with --algorithms will show you some details on what
decisions are made.

In general, understanding the behavior of a system should be a multistep
approach, from tiny, small,... huge tables. Then the SQL TRACE command will
show you where time is actually spent. This is particularly relevant when
queries become more complex.

regards, Martin

On 08/04/15 21:05, Valentin Reis wrote:
...
Quoting Niels Nes (2015-04-08 19:16:09)
...
On Wed, Apr 08, 2015 at 04:01:29PM +0200, Valentin Reis wrote:
...
Hello all,
To the purpose of evaluating MonetDB, I am trying to perform the following:
Load two csv files into two tables, from disk A.
  Perform a JOIN operation on two fields, using disk B for the database.
  Pipe the result to a csv file, on disk A.
This results in a very high disk B usage, and subsequent MT_mremap
failure, as seen in merovingian.log:
2015-04-08 12:50:14 ERR google-trace[2737]: = gdk_posix.c:428: MT_mremap(./bat/06/675.tail,7e8738e90000,91992817664,110391328768): GDKextendf() failed
Does somebody have ideas or explanations? Details follow:
I am loading csv data from hard drive A. The files have size:
-rw-r--r-- 1 fre fre 2.6G Apr  5 16:25 task_events_cut.csv
-rw-r--r-- 1 fre fre  34G Apr  5 16:04 task_usage_cut.csv
I create a database in a separate hard drive B (2TB):
rm -rf /mnt/diskB/mdb
mkdir /mnt/diskB/mdb/
monetdbd create  /mnt/diskB/mdb
monetdbd start   /mnt/diskB/mdb
monetdb  create  google-trace
monetdb  release google-trace
I then load the data into the database:
mclient ct_trace_events_reduced.sql -d google-trace
pv /mnt/diskA/task_events_cut.csv |mclient -d google-trace -s  "COPY  INTO  task_events_reduced FROM  STDIN  USING  DELIMITERS ',','\\n'" -
mclient ct_trace_usage_reduced.sql -d google-trace
pv /mnt/diskA/task_usage_cut.csv |mclient -d google-trace -s  "COPY  INTO  task_usage_reduced FROM  STDIN  USING  DELIMITERS ',','\\n'" -
using very standard scripts:
$ cat ct_trace_events_reduced.sql ct_trace_usage_reduced.sql
DROP TABLE task_events_reduced;
CREATE TABLE task_events_reduced  ( "job_id"     BIGINT
                                   , "task_id"    BIGINT
                                   , "class"    SMALLINT
                                   , "priority" SMALLINT);
DROP TABLE task_usage_reduced;
CREATE TABLE task_usage_reduced  ( "job_id"     BIGINT
                                   , "task_id"    BIGINT
                                   , "cpu_mean"   FLOAT
                                   , "cpu_sample" FLOAT);
These two operations take about 50 minutes, which is very reasonable.
I then use mclient to do my join:
mclient join.sql -d google-trace
using the script:
$ cat join.sql
COPY
(SELECT te.job_id, te.task_id, te.class, te.priority, tu.cpu_mean, tu.cpu_sample FROM
(SELECT *
FROM task_events_reduced
)AS te
RIGHT JOIN
(SELECT *
FROM task_usage_reduced
)AS tu
ON(te.job_id=tu.job_id
AND te.task_id=tu.task_id)
)
INTO '/diskA/join.csv'
USING DELIMITERS ',','\n';
Why are you using a RIGHT 'OUTER' join?
Outer joins are much more expensive then innen joins.
Niels
Hello,
denote:
A=task_usage_reduced
B=task_events_reduced
Morally, the goal is to augment rows of A -big table- with
some information (class and priority) from B -smaller table-.
I am not sure whether B does contain information
for all (job_id,task_id) entries of A.
In the cases where a row from A wouldn't have an entry in
B, I would like that the row be present in the output,
only with NA's instead of the information from B.
Therefore: RIGHT JOIN. Please do correct me again if I am mistaken, the
SQL class is far away :)
What I hoped from MonetDB (and what I still hope to get :) ) is the
following behavior, denoting n the length of A, m the length of B:
-build a hash-table of B indexed by job_id and task_id (somehow O(m))
-augment each row of A by doing a O(1) lookup in the hash table. -> O(n)
Therefore, O(m+n). processing time with O(m+n) memory usage
(hashtable+result).
I am nonetheless looking at the performance of the inner join now.
Cheers,
Valentin
...
...
This results in more than three hours of data crunching on a google
compute engine machine (16 processors and 100GB RAM), where disk B is
being increasingly used, until it is full (2TB HDD). Then, the aforementionned
error happens.
I am not hoping that MonetDB would perform streaming I/O on the right
file from the join. However the disk usage seems quite high. Is there a
way to force MonetDB to do a hash join?
Thanks a lot,
Valentin Reis
University of Warsaw
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list
--
Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI)
Science Park 123, 1098 XG Amsterdam, The Netherlands
room L3.14,  phone ++31 20 592-4098     sip:4098@sip.cwi.nl
url: https://www.cwi.nl/people/niels    e-mail: Niels.Nes@cwi.nl
_______________________________________________
users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list

users-list mailing list
users-list@monetdb.org
https://www.monetdb.org/mailman/listinfo/users-list

Re: Join memory overhead, COPY INTO csv, MT_mremap failure.

Martin Kersten