Re: growing size of database

14 Nov 2012

      For bulk loading the RDF data in triples form (triples represented
simply as integers 1:2:3 on each line of datafile to load, where 1 is
an ID to which subject value is mapped, 2 to which predicate string is
mapped, and 3 is what object is mapped to), I use commands and SQL
statements like --

$ monetdbd create /path/to/dbfarm/rdf
$ monetdbd start /path/to/dbfarm
$ monetdb create lubm
$ monetdb release lubm
$ mclient -d lubm -lsql < /path/to/bulk_load_file

bulk_load_file contains --
--------------------------
create table lubm100u(sub int, pred int, obj int);
copy into lubm100u from '/path/to/datafile' using delimiters ':','\n';

create table lubm100u_p1(sub int, obj int);
create table lubm100u_p2(sub int, obj int);

insert into lubm100u_p1 select sub, obj from lubm100u where pred=1
order by sub, obj;
insert into lubm100u_p2 select sub, obj from lubm100u where pred=2
order by sub, obj;
-------------------
Then I run bulk queries like --

for (( i=1; i <=6000; i++ ));
do
	echo $i
	sed -n "${i}p" /path/to/bulk_q_file | mclient -d lubm -lsql | grep -P
"^\|\s+\d+\s+\|$" >> results_file
	if (( i % 1000 == 0 ));
	then echo "Shutting down MonetDB"
	monetdb stop lubm
	monetdbd stop /path/to/dbfarm/rdf
	sleep 10
	echo "Starting MonetDB"
	monetdbd start /path/to/dbfarm/rdf
	monetdb start lubm
	sleep 2
	fi
done

My queries were simple 1 join on a pair of vertically partitioned tables --

select count(*) from lubm100u_p1 as t1, lubm100u_p1 as t2 where t1.obj=t2.sub;
select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.sub;
select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.sub;
select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.sub;
select count(*) from lubm100u_p1 as t1, lubm100u_p5 as t2 where t1.obj=t2.sub;
................... etc
AND

select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.sub=t2.sub;
select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.sub=t2.sub;
select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.sub=t2.sub;
................. etc

AND

select count(*) from lubm100u_p1 as t1, lubm100u_p2 as t2 where t1.obj=t2.obj;
select count(*) from lubm100u_p1 as t1, lubm100u_p3 as t2 where t1.obj=t2.obj;
select count(*) from lubm100u_p1 as t1, lubm100u_p4 as t2 where t1.obj=t2.obj;
.................. etc.

I noticed that for very large datasets (~800 million rows in original
sub, pred, obj table), the simple merge-joins on vertically
partitioned tables kept on crashing for some long running queries. And
at the time of crash sometimes the disk got full because datasize
increased several times more than original one. I thought that due to
crash and backup/recovery this may have happened, but then when I
tried running queries on really tiny RDF dataset (~13 million rows),
the server didn't crash but the size of db farm increased as much as
25 times!

On Wed, Nov 14, 2012 at 8:43 AM, Lefteris <lsidir@gmail.com> wrote:
...
Hi Medha,
can you tell us a bit more how you load the data, which SQL statements you
used and how you did the vertical partitioning?
What you are describing should not be happening.
thank you
On Wed, Nov 14, 2012 at 2:25 PM, Medha Atre <medha.atre@gmail.com> wrote:
...
Just wondering if anyone could throw a light on this issue?
---------- Forwarded message ----------
From: Medha Atre <medha.atre@gmail.com>
Date: Mon, Nov 12, 2012 at 5:11 PM
Subject: growing size of database
To: Communication channel for MonetDB users <users-list@monetdb.org>
Hi,
I am running the latest MonetDB 11.13.3. I loaded a pretty tiny RDF
dataset of 13 million triples and created some "vertical partitions"
on predicate values. After having done dataset loading, I checked the
size of the dataset folder and it showed to be _258 MB_. Then I ran
about 6000+ bulk queries by periodically shutting down the server
after about 1000 queries each time. The queries finished successfully
and there was no crash. After the final query I stopped the database
"monetdb stop dbname" and shutdown the main server too "monetdbd stop
/path/to/dbfarm".
After that I again checked the size of the database folder and then it
showed it to be _5.8GB_ and the "HC" folder under "bat" was shown to
have size of about 3 GB with some other files having sizes between
300-700 MBs.
It seems mysterious to me why would the size of database go on
increasing as much as 25 times the original size.
I intend to load much larger data (> 1 billion triples), and want to
avoid growing disk size of the dataset due to limited disk-space.
Can someone please enlighten?
Medha
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________
users-list mailing list
users-list@monetdb.org
http://mail.monetdb.org/mailman/listinfo/users-list