New subject: [MonetDB-users] 64bit MonetDB, JDBC Insert via RJDBC, >300 million rows

19 Mar 2009


      Now have tried the same bulk load with COPY on the latest (v5.10.0) with
single thread setting.
Failed again!

System:
OS X
MonetDB complied in 64bit
number of files to load: 322
total size of files: 15GB
Max row in file: 3million

On Wed, Mar 18, 2009 at 11:51 AM, Martin Kersten <Martin.Kersten@cwi.nl>wrote:
...
Yue Sheng wrote:
...
I'm not sure how "The (parallel) load used scratch area as well" is
related to the question.
If you look at the code, you will notice that there is a two phase
loading process involving (possibly) multiple threads
...
Sorry if I'm a bit slow.
On Wed, Mar 18, 2009 at 11:25 AM, Martin Kersten <Martin.Kersten@cwi.nl<mailto:
Martin.Kersten@cwi.nl>> wrote:
Yue Sheng wrote:
Sorry, if I wasn't clear on the first question:
(1) we ramp up N for the first insert to claim sufficient space.
       Sure, understand that one.
But:
The claimed space got "given back" *right after* the first
       insert. (this is the part I don't understand.)
The (parallel) load used scratch area as well
Question: how does the second, third, .... inserts get the
       "benefit" of the ramp up that we did for the first insert?
Is this a bit clearer what my question pertains?
Thanks.
On Wed, Mar 18, 2009 at 10:26 AM, Martin Kersten
       <Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>
       <mailto:Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>>>
       wrote:
Yue Sheng wrote:
Three questions that bothers me are:
              (1) why we need to ramp up N to total of all line in
       first insert.
to let the kernel claim sufficient space
Reason I ask is that right after first insert, the
allocation
              drop right down from, say 100GB to 35GB, and stays
       roughly there
              for *all* subsequent inserts. I totally do not understand
       this.
              (2) in your opinion, based on this experience, what could
       be the
              potential problem here?
little to none, as the files are memory mapped, which only
       may cause
          io on some systems
(3) in your opinion, would the newer version cure the
       problem?
a system can never correctly guess what will come,
          especially since the source of a COPY command need not be a file
          but standard input, i.e. a stream.
Thanks.
On Tue, Mar 17, 2009 at 10:51 PM, Martin Kersten
              <Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>
       <mailto:Martin.Kersten@cwi.nl <mailto:Martin.Kersten@cwi.nl>>
              <mailto:Martin.Kersten@cwi.nl
       <mailto:Martin.Kersten@cwi.nl> <mailto:Martin.Kersten@cwi.nl
       <mailto:Martin.Kersten@cwi.nl>>>>
wrote:
Yue Sheng wrote:
Martin,
It almost worked...
This is what I did and what have happened:
I have 322 files to insert into data base,
       totaling 650
              million rows
I divided the file list into two, then for each
       sub list
(a) I insert first file in the list with N set to
       650milllion
                     rows, (b) all subsequent files have N set to the
       number
              of lines
                     in *that* file
once first list done, then
(c) I insert first file in the second list with N
       set to
                     650million rows,
                     (d) all subsequent files have N set to the number
       of lines in
                     *that* file
Then the same problem happened: it stucked at file
       number
              316.
ok. using the 650M enables MonetDB to allocate enough
       space
              and does
                 not have to fall back on guessing. Guessing is
       painful, because
                 when a file of N records has been created and it needs
       more,
              it makes
                 a file of size 1.3xN. This leads to memory fragmentation.
in your case i would have been a little mode spacious
       and used
                 700M as a start, because miscalculation of 1 gives a
       lot of pain.
                 Such advice is only needed in (a)
Note: This is farther then previous tries, which all
              stopped in
                     the region of file number 280 +/- a few.
My observation:
                     (i) at (a), the VSIZE went up to around 46GB, then
       after
              first
                     insert, it drops to around 36GB
ok fits
(ii) at (c), the VSIZE went up to around 130GB, then
              after first
                     insert, it drops to around 45GB
you tell the system to extend existing BATs prepare for
              another 650 M,
                 which means it allocates 2*36 G, plus room for the old
one
              gives 108GB
                 then during processings some temporary BATs may be
              needed,e.g. to check
                 integrity constraints after each file,.
                 Then it runs out of swapspace.
(iii) the "Free Memory", as reported by Activity
       Monitor,
              just
                     before it failed at file number 316, dipped to as
       low as
              7.5MB!
yes, you are running out of swapspace on your system.
                 This should not have happened, because the system uses
              mmapped files
                 and may be an issue with the MacOS or relate to a
       problem we
              fixed
                 recently
My question:
                     (1) why we need to ramp N up to total number of
       lines (it
              takes
                     along time to do that), then only have it drop down
to
              30GB-40GB
                     right after
this might indicate that on MacOS, just like Windows,
       mmaped
              files
                 need to be written to disk. With a disk bandwidth of
       50MB/sec it
                 still takes several minutes
the first insert and stay roughly there? Does it
       mean we're
                     giving back all the pre-allocation space back to
       the OS? Then
                     should we set N always to total number of lines?
       If so,
              it would
                     take much much longer to process all the files...
                     (2) How come RSIZE never goes above 4GB? (3) Does
       sql log
              file
                     size have some limit, that we need to tweak?
no limit
(4) Has anyone successfully implemented the 64bit
       version of
                     MondeDB and successfully inserted more than
       1billion rows?
you platform may be the first, but Amherst has worked
with
              Macs for
                 years
(5) when you say you "...The VSIZE of 44G is not too
                     problematic, i am looking at queries letting it
tumble
              between
                     20-80 GB....," What does it mean? Mine went up to
       as high as
                     135GB...
explained above.
regards, Martin
Thanks, as always.

Re: [MonetDB-users] 64bit MonetDB, JDBC Insert via RJDBC, >300 million rows

Yue Sheng

Yue Sheng

Yue Sheng

Niels Nes

Niels Nes

tags

participants (2)