19 Feb
2010
19 Feb
'10
7:19 a.m.
I applied the patches of Peter as of 1AM and started the SF100 run.
It gave a segfault after 10minutes, but for once i did not attend
Q1 to 'see/feel' processing.
Rebuilding now with all patches of this night.
Peter Boncz wrote:
> Hi Stefan
>
> Thanks, indeed in all areas improvements are needed:
> 1) indeed (scary use of free!) this should be corrected
> 2) typically yes. I do recall now that BATfetchjoin heap sharing will
> invalidate the otherwise always applying order correlation. If we have a way
> to detect that a heap is shared, we should treat those shared string heaps
> as WILLNEED.
> 3) also correct. The MT_mmap_find() could easily find entries by range
> overlap, then inform would find the relevant heap
>
> Finally, now sequential advise will not trigger preloading, but I actually
> think it can help (if you have enough memory). Maybe prefetch sequential
> heaps until some limit, like Martin suggests, e.g. 1/4*threads of memory.
>
> Peter
>
> -----Original Message-----
> From: Stefan Manegold [mailto:Stefan.Manegold@cwi.nl]
> Sent: vrijdag 19 februari 2010 1:34
> To: monetdb-developers@lists.sourceforge.net; Peter Boncz
> Cc: monetdb-checkins@lists.sourceforge.net
> Subject: Re: [Monetdb-checkins] MonetDB/src/gdk gdk_posix.mx, Feb2010,
> 1.176.2.21, 1.176.2.22 gdk_storage.mx, Feb2010, 1.149.2.32, 1.149.2.33
>
> Peter,
>
> I have some questions to make sure I understand your new code correctly:
>
> 1)
> I don't see any plance in the hash code (at least not in gdk_search.mx)
> where the "free" element of a hash heap is set (or used) other than the
> initialization to 0 in HEAPalloc;
> thus, I guess, "free" for hash heaps is always 0;
> hence, shouln't we use "size" instead of "free" for the madvise & preload
> size of hash heaps (as we did in the original BATpreload/BATaccess code)?
>
> 2)
> Am I right that for string heaps you conclude from a strong order
> correlation between the off-heap and the string heap (due sequential
> load/insertion) that also the first and last BUN in the offset point to the
> "first" and "last" string in the string heap?
> Well, indeed, since access is to be considered in page size granularity,
> this might be reasonable ...
>
>
> 3)
> (This was the same in the previous version of the code)
> For BUN heaps, in case of views (slices), the base pointer of the view's
> heap might not be the same as the parent's heap, in fact, it might not be
> page-aligned.
> If I understand the MT_mmap_tab[] array correctly, it identifies heap by
> their page-aligned base pointer of the parent's heap.
> Hence, BATaccess() on a slice view BAT with non-aligned heap->base
> pointer calls MT_mmap_inform() (through access_heap()) with a non-aligned
> heap->base, which is not found in MT_mmap_tab[], and hence MT_mmap_inform()
> does nothing with that heap. With preload==1 it does hence not resgister the
> posix_madvise() call that access_heap() does. COnsequently, with
> preload==-1, MT_mmap_inform() will never reset the advise set via slice
> views, unless there is (also) access to the original parent's heap (i.e.,
> with page-aligned heap->base pointer.
> I jjust noticed this, but do not yet understand, whether and if so which
> consequences this (might) have ...
>
>
> Stefan
>
>
> On Thu, Feb 18, 2010 at 10:39:22PM +0000, Peter Boncz wrote:
>> Update of /cvsroot/monetdb/MonetDB/src/gdk
>> In directory sfp-cvsdas-1.v30.ch3.sourceforge.com:/tmp/cvs-serv28734
>>
>> Modified Files:
>> Tag: Feb2010
>> gdk_posix.mx gdk_storage.mx
>> Log Message:
>> did experimentation with sequential mmap I/O.
>> - on very fast subsystems (such as 16xssd) it is three times slower than
> optimally tuned direct I/O (1GB/s vs 3GB/s)
>> - with less disks the difference is smaller (e.g. 140 vs 200MB/s)
>> regrettably, nothing helped to get it higher.
>>
>> the below checkin makes the following changes:
>> - simplified BATaccess code by separating out routine
>> - made BATaccess more precies in what to preload (ionly BUNfirst-BUNlast)
>> - observe that large string heaps have a high sequential correletaion
>> hense always WILLNEED fetching is overkill
>> - move the madvise() call back to BATaccess at the start of the access but
> removing
>> the advise is done in vmtrim, as you need the overview when the last
> user is away.
>> - the basic advise is SEQUENTIAL (ie decent I/O)
>>
>>
>>
>> Index: gdk_storage.mx
>> ===================================================================
>> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_storage.mx,v
>> retrieving revision 1.149.2.32
>> retrieving revision 1.149.2.33
>> diff -u -d -r1.149.2.32 -r1.149.2.33
>> --- gdk_storage.mx 18 Feb 2010 01:04:11 -0000 1.149.2.32
>> +++ gdk_storage.mx 18 Feb 2010 22:39:08 -0000 1.149.2.33
>> @@ -697,156 +697,95 @@
>> return BATload_intern(i);
>> }
>> @- BAT preload
>> -To avoid random disk access to large (memory-mapped) BATs it may help to
> issue a preload
>> -request.
>> -Of course, it does not make sense to touch more then we can physically
> accomodate.
>> +To avoid random disk access to large (memory-mapped) BATs it may help to
> issue a preload request.
>> +Of course, it does not make sense to touch more then we can physically
> accomodate (budget).
>> @c
>> -size_t
>> -BATaccess(BAT *b, int what, int advise, int preload) {
>> - size_t *i, *limit;
>> - size_t v1 = 0, v2 = 0, v3 = 0, v4 = 0;
>> - size_t step = MT_pagesize()/sizeof(size_t);
>> - size_t pages = (size_t) (0.8 * MT_npages());
>> -
>> -
> assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad
> vise==MMAP_WILLNEED||advise==MMAP_DONTNEED);
>> -
>> - /* VAR heaps (inherent random access) */
>> - if ( what&USE_HEAD && b->H->vheap && b->H->vheap->base ) {
>> - if (b->H->vheap->storage != STORE_MEM && b->H->vheap->size >
> MT_MMAP_TILE) {
>> - MT_mmap_inform(b->H->vheap->base, b->H->vheap->size,
> preload, MMAP_WILLNEED, 0);
>> - }
>> - if (preload > 0 && pages > 0) {
>> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
> H->vheap\n", BATgetId(b), advise);
>> - limit = (size_t *) (b->H->vheap->base +
> b->H->vheap->free) - 4 * step;
>> - /* we need to ensure alignment, here, as b might be
> a view and heap.base of views are not necessarily aligned */
>> - i = (size_t *) (((size_t)b->H->vheap->base +
> sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
>> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
> 4) {
>> - v1 += *i;
>> - v2 += *(i + step);
>> - v3 += *(i + 2*step);
>> - v4 += *(i + 3*step);
>> - }
>> - limit += 4 * step;
>> - for (; i <= limit && pages > 0; i+= step, pages--)
> {
>> - v1 += *i;
>> - }
>> +/* modern linux tends to use 128K readaround = 64K readahead
>> + * changes have been going on in 2009, towards true readahead
>> + * http://tomoyo.sourceforge.jp/cgi-bin/lxr/source/mm/readahead.c
>> + *
>> + * Peter Feb2010: I tried to do prefetches further apart, to trigger
> multiple readahead
>> + * units in parallel, but it does improve performance
> visibly
>> + */
>> +static size_t access_heap(str id, str hp, Heap *h, char* base, size_t sz,
> int touch, int preload, int advise) {
>> + size_t v0 = 0, v1 = 0, v2 = 0, v3 = 0, v4 = 0, v5 =0, v6 = 0, v7 =
> 0, page = MT_pagesize();
>> + int t = GDKms();
>> + if (h->storage != STORE_MEM && h->size > MT_MMAP_TILE) {
>> + MT_mmap_inform(h->base, h->size, preload, advise, 0);
>> + if (preload > 0) {
>> + void* alignedbase = (void*) (((size_t) base) &
> ~(page-1));
>> + size_t alignedsz = (sz + (page-1)) & ~(page-1);
>> + int ret = posix_madvise(alignedbase, sz, advise);
>> + if (ret) THRprintf(GDKerr, "#MT_mmap_inform:
> posix_madvise(file=%s, base="PTRFMT", len="SZFMT"MB, advice=%d) = %d\n",
>> + h->filename, PTRFMTCAST alignedbase,
> alignedsz >> 20, advise, errno);
>> }
>> }
>> - if ( what&USE_TAIL && b->T->vheap && b->T->vheap->base ) {
>> - if (b->T->vheap->storage != STORE_MEM && b->T->vheap->size >
> MT_MMAP_TILE) {
>> - MT_mmap_inform(b->T->vheap->base, b->T->vheap->size,
> preload, MMAP_WILLNEED, 0);
>> - }
>> - if (preload > 0 && pages > 0) {
>> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
> T->vheap\n", BATgetId(b), advise);
>> - limit = (size_t *) (b->T->vheap->base +
> b->T->vheap->free - sizeof(size_t)) - 4 * step;
>> - /* we need to ensure alignment, here, as b might be
> a view and heap.base of views are not necessarily aligned */
>> - i = (size_t *) (((size_t)b->T->vheap->base +
> sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
>> - for (; i <= limit && pages > 3; i+= 4*step, pages-=
> 4) {
>> - v1 += *i;
>> - v2 += *(i + step);
>> - v3 += *(i + 2*step);
>> - v4 += *(i + 3*step);
>> - }
>> - limit += 4 * step;
>> - for (; i <= limit && pages > 0; i+= step, pages--)
> {
>> - v1 += *i;
>> - }
>> + if (touch && preload > 0) {
>> + /* we need to ensure alignment, here, as b might be a view
> and heap.base of views are not necessarily aligned */
>> + size_t *lo = (size_t *) (((size_t) base + sizeof(size_t) -
> 1) & (~(sizeof(size_t) - 1)));
>> + size_t *hi = (size_t *) (base + sz);
>> + for (hi -= 8*page; lo <= hi; lo += 8*page) {
>> + /* try to trigger loading of multiple pages without
> blocking */
>> + v0 += lo[0*page]; v1 += lo[1*page]; v2 +=
> lo[2*page]; v3 += lo[3*page];
>> + v4 += lo[4*page]; v5 += lo[5*page]; v6 +=
> lo[6*page]; v7 += lo[7*page];
>> }
>> + for (hi += 7*page; lo <= hi; lo +=page) v0 += *lo;
>> }
>> + IODEBUG THRprintf(GDKout,"#BATpreload(%s->%s,preload=%d,sz=%dMB,%s)
> = %dms \n", id, hp, preload, (int) (sz>>20),
>> +
> (advise==BUF_WILLNEED)?"WILLNEED":(advise==BUF_SEQUENTIAL)?"SEQUENTIAL":"UNK
> NOWN", GDKms()-t);
>> + return v0+v1+v2+v3+v4+v5+v6+v7;
>> +}
>>
>> - /* BUN heaps (no need to preload for sequential access) */
>> - if ( what&USE_HEAD && b->H->heap.base ) {
>> - if (b->H->heap.storage != STORE_MEM && b->H->heap.size >
> MT_MMAP_TILE) {
>> - MT_mmap_inform(b->H->heap.base, b->H->heap.size,
> preload, advise, 0);
>> - }
>> - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) {
>> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
> H->heap\n", BATgetId(b), advise);
>> - limit = (size_t *) (Hloc(b, BUNlast(b)) -
> sizeof(size_t)) - 4 * step;
>> - /* we need to ensure alignment, here, as b might be
> a view and heap.base of views are not necessarily aligned */
>> - i = (size_t *) (((size_t)Hloc(b, BUNfirst(b)) +
> sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
>> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
> 4) {
>> - v1 += *i;
>> - v2 += *(i + step);
>> - v3 += *(i + 2*step);
>> - v4 += *(i + 3*step);
>> - }
>> - limit += 4 * step;
>> - for (; i <= limit && pages > 0; i+= step, pages--)
> {
>> - v1 += *i;
>> - }
>> - }
>> - }
>> - if ( what&USE_TAIL && b->T->heap.base ) {
>> - if (b->T->heap.storage != STORE_MEM && b->T->heap.size >
> MT_MMAP_TILE) {
>> - MT_mmap_inform(b->T->heap.base, b->T->heap.size,
> preload, advise, 0);
>> +size_t
>> +BATaccess(BAT *b, int what, int advise, int preload) {
>> + ssize_t budget = (ssize_t) (0.8 * MT_npages());
>> + size_t v = 0, sz;
>> + str id = BATgetId(b);
>> + BATiter bi = bat_iterator(b);
>> +
>> +
> assert(advise==MMAP_NORMAL||advise==MMAP_RANDOM||advise==MMAP_SEQUENTIAL||ad
> vise==MMAP_WILLNEED||advise==MMAP_DONTNEED);
>> + if (BATcount(b) == 0) return 0;
>> +
>> + /* HASH indices (inherent random access). handle first as they
> *will* be access randomly (one can always hope for locality on the other
> heaps) */
>> + if ( what&USE_HHASH || what&USE_THASH ) {
>> + gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK),
> "BATaccess");
>> + if ( what&USE_HHASH && b->H->hash && b->H->hash->heap &&
> b->H->hash->heap->base) {
>> + budget -= sz = (b->H->hash->heap->free > (size_t)
> budget)?budget:(ssize_t)b->T->hash->heap->free;
>> + v += access_heap(id, "hhash", b->H->hash->heap,
> b->H->hash->heap->base, sz, 1, preload, MMAP_WILLNEED);
>> }
>> - if (preload > 0 && pages > 0 && advise != MMAP_SEQUENTIAL) {
>> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
> T->heap\n", BATgetId(b), advise);
>> - limit = (size_t *) (Tloc(b, BUNlast(b)) -
> sizeof(size_t)) - 4 * step;
>> - /* we need to ensure alignment, here, as b might be
> a view and heap.base of views are not necessarily aligned */
>> - i = (size_t *) (((size_t)Tloc(b, BUNfirst(b)) +
> sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
>> - for (; i <= limit && pages > 3; i+= 4*step, pages-=
> 4) {
>> - v1 += *i;
>> - v2 += *(i + step);
>> - v3 += *(i + 2*step);
>> - v4 += *(i + 3*step);
>> - }
>> - limit += 4 * step;
>> - for (; i <= limit && pages > 0; i+= step, pages--)
> {
>> - v1 += *i;
>> - }
>> + if ( what&USE_THASH && b->T->hash && b->T->hash->heap &&
> b->T->hash->heap->base) {
>> + budget -= sz = (b->T->hash->heap->free > (size_t)
> budget)?budget:(ssize_t)b->T->hash->heap->free;
>> + v += access_heap(id, "thash", b->T->hash->heap,
> b->T->hash->heap->base, sz, 1, preload, MMAP_WILLNEED);
>> }
>> + gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) &
> BBP_BATMASK), "BATaccess");
>> }
>>
>> - /* HASH indices (inherent random access) */
>> - if ( what&USE_HHASH || what&USE_THASH )
>> - gdk_set_lock(GDKhashLock(ABS(b->batCacheid) & BBP_BATMASK),
> "BATaccess");
>> - if ( what&USE_HHASH && b->H->hash && b->H->hash->heap &&
> b->H->hash->heap->base ) {
>> - if (b->H->hash->heap->storage != STORE_MEM &&
> b->H->hash->heap->size > MT_MMAP_TILE) {
>> - MT_mmap_inform(b->H->hash->heap->base,
> b->H->hash->heap->size, preload, MMAP_WILLNEED, 0);
>> + /* we only touch stuff that is going to be read randomly (WILLNEED).
> Note varheaps are sequential wrt to the references, or small */
>> + if ( what&USE_HEAD) {
>> + if (b->H->heap.base) {
>> + char *lo = BUNhloc(bi, BUNfirst(b)), *hi =
> BUNhloc(bi, BUNlast(b)-1);
>> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
>> + v += access_heap(id, "hbuns", &b->H->heap, lo, sz,
> (advise == BUF_WILLNEED), preload, advise);
>> }
>> - if (preload > 0 && pages > 0) {
>> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
> H->hash\n", BATgetId(b), advise);
>> - limit = (size_t *) (b->H->hash->heap->base +
> b->H->hash->heap->size - sizeof(size_t)) - 4 * step;
>> - /* we need to ensure alignment, here, as b might be
> a view and heap.base of views are not necessarily aligned */
>> - i = (size_t *) (((size_t)b->H->hash->heap->base +
> sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
>> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
> 4) {
>> - v1 += *i;
>> - v2 += *(i + step);
>> - v3 += *(i + 2*step);
>> - v4 += *(i + 3*step);
>> - }
>> - limit += 4 * step;
>> - for (; i <= limit && pages > 0; i+= step, pages--)
> {
>> - v1 += *i;
>> - }
>> + if (b->H->vheap && b->H->vheap->base) {
>> + char *lo = BUNhead(bi, BUNfirst(b)), *hi =
> BUNhead(bi, BUNlast(b)-1);
>> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
>> + v += access_heap(id, "hheap", b->H->vheap, lo, sz,
> (advise == BUF_WILLNEED), preload, advise);
>> }
>> }
>> - if ( what&USE_THASH && b->T->hash && b->T->hash->heap &&
> b->T->hash->heap->base ) {
>> - if (b->T->hash->heap->storage != STORE_MEM &&
> b->T->hash->heap->size > MT_MMAP_TILE) {
>> - MT_mmap_inform(b->T->hash->heap->base,
> b->T->hash->heap->size, preload, MMAP_WILLNEED, 0);
>> + if ( what&USE_TAIL) {
>> + if (b->T->heap.base) {
>> + char *lo = BUNtloc(bi, BUNfirst(b)), *hi =
> BUNtloc(bi, BUNlast(b)-1);
>> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
>> + v += access_heap(id, "tbuns", &b->T->heap, lo, sz,
> (advise == BUF_WILLNEED), preload, advise);
>> }
>> - if (preload > 0 && pages > 0) {
>> - IODEBUG THRprintf(GDKout,"#BATaccess(%s,%d):
> T->hash\n", BATgetId(b), advise);
>> - limit = (size_t *) (b->T->hash->heap->base +
> b->T->hash->heap->size - sizeof(size_t)) - 4 * step;
>> - /* we need to ensure alignment, here, as b might be
> a view and heap.base of views are not necessarily aligned */
>> - i = (size_t *) (((size_t)b->T->hash->heap->base +
> sizeof(size_t) - 1) & (~(sizeof(size_t) - 1)));
>> - for (; i <= limit && pages > 3 ; i+= 4*step, pages-=
> 4) {
>> - v1 += *i;
>> - v2 += *(i + step);
>> - v3 += *(i + 2*step);
>> - v4 += *(i + 3*step);
>> - }
>> - limit += 4 * step;
>> - for (; i <= limit && pages > 0; i+= step, pages--)
> {
>> - v1 += *i;
>> - }
>> + if (b->T->vheap && b->T->vheap->base) {
>> + char *lo = BUNtail(bi, BUNfirst(b)), *hi =
> BUNtail(bi, BUNlast(b)-1);
>> + budget -= sz = ((hi-lo) > budget)?budget:(hi-lo);
>> + v += access_heap(id, "theap", b->T->vheap, lo, sz,
> (advise == BUF_WILLNEED), preload, advise);
>> }
>> }
>> - if ( what&USE_HHASH || what&USE_THASH )
>> - gdk_unset_lock(GDKhashLock(ABS(b->batCacheid) &
> BBP_BATMASK), "BATaccess");
>> -
>> - return v1 + v2 + v3 + v4;
>> + return v;
>> }
>> @}
>>
>>
>> Index: gdk_posix.mx
>> ===================================================================
>> RCS file: /cvsroot/monetdb/MonetDB/src/gdk/gdk_posix.mx,v
>> retrieving revision 1.176.2.21
>> retrieving revision 1.176.2.22
>> diff -u -d -r1.176.2.21 -r1.176.2.22
>> --- gdk_posix.mx 18 Feb 2010 01:03:55 -0000 1.176.2.21
>> +++ gdk_posix.mx 18 Feb 2010 22:38:53 -0000 1.176.2.22
>> @@ -909,10 +909,8 @@
>> unload = MT_mmap_tab[i].usecnt == 0;
>> }
>> (void) pthread_mutex_unlock(&MT_mmap_lock);
>> - if (i >= 0 && preload > 0)
>> - ret = posix_madvise(base, len, advise);
>> - else if (unload)
>> - ret = posix_madvise(base, len, MMAP_NORMAL);
>> + if (unload)
>> + ret = posix_madvise(base, len, BUF_SEQUENTIAL);
>> if (ret) {
>> stream_printf(GDKerr, "#MT_mmap_inform:
> posix_madvise(file=%s, fd=%d, base="PTRFMT", len="SZFMT"MB, advice=%d) =
> %d\n",
>> (i >= 0 ? MT_mmap_tab[i].path : ""), (i >= 0 ?
> MT_mmap_tab[i].fd : -1),
>>
>>
> ----------------------------------------------------------------------------
> --
>> Download Intel® Parallel Studio Eval
>> Try the new software tools for yourself. Speed compiling, find bugs
>> proactively, and fine-tune applications for parallel performance.
>> See why Intel Parallel Studio got high marks during beta.
>> http://p.sf.net/sfu/intel-sw-dev
>> _______________________________________________
>> Monetdb-checkins mailing list
>> Monetdb-checkins@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/monetdb-checkins
>>
>>
>