
Hi there, I need a string tokenizer in MonetDB. The problem I have is not with the function itself, but with the fact that this is a 1 to N rows function. Implementing this for a single string value is easy enough, using a table function that takes a string a returns a table: create function tokenize(s string) returns table (token string) external name tokenize; select * from tokenize("one two three"); That's fine. The issue I'm having is with extending this to a column of strings. Ideally, given a string column one two three four five six seven eight I'd like to get an output along these lines (simplistic representation here): one two three | one one two three | two one two three | three four five six | four four five six | five four five six | six seven eight | seven seven eight | eight I can sure code the c function and the mal wrapper to implement this, but I can't see how to map it to SQL, given that table functions don't accept identifiers as parameters. Any idea? Any possible workaround? Thanks, Roberto

On Sat, Apr 11, 2015 at 11:03:22AM +0200, Roberto Cornacchia wrote: this in your tokenize function, ie return both input and token. Niels
Thanks, Roberto
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl

Hi Niels, That sounds perfect. I suppose you refer to this: http://dev.monetdb.org/hg/MonetDB?cmd=changeset;node=5db56a1d5bc5 Do you think I would have any luck trying to port this back to Oct2014? On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl> wrote:

On Sat, Apr 11, 2015 at 03:24:42PM +0200, Roberto Cornacchia wrote:
Niels ps we are planning a release too...
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl

Hi Niels, I have tried this in default and indeed it does work like a charm. (my UTF8tokenize UDF takes two values and outputs a 3-column table) I noticed, though that it results in a MAL loop: | barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := str.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73); This of course is not going to be efficient. What if I write the bulk version of this function? Would that work? And if it does, would it then also work in Oct2014, as it would no longer need the "union" trick? Roberto On 11 April 2015 at 14:06, Niels Nes <Niels.Nes@cwi.nl> wrote:

On Thu, Jun 04, 2015 at 11:46:08AM +0200, Martin Kersten wrote:
Niels
-- Niels Nes, Manager ITF, Centrum Wiskunde & Informatica (CWI) Science Park 123, 1098 XG Amsterdam, The Netherlands room L3.14, phone ++31 20 592-4098 sip:4098@sip.cwi.nl url: https://www.cwi.nl/people/niels e-mail: Niels.Nes@cwi.nl

I'm don''t seem to get the expected result, let's see if I'm doing something silly. - SQL signature: create function tokenize(id integer, s string, prob double) returns table (id integer, token string, prob double) external name batstr."UTF8tokenize"; - MAL signature: command batstr.UTF8tokenize(id:bat[:oid,:int],s:bat[:oid,:str],prob:bat[:oid,:dbl]) (:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl]) address STRbat_utf8_tokenize_id_prob; - C signature: batstr_export str STRbat_utf8_tokenize_id_prob(bat *r1, bat *r2, bat *r3, const bat *idx, const bat *s, const bat *prob); Inspecting a mal plan for a query like SELECT * FROM tokenize (select id, s, prob from x); I see that the bat version of the function being used inside the same tuple-oriented loop. | X_64 := bat.new(nil:oid,nil:int); | X_67 := bat.new(nil:oid,nil:str); | X_69 := bat.new(nil:oid,nil:dbl); | barrier (X_72,X_73) := iterator.new(X_8); | X_75 := algebra.fetch(X_11,X_72); | X_77 := algebra.fetch(X_14,X_72); | (X_79,X_80,X_81) := batstr.UTF8tokenize(X_73,X_75,X_77); | bat.append(X_64,X_79); | bat.append(X_67,X_80); | bat.append(X_69,X_81); | redo (X_72,X_73) := iterator.next(X_8); | exit (X_72,X_73); Executing this fails, obviously. Can you spot where the problem is? Roberto On 6 June 2015 at 10:29, Niels Nes <Niels.Nes@cwi.nl> wrote:

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 Clone the following (small) repository and read the README.rst file in there: http://dev.monetdb.org/hg/MonetDB-extend/ In short, you need to define the SQL scalar function which should point to the MAL function str.UTF8tokenize, and you need to have the MAL bulk function batstr.UTF8tokenize. If you have those, SQL should figure it all out. (In particular, you should *not* have the SQL bulk function.) On 08/06/15 11:36, Roberto Cornacchia wrote:
(:bat[:oid,:int],:bat[:oid,:str],:bat[:oid,:dbl])
- -- Sjoerd Mullender -----BEGIN PGP SIGNATURE----- Version: GnuPG v2 iQEcBAEBCAAGBQJVdYEEAAoJEISMxT6LrWYgun4IAIU6hskhxHCgAF7+R1vAyoZC refsxd9voT4xOKuODBuc32NDlS96zotinoMTJ1i4hGCjueEuCY/ty8gF0kIQXNbY PEMQujcYmn74I21Wv8NrUfXQhpnNAhapHMuIY7O3n4MteDWUIwYy0QvxEWG0jSZv bzEDhRSnXhUmhMYrA/sKzkbQAdcHiYRO+ie+/iHcNQhvnF7Xo2Wq6ysTs+KyF7GF eGx1oRxArv9OJHsY8VRr1Ah5o9Dp09oAhDDzOl/aD9yAwQVYsmjkBm5IuG9mfpNk 2hDb3QJopFSXrpqgegj79wbrs1Wh8G0wPDa7Eq0cjd4eLAVsnDmmoKvkK4d6G14= =eS+c -----END PGP SIGNATURE-----

Thanks Sjoerd, What I posted before had indeed one mistake: the SQL scalar function was pointing to the MAL bulk function, not to the scalar one. Now I fixed it, but still the bulk version is not used (it exists and is defined in batstr, I believe with the right signature). I have already a number of UDFs, with both the scalar and bulk implementations, and have no issue with those. Can it be that in this particular case (a function that takes a sub-select as input), the pattern for bulk version is not recognised? Roberto On 8 June 2015 at 13:48, Sjoerd Mullender <sjoerd@acm.org> wrote:

Just to close the loop, I no longer see my problem. The signatures are now correct and the MAL plan is generated as expected. Perhaps I had run my last tests on the wrong MonetDB instance. No idea. Thanks for the pointer to the documentation, I hadn't seen it before. Roberto On 8 June 2015 at 14:05, Roberto Cornacchia <roberto.cornacchia@gmail.com> wrote:
participants (4)
-
Martin Kersten
-
Niels Nes
-
Roberto Cornacchia
-
Sjoerd Mullender