
Hi, After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates), so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages. The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance. Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: *str* *UDFyearbracket(str *ret, const date *v)* *{* * if (*v == date_nil) {* * *ret = GDKstrdup(str_nil);* * } else {* * int year;* * fromdate(*v, NULL, NULL, &year);* * *ret = (str) GDKmalloc(15);* * sprintf(*ret, "%d", year);* * }* * return MAL_SUCCEED;* *}* For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. *str* *UDFBATyearbracket(bat *ret, const bat *bid)* *{* * BAT *b, *bn;* * BUN i,n;* * str *y;* * const date *t;* * if ((b = BATdescriptor(*bid)) == NULL)* * throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");* * n = BATcount(b);* * bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);* * if (bn == NULL) {* * BBPunfix(b->batCacheid);* * throw(MAL, "UDF.BATyearbracket", "memory allocation failure");* * }* * bn->tnonil = 1;* * bn->tnil = 0;* * t = (const date *) Tloc(b, 0);* * y = (str *) Tloc(bn, 0);* * for (i = 0; i < n; i++) {* * if (*t == date_nil) {* * *y = GDKstrdup(str_nil);* * } else* * UDFyearbracket(y, t);* * if (strcmp(*y, str_nil) == 0) {* * bn->tnonil = 0;* * bn->tnil = 1;* * }* * y++;* * t++;* * }* * BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));* * bn->tsorted = BATcount(bn)<2;* * bn->trevsorted = BATcount(bn)<2;* * BBPkeepref(*ret = bn->batCacheid);* * BBPunfix(b->batCacheid);* * return MAL_SUCCEED;* *}* PS: I am not a c expert but i can find my way with basic operations and pointers. Any help or suggestions is appreciated. Thank you.

Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance) Regards! 2016-12-28 14:48 GMT-03:00 imad hajj chahine <imad.hajj.chahine@gmail.com>:
Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient:
*str* *UDFyearbracket(str *ret, const date *v)* *{* * if (*v == date_nil) {* * *ret = GDKstrdup(str_nil);* * } else {* * int year;* * fromdate(*v, NULL, NULL, &year);* * *ret = (str) GDKmalloc(15);* * sprintf(*ret, "%d", year);* * }* * return MAL_SUCCEED;* *}*
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. *str* *UDFBATyearbracket(bat *ret, const bat *bid)* *{* * BAT *b, *bn;* * BUN i,n;* * str *y;* * const date *t;*
* if ((b = BATdescriptor(*bid)) == NULL)* * throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");* * n = BATcount(b);*
* bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);* * if (bn == NULL) {* * BBPunfix(b->batCacheid);* * throw(MAL, "UDF.BATyearbracket", "memory allocation failure");* * }* * bn->tnonil = 1;* * bn->tnil = 0;*
* t = (const date *) Tloc(b, 0);* * y = (str *) Tloc(bn, 0);* * for (i = 0; i < n; i++) {* * if (*t == date_nil) {* * *y = GDKstrdup(str_nil);* * } else* * UDFyearbracket(y, t);* * if (strcmp(*y, str_nil) == 0) {* * bn->tnonil = 0;* * bn->tnil = 1;* * }* * y++;* * t++;* * }*
* BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));*
* bn->tsorted = BATcount(bn)<2;* * bn->trevsorted = BATcount(bn)<2;*
* BBPkeepref(*ret = bn->batCacheid);* * BBPunfix(b->batCacheid);* * return MAL_SUCCEED;* *}*
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

See https://dev.monetdb.org/hg/MonetDB-extend/ for a tutorial on how to create a UDF in C. You can use the URL to clone from. On 12/28/2016 09:28 PM, Alberto Ferrari wrote:
Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance)
Regards!
2016-12-28 14:48 GMT-03:00 imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>:
Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender

Thank you Sjoerd, Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8? Thank you. On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org> wrote:
See https://dev.monetdb.org/hg/MonetDB-extend/ for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote:
Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance)
Regards!
2016-12-28 14:48 GMT-03:00 imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>:
Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Hi again Sjoerd, After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function: str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; } Thank you. On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote:
Thank you Sjoerd,
Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8?
Thank you.
On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org> wrote:
See https://dev.monetdb.org/hg/MonetDB-extend/ for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote:
Imad, I hope your success with this. Please comment if you get it, and then, could those new functions incorporate to future version of Monet? Or maybe easily compiled to current? So in the future users may suggest new useful functions (shame about SQL UDF performance)
Regards!
2016-12-28 14:48 GMT-03:00 imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>:
Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Since the MonetDB server is UTF-8 *only*, you should *never* have non-UTF-8 strings inside the server. If you have strings in some other encoding, they should be converted to UTF-8 by whatever client program you're using. mclient has options to do this (-e option). If you want to do conversions yourself, take a look at the iconv related code in common/stream/stream.c. Also, the console_read and console_write functions in that file can give you inspiration. They convert Windows wide characters (16-bit encodings of Unicode code points) to and from UTF-8. This would be close to converting ints to UTF-8. On 12/29/2016 01:10 AM, imad hajj chahine wrote:
Hi again Sjoerd,
After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; }
Thank you.
On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>> wrote:
Thank you Sjoerd,
Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8?
Thank you.
On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>> wrote:
See https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/> for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote: > Imad, I hope your success with this. Please comment if you get it, and > then, could those new functions incorporate to future version of Monet? > Or maybe easily compiled to current? So in the future users may suggest > new useful functions (shame about SQL UDF performance) > > Regards! > > 2016-12-28 14:48 GMT-03:00 imad hajj chahine > <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>>: > > Hi, > > After reviewing all the other alternatives like SQL and Python UDF, > I was either stuck on performance with SQL UDF or on usability with > Python UDF (unable to use with aggregation, and not such great > performance with dates), > > so I decided to go the hard way with C functions, as a bonus it will > give me the possibility to change the functionalities without > worrying about dependencies, which was not the case in other languages. > > The purpose is to create a set of formatting functions for Year, > Quarter, Month, Week and Day brackets, and of course i need to > create the bulk version of each function for performance. > > Starting from the MTIMEdate_extract_year_bulk, now i have the simple > function working, and successfully calling it from mclient: > / > / > /str/ > /UDFyearbracket(str *ret, const date *v)/ > /{/ > /if (*v == date_nil) {/ > /*ret = GDKstrdup(str_nil);/ > /} else {/ > /int year;/ > /fromdate(*v, NULL, NULL, &year);/ > /*ret = (str) GDKmalloc(15);/ > /sprintf(*ret, "%d", year);/ > /}/ > /return MAL_SUCCEED;/ > /}/ > > > For the bulk version i get an error in the log: gdk_atoms.c:1345: > strPut: Assertion `(v[i] & 0x80) == 0' failed. > /str/ > /UDFBATyearbracket(bat *ret, const bat *bid)/ > /{/ > /BAT *b, *bn;/ > /BUN i,n;/ > /str *y;/ > /const date *t;/ > / > / > /if ((b = BATdescriptor(*bid)) == NULL)/ > /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ > /n = BATcount(b);/ > / > / > /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ > /if (bn == NULL) {/ > /BBPunfix(b->batCacheid);/ > /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ > /}/ > /bn->tnonil = 1;/ > /bn->tnil = 0;/ > / > / > /t = (const date *) Tloc(b, 0);/ > /y = (str *) Tloc(bn, 0);/ > /for (i = 0; i < n; i++) {/ > /if (*t == date_nil) {/ > /*y = GDKstrdup(str_nil);/ > /} else/ > /UDFyearbracket(y, t);/ > /if (strcmp(*y, str_nil) == 0) {/ > /bn->tnonil = 0;/ > /bn->tnil = 1;/ > /}/ > /y++;/ > /t++;/ > /}/ > / > / > /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ > / > / > /bn->tsorted = BATcount(bn)<2;/ > /bn->trevsorted = BATcount(bn)<2;/ > / > / > /BBPkeepref(*ret = bn->batCacheid);/ > /BBPunfix(b->batCacheid);/ > /return MAL_SUCCEED;/ > /}/ > > PS: I am not a c expert but i can find my way with basic operations > and pointers. > > Any help or suggestions is appreciated. > > Thank you. > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> > <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> > > > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> >
-- Sjoerd Mullender
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender

Hi Sjoerd, I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation: str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen; int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; } Thanks On Thu, Dec 29, 2016 at 1:08 PM, Sjoerd Mullender <sjoerd@monetdb.org> wrote:
Since the MonetDB server is UTF-8 *only*, you should *never* have non-UTF-8 strings inside the server. If you have strings in some other encoding, they should be converted to UTF-8 by whatever client program you're using. mclient has options to do this (-e option). If you want to do conversions yourself, take a look at the iconv related code in common/stream/stream.c. Also, the console_read and console_write functions in that file can give you inspiration. They convert Windows wide characters (16-bit encodings of Unicode code points) to and from UTF-8. This would be close to converting ints to UTF-8.
Hi again Sjoerd,
After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; }
Thank you.
On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>> wrote:
Thank you Sjoerd,
Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8?
Thank you.
On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>> wrote:
See https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/> for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote: > Imad, I hope your success with this. Please comment if you get it, and > then, could those new functions incorporate to future version of Monet? > Or maybe easily compiled to current? So in the future users may suggest > new useful functions (shame about SQL UDF performance) > > Regards! > > 2016-12-28 14:48 GMT-03:00 imad hajj chahine > <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>>: > > Hi, > > After reviewing all the other alternatives like SQL and Python UDF, > I was either stuck on performance with SQL UDF or on usability with > Python UDF (unable to use with aggregation, and not such great > performance with dates), > > so I decided to go the hard way with C functions, as a bonus it will > give me the possibility to change the functionalities without > worrying about dependencies, which was not the case in other languages. > > The purpose is to create a set of formatting functions for Year, > Quarter, Month, Week and Day brackets, and of course i need to > create the bulk version of each function for performance. > > Starting from the MTIMEdate_extract_year_bulk, now i have
On 12/29/2016 01:10 AM, imad hajj chahine wrote: the simple
> function working, and successfully calling it from mclient: > / > / > /str/ > /UDFyearbracket(str *ret, const date *v)/ > /{/ > /if (*v == date_nil) {/ > /*ret = GDKstrdup(str_nil);/ > /} else {/ > /int year;/ > /fromdate(*v, NULL, NULL, &year);/ > /*ret = (str) GDKmalloc(15);/ > /sprintf(*ret, "%d", year);/ > /}/ > /return MAL_SUCCEED;/ > /}/ > > > For the bulk version i get an error in the log:
gdk_atoms.c:1345:
> strPut: Assertion `(v[i] & 0x80) == 0' failed. > /str/ > /UDFBATyearbracket(bat *ret, const bat *bid)/ > /{/ > /BAT *b, *bn;/ > /BUN i,n;/ > /str *y;/ > /const date *t;/ > / > / > /if ((b = BATdescriptor(*bid)) == NULL)/ > /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ > /n = BATcount(b);/ > / > / > /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b),
TRANSIENT);/
> /if (bn == NULL) {/ > /BBPunfix(b->batCacheid);/ > /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ > /}/ > /bn->tnonil = 1;/ > /bn->tnil = 0;/ > / > / > /t = (const date *) Tloc(b, 0);/ > /y = (str *) Tloc(bn, 0);/ > /for (i = 0; i < n; i++) {/ > /if (*t == date_nil) {/ > /*y = GDKstrdup(str_nil);/ > /} else/ > /UDFyearbracket(y, t);/ > /if (strcmp(*y, str_nil) == 0) {/ > /bn->tnonil = 0;/ > /bn->tnil = 1;/ > /}/ > /y++;/ > /t++;/ > /}/ > / > / > /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ > / > / > /bn->tsorted = BATcount(bn)<2;/ > /bn->trevsorted = BATcount(bn)<2;/ > / > / > /BBPkeepref(*ret = bn->batCacheid);/ > /BBPunfix(b->batCacheid);/ > /return MAL_SUCCEED;/ > /}/ > > PS: I am not a c expert but i can find my way with basic
operations
> and pointers. > > Any help or suggestions is appreciated. > > Thank you. > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> > <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> > > > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> >
-- Sjoerd Mullender
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Hey Imad, One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this: str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; } Regards, Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year);
fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
On Thu, Dec 29, 2016 at 1:08 PM, Sjoerd Mullender <sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>> wrote: Since the MonetDB server is UTF-8 *only*, you should *never* have non-UTF-8 strings inside the server. If you have strings in some other encoding, they should be converted to UTF-8 by whatever client program you're using. mclient has options to do this (-e option). If you want to do conversions yourself, take a look at the iconv related code in common/stream/stream.c. Also, the console_read and console_write functions in that file can give you inspiration. They convert Windows wide characters (16-bit encodings of Unicode code points) to and from UTF-8. This would be close to converting ints to UTF-8.
On 12/29/2016 01:10 AM, imad hajj chahine wrote:
Hi again Sjoerd,
After digging in the code I found the GDKstrFromStr, does this function handle conversion from a normal string to UTF8_string? Is this the correct syntax to use the function:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; fromdate(*v, NULL, NULL, &year); *ret = (str) GDKmalloc(15); sprintf(*ret, "%d", year); GDKstrFromStr((unsigned char *)*ret, (unsigned char *)*ret, 15); } return MAL_SUCCEED; }
Thank you.
On Wed, Dec 28, 2016 at 11:40 PM, imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>> wrote:
Thank you Sjoerd,
Any idea how to convert an integer to UTF-8 string, does sprintf come with a variation that can handle UTF-8?
Thank you.
On Wed, Dec 28, 2016 at 11:08 PM, Sjoerd Mullender <sjoerd@monetdb.org <mailto:sjoerd@monetdb.org> <mailto:sjoerd@monetdb.org <mailto:sjoerd@monetdb.org>>> wrote:
See https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/> <https://dev.monetdb.org/hg/MonetDB-extend/ <https://dev.monetdb.org/hg/MonetDB-extend/>> for a tutorial on how to create a UDF in C. You can use the URL to clone from.
On 12/28/2016 09:28 PM, Alberto Ferrari wrote: > Imad, I hope your success with this. Please comment if you get it, and > then, could those new functions incorporate to future version of Monet? > Or maybe easily compiled to current? So in the future users may suggest > new useful functions (shame about SQL UDF performance) > > Regards! > > 2016-12-28 14:48 GMT-03:00 imad hajj chahine > <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com> <mailto:imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>>>>: > > Hi, > > After reviewing all the other alternatives like SQL and Python UDF, > I was either stuck on performance with SQL UDF or on usability with > Python UDF (unable to use with aggregation, and not such great > performance with dates), > > so I decided to go the hard way with C functions, as a bonus it will > give me the possibility to change the functionalities without > worrying about dependencies, which was not the case in other languages. > > The purpose is to create a set of formatting functions for Year, > Quarter, Month, Week and Day brackets, and of course i need to > create the bulk version of each function for performance. > > Starting from the MTIMEdate_extract_year_bulk, now i have the simple > function working, and successfully calling it from mclient: > / > / > /str/ > /UDFyearbracket(str *ret, const date *v)/ > /{/ > /if (*v == date_nil) {/ > /*ret = GDKstrdup(str_nil);/ > /} else {/ > /int year;/ > /fromdate(*v, NULL, NULL, &year);/ > /*ret = (str) GDKmalloc(15);/ > /sprintf(*ret, "%d", year);/ > /}/ > /return MAL_SUCCEED;/ > /}/ > > > For the bulk version i get an error in the log: gdk_atoms.c:1345: > strPut: Assertion `(v[i] & 0x80) == 0' failed. > /str/ > /UDFBATyearbracket(bat *ret, const bat *bid)/ > /{/ > /BAT *b, *bn;/ > /BUN i,n;/ > /str *y;/ > /const date *t;/ > / > / > /if ((b = BATdescriptor(*bid)) == NULL)/ > /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ > /n = BATcount(b);/ > / > / > /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ > /if (bn == NULL) {/ > /BBPunfix(b->batCacheid);/ > /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ > /}/ > /bn->tnonil = 1;/ > /bn->tnil = 0;/ > / > / > /t = (const date *) Tloc(b, 0);/ > /y = (str *) Tloc(bn, 0);/ > /for (i = 0; i < n; i++) {/ > /if (*t == date_nil) {/ > /*y = GDKstrdup(str_nil);/ > /} else/ > /UDFyearbracket(y, t);/ > /if (strcmp(*y, str_nil) == 0) {/ > /bn->tnonil = 0;/ > /bn->tnil = 1;/ > /}/ > /y++;/ > /t++;/ > /}/ > / > / > /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ > / > / > /bn->tsorted = BATcount(bn)<2;/ > /bn->trevsorted = BATcount(bn)<2;/ > / > / > /BBPkeepref(*ret = bn->batCacheid);/ > /BBPunfix(b->batCacheid);/ > /return MAL_SUCCEED;/ > /}/ > > PS: I am not a c expert but i can find my way with basic operations > and pointers. > > Any help or suggestions is appreciated. > > Thank you. > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> > <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>>> > > > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>> >
-- Sjoerd Mullender
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> <https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>>
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
-- Sjoerd Mullender
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Hey Imad, Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading. The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error. Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column. Regards, Mark str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t; if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b); bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0; t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; } BATsetcount(bn, n); bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2; BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; } ----- Original Message ----- From: "Mark Raasveldt" <m.raasveldt@cwi.nl> To: "users-list" <users-list@monetdb.org> Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF Hey Imad, One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this: str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; } Regards, Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year);
fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Thank you Mark, Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/BUNappend? Find bellow the complete implementation: str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); *ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; } str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n; if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b); bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); } BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); } Thank You. On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt <m.raasveldt@cwi.nl> wrote:
Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0;
t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; }
BATsetcount(bn, n);
bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2;
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; }
----- Original Message ----- From: "Mark Raasveldt" <m.raasveldt@cwi.nl> To: "users-list" <users-list@monetdb.org> Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year);
fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Hey Imad, You don’t need to set the properties manually when using BUNappend; it will do it for you. Your implementation looks correct, although from a performance perspective I would offer you two tips: - There is no need to allocate/free on every iteration, you can simply create a reasonably sized buffer once and use it for every iteration. A year can never be more than 10~ characters anyway. - In the same vein, using snprintf to determine the length of the string on every iteration is a bit overkill. Mark
On 02 Jan 2017, at 17:31, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Thank you Mark,
Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/BUNappend?
Find bellow the complete implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y);
*ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
Thank You.
On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt <m.raasveldt@cwi.nl <mailto:m.raasveldt@cwi.nl>> wrote: Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0;
t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; }
BATsetcount(bn, n);
bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2;
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; }
----- Original Message ----- From: "Mark Raasveldt" <m.raasveldt@cwi.nl <mailto:m.raasveldt@cwi.nl>> To: "users-list" <users-list@monetdb.org <mailto:users-list@monetdb.org>> Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year);
fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Thanks Mark, When I use sprintf(*ret, "%d", y) on a pre-allocated buffer of 15 chars and write only 4 chars the unused characters will be \0 and this will not cause any problem as BUNappend will take a copy of the buffer and stop at the first \0? So the implementation will be: str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); sprintf(*ret, "%d", y); } return MAL_SUCCEED; } str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n; char *y; if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b); bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bi = bat_iterator(b); y = (char *)GDKmalloc(15); /* longest possible string: "-5867411-01-01" i.e. 14 chars without NUL (see definition of YEAR_MIN/YEAR_MAX above) */ BATloop(b, i, n) { const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } } GDKfree(y); BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); } On Mon, Jan 2, 2017 at 6:54 PM, Mark Raasveldt <m.raasveldt@cwi.nl> wrote:
Hey Imad,
You don’t need to set the properties manually when using BUNappend; it will do it for you. Your implementation looks correct, although from a performance perspective I would offer you two tips:
- There is no need to allocate/free on every iteration, you can simply create a reasonably sized buffer once and use it for every iteration. A year can never be more than 10~ characters anyway. - In the same vein, using snprintf to determine the length of the string on every iteration is a bit overkill.
Mark
On 02 Jan 2017, at 17:31, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Thank you Mark,
Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/ BUNappend?
Find bellow the complete implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y); *ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
Thank You.
On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt <m.raasveldt@cwi.nl> wrote:
Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0;
t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; }
BATsetcount(bn, n);
bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2;
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; }
----- Original Message ----- From: "Mark Raasveldt" <m.raasveldt@cwi.nl> To: "users-list" <users-list@monetdb.org> Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year);
fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Hey Imad, Yes, that is the way strings in C generally work (see: https://en.wikipedia.org/wiki/Null-terminated_string <https://en.wikipedia.org/wiki/Null-terminated_string>). Your implementation looks good, except you overwrite your buffer variable (y) when you encounter date_nil. Consider something like this for your main loop: bi = bat_iterator(b); y = (char *)GDKmalloc(15); BATloop(b, i, n) { const date *t = (const date *) BUNtail(bi, i); char* res = str_nil; if (*t != date_nil) { UDFyearbracket(&y, t); res = y; } if (BUNappend(bn, res, FALSE) != GDK_SUCCEED) { goto bailout; } } GDKfree(y); Mark
On 02 Jan 2017, at 18:20, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Thanks Mark,
When I use sprintf(*ret, "%d", y) on a pre-allocated buffer of 15 chars and write only 4 chars the unused characters will be \0 and this will not cause any problem as BUNappend will take a copy of the buffer and stop at the first \0?
So the implementation will be:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y);
sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n; char *y;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); y = (char *)GDKmalloc(15); /* longest possible string: "-5867411-01-01" i.e. 14 chars without NUL (see definition of YEAR_MIN/YEAR_MAX above) */ BATloop(b, i, n) {
const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } } GDKfree(y);
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
On Mon, Jan 2, 2017 at 6:54 PM, Mark Raasveldt <m.raasveldt@cwi.nl <mailto:m.raasveldt@cwi.nl>> wrote: Hey Imad,
You don’t need to set the properties manually when using BUNappend; it will do it for you. Your implementation looks correct, although from a performance perspective I would offer you two tips:
- There is no need to allocate/free on every iteration, you can simply create a reasonably sized buffer once and use it for every iteration. A year can never be more than 10~ characters anyway. - In the same vein, using snprintf to determine the length of the string on every iteration is a bit overkill.
Mark
On 02 Jan 2017, at 17:31, imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>> wrote:
Thank you Mark,
Actually I managed to solve this issue on late Friday night, even when using TLoc with integer values i was having random errors in the log and the db was shutdown. Do I need to set tonil and tnil flags when using BATloop/bat_iterator/BUNappend?
Find bellow the complete implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int y = 0; fromdate(*v, NULL, NULL, &y);
*ret = (str)GDKmalloc(snprintf(NULL, 0, "%d", y) + 1); if (*ret == NULL) throw(MAL, "UDF.yearbracket", "memory allocation failure"); sprintf(*ret, "%d", y); } return MAL_SUCCEED; }
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BATiter bi; BUN i,n;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); }
bi = bat_iterator(b); BATloop(b, i, n) { char *y = NULL; const date *t = (const date *) BUNtail(bi, i); if (*t == date_nil) { y = GDKstrdup(str_nil); } else UDFyearbracket(&y, t); if (BUNappend(bn, y, 0) != GDK_SUCCEED) { goto bailout; } GDKfree(y); }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearbracket", MAL_MALLOC_FAIL); }
Thank You.
On Mon, Jan 2, 2017 at 5:58 PM, Mark Raasveldt <m.raasveldt@cwi.nl <mailto:m.raasveldt@cwi.nl>> wrote: Hey Imad,
Apologies, scrolling back I noticed that was actually your first attempt at writing the UDF. The source of your error is not encoding related, the error is misleading.
The problem is that in your bulk version you are using Tloc(bn, i) to assign to a string column. Tloc should only be used with constant-sized columns, such as integers or dates. For variable-sized columns such as strings, you should use BUNappend to add values to the column. The reason for that is that string columns are not stored as an array of character pointers, which your initial implementation assumes. Instead, string columns use integers to point into a heap of strings. You are assigning a pointer to one of these integers, which makes MonetDB think the strings are in some random part of your memory. There’s a high chance that that random part of memory does not contain a valid UTF-8 string, hence you get the encoding error.
Try the following bulk implementation instead, using BUNappend instead of Tloc to assign to your column.
Regards,
Mark
str UDFBATyearbracket(bat *ret, const bat *bid) { BAT *b, *bn; BUN i,n; const date *t;
if ((b = BATdescriptor(*bid)) == NULL) throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor"); n = BATcount(b);
bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT); if (bn == NULL) { BBPunfix(b->batCacheid); throw(MAL, "UDF.BATyearbracket", "memory allocation failure"); } bn->tnonil = 1; bn->tnil = 0;
t = (const date *) Tloc(b, 0); for (i = 0; i < n; i++) { if (*t == date_nil) { BUNappend(bn, str_nil, FALSE); bn->tnonil = 0; bn->tnil = 1; } else { char* ret; UDFyearbracket(&ret, t); BUNappend(bn, ret, FALSE); } t++; }
BATsetcount(bn, n);
bn->tsorted = BATcount(bn)<2; bn->trevsorted = BATcount(bn)<2;
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b->batCacheid); return MAL_SUCCEED; }
----- Original Message ----- From: "Mark Raasveldt" <m.raasveldt@cwi.nl <mailto:m.raasveldt@cwi.nl>> To: "users-list" <users-list@monetdb.org <mailto:users-list@monetdb.org>> Sent: Monday, January 2, 2017 4:32:32 PM Subject: Re: C UDF
Hey Imad,
One of the nice things about UTF-8 is that normal ASCII characters are valid UTF-8. Hence “normal strings” in C are already valid UTF-8. Try simply returning the output from sprintf, like this:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { int year; char *buf; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year); *ret = buf; } return MAL_SUCCEED; }
Regards,
Mark
On 29 Dec 2016, at 14:35, imad hajj chahine <imad.hajj.chahine@gmail.com <mailto:imad.hajj.chahine@gmail.com>> wrote:
Hi Sjoerd,
I tried to used iconv with no luck, I am getting always empty string. I assumed the encoding that i am getting from sprintf are in "ISO-8859-1" Can you please take a look at the following implementation:
str UDFyearbracket(str *ret, const date *v) { if (*v == date_nil) { *ret = GDKstrdup(str_nil); } else { iconv_t cv = iconv_open("UTF-8", "ISO-8859-1"); int factor = 4; size_t fromlen, tolen;
int year; char *buf; char *retChar = (char *)*ret; fromdate(*v, NULL, NULL, &year); buf = (char *) GDKmalloc(15); sprintf(buf, "%d", year);
fromlen = strlen(buf); tolen = factor * fromlen + 1; retChar = (char *) GDKmalloc(tolen); iconv(cv, &buf, &fromlen, &retChar, &tolen); iconv_close(cv); } return MAL_SUCCEED; }
Thanks
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list> _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> https://www.monetdb.org/mailman/listinfo/users-list <https://www.monetdb.org/mailman/listinfo/users-list>
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Mark, Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code: str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2); *ret = y2 - y1; } return MAL_SUCCEED; } str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n; b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1); bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); } bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } } BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); }

Hey Imad, No, it does not apply to integers because you are not doing any heap allocation. Mark
On 02 Jan 2017, at 19:03, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Mark,
Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code:
str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2);
*ret = y2 - y1; } return MAL_SUCCEED; }
str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n;
b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1);
bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); }
bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED;
bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); } _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Thank you Mark, I should refresh my memory about C programming, I never thought i would code in C again. On Tue, Jan 3, 2017 at 3:48 PM, Mark Raasveldt <m.raasveldt@cwi.nl> wrote:
Hey Imad,
No, it does not apply to integers because you are not doing any heap allocation.
Mark
On 02 Jan 2017, at 19:03, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Mark,
Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code:
str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2); *ret = y2 - y1; } return MAL_SUCCEED; }
str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n;
b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1);
bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); }
bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); } _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

Hi Mark, Is it possible to return Table or multiple columns from a C function? Any example or existing function I can check in the code? Also is it possible using a C function to expand table rows instead of joining to a calendar table, what i am trying to do is to expand a row and return an entry for each month bracket between startdate and enddate? Thanks On Tue, Jan 3, 2017 at 4:45 PM, imad hajj chahine < imad.hajj.chahine@gmail.com> wrote:
Thank you Mark,
I should refresh my memory about C programming, I never thought i would code in C again.
On Tue, Jan 3, 2017 at 3:48 PM, Mark Raasveldt <m.raasveldt@cwi.nl> wrote:
Hey Imad,
No, it does not apply to integers because you are not doing any heap allocation.
Mark
On 02 Jan 2017, at 19:03, imad hajj chahine <imad.hajj.chahine@gmail.com> wrote:
Mark,
Does the same apply for Integer values, Meaning its better to declare one the int value and used in all the iteration, do I have a performance issue with the following code:
str UDFyearlag(int *ret, const date *v1, const date *v2) { if (*v1 == date_nil || *v2 == date_nil) { *ret = int_nil; } else { int y1 = 0, y2 = 0; fromdate(*v1, NULL, NULL, &y1); fromdate(*v2, NULL, NULL, &y2); *ret = y2 - y1; } return MAL_SUCCEED; }
str UDFBATyearlag(bat *ret, const bat *bid1, const bat *bid2) { BAT *b1, *b2, *bn; BATiter bi1, bi2; BUN i,n;
b1 = BATdescriptor(*bid1); b2 = BATdescriptor(*bid2); if (b1 == NULL || b2 == NULL) { if (b1) BBPunfix(b1->batCacheid); if (b2) BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "Cannot access descriptor"); } n = BATcount(b1);
bn = COLnew(b1->hseqbase, TYPE_int, BATcount(b1), TRANSIENT); if (bn == NULL) { BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); throw(MAL, "UDF.BATyearlag", "memory allocation failure"); }
bi1 = bat_iterator(b1); bi2 = bat_iterator(b2); BATloop(b1, i, n) { int y; const date *t1 = (const date *) BUNtail(bi1, i); const date *t2 = (const date *) BUNtail(bi2, i); if (*t1 == date_nil || *t2 == date_nil) { y = int_nil; } else UDFyearlag(&y, t1, t2); if (BUNappend(bn, &y, 0) != GDK_SUCCEED) { goto bailout; } }
BBPkeepref(*ret = bn->batCacheid); BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); return MAL_SUCCEED; bailout: BBPunfix(b1->batCacheid); BBPunfix(b2->batCacheid); BBPunfix(bn->batCacheid); throw(MAL, "UDF.BATyearlag", MAL_MALLOC_FAIL); } _______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list

The error indicates that your string is not UTF-8 encoded. On 12/28/2016 06:48 PM, imad hajj chahine wrote:
Hi,
After reviewing all the other alternatives like SQL and Python UDF, I was either stuck on performance with SQL UDF or on usability with Python UDF (unable to use with aggregation, and not such great performance with dates),
so I decided to go the hard way with C functions, as a bonus it will give me the possibility to change the functionalities without worrying about dependencies, which was not the case in other languages.
The purpose is to create a set of formatting functions for Year, Quarter, Month, Week and Day brackets, and of course i need to create the bulk version of each function for performance.
Starting from the MTIMEdate_extract_year_bulk, now i have the simple function working, and successfully calling it from mclient: / / /str/ /UDFyearbracket(str *ret, const date *v)/ /{/ /if (*v == date_nil) {/ /*ret = GDKstrdup(str_nil);/ /} else {/ /int year;/ /fromdate(*v, NULL, NULL, &year);/ /*ret = (str) GDKmalloc(15);/ /sprintf(*ret, "%d", year);/ /}/ /return MAL_SUCCEED;/ /}/
For the bulk version i get an error in the log: gdk_atoms.c:1345: strPut: Assertion `(v[i] & 0x80) == 0' failed. /str/ /UDFBATyearbracket(bat *ret, const bat *bid)/ /{/ /BAT *b, *bn;/ /BUN i,n;/ /str *y;/ /const date *t;/ / / /if ((b = BATdescriptor(*bid)) == NULL)/ /throw(MAL, "UDF.BATyearbracket", "Cannot access descriptor");/ /n = BATcount(b);/ / / /bn = COLnew(b->hseqbase, TYPE_str, BATcount(b), TRANSIENT);/ /if (bn == NULL) {/ /BBPunfix(b->batCacheid);/ /throw(MAL, "UDF.BATyearbracket", "memory allocation failure");/ /}/ /bn->tnonil = 1;/ /bn->tnil = 0;/ / / /t = (const date *) Tloc(b, 0);/ /y = (str *) Tloc(bn, 0);/ /for (i = 0; i < n; i++) {/ /if (*t == date_nil) {/ /*y = GDKstrdup(str_nil);/ /} else/ /UDFyearbracket(y, t);/ /if (strcmp(*y, str_nil) == 0) {/ /bn->tnonil = 0;/ /bn->tnil = 1;/ /}/ /y++;/ /t++;/ /}/ / / /BATsetcount(bn, (BUN) (y - (str *) Tloc(bn, 0)));/ / / /bn->tsorted = BATcount(bn)<2;/ /bn->trevsorted = BATcount(bn)<2;/ / / /BBPkeepref(*ret = bn->batCacheid);/ /BBPunfix(b->batCacheid);/ /return MAL_SUCCEED;/ /}/
PS: I am not a c expert but i can find my way with basic operations and pointers.
Any help or suggestions is appreciated.
Thank you.
_______________________________________________ users-list mailing list users-list@monetdb.org https://www.monetdb.org/mailman/listinfo/users-list
-- Sjoerd Mullender
participants (4)
-
Alberto Ferrari
-
imad hajj chahine
-
Mark Raasveldt
-
Sjoerd Mullender