monetdb status health

My questions are simple: what causes crashes? what is health? how do we stop health from degrading? Following is the status of my db- msearch_stats_db: database name: msearch_stats_db state: running locked: no scenarios: mal sql msql start count: 140 stop count: 1 crash count: 138 current uptime: 1m 49s average uptime: 15m 33s maximum uptime: 15m 33s minimum uptime: 15m 33s last start with crash: 2013-02-10 17:32:36 last start: 2013-02-10 17:32:47 last stop: 2013-02-06 11:27:15 average of crashes in the last start attempt: 0 average of crashes in the last 10 start attempts: 0.90 average of crashes in the last 30 start attempts: 0.97 Regards, Tapomay.

On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time. -- Fabian Groffen fabian@monetdb.org column-store pioneer http://www.monetdb.org/Home

So it is a statistic that does not necessarily mean there is data corruption? Sent from my iPhone On Feb 10, 2013, at 12:03 PM, Fabian Groffen <fabian@monetdb.org> wrote:
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

On 2/11/13 4:54 AM, Brandon Jackson wrote:
So it is a statistic that does not necessarily mean there is data corruption? Yes, but if your statistics grow with each server restart, you should consider a severely corrupted database. This could be caused by hardware failures, OOM OS exception handling, software bugs,...
Sent from my iPhone
On Feb 10, 2013, at 12:03 PM, Fabian Groffen <fabian@monetdb.org> wrote:
My questions are simple:
what causes crashes? The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote: program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health? Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading? You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138 So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details in merovingian.log. Health is 1%. What can I do at this stage? Thanks and Regards, Tapomay. ________________________________ From: Fabian Groffen <fabian@monetdb.org> To: users-list@monetdb.org Sent: Sunday, February 10, 2013 11:33 PM Subject: Re: monetdb status health On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time. -- Fabian Groffen fabian@monetdb.org column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. 2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state. My setup is such that there would be non-stop Inserts/updates into the DB 24/7. Thanks and Regards, Tapomay. ________________________________ From: Tapomay Dey <tapomay@yahoo.com> To: Communication channel for MonetDB users <users-list@monetdb.org> Sent: Monday, February 11, 2013 10:47 AM Subject: Re: monetdb status health Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details in merovingian.log. Health is 1%. What can I do at this stage? Thanks and Regards, Tapomay. ________________________________ From: Fabian Groffen <fabian@monetdb.org> To: users-list@monetdb.org Sent: Sunday, February 10, 2013 11:33 PM Subject: Re: monetdb status health On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time. -- Fabian Groffen fabian@monetdb.org column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. There are a few cases known where it may crash, it is worked upon. In the testweb you can find the few cases. In general, you would be
Dear Tapomay, Taking the non-released revision is indeed "living on the edge". On 2/11/13 6:44 AM, Tapomay Dey wrote: the unlucky guy if you hit on them immediately. They seem rare.
2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state.
If it never passes the initialization phase after restart, it is most likely a corrupted database. This could happen as a result of a hardware failure, or an unknown error software error that caused a crash. It may be your UDF that went haywire and caused the system to loose. If it crashes without your UDF, then a run of the mserver using gdb may provide a hint on the whereabouts (see calling sequence in meriovingian.log to start mserver directly) My approach would now be: 1) restore database from backup (or a small testdb) 2) ensure it is working correctly without your UDF 3) prepare test cases for your UDF 4) add your UDF 5) start/stop after the first few calls of code with UDF to observe behavior. Success, Martin
My setup is such that there would be non-stop Inserts/updates into the DB 24/7.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 10:47 AM *Subject:* Re: monetdb status health
Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details
in merovingian.log.
Health is 1%.
What can I do at this stage?
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Fabian Groffen <fabian@monetdb.org> *To:* users-list@monetdb.org *Sent:* Sunday, February 10, 2013 11:33 PM *Subject:* Re: monetdb status health
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Tapomay, in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important. If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code. You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs. Best, Stefan ----- Original Message -----
Dear Tapomay,
Taking the non-released revision is indeed "living on the edge".
1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. There are a few cases known where it may crash, it is worked upon. In the testweb you can find the few cases. In general, you would be
On 2/11/13 6:44 AM, Tapomay Dey wrote: the unlucky guy if you hit on them immediately. They seem rare.
2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state.
If it never passes the initialization phase after restart, it is most likely a corrupted database. This could happen as a result of a hardware failure, or an unknown error software error that caused a crash. It may be your UDF that went haywire and caused the system to loose. If it crashes without your UDF, then a run of the mserver using gdb may provide a hint on the whereabouts (see calling sequence in meriovingian.log to start mserver directly)
My approach would now be: 1) restore database from backup (or a small testdb) 2) ensure it is working correctly without your UDF 3) prepare test cases for your UDF 4) add your UDF 5) start/stop after the first few calls of code with UDF to observe behavior.
Success, Martin
My setup is such that there would be non-stop Inserts/updates into the DB 24/7.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 10:47 AM *Subject:* Re: monetdb status health
Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details
in merovingian.log.
Health is 1%.
What can I do at this stage?
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Fabian Groffen <fabian@monetdb.org> *To:* users-list@monetdb.org *Sent:* Sunday, February 10, 2013 11:33 PM *Subject:* Re: monetdb status health
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |

Great info. Thanks a lot. I am going to keep my db in this state. I will be able to perform your suggestions tomorrow. Until then this is what I see in the logs when I do a fresh monetdb start: 2013-02-11 10:49:46 MSG merovingian[31990]: database 'msearch_stats_db' (32018) was killed by signal SIGSEGV 2013-02-11 10:49:46 ERR control[31990]: (local): failed to fork mserver: database 'msearch_stats_db' has crashed after starting, manual intervention ... FYI I have not included my UDF. I have done a basic configure-make-make install with no modifications/extra options on Ubuntu 12.10 64 bit.(mercurial changeset: 46861:45c89b2e2ac2 Wed Feb 06 11:42:37) Usage profile: There is a constant transactional insert/update load of the order of 100 update attempts per second. There is a "select 1;" fired every 10 seconds to check if DB is alive. There has been no significant select load yet(we are still loading historical data into the db). Thanks and Regards, Tapomay. ________________________________ From: Stefan Manegold <Stefan.Manegold@cwi.nl> To: Communication channel for MonetDB users <users-list@monetdb.org> Sent: Monday, February 11, 2013 12:17 PM Subject: Re: monetdb status health Tapomay, in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important. If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code. You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs. Best, Stefan ----- Original Message -----
Dear Tapomay,
Taking the non-released revision is indeed "living on the edge".
1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. There are a few cases known where it may crash, it is worked upon. In the testweb you can find the few cases. In general, you would be
On 2/11/13 6:44 AM, Tapomay Dey wrote: the unlucky guy if you hit on them immediately. They seem rare.
2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state.
If it never passes the initialization phase after restart, it is most likely a corrupted database. This could happen as a result of a hardware failure, or an unknown error software error that caused a crash. It may be your UDF that went haywire and caused the system to loose. If it crashes without your UDF, then a run of the mserver using gdb may provide a hint on the whereabouts (see calling sequence in meriovingian.log to start mserver directly)
My approach would now be: 1) restore database from backup (or a small testdb) 2) ensure it is working correctly without your UDF 3) prepare test cases for your UDF 4) add your UDF 5) start/stop after the first few calls of code with UDF to observe behavior.
Success, Martin
My setup is such that there would be non-stop Inserts/updates into the DB 24/7.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 10:47 AM *Subject:* Re: monetdb status health
Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details
in merovingian.log.
Health is 1%.
What can I do at this stage?
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Fabian Groffen <fabian@monetdb.org> *To:* users-list@monetdb.org *Sent:* Sunday, February 10, 2013 11:33 PM *Subject:* Re: monetdb status health
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote: > My questions are simple: > > what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
> what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
> how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
> Following is the status of my db- > start count: 140 > stop count: 1 > crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

After analysing the logs I see the following as a lone single ERR statement different from rest of the repeating ERR logs stating (crashed, manual intervention needed): 2013-02-10 21:09:57 ERR msearch_stats_db[3073]: mserver5: mal_dataflow.c:587: runMALdataflow: Assertion `workers[0]' failed. I also see "ERR merovingian[16831]: client error: unknown or impossible state: 4" in the later stages. Thanks and Regards, Tapomay. ________________________________ From: Tapomay Dey <tapomay@yahoo.com> To: Communication channel for MonetDB users <users-list@monetdb.org> Sent: Monday, February 11, 2013 4:46 PM Subject: Re: monetdb status health Great info. Thanks a lot. I am going to keep my db in this state. I will be able to perform your suggestions tomorrow. Until then this is what I see in the logs when I do a fresh monetdb start: 2013-02-11 10:49:46 MSG merovingian[31990]: database 'msearch_stats_db' (32018) was killed by signal SIGSEGV 2013-02-11 10:49:46 ERR control[31990]: (local): failed to fork mserver: database 'msearch_stats_db' has crashed after starting, manual intervention ... FYI I have not included my UDF. I have done a basic configure-make-make install with no modifications/extra options on Ubuntu 12.10 64 bit.(mercurial changeset: 46861:45c89b2e2ac2 Wed Feb 06 11:42:37) Usage profile: There is a constant transactional insert/update load of the order of 100 update attempts per second. There is a "select 1;" fired every 10 seconds to check if DB is alive. There has been no significant select load yet(we are still loading historical data into the db). Thanks and Regards, Tapomay. ________________________________ From: Stefan Manegold <Stefan.Manegold@cwi.nl> To: Communication channel for MonetDB users <users-list@monetdb.org> Sent: Monday, February 11, 2013 12:17 PM Subject: Re: monetdb status health Tapomay, in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important. If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code. You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs. Best, Stefan ----- Original Message -----
Dear Tapomay,
Taking the non-released revision is indeed "living on the edge".
1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. There are a few cases known where it may crash, it is worked upon. In the testweb you can find the few cases. In general, you would be
On 2/11/13 6:44 AM, Tapomay Dey wrote: the unlucky guy if you hit on them immediately. They seem rare.
2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state.
If it never passes the initialization phase after restart, it is most likely a corrupted database. This could happen as a result of a hardware failure, or an unknown error software error that caused a crash. It may be your UDF that went haywire and caused the system to loose. If it crashes without your UDF, then a run of the mserver using gdb may provide a hint on the whereabouts (see calling sequence in meriovingian.log to start mserver directly)
My approach would now be: 1) restore database from backup (or a small testdb) 2) ensure it is working correctly without your UDF 3) prepare test cases for your UDF 4) add your UDF 5) start/stop after the first few calls of code with UDF to observe behavior.
Success, Martin
My setup is such that there would be non-stop Inserts/updates into the DB 24/7.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 10:47 AM *Subject:* Re: monetdb status health
Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details
in merovingian.log.
Health is 1%.
What can I do at this stage?
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Fabian Groffen <fabian@monetdb.org> *To:* users-list@monetdb.org
*Sent:* Sunday, February 10, 2013 11:33 PM
*Subject:* Re: monetdb status health
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote: > My questions are simple: > > what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
> what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start
was followed by a clean shutdown (hence no crash).
> how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
> Following is the status of my db- > start count: 140 > stop count: 1 > crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) | _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list _______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Dear Tapomay, The error message is quite specific and certainly calls for the SQL schema/query to analyse. You might sent me the MAL plan of that query (using EXPLAIN command at SQL level). regards, Martin On 2/11/13 3:53 PM, Tapomay Dey wrote:
After analysing the logs I see the following as a lone single ERR statement different from rest of the repeating ERR logs stating (crashed, manual intervention needed): 2013-02-10 21:09:57 ERR msearch_stats_db[3073]: mserver5: mal_dataflow.c:587: runMALdataflow: Assertion `workers[0]' failed.
I also see "ERR merovingian[16831]: client error: unknown or impossible state: 4" in the later stages.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 4:46 PM *Subject:* Re: monetdb status health
Great info. Thanks a lot. I am going to keep my db in this state. I will be able to perform your suggestions tomorrow. Until then this is what I see in the logs when I do a fresh monetdb start: 2013-02-11 10:49:46 MSG merovingian[31990]: database 'msearch_stats_db' (32018) was killed by signal SIGSEGV 2013-02-11 10:49:46 ERR control[31990]: (local): failed to fork mserver: database 'msearch_stats_db' has crashed after starting, manual intervention ...
FYI I have not included my UDF. I have done a basic configure-make-make install with no modifications/extra options on Ubuntu 12.10 64 bit.(mercurial changeset: 46861:45c89b2e2ac2 Wed Feb 06 11:42:37) Usage profile: There is a constant transactional insert/update load of the order of 100 update attempts per second. There is a "select 1;" fired every 10 seconds to check if DB is alive. There has been no significant select load yet(we are still loading historical data into the db).
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Stefan Manegold <Stefan.Manegold@cwi.nl> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 12:17 PM *Subject:* Re: monetdb status health
Tapomay,
in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important.
If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code.
You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs.
Best, Stefan
----- Original Message -----
Dear Tapomay,
Taking the non-released revision is indeed "living on the edge".
1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. There are a few cases known where it may crash, it is worked upon. In the testweb you can find the few cases. In general, you would be
On 2/11/13 6:44 AM, Tapomay Dey wrote: the unlucky guy if you hit on them immediately. They seem rare.
2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state.
If it never passes the initialization phase after restart, it is most likely a corrupted database. This could happen as a result of a hardware failure, or an unknown error software error that caused a crash. It may be your UDF that went haywire and caused the system to loose. If it crashes without your UDF, then a run of the mserver using gdb may provide a hint on the whereabouts (see calling sequence in meriovingian.log to start mserver directly)
My approach would now be: 1) restore database from backup (or a small testdb) 2) ensure it is working correctly without your UDF 3) prepare test cases for your UDF 4) add your UDF 5) start/stop after the first few calls of code with UDF to observe behavior.
Success, Martin
My setup is such that there would be non-stop Inserts/updates into the DB 24/7.
Thanks and Regards, Tapomay.
*From:* Tapomay Dey <tapomay@yahoo.com <mailto:tapomay@yahoo.com>> *To:* Communication channel for MonetDB users <users-list@monetdb.org <mailto:users-list@monetdb.org>> *Sent:* Monday, February 11, 2013 10:47 AM *Subject:* Re: monetdb status health
Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details
in merovingian.log.
Health is 1%.
What can I do at this stage?
Thanks and Regards, Tapomay.
*From:* Fabian Groffen <fabian@monetdb.org <mailto:fabian@monetdb.org>> *To:* users-list@monetdb.org <mailto:users-list@monetdb.org> *Sent:* Sunday, February 10, 2013 11:33 PM *Subject:* Re: monetdb status health
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org>> column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl <mailto:Stefan.Manegold@CWI.nl> | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Attaching the schema and MAL plan. Thanks and Regards, Tapomay. ________________________________ From: Martin Kersten <Martin.Kersten@cwi.nl> To: Communication channel for MonetDB users <users-list@monetdb.org> Sent: Monday, February 11, 2013 11:38 PM Subject: Re: monetdb status health Dear Tapomay, The error message is quite specific and certainly calls for the SQL schema/query to analyse. You might sent me the MAL plan of that query (using EXPLAIN command at SQL level). regards, Martin On 2/11/13 3:53 PM, Tapomay Dey wrote:
After analysing the logs I see the following as a lone single ERR statement different from rest of the repeating ERR logs stating (crashed, manual intervention needed): 2013-02-10 21:09:57 ERR msearch_stats_db[3073]: mserver5: mal_dataflow.c:587: runMALdataflow: Assertion `workers[0]' failed.
I also see "ERR merovingian[16831]: client error: unknown or impossible state: 4" in the later stages.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 4:46 PM *Subject:* Re: monetdb status health
Great info. Thanks a lot. I am going to keep my db in this state. I will be able to perform your suggestions tomorrow. Until then this is what I see in the logs when I do a fresh monetdb start: 2013-02-11 10:49:46 MSG merovingian[31990]: database 'msearch_stats_db' (32018) was killed by signal SIGSEGV 2013-02-11 10:49:46 ERR control[31990]: (local): failed to fork mserver: database 'msearch_stats_db' has crashed after starting, manual intervention ...
FYI I have not included my UDF. I have done a basic configure-make-make install with no modifications/extra options on Ubuntu 12.10 64 bit.(mercurial changeset: 46861:45c89b2e2ac2 Wed Feb 06 11:42:37) Usage profile: There is a constant transactional insert/update load of the order of 100 update attempts per second. There is a "select 1;" fired every 10 seconds to check if DB is alive. There has been no significant select load yet(we are still loading historical data into the db).
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Stefan Manegold <Stefan.Manegold@cwi.nl> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 12:17 PM *Subject:* Re: monetdb status health
Tapomay,
in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important.
If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code.
You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs.
Best, Stefan
----- Original Message ----- > Dear Tapomay, > > Taking the non-released revision is indeed "living on the edge". > > On 2/11/13 6:44 AM, Tapomay Dey wrote: > > 1. BTW I am running a non-released revision of Feb13 branch. Could > > this > > be the reason for such a crash? > > I am doing so coz I need a fix that Niels had made for fixing a > > concurrency issue that caused duplicate keys. > > Also planning to implement group_concat UDF as per the changed > > semantics > > of Feb13. I already have a partially running one for Oct12. > There are a few cases known where it may crash, it is worked upon. > In the testweb you can find the few cases. In general, you would be > the unlucky guy if you hit on them immediately. They seem rare. > > > > 2. As the DB crashes each time I try to start it I think its a > > perfect > > state to gather more diagnostics. How do I do so? > > I really need that a DB never reaches a non-recoverable state. > If it never passes the initialization phase after restart, it is most > likely a corrupted database. This could happen as a result of a > hardware > failure, or an unknown error software error that caused a crash. > It may be your UDF that went haywire and caused the system to loose. > If it crashes without your UDF, then a run of the mserver using gdb > may provide a hint on the whereabouts > (see calling sequence in meriovingian.log to start mserver directly) > > My approach would now be: > 1) restore database from backup (or a small testdb) > 2) ensure it is working correctly without your UDF > 3) prepare test cases for your UDF > 4) add your UDF > 5) start/stop after the first few calls of code with UDF > to observe behavior. > > Success, Martin > > > > > My setup is such that there would be non-stop Inserts/updates into > > the > > DB 24/7. > > > > Thanks and Regards, > > Tapomay. > > > > ------------------------------------------------------------------------ > > *From:* Tapomay Dey <tapomay@yahoo.com <mailto:tapomay@yahoo.com>> > > *To:* Communication channel for MonetDB users > > <users-list@monetdb.org <mailto:users-list@monetdb.org>> > > *Sent:* Monday, February 11, 2013 10:47 AM > > *Subject:* Re: monetdb status health > > > > Thanks a lot. > > But since the time I asked the question the DB has gone into a > > state > > where it keeps logging > > 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database > > 'msearch_stats_db' has crashed after starting, manual intervention > > needed, check monetdbd's logfile for details > > > > in merovingian.log. > > > > Health is 1%. > > > > What can I do at this stage? > > > > Thanks and Regards, > > Tapomay. > > > > ------------------------------------------------------------------------ > > *From:* Fabian Groffen <fabian@monetdb.org <mailto:fabian@monetdb.org>> > > *To:* users-list@monetdb.org <mailto:users-list@monetdb.org> > > *Sent:* Sunday, February 10, 2013 11:33 PM > > *Subject:* Re: monetdb status health > > > > On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote: > > > My questions are simple: > > > > > > what causes crashes? > > > > The mserver5 (monetdb database) terminates in such a way that it > > can > > not be considered a clean shutdown, this is usually the case when > > the > > program gets terminated due to a condition that makes further > > execution > > impossible, e.g. memory faults. These are almost always program > > errors. > > > > > what is health? > > > > Health is the percentage of start-stop sequences compared to the > > number > > of times the database was actually started. E.g. how many times a > > start > > was followed by a clean shutdown (hence no crash). > > > > > how do we stop health from degrading? > > > > You can't, a database that crashes, and keeps on doing so will > > cause the > > health of the database to degrade. > > > > > Following is the status of my db- > > > start count: 140 > > > stop count: 1 > > > crash count: 138 > > > > So, essentially, every time you start your database, it never > > reaches a > > point where you stop it cleanly, but instead your database crashes > > all > > the time. > > > > > > -- > > Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org>> > > column-store pioneer http://www.monetdb.org/Home > > _______________________________________________ > > users-list mailing list > > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > > http://mail.monetdb.org/mailman/listinfo/users-list > > > > > > > > _______________________________________________ > > users-list mailing list > > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > > http://mail.monetdb.org/mailman/listinfo/users-list > > > > > > > > > > _______________________________________________ > > users-list mailing list > > users-list@monetdb.org <mailto:users-list@monetdb.org> > > http://mail.monetdb.org/mailman/listinfo/users-list > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > http://mail.monetdb.org/mailman/listinfo/users-list >
-- | Stefan.Manegold@CWI.nl <mailto:Stefan.Manegold@CWI.nl> | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Dear Tapomay. Thanks for the information. Running the create statement does not cause a problem on my machine. Your error points to a situation where the OS does not allow spawning of a new process thread where MonetDB would expect this to succeed without problems. Namely, during the first call of parallel execution it creates up to gdk_nr_thread workers to stand by for processing. You indicate that the number is set to 8, which should not cause any problem, normally. It is hard to tell what limits have been set in your OS environment or even your virtual OS environment. It typically occurs when the OS detects insufficient resources. The specific code segment has been made a little more defensive by raising an explicit exception to catch the case. regards, Martin On 02/12/2013 08:27 AM, Tapomay Dey wrote:
Attaching the schema and MAL plan.
Thanks and Regards, Tapomay. ------------------------------------------------------------------------ *From:* Martin Kersten <Martin.Kersten@cwi.nl> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 11:38 PM *Subject:* Re: monetdb status health
Dear Tapomay,
The error message is quite specific and certainly calls for the SQL schema/query to analyse. You might sent me the MAL plan of that query (using EXPLAIN command at SQL level).
regards, Martin
On 2/11/13 3:53 PM, Tapomay Dey wrote:
After analysing the logs I see the following as a lone single ERR statement different from rest of the repeating ERR logs stating (crashed, manual intervention needed): 2013-02-10 21:09:57 ERR msearch_stats_db[3073]: mserver5: mal_dataflow.c:587: runMALdataflow: Assertion `workers[0]' failed.
I also see "ERR merovingian[16831]: client error: unknown or impossible state: 4" in the later stages.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com <mailto:tapomay@yahoo.com>> *To:* Communication channel for MonetDB users <users-list@monetdb.org <mailto:users-list@monetdb.org>> *Sent:* Monday, February 11, 2013 4:46 PM *Subject:* Re: monetdb status health
Great info. Thanks a lot. I am going to keep my db in this state. I will be able to perform your suggestions tomorrow. Until then this is what I see in the logs when I do a fresh monetdb start: 2013-02-11 10:49:46 MSG merovingian[31990]: database 'msearch_stats_db' (32018) was killed by signal SIGSEGV 2013-02-11 10:49:46 ERR control[31990]: (local): failed to fork mserver: database 'msearch_stats_db' has crashed after starting, manual intervention ...
FYI I have not included my UDF. I have done a basic configure-make-make install with no modifications/extra options on Ubuntu 12.10 64 bit.(mercurial changeset: 46861:45c89b2e2ac2 Wed Feb 06 11:42:37) Usage profile: There is a constant transactional insert/update load of the order of 100 update attempts per second. There is a "select 1;" fired every 10 seconds to check if DB is alive. There has been no significant select load yet(we are still loading historical data into the db).
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Stefan Manegold <Stefan.Manegold@cwi.nl <mailto:Stefan.Manegold@cwi.nl>> *To:* Communication channel for MonetDB users <users-list@monetdb.org <mailto:users-list@monetdb.org>> *Sent:* Monday, February 11, 2013 12:17 PM *Subject:* Re: monetdb status health
Tapomay,
in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important.
If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code.
You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs.
Best, Stefan
----- Original Message -----
Dear Tapomay,
Taking the non-released revision is indeed "living on the edge".
1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. There are a few cases known where it may crash, it is worked upon. In the testweb you can find the few cases. In general, you would be
On 2/11/13 6:44 AM, Tapomay Dey wrote: the unlucky guy if you hit on them immediately. They seem rare.
2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state.
If it never passes the initialization phase after restart, it is most likely a corrupted database. This could happen as a result of a hardware failure, or an unknown error software error that caused a crash. It may be your UDF that went haywire and caused the system to loose. If it crashes without your UDF, then a run of the mserver using gdb may provide a hint on the whereabouts (see calling sequence in meriovingian.log to start mserver directly)
My approach would now be: 1) restore database from backup (or a small testdb) 2) ensure it is working correctly without your UDF 3) prepare test cases for your UDF 4) add your UDF 5) start/stop after the first few calls of code with UDF to observe behavior.
Success, Martin
My setup is such that there would be non-stop Inserts/updates into the DB 24/7.
Thanks and Regards, Tapomay.
*From:* Tapomay Dey <tapomay@yahoo.com <mailto:tapomay@yahoo.com> <mailto:tapomay@yahoo.com <mailto:tapomay@yahoo.com>>> *To:* Communication channel for MonetDB users <users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>> *Sent:* Monday, February 11, 2013 10:47 AM *Subject:* Re: monetdb status health
Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details
in merovingian.log.
Health is 1%.
What can I do at this stage?
Thanks and Regards, Tapomay.
*From:* Fabian Groffen <fabian@monetdb.org <mailto:fabian@monetdb.org> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org>>> *To:* users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> *Sent:* Sunday, February 10, 2013 11:33 PM *Subject:* Re: monetdb status health
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org>> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org>>> column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl <mailto:Stefan.Manegold@CWI.nl> <mailto:Stefan.Manegold@CWI.nl <mailto:Stefan.Manegold@CWI.nl>> | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

Dear Tapomay, Looking a little further into the error context, I was wondering if you spawned the mserver with a --set gdk_nr_threads=N where N > 1024? regards, Martin On 2/11/13 3:53 PM, Tapomay Dey wrote:
After analysing the logs I see the following as a lone single ERR statement different from rest of the repeating ERR logs stating (crashed, manual intervention needed): 2013-02-10 21:09:57 ERR msearch_stats_db[3073]: mserver5: mal_dataflow.c:587: runMALdataflow: Assertion `workers[0]' failed.
I also see "ERR merovingian[16831]: client error: unknown or impossible state: 4" in the later stages.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 4:46 PM *Subject:* Re: monetdb status health
Great info. Thanks a lot. I am going to keep my db in this state. I will be able to perform your suggestions tomorrow. Until then this is what I see in the logs when I do a fresh monetdb start: 2013-02-11 10:49:46 MSG merovingian[31990]: database 'msearch_stats_db' (32018) was killed by signal SIGSEGV 2013-02-11 10:49:46 ERR control[31990]: (local): failed to fork mserver: database 'msearch_stats_db' has crashed after starting, manual intervention ...
FYI I have not included my UDF. I have done a basic configure-make-make install with no modifications/extra options on Ubuntu 12.10 64 bit.(mercurial changeset: 46861:45c89b2e2ac2 Wed Feb 06 11:42:37) Usage profile: There is a constant transactional insert/update load of the order of 100 update attempts per second. There is a "select 1;" fired every 10 seconds to check if DB is alive. There has been no significant select load yet(we are still loading historical data into the db).
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Stefan Manegold <Stefan.Manegold@cwi.nl> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 12:17 PM *Subject:* Re: monetdb status health
Tapomay,
in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important.
If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code.
You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs.
Best, Stefan
----- Original Message -----
Dear Tapomay,
Taking the non-released revision is indeed "living on the edge".
1. BTW I am running a non-released revision of Feb13 branch. Could this be the reason for such a crash? I am doing so coz I need a fix that Niels had made for fixing a concurrency issue that caused duplicate keys. Also planning to implement group_concat UDF as per the changed semantics of Feb13. I already have a partially running one for Oct12. There are a few cases known where it may crash, it is worked upon. In the testweb you can find the few cases. In general, you would be
On 2/11/13 6:44 AM, Tapomay Dey wrote: the unlucky guy if you hit on them immediately. They seem rare.
2. As the DB crashes each time I try to start it I think its a perfect state to gather more diagnostics. How do I do so? I really need that a DB never reaches a non-recoverable state.
If it never passes the initialization phase after restart, it is most likely a corrupted database. This could happen as a result of a hardware failure, or an unknown error software error that caused a crash. It may be your UDF that went haywire and caused the system to loose. If it crashes without your UDF, then a run of the mserver using gdb may provide a hint on the whereabouts (see calling sequence in meriovingian.log to start mserver directly)
My approach would now be: 1) restore database from backup (or a small testdb) 2) ensure it is working correctly without your UDF 3) prepare test cases for your UDF 4) add your UDF 5) start/stop after the first few calls of code with UDF to observe behavior.
Success, Martin
My setup is such that there would be non-stop Inserts/updates into the DB 24/7.
Thanks and Regards, Tapomay.
*From:* Tapomay Dey <tapomay@yahoo.com <mailto:tapomay@yahoo.com>> *To:* Communication channel for MonetDB users <users-list@monetdb.org <mailto:users-list@monetdb.org>> *Sent:* Monday, February 11, 2013 10:47 AM *Subject:* Re: monetdb status health
Thanks a lot. But since the time I asked the question the DB has gone into a state where it keeps logging 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database 'msearch_stats_db' has crashed after starting, manual intervention needed, check monetdbd's logfile for details
in merovingian.log.
Health is 1%.
What can I do at this stage?
Thanks and Regards, Tapomay.
*From:* Fabian Groffen <fabian@monetdb.org <mailto:fabian@monetdb.org>> *To:* users-list@monetdb.org <mailto:users-list@monetdb.org> *Sent:* Sunday, February 10, 2013 11:33 PM *Subject:* Re: monetdb status health
On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote:
My questions are simple:
what causes crashes?
The mserver5 (monetdb database) terminates in such a way that it can not be considered a clean shutdown, this is usually the case when the program gets terminated due to a condition that makes further execution impossible, e.g. memory faults. These are almost always program errors.
what is health?
Health is the percentage of start-stop sequences compared to the number of times the database was actually started. E.g. how many times a start was followed by a clean shutdown (hence no crash).
how do we stop health from degrading?
You can't, a database that crashes, and keeps on doing so will cause the health of the database to degrade.
Following is the status of my db- start count: 140 stop count: 1 crash count: 138
So, essentially, every time you start your database, it never reaches a point where you stop it cleanly, but instead your database crashes all the time.
-- Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org>> column-store pioneer http://www.monetdb.org/Home _______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
-- | Stefan.Manegold@CWI.nl <mailto:Stefan.Manegold@CWI.nl> | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list

I am using "monetdb start {DBname}" to start the DB. Following is logged in the merovingian.log. It's using --set gdk_nr_threads=8. /usr/local/bin/mserver5 --dbpath=/usr/local/monetdb_home/stats_db_farm/msearch_stats_db --set merovingian_uri=mapi:monetdb://monet-db-1.hzdc15.sokrati.com:50000/msearch_stats_db --set mapi_open=false --set mapi_port=0 --set mapi_usock=/usr/local/monetdb_home/stats_db_farm/msearch_stats_db/.mapi.sock --set monet_vault_key=/usr/local/monetdb_home/stats_db_farm/msearch_stats_db/.vaultkey --set gdk_nr_threads=8 --set max_clients=128 --set sql_optimizer=default_pipe --set monet_daemon=yes Thanks and Regards, Tapomay. ________________________________ From: Martin Kersten <Martin.Kersten@cwi.nl> To: Communication channel for MonetDB users <users-list@monetdb.org> Sent: Tuesday, February 12, 2013 12:00 AM Subject: Re: monetdb status health Dear Tapomay, Looking a little further into the error context, I was wondering if you spawned the mserver with a --set gdk_nr_threads=N where N > 1024? regards, Martin On 2/11/13 3:53 PM, Tapomay Dey wrote:
After analysing the logs I see the following as a lone single ERR statement different from rest of the repeating ERR logs stating (crashed, manual intervention needed): 2013-02-10 21:09:57 ERR msearch_stats_db[3073]: mserver5: mal_dataflow.c:587: runMALdataflow: Assertion `workers[0]' failed.
I also see "ERR merovingian[16831]: client error: unknown or impossible state: 4" in the later stages.
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Tapomay Dey <tapomay@yahoo.com> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 4:46 PM *Subject:* Re: monetdb status health
Great info. Thanks a lot. I am going to keep my db in this state. I will be able to perform your suggestions tomorrow. Until then this is what I see in the logs when I do a fresh monetdb start: 2013-02-11 10:49:46 MSG merovingian[31990]: database 'msearch_stats_db' (32018) was killed by signal SIGSEGV 2013-02-11 10:49:46 ERR control[31990]: (local): failed to fork mserver: database 'msearch_stats_db' has crashed after starting, manual intervention ...
FYI I have not included my UDF. I have done a basic configure-make-make install with no modifications/extra options on Ubuntu 12.10 64 bit.(mercurial changeset: 46861:45c89b2e2ac2 Wed Feb 06 11:42:37) Usage profile: There is a constant transactional insert/update load of the order of 100 update attempts per second. There is a "select 1;" fired every 10 seconds to check if DB is alive. There has been no significant select load yet(we are still loading historical data into the db).
Thanks and Regards, Tapomay.
------------------------------------------------------------------------ *From:* Stefan Manegold <Stefan.Manegold@cwi.nl> *To:* Communication channel for MonetDB users <users-list@monetdb.org> *Sent:* Monday, February 11, 2013 12:17 PM *Subject:* Re: monetdb status health
Tapomay,
in addition to what Martin suggests, please consider also checking the merovingian log for all details, in particular any server error messages. To understand the cause of the problem, it is crucial to know the exact error messages, in fact the exact sequence of events that led to the current situation. So, what did you do when (or just before) the server crashed first on your database? What did you do then? Each step (and its outcome including both error messages on the client and server errors in the merovingian log) is important.
If you use a genuine (i.e., non-modified) Feb2013 code base, the problem (obviously) exists in that code. If you modified the code locally (e.g., by adding UDFs), the problem might also be in your code.
You might also what to consider building a debug version (i.e., configured with --disable-optimize --enable-debug --enable-assert), start mserver5 by hand on you database using the exact command line as given in the merovingian log (possibly also in a debugger), and see where (and why?) the crash occurs.
Best, Stefan
----- Original Message ----- > Dear Tapomay, > > Taking the non-released revision is indeed "living on the edge". > > On 2/11/13 6:44 AM, Tapomay Dey wrote: > > 1. BTW I am running a non-released revision of Feb13 branch. Could > > this > > be the reason for such a crash? > > I am doing so coz I need a fix that Niels had made for fixing a > > concurrency issue that caused duplicate keys. > > Also planning to implement group_concat UDF as per the changed > > semantics > > of Feb13. I already have a partially running one for Oct12. > There are a few cases known where it may crash, it is worked upon. > In the testweb you can find the few cases. In general, you would be > the unlucky guy if you hit on them immediately. They seem rare. > > > > 2. As the DB crashes each time I try to start it I think its a > > perfect > > state to gather more diagnostics. How do I do so? > > I really need that a DB never reaches a non-recoverable state. > If it never passes the initialization phase after restart, it is most > likely a corrupted database. This could happen as a result of a > hardware > failure, or an unknown error software error that caused a crash. > It may be your UDF that went haywire and caused the system to loose. > If it crashes without your UDF, then a run of the mserver using gdb > may provide a hint on the whereabouts > (see calling sequence in meriovingian.log to start mserver directly) > > My approach would now be: > 1) restore database from backup (or a small testdb) > 2) ensure it is working correctly without your UDF > 3) prepare test cases for your UDF > 4) add your UDF > 5) start/stop after the first few calls of code with UDF > to observe behavior. > > Success, Martin > > > > > My setup is such that there would be non-stop Inserts/updates into > > the > > DB 24/7. > > > > Thanks and Regards, > > Tapomay. > > > > ------------------------------------------------------------------------ > > *From:* Tapomay Dey <tapomay@yahoo.com <mailto:tapomay@yahoo.com>> > > *To:* Communication channel for MonetDB users > > <users-list@monetdb.org <mailto:users-list@monetdb.org>> > > *Sent:* Monday, February 11, 2013 10:47 AM > > *Subject:* Re: monetdb status health > > > > Thanks a lot. > > But since the time I asked the question the DB has gone into a > > state > > where it keeps logging > > 2013-02-11 04:12:40 ERR merovingian[15380]: client error: database > > 'msearch_stats_db' has crashed after starting, manual intervention > > needed, check monetdbd's logfile for details > > > > in merovingian.log. > > > > Health is 1%. > > > > What can I do at this stage? > > > > Thanks and Regards, > > Tapomay. > > > > ------------------------------------------------------------------------ > > *From:* Fabian Groffen <fabian@monetdb.org <mailto:fabian@monetdb.org>> > > *To:* users-list@monetdb.org <mailto:users-list@monetdb.org> > > *Sent:* Sunday, February 10, 2013 11:33 PM > > *Subject:* Re: monetdb status health > > > > On 10-02-2013 09:44:09 -0800, Tapomay Dey wrote: > > > My questions are simple: > > > > > > what causes crashes? > > > > The mserver5 (monetdb database) terminates in such a way that it > > can > > not be considered a clean shutdown, this is usually the case when > > the > > program gets terminated due to a condition that makes further > > execution > > impossible, e.g. memory faults. These are almost always program > > errors. > > > > > what is health? > > > > Health is the percentage of start-stop sequences compared to the > > number > > of times the database was actually started. E.g. how many times a > > start > > was followed by a clean shutdown (hence no crash). > > > > > how do we stop health from degrading? > > > > You can't, a database that crashes, and keeps on doing so will > > cause the > > health of the database to degrade. > > > > > Following is the status of my db- > > > start count: 140 > > > stop count: 1 > > > crash count: 138 > > > > So, essentially, every time you start your database, it never > > reaches a > > point where you stop it cleanly, but instead your database crashes > > all > > the time. > > > > > > -- > > Fabian Groffen fabian@monetdb.org <mailto:fabian@monetdb.org> <mailto:fabian@monetdb.org <mailto:fabian@monetdb.org>> > > column-store pioneer http://www.monetdb.org/Home > > _______________________________________________ > > users-list mailing list > > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > > http://mail.monetdb.org/mailman/listinfo/users-list > > > > > > > > _______________________________________________ > > users-list mailing list > > users-list@monetdb.org <mailto:users-list@monetdb.org> <mailto:users-list@monetdb.org <mailto:users-list@monetdb.org>> > > http://mail.monetdb.org/mailman/listinfo/users-list > > > > > > > > > > _______________________________________________ > > users-list mailing list > > users-list@monetdb.org <mailto:users-list@monetdb.org> > > http://mail.monetdb.org/mailman/listinfo/users-list > > > _______________________________________________ > users-list mailing list > users-list@monetdb.org <mailto:users-list@monetdb.org> > http://mail.monetdb.org/mailman/listinfo/users-list >
-- | Stefan.Manegold@CWI.nl <mailto:Stefan.Manegold@CWI.nl> | DB Architectures (DA) | | www.CWI.nl/~manegold/ | Science Park 123 (L321) | | +31 (0)20 592-4212 | 1098 XG Amsterdam (NL) |
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org <mailto:users-list@monetdb.org> http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
_______________________________________________ users-list mailing list users-list@monetdb.org http://mail.monetdb.org/mailman/listinfo/users-list
participants (6)
-
Brandon Jackson
-
Fabian Groffen
-
Martin Kersten
-
Martin Kersten
-
Stefan Manegold
-
Tapomay Dey