Daniel's Computer Blog: January 2008

Introduction

I often stumble upon some limitations or an awkwardness when dealing with the Firebird SQL-server. These are the big points which you can't fix in your client-application, because the API and some design choices do not allow it:

There's no progress feedbeck except when fetching datasets. Updating and deleting, however, is just one (potentially huge) blocking call.

There's no clean way to keep my datasets current. When they've been changed from a different (commited) transaction, I don't get any kind of notification.

Buffering datasets is a tremedously hard task to program, if you want to do it efficiently (both speed- and memory-wise).

Multithreading is a no-go optimization, because you need to establish a new connection for each thread. If you want to do intelligent multithreading with Firebird, the task becomes exponentially more difficult and error-prone.

Requirements

So here are the requirements I would expect from a well-designed client-API for Firebird, which would allow me to solve the aforementioned problems.

(Cumulated) progress feedback

This is applicable to update and delete (and maybe other) statements only. For select-statements, it's not needed. The client-app should be notified about the progress of the operation it just executed. Otherwise, (unexpectedly) very long executions may lead to the wrong conclusion that the server 'hangs' or there is something else going wrong. You should be able to create visual feedback on the client-side for better end-user experience. I often see myself writing client-side loops instead of simple updates with where-clauses for the sole reason to provide a progress-bar to the end user, since it is a strong usability-requirement of good UI design. Of course you can't know how many datasets are to come, but you at least should know how many datasets have been processed.
There are two ways in which I can imagine the Firebird-API to support this:
a) provide a call-back function to the execute-function:

      int execute_progress_info(int datasets, void *data, some_other_args) {
        if( time_elapsed > 10000 )
           return 0;
        else
           return 1;
     }

If the call-back function returns 0, execute aborts it's operation (if possible).
b) make the client call the execute-function repeatedly and have it return the number of processed datasets. If the execute-function returns 0, the client stops to call it:

     int datasets = 0, temp;
    while(temp = sql_server_execute(statement)) {
       datasets += temp;
       printf("%d datasets processed.",datasets);
       /* possibly process some window- or system-messages here */
    }

Integrated SQL-parser and editor

I often see the requirement for the client-application (especially for client-library wrappers) to parse and understand the SQL-query. For example, the IBObjects Delphi library does this to modify the SQL-string. Currently, there is much of the SQL-parsing logic re-implemented by IBObjects, while this should be accessible through the SQL-server's API to remove code-duplication and incompatibilities.
It should be possible to create a program-readable SQL-'tree' from SQL-code and vice-versa. With SQL-'tree' I mean some kind of data-structure that represents the SQL-code in-memory, much like how it should look like when it is compiled the server. This in-memory structure can then be modified (e.g. add or remove parts of the selected fields or add an expression to the 'order by'-clause) by the program in a error-proof and logical fashion and converted back to SQL-code or directly executed at API-level. This could be used for SQL-editors that have rich completion-features like today's IDEs like Visual Studio, NetBeans, etc. have. There are SQL-editors that provide those, but they duplicate the SQL-parsing-code, too, and are likely to break with unknown or new syntax. I will provide another good example for this later.

Partial exposure of SQL-subsystem functionality

What I mean by that is that I want to be able to check whether an existing in-memory dataset (that may or may not be in the database) matches a specific where-clause or order datasets that exist in the client's memory only with a given 'order by"-clause.

Event-notification systems with additional information

The event-mechanism as it is right now is kinda useless for many purposes I'd like to use it for. It's design is very simple, but this limits it to a very narrow range of applicable use-cases. For example, when monitoring a dataset for server-side changes, I can currently do the following: I have a base-name for the event, which ideally corresponds to the table's name. Let's say I want to monitor the datasets in the table 'customers' that I've partially fetched in a list. To monitor changes to each existing dataset, I would need two different events: customers_del_$ID and customers_upd_$ID, where $ID is replaced with the dataset's ID. I have two triggers for the table which fire off the corresponding events like this:

  create trigger customers_del_event for customers active after delete as
 begin
    post_event 'customers_del_'||old.id;
 end^

and

  create trigger customers_upd_event for customers active after update as
 begin
    post_event 'customers_upd_'||old.id;
 end^

I then register to two events for each dataset to monitor changes or deletions. When the upd-event triggers, I can re-fetch the dataset by ID and if the del-event is triggered, I can simply remove it from the list. But there's one caveat: When the ID of the dataset changes, I have no chance at all to notice that. There's no way the client can be notified about the ID-change by the server. Of course there are workarounds for this limitation, but none of the ones I can think of come with zero overhead.

Now what I would like to see is the ability to pass (arbitrary) parameters to the events. If it's possible to pass a 'new_id' parameter to the upd-event, all would be fine. I could simply re-fetch the dataset with it's new ID. Or even better, pass all the required fields to the event and eliminate the need to re-fetch the dataset alltogether, because the client can update it's data from the received event's parameters. Here's how I would imagine the syntax:

  create event customer_upd(
       id integer,
       name varchar(100),
       street varchar(100),
       city varchar(100))^
 /* When parameters for events are possible, the event needs to be declared first */

 create trigger customers_upd_event for customers active after update as
 begin
    post_event customer_upd(new.id,new.name,new.street,new.city);
 end^

Thread-safe Client-API

Thread-safeness for the client-API would be a real gain. The whole programming world becomes more and more parallelized to increase speed and responsiveness of programs. More so with Dual-Cores and Quad-Cores on desktop-PCs.
One possible approach that does not provide more speed, but would make it possible to access the same connection from multiple threads would be to have one 'db-access-thread' automatically started in the client-API that simply communicates in a thread-safe way with every client-API-function (e.g. pipes or something similar). That way, each thread would still have to wait for the other thread's operation to finish, but it would make it much easier to program more responsive applications by putting database-operations that will otherwise block the GUI-thread into a separate thread, without the need to create a new connection (which would decrease performance and, additionally, I reallly like to have only one connection per application). This can of course be implemented in the client-application (or 3rd-party client-library) itself, but I currently know no implementation which bothers to do so. Having such neat features in the client-API makes Firebird a better product 'out of the box', IMHO. Another idea is a completely asynchronous API, which would be the best solution. This would be done by making every API-call non-blocking and have an event delivered to the application when it finishes. Asynchronous handling should always be prefered to threaded handling when it comes to networking stuff, since it usually scales better and puts less stress on the client. On the other hand, it's usually far more complicated to program.

Dataset-buffer in Client-API

When programming lists that display datasets, you usually fetch the datasets from the database and then put them in some kind of buffer of your own. Programming that buffer can become a major task if the list is very large. There are techniques to speed up fetching this list, for example what I call 'lazy' data-reading:
You first read in all the dataset's IDs and store them in a structure that looks like this:

  struct dataset {
    db_key id;
    struct dataset_content *content;
 };

content is initially NULL and when the program tries to read the content for the first time, it's fetched from the database with a separate cursor than the original that fetched the IDs. The two SQL-queries would resemble something like this:

  select id from customers where city='New York' order by name; /* Fetch all the IDs */
 select name, street, city from customers where id=? /* fetch a specific dataset */

This 'lazy fetching' speeds up the creation of lists that *require* to know the total size of the result-set, because fetching the IDs is a few times faster than fetching the data, and fetching the data is delayed until it's displayed. Most lists require the total size of the result-set, because implementing the scrollbar-logic for those lists without that information is not possible in any usable fashion. Nearly all SQL-based programs have very bad handling when it comes to lists and scroll-bars. Additionally optimization of small-object-allocation is nearly a MUST in this situation. Otherwise performance is likely to degrade to unusable levels. This is true for any list-buffer, though, and I think it should be part of the client-API so the application does not need to worry about it. As much optimization as possible should be included in the client-API, because it'll make Firebird look like a more stable and faster database than others in subjective comparisons. Again, application-developers are unlikely to write these kinds of optimizations themselves in today's age, because bigger hardware is cheaper than software-optimization (especially at the client-side). Combining this with a multi-threaded or asynchronous approach is the ideal solution. I have created a proof-of-concept implementation with the techniques mentioned above and I think I'll write a detailed describtion of the process and the results, soon. IMHO the usablity and responsiveness of what I've created is as far as it can possibly get.

Creating a 'live' list

Without those features it's still possible to write a client-application that shows a 'live' list of datasets of a specific range in a specific order. With 'live' I mean that new, deleted and updated entries are visible immediately in real- time, without any polling going on. This of course decreases network-traffic and at the same time increases speed. First, I'd like to describe the approach that I see optimal for implementing this without the aforementioned desired features. There are a few database-prerequisites that each table you want to watch 'live' has to meet: It needs a 'updated'-field which is set in 'before insert'- and 'before update'- triggers to the current datetime. Then it needs to post the events 'table_inserted', 'table_updated' and 'table_deleted' in the corresponding 'after'-triggers. The client-app, in turn, first selects the desired data and listens to the table_xxx triggers. Each time a _inserted or _updated trigger is fired, it adds a 'and updated>=:last_update' to it's original select-statement to fetches only the updated datasets. Existing datasets (identified by the ID) are updated accordingly, and new datasets can be added to the list. Since you don't know how the list might be sorted (after all, it's hidden in the select-statement which may be user-provided), the best bet is to add those datasets to either the top or the bottom of the list. The 'if the ID changes, I'm out of luck'-problem applies here, too. In that case, the updated dataset with the new ID would be interpreted as a new dataset and I would see the same dataset twice in the list.
For best performance, you should include the 'lazy fetching'-optimization I described earlier. You first fetch the IDs only, then fetch each dataset as it is required. You need to separate selects for this. The first select is like the original, but with the fields replaced with the 'id'-field only. The second adds an 'and id=:id' to the where-clause, with all previous where-clauses put into parantheses so that no operator-precedence messes up the result. Those are both difficult tasks to program, because it requires some effort to parse and modify the SQL structure. The lazy fetching is difficult, because, when done multi-threaded or asynchronous, it'll quickly become complex and confusing. Parsing the sql-statement is very complex, too, since writing a bullet-proof, Firebird-compatible sql-parser isn't an easy task. That's why there should be a 'cursor'-like object in the Firebird-API that provides optimal fetching and buffering means, and some functions to parse and modify an sql-statement.

Daniel's Computer Blog

Thursday, January 31, 2008

SQL proposal: Select deleted and updated records

Clean SQL-Server Client-API

Introduction

Requirements

(Cumulated) progress feedback

Integrated SQL-parser and editor

Partial exposure of SQL-subsystem functionality

Event-notification systems with additional information

Thread-safe Client-API

Dataset-buffer in Client-API

Creating a 'live' list

Wednesday, January 16, 2008

Got BUGs?!

Monday, January 7, 2008

Slimy Lisp video

Labels

Earlier posts