Thursday, January 31, 2008

SQL proposal: Select deleted and updated records

Here's another proposal for an addition/extension to SQL. In the earlier post I said that it should be possible to get feedback from delete- and update-statements. Here's an even better proposal that adds some syntax to SQL and allows you to fetch the records that have been updated or deleted.
Here's how it should look like:

delete from customers where marked_as_deleted="true" select id, name

Executing this will behave like a normal select-statement, while the results will be the id and the name of the customers that you have just deleted. With this it is possible to create a message like
The customers "Tom", "Bill" and "Lara" have been successfully deleted.

This is a good idea because it gives the user more feedback on the operation that just has been completed. With update-statements it's very similar:
update customers set marked_as_deleted="true" where last_action > "01.01.1998" select id, name

A possible user-feedback would be
The customers "Tom", "Bill" and "Lara" have just been marked for deletion, because they are too old.

You can simulate this, of course, by first selecting the customers and then running an update/delete with a) the same where-clause (which is unsafe because the data might've been changed in the meantime) or b) for each fetched id, which is slower because it doesn't let the server handle the whole operation on it's own.

Clean SQL-Server Client-API

Introduction


I often stumble upon some limitations or an awkwardness when dealing with the Firebird SQL-server. These are the big points which you can't fix in your client-application, because the API and some design choices do not allow it:

  • There's no progress feedbeck except when fetching datasets. Updating and deleting, however, is just one (potentially huge) blocking call.

  • There's no clean way to keep my datasets current. When they've been changed from a different (commited) transaction, I don't get any kind of notification.

  • Buffering datasets is a tremedously hard task to program, if you want to do it efficiently (both speed- and memory-wise).

  • Multithreading is a no-go optimization, because you need to establish a new connection for each thread. If you want to do intelligent multithreading with Firebird, the task becomes exponentially more difficult and error-prone.



Requirements


So here are the requirements I would expect from a well-designed client-API for Firebird, which would allow me to solve the aforementioned problems.

(Cumulated) progress feedback


This is applicable to update and delete (and maybe other) statements only. For select-statements, it's not needed. The client-app should be notified about the progress of the operation it just executed. Otherwise, (unexpectedly) very long executions may lead to the wrong conclusion that the server 'hangs' or there is something else going wrong. You should be able to create visual feedback on the client-side for better end-user experience. I often see myself writing client-side loops instead of simple updates with where-clauses for the sole reason to provide a progress-bar to the end user, since it is a strong usability-requirement of good UI design. Of course you can't know how many datasets are to come, but you at least should know how many datasets have been processed.
There are two ways in which I can imagine the Firebird-API to support this:
a) provide a call-back function to the execute-function:
      int execute_progress_info(int datasets, void *data, some_other_args) {
if( time_elapsed > 10000 )
return 0;
else
return 1;
}

If the call-back function returns 0, execute aborts it's operation (if possible).
b) make the client call the execute-function repeatedly and have it return the number of processed datasets. If the execute-function returns 0, the client stops to call it:

     int datasets = 0, temp;
while(temp = sql_server_execute(statement)) {
datasets += temp;
printf("%d datasets processed.",datasets);
/* possibly process some window- or system-messages here */
}


Integrated SQL-parser and editor


I often see the requirement for the client-application (especially for client-library wrappers) to parse and understand the SQL-query. For example, the IBObjects Delphi library does this to modify the SQL-string. Currently, there is much of the SQL-parsing logic re-implemented by IBObjects, while this should be accessible through the SQL-server's API to remove code-duplication and incompatibilities.
It should be possible to create a program-readable SQL-'tree' from SQL-code and vice-versa. With SQL-'tree' I mean some kind of data-structure that represents the SQL-code in-memory, much like how it should look like when it is compiled the server. This in-memory structure can then be modified (e.g. add or remove parts of the selected fields or add an expression to the 'order by'-clause) by the program in a error-proof and logical fashion and converted back to SQL-code or directly executed at API-level. This could be used for SQL-editors that have rich completion-features like today's IDEs like Visual Studio, NetBeans, etc. have. There are SQL-editors that provide those, but they duplicate the SQL-parsing-code, too, and are likely to break with unknown or new syntax. I will provide another good example for this later.

Partial exposure of SQL-subsystem functionality


What I mean by that is that I want to be able to check whether an existing in-memory dataset (that may or may not be in the database) matches a specific where-clause or order datasets that exist in the client's memory only with a given 'order by"-clause.

Event-notification systems with additional information


The event-mechanism as it is right now is kinda useless for many purposes I'd like to use it for. It's design is very simple, but this limits it to a very narrow range of applicable use-cases. For example, when monitoring a dataset for server-side changes, I can currently do the following: I have a base-name for the event, which ideally corresponds to the table's name. Let's say I want to monitor the datasets in the table 'customers' that I've partially fetched in a list. To monitor changes to each existing dataset, I would need two different events: customers_del_$ID and customers_upd_$ID, where $ID is replaced with the dataset's ID. I have two triggers for the table which fire off the corresponding events like this:
  create trigger customers_del_event for customers active after delete as
begin
post_event 'customers_del_'||old.id;
end^


and

  create trigger customers_upd_event for customers active after update as
begin
post_event 'customers_upd_'||old.id;
end^


I then register to two events for each dataset to monitor changes or deletions. When the upd-event triggers, I can re-fetch the dataset by ID and if the del-event is triggered, I can simply remove it from the list. But there's one caveat: When the ID of the dataset changes, I have no chance at all to notice that. There's no way the client can be notified about the ID-change by the server. Of course there are workarounds for this limitation, but none of the ones I can think of come with zero overhead.

Now what I would like to see is the ability to pass (arbitrary) parameters to the events. If it's possible to pass a 'new_id' parameter to the upd-event, all would be fine. I could simply re-fetch the dataset with it's new ID. Or even better, pass all the required fields to the event and eliminate the need to re-fetch the dataset alltogether, because the client can update it's data from the received event's parameters. Here's how I would imagine the syntax:

  create event customer_upd(
id integer,
name varchar(100),
street varchar(100),
city varchar(100))^
/* When parameters for events are possible, the event needs to be declared first */

create trigger customers_upd_event for customers active after update as
begin
post_event customer_upd(new.id,new.name,new.street,new.city);
end^


Thread-safe Client-API


Thread-safeness for the client-API would be a real gain. The whole programming world becomes more and more parallelized to increase speed and responsiveness of programs. More so with Dual-Cores and Quad-Cores on desktop-PCs.
One possible approach that does not provide more speed, but would make it possible to access the same connection from multiple threads would be to have one 'db-access-thread' automatically started in the client-API that simply communicates in a thread-safe way with every client-API-function (e.g. pipes or something similar). That way, each thread would still have to wait for the other thread's operation to finish, but it would make it much easier to program more responsive applications by putting database-operations that will otherwise block the GUI-thread into a separate thread, without the need to create a new connection (which would decrease performance and, additionally, I reallly like to have only one connection per application). This can of course be implemented in the client-application (or 3rd-party client-library) itself, but I currently know no implementation which bothers to do so. Having such neat features in the client-API makes Firebird a better product 'out of the box', IMHO. Another idea is a completely asynchronous API, which would be the best solution. This would be done by making every API-call non-blocking and have an event delivered to the application when it finishes. Asynchronous handling should always be prefered to threaded handling when it comes to networking stuff, since it usually scales better and puts less stress on the client. On the other hand, it's usually far more complicated to program.

Dataset-buffer in Client-API


When programming lists that display datasets, you usually fetch the datasets from the database and then put them in some kind of buffer of your own. Programming that buffer can become a major task if the list is very large. There are techniques to speed up fetching this list, for example what I call 'lazy' data-reading:
You first read in all the dataset's IDs and store them in a structure that looks like this:

  struct dataset {
db_key id;
struct dataset_content *content;
};


content is initially NULL and when the program tries to read the content for the first time, it's fetched from the database with a separate cursor than the original that fetched the IDs. The two SQL-queries would resemble something like this:

  select id from customers where city='New York' order by name; /* Fetch all the IDs */
select name, street, city from customers where id=? /* fetch a specific dataset */


This 'lazy fetching' speeds up the creation of lists that *require* to know the total size of the result-set, because fetching the IDs is a few times faster than fetching the data, and fetching the data is delayed until it's displayed. Most lists require the total size of the result-set, because implementing the scrollbar-logic for those lists without that information is not possible in any usable fashion. Nearly all SQL-based programs have very bad handling when it comes to lists and scroll-bars. Additionally optimization of small-object-allocation is nearly a MUST in this situation. Otherwise performance is likely to degrade to unusable levels. This is true for any list-buffer, though, and I think it should be part of the client-API so the application does not need to worry about it. As much optimization as possible should be included in the client-API, because it'll make Firebird look like a more stable and faster database than others in subjective comparisons. Again, application-developers are unlikely to write these kinds of optimizations themselves in today's age, because bigger hardware is cheaper than software-optimization (especially at the client-side). Combining this with a multi-threaded or asynchronous approach is the ideal solution. I have created a proof-of-concept implementation with the techniques mentioned above and I think I'll write a detailed describtion of the process and the results, soon. IMHO the usablity and responsiveness of what I've created is as far as it can possibly get.

Creating a 'live' list


Without those features it's still possible to write a client-application that shows a 'live' list of datasets of a specific range in a specific order. With 'live' I mean that new, deleted and updated entries are visible immediately in real- time, without any polling going on. This of course decreases network-traffic and at the same time increases speed. First, I'd like to describe the approach that I see optimal for implementing this without the aforementioned desired features. There are a few database-prerequisites that each table you want to watch 'live' has to meet: It needs a 'updated'-field which is set in 'before insert'- and 'before update'- triggers to the current datetime. Then it needs to post the events 'table_inserted', 'table_updated' and 'table_deleted' in the corresponding 'after'-triggers. The client-app, in turn, first selects the desired data and listens to the table_xxx triggers. Each time a _inserted or _updated trigger is fired, it adds a 'and updated>=:last_update' to it's original select-statement to fetches only the updated datasets. Existing datasets (identified by the ID) are updated accordingly, and new datasets can be added to the list. Since you don't know how the list might be sorted (after all, it's hidden in the select-statement which may be user-provided), the best bet is to add those datasets to either the top or the bottom of the list. The 'if the ID changes, I'm out of luck'-problem applies here, too. In that case, the updated dataset with the new ID would be interpreted as a new dataset and I would see the same dataset twice in the list.
For best performance, you should include the 'lazy fetching'-optimization I described earlier. You first fetch the IDs only, then fetch each dataset as it is required. You need to separate selects for this. The first select is like the original, but with the fields replaced with the 'id'-field only. The second adds an 'and id=:id' to the where-clause, with all previous where-clauses put into parantheses so that no operator-precedence messes up the result. Those are both difficult tasks to program, because it requires some effort to parse and modify the SQL structure. The lazy fetching is difficult, because, when done multi-threaded or asynchronous, it'll quickly become complex and confusing. Parsing the sql-statement is very complex, too, since writing a bullet-proof, Firebird-compatible sql-parser isn't an easy task. That's why there should be a 'cursor'-like object in the Firebird-API that provides optimal fetching and buffering means, and some functions to parse and modify an sql-statement.

Wednesday, January 16, 2008

Got BUGs?!

You just have to check this out, it's extremely cool:
http://www.buglabs.net/products
Bug Labs manufactures and sells the BUG device, which is a small micro-controller in a very stylish housing. It got an ARM CPU at it's core (no clock-speed mentioned) and 128MB of main memory. It features hardware graphics acceleration and MPEG4 decoding and encoding, WiFi, USB, Ethernet and a hell of a lot more interfaces.
If it wasn't for the 300 bucks this BUG costs and world-dominating gadget thingie... or whatever.

Monday, January 7, 2008

Slimy Lisp video

There's a great video that gives you a kick-start in learning SLIME, the extremely advanced Lisp-mode for Emacs.
You can find it on common-lisp.net here: http://common-lisp.net/movies/slime.mov
It's very interesting even for non-Lispers, since it shows off features the average or professional C++, C# or Java IDE can only dream of!