Saturday, January 27, 2007

IDS 10.00.*C6 new features and some thoughts...

I finally found a bit of spare time and decided to look at IDS 10.00.*C6 which was recently released. I must confess that I don't do this as often as I should, but given the current rate of IDS improvements I must say that this is a rewarding exercise.
Let me say I haven't read the *C5 release notes, so I did read them too.

If you're like me, and still haven't read them, you can find a shortcut here.

The first feature you'll notice is the Index Self Join. After reading a bit I recall I've already seen a description of this feature in an Oracle article. At the time I thought that although theoretically useful it should be difficult to find a situation where this would give real life benefits. Well, after testing it I was surprised.
So, what is this Index Self Join Feature? Putting it in an easy way, it's a way to scan an index where the Where clause may or may not include the index head column, and where the first index column(s) have very low selectivity.
Previously, the optimizer would scan all the keys which fullfill the complete set of key conditions, or it would make a full table scan if the leading key had no conditions associated. With this feature it will find the unique leading keys (low selectivity), and will make small queries using this unique keys and the rest of the index keys provided in the WHERE condition of your query. Err... I wrote "easy" above? Maybe an example will make it clearer:

Imagine you have table "test" lile this:

create table test(
a smallint,
b smallint,
c smallint,
t varchar( 255)
)
in dbs1 extent size 1000 next size 1000 lock mode row;
create index ix_test_1 on test ( a, b, c);


and you populate it with the result of (bash script):


#!/bin/bash

for ((a=1;a<=15;a++))
do
for ((b=1;b<=2000;b++))
do
for ((c=1;c<=100;c++))
do
echo "$a|$b|$c|dummyt"
done
done
done

If you make a query like this:


select t from test where b = 100 and c = 1;


You'll get a sequential table scan.
If you include a condition on column "a" you'll get an index scan, but the performance won't be nice...


Now, in *UC6, if you make a query with an optimizer hint like the one mentioned in the release notes:


select --+ INDEX_SJ ( test ix_test_1 )
t from test where b = 100 and c = 1;


You will get a what is called the Index Self Join query plan, and believe me, a much quicker response. If you don't believe me, and I suggest you don't, please try it yourself. You'll need the bash script above (if you're not using bash adapt the script to your favorite SHELL). Run the script and send the results to /tmp/test.unl. Then execute the SQL below (queries have a condition on column "a"):


cat <<eof >/tmp/test.sh
#!/bin/bash

for ((a=1;a<=15;a++))
do
for ((b=1;b<=2000;b++))
do
for ((c=1;c<=100;c++))
do
echo "$a|$b|$c|dummyt|"
done
done
done
eof

/tmp/test.sh > /tmp/test.unl


dbaccess stores_demo <<eof

-- use a raw table to avoid long tx
create raw table test(
a smallint,
b smallint,
c smallint,
t varchar( 255)
)
-- choose the right dbspace for you
in dbs1 extent size 1000 next size 1000 lock mode row;

-- locking exclusively to avoid lock overflow or table expansion
begin work;
lock table test in exclusive mode;
load from /tmp/teste.unl insert into test;
commit work;
create index ix_test_1 on test ( a, b, c);

-- dsitributions must be create for optimizer to know about field selectivity
update statistics high for table test (a,b,c);

select "Start INDEX_SJ: ", current year to fraction(5) from systables where tabid = 1;
unload to result1.unl select --+ EXPLAIN, INDEX_SJ ( test ix_test_1 )
* from test where a>=1 and a<=15 and b = 100 and c = 1;
select "Start FULL: ", current year to fraction(5) from systables where tabid = 1;
unload to result2.unl select --+ EXPLAIN, FULL ( test )
* from test where a>=0 and a<=15 and b = 100 and c = 1;
select "Start INDEX: ", current year to fraction(5) from systables where tabid = 1;
unload to result3.unl select --+ EXPLAIN, AVOID_FULL ( test )
* from test where a>=0 and a<=15 and b = 100 and c = 1;
select "Stop INDEX: ", current year to fraction(5) from systables where tabid = 1;

eof



In my system (a vmware machine runing Fedora Core5), the results were (only useful for comparison between query plans):


Start INDEX_SJ: 2007-01-28 19:18:29.77990
Start FULL: 2007-01-28 19:18:29.88101
Start INDEX: 2007-01-28 19:18:34.67570
Stop INDEX: 2007-01-28 19:18:41.92104


So, from about 5 or 6 seconds to about 0.1. Not bad hmmm?
Take a look at the query plan for more details:


QUERY:
------
select --+ EXPLAIN, INDEX_SJ ( test ix_test_1 )
* from test where a>=1 and a<=15 and b = 100 and c = 1

DIRECTIVES FOLLOWED:
EXPLAIN
INDEX_SJ ( test ix_test_1 )
DIRECTIVES NOT FOLLOWED:

Estimated Cost: 42
Estimated # of Rows Returned: 15

1) informix.test: INDEX PATH

(1) Index Keys: a b c (Serial, fragments: ALL)
Index Self Join Keys (a )
Lower bound: informix.test.a >= 1
Upper bound: informix.test.a <= 15

Lower Index Filter: informix.test.a = informix.test.a AND (informix.test.b = 100 AND informix.test.c = 1 )


QUERY:
------
select --+ EXPLAIN, FULL ( test )
* from test where a>=0 and a<=15 and b = 100 and c = 1

DIRECTIVES FOLLOWED:
EXPLAIN
FULL ( test )
DIRECTIVES NOT FOLLOWED:

Estimated Cost: 118847
Estimated # of Rows Returned: 15

1) informix.test: SEQUENTIAL SCAN

Filters: (((informix.test.b = 100 AND informix.test.c = 1 ) AND informix.test.a <= 15 ) AND informix.test.a >= 0 )


QUERY:
------
select --+ EXPLAIN, AVOID_FULL ( test )
* from test where a>=0 and a<=15 and b = 100 and c = 1

DIRECTIVES FOLLOWED:
EXPLAIN
AVOID_FULL ( test )
DIRECTIVES NOT FOLLOWED:

Estimated Cost: 114798
Estimated # of Rows Returned: 15

1) informix.test: INDEX PATH

(1) Index Keys: a b c (Key-First) (Serial, fragments: ALL)
Lower Index Filter: informix.test.a >= 0 AND (informix.test.b = 100 ) AND (informix.test.c = 1 )
Upper Index Filter: informix.test.a <= 15
Index Key Filters: (informix.test.b = 100 ) AND
(informix.test.c = 1 )




Is there a catch? Well, yes and no.
Currently this feature is disabled by default. To use it you'll need to use the optimizer directives or you'll have to change an ONCONFIG "hidden" parameter.
The parameter in question is called INDEX_SELFJOIN. A value of 1 enables it, and 0 disables it.
You can also change this at any time using:


onmode -wm INDEX_SELFJOIN=<1|0>


This information is not clearly explained in the release notes, but you can find it in the performance guide 10.00.*c6 release notes.
This is a feature planned for Cheetah that was backported to version 10. Probably in Cheetah (and in future 10.00 versions) it will be activaded by default. If you plan to use it, be careful and monitor the results... It's still a fresh feature.

So, what other good news do we have in the later versions? Well, one of them was used in the scripts above. Some of you may have noticed I created a raw (non-logged) table, and after loading it I created an index on it. Older versions wouldn't allow this, but we can use it since 10.00.xC5. There is also several enhancements that I won't review in detail:

  • Control the trigger fireing on replicated tables during synchronization
    This enables the control of triggers in replicated tables when we synchronize them

  • It's unecessary to copy oncfg file into target server when doing a imported restore
    When doing and imported restore (restore on a different server, not the one where we make the backups) we don't need to copy the oncfg files as we used to.

  • New binary data types
    There are two new datatypes: BINARYVAR and BINARY18. These datatypes provide indexable binary encoded strings and were created to improve compatibility with WebSphere. They are provided by a new free Datablade (binaryudt). In fact this is a showcase of IDS extensibility. These datatypes support some new specific functions (bit_and(), bit_or(), bit_xor() and bit_complement() ) as long as some standard functions like length(), octet_length(), COUNT DISTINCT(), MAX() and MIN().

    When I read about this I imagined a scenario where this could be used to create functional indexes, based on table fields which values could be represented in binary form by a function. I mean creating an index based on a function that given a list of attributes would generate a binary string. We could represent a true/false with just one bit. The attributes could be marketing fields about your customers (sex, married/single/divorced, has car, has children... etc.). Then you could create a bit representation of this fields and index your tables with it. A search could check all the fields with a bit comparison to the function generated index.
    I couldn't prove to myself that this was a good ideia, neither with search time comparisons neither with flexibility comparisons against the traditional index methods. But I leave here the idea. If someone manages to use it efficiently, please give me some feedback.

  • View folding
    This optimization permits that in certain cases there is no need to materialize a view (by creating a temp table). Instead the optimizer will make the join not against the resulting temp table but against the view base tables.


Well, that's it for now. The main message of this post is: keep up to date with the new versions release notes... They contain precious information that can help you decide if it's time to upgrade your systems or not.