Not using index, or avoid using index, for large scale data warehouse is not a simple technique. I call this a “strategy” for it does require coordinated actions from database configurations, to data modeling, to ETL, and to BI application design.
First, the utility of index is that when you need to access one database block the index can pinpoint to the specific block for you. The basic question is what if I need to access millions of blocks. Indexes still can help, but the database will need to do block lookup millions of times. In such case, particularly on powerful platforms like ExaData or TeraData, a targeted table scan, such as a partition scan or a table scan with Storage Index, could be much fast.
That is where we will start looking into the issue.
The Need for Indexes
Apparently, simply turning off index could have consequences for situations like row updates. Without index, a row update will cause a table scan, which in most cases will be very slow even it does a partition scan. If it is a massive update, like select-update, then it can be a disaster when there is no index.
Star transformation, which use bit-wise operation of multiple indexes to dramatically reduce the number of blocks accessed, is still an effective technique if the typical queries have filters on multiple columns.
Another reason DBAs prefer to indexes is that, without indexes, large amount of table (partition) scan can flood the limited IO bandwidth between the RDBMS and its storage. That can be so severe when data volume is high and when the storage is network attached.
The Problem with Indexes
However, indexes are addictive, the more you create the more you need. Let’s face it. It is not uncommon going into any database and you find indexes take as much storage as actual data do, or even more.
What’s more important, indexes are not free. Indexes have significant downside. The larger the data volume, the costlier they are. So are you willing to pay for the 1 point of the benefit of indexing by incurring 99 points of the burden of maintaining indexes? That is so common when you have 100s TB data – the cost of index can way overweight the benefit of it.
So, is there a way out? The answer is yes for some typical data warehouse applications. And to use NO index, it has to be a coordinated strategy.
For Oracle database, data blocks are read to buffer cache in SGA before they are used in calculations. That is the normal case. Index access generally will result in data blocks being read into SGA. For parallel queries, data mostly are read directly into PGA through “direct IO”, bypassing SGA. We will see why it works that way and the benefit of direct IO.
So you know smart scan is a huge advantage of ExaData. Smart scan can push filters and column projections down to storage cell. Smart scan can be a whole lot more efficient than index range scan, particularly when storage index or proper partitioning are present. In my experience, smart scan (row filter + column selector), plus storage index, plus partition scan can reduce IO traffic from storage cell to RDBMS to 1% of it would be or even less.
But smart scan will result in data blocks being broken down to records. Data returned from storage cell via smart scan cannot be cached in SGA as blocks. They need to go to PGA directly. If the database cannot do direct read, it will not be able to efficiently use smart scan.
From there you see at least two configurations are critical if you would like to avoid indexes. One is that you need to configure PGA and SGA properly. Typically you will need to allocate a lot more memory to PGA than you normally do, and thus less memory to SGA. The other one is that you have to configure the system properly so that it supports and encourages parallel queries.
On the route of using table scan instead of index access, we can go further by compressing data. When data are compressed, it will be much faster to scan through a wide range of data, like scooping up a whole partition. Compression can be a data modeling task, but equally important are system and tablespace configurations to facilitate compression.
When the mindset is to use NO indexes, the design of database objects can be quite significantly different.
You will find it top important to have everything partitioned properly. Partitioning is not a simple choice. The partitions should facilitate not only partition scans, which apparent is cheaper than full table scans, but also partition-wise joins. A partition-wise join can lower the complexity from O(n square) to O(n).
You will also need to model tables so that bloom filters (BF) can do the magic. On ExaData, bloom filter is proven much more efficient than star transformations. Symbol :BF is the sweet friend to have in your query plans.
And of course, you will need to choose proper method and level of compression. Compression is a great way to speed up table scan so that it can work faster than index access.
How can you update quickly without using indexes? That should be the first question to ask when you go down the NO index route.
One option is to defined limited number of indexes for updates. But you will find updates will mess up compression. A compressed block will be uncompressed after a few row updates. And again you will find the headache of maintaining those indexes.
Partion exchange is another, and often more effective, approach. By using partition switch, you can update a row in a partition and keep the partition compressed as usual. Proper managed partition exchange can be as fast as update-by-index in small updates, and can be much faster than update-by-index in large updates, or in UPSERT situations.
The key part of BI application design when NO index is defined is to always enforce partition access, thus not full table scan, and always facilitate partition-wise join between large tables. This, however, can dependent on data model heavily.
And of course, it is critical to mange parallelism and other configurations properly so that smart scans are in effect.
In summary, it takes a thoughtful comprehensive solution design to truly utilize all those advanced features of ExaData so that you don’t get bothered by the overhead of indexes. Above arguments are by no means to be exhaustive. But the main point is that it is not a simple configuration change.