SQL indexing in 9 minutes and a half

SQL indexing in 9 minutes and a half


Hello, my name is Stephane Faroult. I am a French database consultant. I’d like
to talk a little about indexes, because I see a
lot of mistakes in the field. I have published two books with O’Reilly,
one about how I think one should write SQL, and more recently
a second one more relevant to what takes most of my professional time, namely improving
SQL applications written by people who have obviously NOT read
my first book. In both books, I had to talk about indexes,
a subject that is often surprisingly misunderstood. Almost everyone knows you mostly have two
ways to access data in a relational table: one is to scan
the table from beginning to end, and the other one is to descend an index that is most often
a tree tructure, find the physical address of rows
of interest, then directly access them. Neat and fast. So, why the fuss? We should probaly index
every column that intervenes in a WHERE condition. Hold on. Some indexes are created to implement
constraints, in particular to quickly identify that you are trying to insert duplicates in
columns that should contain unique values. And you almost always have some obvious “entry
points”, some columns that are particularly important
and selective search criteria. I have no issue with these indexes. But all indexes have a cost. For one thing, they take a lot of storage.
This shows the total storage taken by a real 8 Gigabyte table
in a bank. The five indexes take together more than three times as much storage as data. You are going to tell me: who cares, storage
is cheap nowadays. True enough, but only to some extent. And
of course I’m not only talking about the production
database. It is likely, if data is precious, that disks are mirrored. And if the data is
critical, it is probably replicated to some distant
disaster recovery site, where it can also be mirrored. We probably also have a copy of the production
data that is used for development. And another copy
for maintenance of the current version of the software. And a user-acceptance database
copy, for testing new functionalities. And a performance database
for stressing the system. A few gigabytes here, a few gigabytes there,
and you are soon talking about real storage requirements,
that don’t come that cheap. But paying for disks isn’t the only cost. The time it takes to backup a database is
directly proportional to its size. Backups are usually planned. But recoveries
often aren’t, and the problem is the same. If your database takes a lot of space because
of indexes, it will take longer to recover. Franklin gave to an old saying its modern
guise in 1748 in his “Advice to a Young Tradesman”: Remember that time is money. There is even worse, and this is a test you
can easily replicate. I have inserted 100,000 rows
into a table of a dozen columns. In the same time I could insert 100 rows without
any index, After creating a primary key index I inserted
65 After adding a second index I was down to
22, After adding another index I could only insert
15 And a last index brought my throughput down
to 5. The actual numbers may vary, but since indexes
are maintained in real time each index inflicts a performance penalty, and it can be verified
with any database management system. It’s not a concern in a read-only decision
support system. It can be one in an operational system. I certainly want some indexes. But since indexes
are costly, we must study the cost/benefit ratio and really
weigh their usefulness. Let’s first understand how they really work. An index entry is the association of a key
value with a physical addresses. In an index, all the entries are stored in
sorted order of the key values, and a tree structure
is plugged over this list to quickly access the address of a key value by comparison of
this key with what you find in the various nodes. Searching an index is in fact very similar
to searching a phone book, in which the key is the surname,
firstname and perhaps the street, while the associated value that is searched is the phone
number. By comparing the name you search to the top
of the page, you reach the good line very fast. But when we are looking for a lot of values,
either because the search criterion isn’t very selective,
or because we are using the result of a subquery that returns many rows, it may actually be
more efficient to remorselessly scan the table
than to get the address from the index each time. The question really is: do we need an index,
like a book index that is great to find some very
specific information at one particular place or do we need something which is more like
the broad categorization of the table of contents, because we need to get information from full
chapters. Databases know the equivalent of chapters,
they are called partitions. Even when you are certain about which columns
really need indexing, you must be careful about how
you index them. For instance, suppose we have a table called
T, and that we have indexes on two columns from T,
country and year. If we often run queries in which we combine
a condition on the country with a condition on the year,
everything is fine … except that the optimizer will have to think
hard about how best to run the query. It can use an index and forget about the other
one, or it can sometimes use both indexes to find
the intersection of two result sets. If both columns appear in most query conditions,
it makes sense to create a composite index, in which we shall take the year and the country,
concatenate the two and call the result the key. The composite index will take us precisely
to the rows we want, and in terms of index maintenance
during inserts and deletes, it will be faster to maintain one index on two columns than
two indexes on one column. But then there is a question. Should the index be built on country and year, or on year and country? It’s not exactly
the same thing. Remember that keys are sorted in indexes.
An index on year and country might look like this While an index an country and year would resemble
that. If both country and year appear in the condition,
the two indexes will be equivalent and will allow to zoom fast on the result. If only a condition on the year appears in
the where clause, we shall be able to use the index
that starts with the year, but not the other one, unless we scan it fully. And of course the reverse will be true if
we only have a condition on the name of the country. Searching an index is similar to searching
a dictionary : you quickly find what you want if you only
know the beginning of a word the spelling of which you want to check. If you want to find words that rhyme together
a regular dictionary is useless. It also means that when you have an index
on two columns C1 and C2, then having a single column
index on C1 is pointless. However, it may make sense to have an index on C2 alone if
it speeds up a series of queries in which there is no condition
on C1. There are some other subtleties. If we have a range condition on the year,
both indexes will in theory be usable, but searching
the index that starts with the year will require more work, and perhaps that the optimizer
will decide that it’s not worth the trouble. There is much more to say about indexes, and
this video is a mere introduction to the topic. But remember that indexing indiscriminately,
hoping to reach the target, will in most cases not work,
because indexes are not the silver bullet that always makes queries fast. If you want to index efficiently your tables,
you don’t usually need many indexes, but you need finely targeted indexes. Thank you for listening.

Daniel Ostrander

Related Posts

26 thoughts on “SQL indexing in 9 minutes and a half

  1. bulbish says:

    thanks for the great tutorials! that's the most i ever learned about databases in just an hour!

  2. twolittlehorses4me says:

    would not have dropped my class if I seen this video!

  3. CamiloSanchez1979 says:

    Ok, so the question remains, should one index or not? ..i guess the answer is relative

  4. Wandering Alone says:

    @ponghissimo1 That's why he has put closed captions… So…

  5. Wandering Alone says:

    @CamiloSanchez1979 Yes, it seems so. The only question that I have concerns the second table with countries and years…is it possible that the search will be done by first looking at the years column even though, in our query, we've put country on the first place (…WHERE country='CHINA'…).
    Strange…

  6. Justin Mahaney says:

    @ponghissimo1 He's French..

  7. roughsealtd says:

    @AnExplorer1000
    The order of conditions in the SQL statement is irrelevant. Actually "order" is a notion that is completely foreign to the relational theory as designed by Ted Codd. Query optimizers check conditions, check indexes, sometimes rewrite queries, and do as they can to take advantage of existing indexes – if things such as selectivity and distribution of value hint at using the indexes being good for response time.

  8. Vinicius Heringer says:

    Very easy to understand.

  9. polypus74 says:

    extremely funny videos. i am definitely going to have a look at your books. merci pour le bon travaille.

  10. murph1329 says:

    probably one of the best tutorials i've seen for indexes

  11. shakeosm says:

    excellent.. teaching method and contents…

  12. hemmahbeast says:

    No, thank you for teaching.

  13. Jai Rajasekaran says:

    Useful

  14. freekeyboards says:

    Very informative video and love the accent! Thanks so much!

  15. Ashok Kumar Sharma says:

    Thank you. Easy to understand and follow.

  16. RockAddicted says:

    Thank you!

  17. Mido Mostafa says:

    Print
    select best thanks from Greetings where ThanksValue = 1000000

  18. Dan Levin says:

    fire ze mizziles!

  19. Mikayil Abdullayev says:

    This was very useful. This video once more proves that databases should be approached more academically.

  20. Michael Kelley says:

    Great video. Thank you for adding the sub-titles too.

  21. Tharinda Wickramaarachchi says:

    thanks 

  22. Lawrence Adams says:

    Brilliant!

  23. Prashant Vikal says:

    Though the information provided is excellent in terms of the value but the title is misleading.
    The information provided in the video is about pros and cons of indexing and not about how to index ?

  24. pradeep gowda says:

    excellent video. Especially the way it was presented.

  25. nelsonthekinger says:

    great job sir 🙂

  26. vikash kumar says:

    Very well explained.Amazing

Leave a Reply

Your email address will not be published. Required fields are marked *