Inverted File Structures:
Emp-no Emp-name Dept No.of dependents Salary
01 Jane Gun 1
02 George Shoe 2
04 Dave Shoe 3
05 Mary Gun 2
06 Mike Shoe 1
Inverted indexes for secondary attribute
|
|
|
|
Time needed to access posting lists is a function of their length and their allocation.
n: number of blocks
85% of are sequentially allocated
file needed to access posting list= n*0.15*(s+r+btt) +(s+r) +n*0.85*ebt
Display Emp-name where Dept= “Shoe” and No.of dependents=2
How many employees in the result set?
It depends on the selectivity of
the attributes.
R(no. of records)=10000
Condition No. of hits
Dept=Gun 0.10*10000=1000
Dept=Shoe & no. of dependents=2 0.30*0.40*10000=1200
List Structured Files
Consider a student file (SSN, Name, Major, GPA,…)
Display name where major= “BUS”
List Structured File Organization According to attribute major:
SSN Name Major Year GPA Link
10 Dave BUS 3 4.0 30
20 Jack CS 1 3.5 40
30 Jim BUS 2 2.7 50
40 Tom Music 3 2.2 0
50 Don BUS 1 3.2 60
60 Sam BUS 2 2.0 20
Multi List Structured File Organization:
SSN Name Major … Link
40 Tom Music 0
10 Dave BUS 3
60 Sam BUS 5
20 Jack CS 0
30 Jim BUS 6
50 Don BUS 0
Bus List Head :2
CS List Head :4
Music List Head :1
Assume that list size =k (e.g. k=4 for the BUS major)
How many blocks to retrieve(access)?
Consider
Record placement problem: Assume that we know all the queries and the records that will satisfy each query.
Problem: Distribute records among blocks such that the cost processing queries in terms of block accesses will be minimized.
One modification: Assume we also know frequency of execution for each query.
(ACM Trans. On Database Systems, December 1990)
Secondary key based retrieval using Range Question
Display student name where GPA>3.00 & age<20
Grid Files:
Example:
Name Height(inch) Weight(lb)
Sleepy 36 48
Happy 34 52
Doc 38 51
Dopey 37 54
Grumpy 32 55
Sneezy 35 46
Bashful 33 50
Ms. White 62 98
Data bucket size=2
Insert Sleepy and Happy.
Insert Doc.
Insert Dopey.
Insert Grumpy ,Sneezy and Bashful.
Insert Ms.White
Query: Display name where h=50 and w=50
Access the sleepy bucket and check for a person with h=50 and w=50
False drop elimination
2 disc accesses
1. access grid area
2. access data bucket
Always 2 disc accesses (for successful and unsuccessful search)
Characteristics of Grid Files:
1. A grid file is symmetric and adaptable symmetric, means that, each key is treated as a
primary key. Adaptable implies that it changes itself dynamically.
2. Data storage utilization ?70%
3. Concurrent application are easier than a B+ tree.
Two record additions may try to change the root node in B+ tree.
4. Involves elimination of false drops: after accessing a data bucket make sure that
records stored in the bucket matches the query condition.
K-d Trees(K dimensional trees):
A K-d tree is a K-dimensional tree index for range queries based on secondary key.
Name Height(inch) Weight(lb)
Sleepy 36 48
Happy 34 52
Doc 38 51
Dopey 37 54
Grumpy 32 55
Sneezy 35 46
Bashful 33 50
Ms. White 62 98
During the construction, alternate between the attributes of height and weight when deciding how to branch on successive levels of the tree.
1.Sleepy Sleepy(36,48)
2.Happy Happy(34,52)
3.Doc Doc(38,51)
4.Dopey Dopey(37,54)
5.Grumpy Grumpy(32,55)
6.Sneezy Sneezy(35,46)
7.Bashful Bashful(33,50)
8.Ms.White Ms. White(62,98)
n nodes (n+1) null pointers
Person Region
Sleepy 3
Happy 2
Doc 6
Dopey 7
Grumpy 4
Sneezy 1
Bashful 1
Ms. White 8
Example: K-
tree structure where K=3.
People Tolerant(%) Arrogant(%) Stubborn(%) Tender(%) Happiness
P1 80 10 3 7 80
P2 20 60 15 5 40
P3 60 5 25 10 65
P4 60 20 5 15 60
P5 10 20 10 60 55
P6 45 5 5 45 95
P7 60 25 5 10 50
P8 10 80 5 5 25
P9 25 5 5 65 50
P10 10 60 25 5 35
P11 70 10 15 5 75
P12 55 5 30 10 60
Key Precedence: 1. Tolerant 2. Tender 3. Stubborn then cycle. Insert in an order of happy to unhappy.
Our tree has 12 nodes so it has 12+1 = 13 null pointers shown with dash lines above.