[DatabaseSystem] Database Storage 1

Today we would focus on how data organized in disk. DBMS would assume that the primary storage is lied on the non-volatile storage.

Storage Hierarchy Overview

Typical Storage Layer:
cpu register
cpu cache
DRAM
(Non-volitile memory)
SSD
HDD
Network Storage

Usually data is moved from one layer to the neighbor layer, it is not possible to move data from SSD to cache without moving in DRAM.
Network storage usually means S3,AFS or NFS. It is mush more slower than other levels, and not frequently used in typical database.

Volatile vs. Non-Volatile
Volatile memory typically follows the random pattern, the data address could be jump from one location to another location without serious cost. Non volatile memory also supports jumping, but usually it is relatively slow compared with sequential reading.

In this course, we call DRAM memory, call SSD/HDD/NetworkStorage disk. Also, this course would not cover the optimization in or below than cache level. Note that the new hardware non-volatile memory has been present in the industry, Intel NVM comes into the market this year, it is quite an interesting new topic, but we have not covered it in this semester.

Sequential vs. Random Access
Traditional DBMS are designed to maximize sequential access. It means that we take a series of data blocks from disk, and then copy it into the memory, which is much more efficient than take the segments one by one.

mmap
A map reflects the address from virtual memory to physical memory. MMAP is a bridge between OS level and database level, but OS does not provide a well done mechanism for the database implementation, and the default OS MMAP policy could not satisfy the requirement of database. Some companies use the kernel interface for the help of mmap policy

Basic Question
1. How DBMS represents the database in files on disk?
2. How DBMS

File Storage

Fundamentally, DBMS stores data in one or more files on the disk. These files are exactly the same with the regular files, and no further optimization in OS level. The OS could not identify a database file from other files.
Once upon a time, some DBMS use the custimzed file system. Sure, it could provide some performance improvement(around 15%), but the cost in industry is too large to afford.

Page

Page could be used to store everything. It is self-contained, because we do not want to lose everything after we lose one page. A file is composed of many pages, and each page is marked as an offset value.
Note that Database Page is not Hardware Page or OS pages. They are totally different.Different DBMS use different page size, from 1KB-16KB. If the db page is larger than the hardware page, DBMS need additional work to keep the write safe and atomic.

Heap

A heap file is an unordered collection of pages, the frequently used implementation is linked list or page directory.
Linked List
Too slow, not used today.
Page Direcotry
The meta-data pages include everything about the page and free slots in the page. Even if the meta page is lost, DBMS could scan through available pages to reconstruct the meta page. It is quicker than the linked list.

Page Layout

Page Header
Each pagee contains the metadata about this page, which make the page self-contained.

Slotted Pages
Slot array to keep the tuple offset. The slot array begins from top to bottom, the real data begins from the bottom to top, and once they meet in the middle, then there is no more available space anymore.
It supports both fixed tuple and variable tuple.

Log-Structures File Organization
No tuples any more, only the log record.System appends log records to the file of how database is modified.
Write is quick, because it only needs write an item in the page.
Read is slow, because the database need to reconstruct the tuple from logs. Even DBMS need to read multiple pages for specific content.
The idea is that system only writes forward, never modify backward.
Usually, it is companied with periodically compact, or checkpoint. Level Compaction is something compacting the logs in different size, while Universial Compaction is directly applied on all log pages.

Tuple Layout

Tuple, in physical level, is a sequence of bytes. Same with page, tuple includes header and data field.
Usually, page only involves tuples from one table. For example, all tuples from page 123 is located in table foo.

Denormalization
Different tables use the same storage space in one page. It potentially reduces the amount of IO throughput, because it makes join much more quick. But it makes update more expansive for the reorgnization reason.

Record Ids
Each tuple is assigned a record ID. page_id+offset/slot. It is quite strange that some DBMS expose the record id in application level.

select r.ctid,r.* FROM r;

But application could not use the tuple id to get access to the data, beacuse it is easily changed after tuple insetion or garbage collection. After remove a tuple from the table and executing the garbage collection, the tuple id would be changed.

VACUUM FULL

Still you could read the tuples by tuple id, or you could insert the tuple by assigned tuple id manually.

INSERT INTO r VALUES (104,`XXX`)
UPDATE r SET val='yyy' WHERE ctid = (0,4)

Some DBMS stores the id in tuple header, some not because they claims that it is a waste of space.



Leave a Reply

Your email address will not be published. Required fields are marked *