PostgreSQL organizes data and associated indices in separate files: one file per table plus one file per index. Files are divided into blocks of 8192 bytes (8 KiB). This is the smallest unit of PostgreSQL's disc I/Os. Currently used blocks are mirrored 1:1 in the Shared Memory of the instance. Within PostgreSQL, a 'block' is often called a 'page' though the term 'page' may be differently interpreted by the underlying file system, e.g.: 4 KiB or some other 2^n value.
All pages of data files are logically equivalent, whereas index files use different page types depending on the needs of the index (meta page, root page, internal tree page, leaf page, ...). Nevertheless, the physical layout for pages of data files and of all types of index files is identical.
Each page consists of 5 major components
- Page Header: General information about the page.
- ItemIds: Array of pointers to Items. It grows or shrinks over time.
- Free Space: The unused space of the page.
- Items: A set of actual data respectively rows or a set of index entries. It grows or shrinks over time.
- Special Space: Data files don't use it. Index files structure it depending on the needs of the index type.
Every additional row or index entry creates a new ItemId at the end of the 2. component plus a new Item at the beginning of the 4. component. As a result, the free space shrinks from both its left and right side.
The Page Header consists of 24 bytes and contains information like a page checksum, offset to start and end of free space or to special space, information for WAL handling, the layout version number, and some more flags.
Every ItemId consists of 4 bytes and contains the offset to and the length of the corresponding Item.
In the case of data files, every Item (= row) consists of:
- Item Header: 23 bytes containing various transaction IDs, current or newer tuple ID, offset to data, and various flags.
- Null Bit Map: Marker for such columns which are currently NULL. The map is optional: if a table contains only non-nullable columns, it is superfluous.
- Data: The value of every attribute of the row - if not NULL. The sequence and types of attributes are stored in the system schema. In the case of variable-length data types, the currently used length is stored at the beginning of the attribute.
In the case of index files, Items consist of index entries, which are differently structured depending on the index type.
The Special Space is only used by index files for their individual purpose, e.g.: page number, page type, (double-)linked list within pages of the same level, tree-level, ... . Data files don't need such additional information. They are organized as a heap, which means that their pages are logically equivalent without any special order or hierarchy between pages. Additional data is always put to the next free space or to a freshly allocated page.
Often, parts of data files are read and written directly by their page number. But it's not unusual that they are read completely. This is done by a sequential scan. Such comprehensive sequential accesses (as an analogy for following a linked list in an index file) are - in most cases - optimized by the underlying file system by its read-ahead technique. This helps especially for files located on a rotating disc (HDD) whereas for SSDs this advantage vanishes a little.
[edit | edit source]