Optimizing C++/General optimization techniques/Input/Output
From Wikibooks, the open-content textbooks collection
Contents |
[edit] Binary format
Instead of storing data in text mode, store them in a binary format.
In the average, binary numbers occupy less space than formatted numbers, and so they require less time to be transferred from memory to disk or vice versa, but, mainly, if data is transferred in the same format used by the processor, there is no need of costly conversions from text format to binary format or vice versa.
The disadvantages of the binary format are that data is not human-readable and that such format may be dependent on the processor architecture.
[edit] Open files
Instead of opening and closing an often needed file every time you access it, open it only the first time you access it, and close it when you are finished using it.
To close and reopen a disk file takes a variable time, but about the same time to read 15 to 20 KB from the disk cache.
Therefore, if you need to access a file often, you can avoid this overhead by opening the file only one time before accessing it, keeping it open by hoisting its handle wrapper to an external scope, and closing it when you are done.
[edit] I/O buffers
Instead of doing many I/O operations on single small or tiny objects, do I/O operations on a 4 KB buffer containing many objects.
Even if the run-time support I/O operations are buffered, the overhead of many I/O functions costs more than copying the objects into a buffer.
Larger buffers do not have a good locality of reference.
[edit] Memory-mapped file
Except in a critical section of a real-time system, if you need to access most parts of a binary file in a non-sequential fashion, instead of accessing it repeatedly with seek operations, or loading it all in an application buffer, use a memory-mapped file, if your operating system provides such feature.
When you have to access most parts of a binary file in a non-sequential fashion, there are two standard alternative techniques:
- Open the file without reading its contents; and every time a data is demanded, jump to the data position using a file positioning operation (aka seek), and read that data from the file.
- Allocate a buffer as large as the whole file, open the file, read its contents into the buffer, close the file; and every time a data is demanded, search the buffer for it.
Using a memory-mapped file, with respect to the first technique, every positioning operation is replaced by a simple pointer assignment, and every read operation is replaced by a simple memory-to-memory copy. Even assuming that the data is already in disk cache, both memory-mapped files operations are much faster than the corresponding file operations, as the latter require as many system calls.
With respect to the technique of pre-loading the whole file into a buffer, using a memory-mapped file has the following advantages:
- When file reading system calls are used, data is usually transferred first into the disk cache and then in the process memory, while using a memory-mapped file the system buffer containing the data loaded from disk is directly accessed, thus saving both a copy operation and the disk cache space. The situation is analogous for output operations.
- When reading the whole file, the program is stuck for a significant time period, while using a memory-mapped file such time period is scattered through the processing, as long as the file is accessed.
- If some sessions need only a small part of the file, a memory-mapped file loads only those parts.
- If several processes have to load in memory the same file, the memory space is allocated for every process, while using a memory-mapped file the operating system keeps in memory a single copy of the data, shared by all the processes.
- When memory is scarce, the operating system has to write out to the swap disk area even the parts of the buffer that haven't been changed, while the unchanged pages of a memory-mapped file are just discarded.
Yet, usage of memory-mapped files is not appropriate in a critical portion of a real-time system, as access to data has a latency that depends on the fact that the data has already been loaded in system memory or is still only on disk.
Strictly speaking, this is a technique dependent on the software platform, as the memory-mapped file feature is not part of C++ standard library nor of all operating systems. Though, given that such feature exists in all the main operating systems that support virtual memory, this technique is of wide applicability.
Here is a class that encapsulates the read access to a file through a memory-mapped file, followed by a small program demonstrating the usage of such class. It is usable both from Posix operating systems (like Unix, Linux, and Mac OS X) and from Microsoft Windows.
A class that encapsulates the write access to a file through a memory-mapped file would be somewhat more complex.
File "memory_file.hpp":
#ifndef MEMORY_FILE_HPP #define MEMORY_FILE_HPP /* Read-only memory-mapped file wrapper. It handles only files that can be wholly loaded into the address space of the process. The constructor opens the file, the destructor closes it. The "data" function returns a pointer to the beginning of the file, if the file has been successfully opened, otherwise it returns 0. The "length" function returns the length of the file in bytes, if the file has been successfully opened, otherwise it returns 0. */ class InputMemoryFile { public: InputMemoryFile(const char *pathname); ~InputMemoryFile(); const void* data() const { return data_; } unsigned long length() const { return length_; } private: void* data_; unsigned long length_; #if defined(__unix__) int file_handle_; #elif defined(_WIN32) typedef void * HANDLE; HANDLE file_handle_; HANDLE file_mapping_handle_; #else #error Only Posix or Windows systems can use memory-mapped files. #endif }; #endif
File "memory_file.cpp":
#include "memory_file.hpp" #if defined(__unix__) #include <fcntl.h> #include <unistd.h> #include <sys/mman.h> #elif defined(_WIN32) #include <windows.h> #endif InputMemoryFile::InputMemoryFile(const char *pathname): data_(0), length_(0), #if defined(__unix__) file_handle_(-1) { file_handle_ = open(pathname, O_RDONLY); if (file_handle_ == -1) return; struct stat sbuf; if (fstat(file_handle_, &sbuf) == -1) return; data_ = mmap(0, sbuf.st_size, PROT_READ, MAP_SHARED, file_handle_, 0); if (data_ == MAP_FAILED) data_ = 0; else length_ = sbuf.st_size; #elif defined(_WIN32) file_handle_(INVALID_HANDLE_VALUE), file_mapping_handle_(INVALID_HANDLE_VALUE) { file_handle_ = ::CreateFile(pathname, GENERIC_READ, FILE_SHARE_READ, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0); if (file_handle_ == INVALID_HANDLE_VALUE) return; file_mapping_handle_ = ::CreateFileMapping( file_handle_, 0, PAGE_READONLY, 0, 0, 0); if (file_mapping_handle_ == INVALID_HANDLE_VALUE) return; data_ = ::MapViewOfFile( file_mapping_handle_, FILE_MAP_READ, 0, 0, 0); if (data_) length_ = ::GetFileSize(file_handle_, 0); #endif } InputMemoryFile::~InputMemoryFile() { #if defined(__unix__) munmap(data_, length_); close(file_handle_); #elif defined(_WIN32) ::UnmapViewOfFile(data_); ::CloseHandle(file_mapping_handle_); ::CloseHandle(file_handle_); #endif }
File "memory_file_test.cpp":
#include "memory_file.hpp" #include <iostream> #include <iterator> int main() { // Write to console the contents of the source file. InputMemoryFile imf("memory_file_test.cpp"); if (imf.data()) copy((const char*)imf.data(), (const char*)imf.data() + imf.length(), std::ostream_iterator<char>(std::cout)); else std::cerr << "Can't open the file"; }