Choosing The Right File Format/Print version
|This is the print version of Choosing The Right File Format
You won't see this message or any elements not part of the book's content when you print or preview this page.
- 1 Table of Contents
- 2 Introduction
- 3 Old version of Introduction to merge in
- 4 Quick Guide to recommended formats
- 5 Is there a problem?
- 6 Formats for storing electronic information
- 7 Understanding Vector Formats
- 8 Vector files (2D & 3D scalable line drawings)
Table of Contents
- Quick Guide to recommended formats
- Is there a problem?
- A general look at File Formats
- Recommendations in detail
File formats are the language of a computer's memory. Choosing the right format for the electronic information we want to store is one important step in making good use of computers and minimising problems.
This book tries to help you choose the file format best suited to its use. It concentrates on two purposes of storing your information (data).
- Portability and interoperability
- Digital Preservation
Portability and Interoperability is the ability of your data to be read (interpreted) by different software and hardware. The most common portable format in use now is the PDF (Portable Document Format) for sending documents over the internet. Somewhat more troublesome is the exchanging of address information between email software.
Digital Preservation can be defined as long-term, error-free storage of digital information, with means for retrieval and interpretation of needed files from the long-term, error-free digital storage, for all the time span that the information is required for.
Old version of Introduction to merge in
Planning for an unpredictable future is known as future proofing, although you can't really know if you are future proof you can practice risk reduction and learn from the mistakes of history. This article focuses on future proofing of computer files. This article gives tips for creating files in a manner which makes them easy to preserve and later access, and for avoiding pitfalls that could make your files difficult to access later. Looking after the files you already have is known as digital preservation.
Where future proofing and digital preservation deal with the rather etheric matter of electronic files and their formats, then your next concern is the media on which your information is stored. That is an area of study in itself, and is not the subject of this article.
Both future proofing of information and the media it is stored on are vital in any thorough review of your IT systems. Will you be able to read the files you're working on now in 5 years time? Do you know if all the old files you have now are still readable?
The chance of electronic files being readable in 5 or 10 years is not something to leave up to chance. Active intervention is needed in most cases. Migrating to software/hardware that supports openly published standards is the most effective single step in any plan to future-proof.
Quick Guide to recommended formats
For an explanation of terminology read the section "Formats for storing electronic information".
|Data Type||Preferred formats||Common OpenSource applications||Common proprietary applications|
||Notepad++, gedit, kate, AbiWord, OpenOffice.org Writer, KWord, PDFCreator, Geany, vim, emacs||Notepad, Adobe Acrobat, Corel WordPerfect, Microsoft Word, Star Office|
||OpenOffice.org Calc, Lotus 1-2-3, Gnumeric, KSpread||Microsoft Excel, Corel Quattro Pro|
|Web pages||Abiword, OpenOffice2.0, Mozilla composer, Quanta, Geany||Dreamweaver, Adobe GoLive|
|Graphics (raster & vector)||
||Gimp, TuxPaint, OpenOffice.org Draw, Blender, Inkscape, kolourpaint||Adobe Photoshop, Adobe Illustrator, 3DStudioMax, AutoCAD|
|Audio||Audio has containers with codecs in together they form a format.
||Audacity, Ardour||Adobe Soundbooth|
|Video||Video also has containers with codecs in together they form a format.
||Openshot , KDEnlive||Adobe Premiere , Windows Movie Maker|
|Database||Databases do not use files in the normal sense, however a good database can output its content structured with SQL (Structured Query Language) - an ANSI/ISO standard. It is also important that it supports ODBC (Open Data Base Connectivity)||PostgreSQL, MySQL, Firebird and InterBase, Kexi||Microsoft SQL Server, Oracle and Sybase|
||GnuPG (GNU Privacy Guard)||PGP (Pretty Good Privacy)|
Is there a problem?
If you are one of the many people who used to use WordPerfect or WordStar and have since switched to a different editor, you may already be familiar with the problem of retrieving your own information from certain types of files. Or perhaps you switched from one operating system to another, from Amiga to Windows, or Windows to Macintosh. Stated simply, file formats for different software far too often leave your information scrambled in a way you cannot decipher again years later.
If this seems a bit theoretical to you then here are some stories to illustrate the issue of choosing the right format for your information.
The English Tourist
A tourist walks into a very nice restaurant in a lovely village in the French countryside and mutters in English "Are you still serving lunch?" No one reacts, so he says louder, "Do you have a TABLE where I might DINE?" Recognizing a few words and realizing that the tourist must only speak English or isn't interested in trying his French, one of the employees goes off to find someone who might be able to help this ignorant tourist.
After a long delay, someone comes, interprets his request and finds him a seat in the restaurant. The tourist is handed a menu. "I can't read this! It is in French! What are Cervelles anyways?" The helpful interpreter is called back and the tourist has the whole menu explained to him and is finally ready to order a meal. By now our hapless tourist is getting hungry and frustrated and, in just the way everyone gets when they are frustrated and hungry forgets their manners and blurts, "By the way, I am going to order in English so I can be sure of what I am getting - and for the privilege of taking my order, I demand that you pay the Queen of England a small sum for the use of this language which you should really just learn to use like everyone else!"
After this last sentence is finally translated back to the previously friendly proprietors, the kitchen is closed and the tourist is sent packing.
In terms of file formats where this tourist has gone wrong is that although he is happy with the format he is using (unlike the Roman official in the next story) he has forgotten that different people do things differently. When in a different context his preferred format (English) is not supported. This is the situation if your favourite software company goes bust or stops supporting the software you bought. The files which once were so convenient can become useless with time.
The Roman Official
An official in ancient Rome by the name of Gallus hires a scribe called Taruna who understands Latin but can only write in a rare (and unrecorded) dialect of Sanskrit. After Taruna has been in the job for some years Gallus finds he is actually too slow and keeps losing important documents. Taruna is turned out into the street and goes back to his family in disgrace.
The following day the official employed a highly regarded new assistant and sent him into the archive. A few minutes later the assistant came out in tears explaining that he only knows a few words of Sanskrit, can't find any references to the dialect used and could never hope to make sense of these documents.
Frantically they search for Taruna. When they find him they ask him to come back to work, but he sees their problem. So he says with a smile "I will happily come back to work, you just need to double my pay and holidays!"
In modern terms where the Roman official went wrong is to use an unpublished format (an unrecorded dialect of Sanskrit) to store his information. He was then trapped by this format and forced to keep buying the software (the scribes services) at ever increasing cost. He had lost control of his own information!
In a report written for The National Archives (UK) in 2003, Adrian Brown summarises how to proceed.
The selection of file formats for creating electronic records should ... be determined not only by the immediate and obvious requirements of the situation, but also by longer-term considerations. An electronic record is not fully fit-for-purpose unless it is sustainable throughout its required life cycle. ... It is therefore highly desirable to identify the minimum set of formats which meet both the active business needs and the sustainability criteria below, and restrict data creation to these formats.  (PDF)
The approach of Project Gutenberg to this challenge has been a strict criteria that all the 15,000+ books stored in their digital repository are stored in plain ASCII text.
Whenever possible, Project Gutenberg distributes a plain text version of an eBook. Other formats, such as HTML, XML, RTF, and others are also welcome, but plain text is the "lowest common denominator." We stress the inclusion of plain text because of its longevity: Project Gutenberg includes numerous text files that are 20-30 years old. In that time, dozens of widely used file formats have come and gone. Text is accessible on all computers, and is also insurance against future obsolescence. 
Does that mean we cannot use word processors, if we want long term access to the information in our documents? Well, yes and no. If you want long term file readability (of Latin script languages) as Project Gutenberg does, then ASCII text is the way to go. This might be something to consider for financial records and other valuable information. If, as many people do, you have non-text information, like images and sounds, then this is the article to read. Either way there are a lot of common errors you can avoid which will at the very least make future migrations to the next generation of file formats much easier.
Let's now take a real world scenario. Many people use the Microsoft Windows operating system and the Microsoft Office package which includes the document application Microsoft Word (or just MSWord). The default file format of MSWord is DOC. So what's DOC like for long term storage?
MS Word is a proprietary program and the .doc file extension is a proprietary format. That means that how the software works and stores your information is secret - only Microsoft knows exactly how it all works.
Formats for storing electronic information
At any one time there is an enormous variety of file formats in use for various purposes, so how do you choose which one is best? There are three type of file formats:
- Proprietary, closed specifications
- Proprietary, open specifications
- Non-proprietary, open specifications
Proprietary, closed specifications are used by some of the most common software, if you don't use them yourself you probably get sent them. However because these formats are not publically documented, you are held hostage to the company making the software. If they decide not to support old versions of their own format, suddenly you can't open your old files! Then your choice of software is greatly dependant on any new software ability to second-guess the format used by your old software. Examples of this type of format are those from the Microsoft Office Word doc format and Excels xls format, and Adobe Photoshop's Document (.psd).
Proprietary, open specifications are somewhat better in that although the format is still legally owned and developed purely for their commercial benefit by one company, they have undertaken to document the format openly. They can still choose to switch back to a closed specification, or they may make changes they choose not to document. In other words, a proprietary open specification is only open as long as the company wants to keep it that way. Examples of this type of format are Adobe's Portable Document Format (.pdf) (patented, although most of the patents are licensed on a royalty-free basis), Adobe TIFF format (.tiff) and Macromedia Shockwave Flash (.swf) (however, the documentation is under a non-disclosure agreement that requires readers not to contribute to any other implementations of Flash, so in practice it is still closed).
Non-proprietary, open specifications have been openly documented by some public body (or released to them) by developers. Once released these formats have a guaranteed reference point. Examples of this type of format are Portable Network graphic (.png) Joint Photographic Expert Group (.jpg / .jpeg) ( .mpeg2), eXtensible Markup Language (.xml) (the structure, more than the specific format), and Scalable Vector Graphic (.svg).
One special case is the Adobe Portable Document Format for archival (pdf-archive or PDF-A), a restricted application of the proprietary open specification PDF 1.4 format. It is a published ISO International Standard from 2002  and developed by the PDF-Archive Committee in close partnership with the Administrative Office of the U.S. Courts . I have not found any software which supports this format, so it is possibly only used in organisations where archival is their main concern.
One family of formats which could solve many issues is collectively called OpenDocument developed by OpenOffice.org, OASIS, and many others in the industry (but not Microsoft). OpenOffice.org 2.0, recently released, uses the OpenDocument family of formats. Of the software supporting OpenDocument, OpenOffice, AbiWord and Google Docs are cross-platform, and KOffice will be as of KDE 4.1 (around July 2008). Mac OS X 10.5's TextEdit can understand the format to some degree, and Microsoft states it will add native support for OpenDocument 1.1 (rather than plug-in converters) to MS Office, as of Spring 2009.
Unfortunately leading software often defaults to a format which is inherently unsuited to later retrieval. An example is Microsoft Word which defaults to their native .doc format rather than the better documented and more widely supported Rich Text Format (.rtf), Though you can change the default format. Microsoft products are also notable for their use of what Marshall Masters of the Independent Book Publishers Association calls 'upgrade blackmail' and describes as "Someone with a new version of your desktop application edits your file, and now your older version of the application cannot read it, which forces you to pay for an expensive upgrade if you want to continue working and playing well with others." That's definitely something for anyone with a budget to avoid.
So, what's the next level of future-proofing? Read on...
Criteria in choosing future proof file formats
Formats for future proofing must:
- Be supported by comprehensively, public documentation
- Be stable, not under constant revision
- Be supported by several software providers
- Be supported on various hardware
- Be supported by software on various operating systems (Windows/Macintosh/Unix/Linux)
- Be free of legal restriction in its use (see PNG not GIF)
- Popular formats are more likely to remain supported
Criteria in choosing suitable software
There are some software implications of the format criteria. Not all software uses open specifications correctly, so they only appear to be using a particular format. This is most common with Hyper Text Markup Language (HTML) editors and the notorious 'Save as html' option in Microsoft Word. So if you are buying software, make sure to check this. Open Source/ Free inherently tends to have strong support for open standards.Choosing The Right File Format/Recommendations
Text & Documents
In most types of organisations text documents are their most important type of electronic information after financial accounts. Depending on what the document contains there are several types of format to choose from.
There are three types of text documents: Plain text files - simple text, no formatting, no font choices. Text documents - you can choose fonts, colors, text size, backgrounds and imbed images (sounds/video etc.). Documents for presentation - all the options of Text documents, with restrictions on further editing.
For Plain text files the simplest, and most durable format is ASCII (American Standard Code for Information Interchange). It has been developed since 1963 and must be the single most supported format ever. However it is also very limited. The only formatting available is the selection of line breaks. There is no embedding of any images or colors, and there is no support for diacritic marks or non-Latin scripts. There are a variety of other encodings based on ASCII which add support for more characters. In the western world windows-1252 (which is closely related to ISO-8859-1) is the most common of these. Other parts of the world will have other conventions. UTF-8, which can represent texts of all languages in real use, is becoming more common and may be the best choice for long term storage of text.
Text files using an encoding based on ASCII are usually represented with the .txt suffix, but it can be hard to determine which one automatically. So it is a good idea to try and find out what encoding you are using and record it. If you are really paranoid you may also want to find and store the authoritative tables for converting that encoding to unicode (try http://www.iana.org/assignments/character-sets and http://www.unicode.org/Public/MAPPINGS/).
For Windows users, Notepad is the default application for handling TXT files. Current versions of notepad assume UTF-8 if the file is completely valid UTF-8 or has a UTF-8 byte order mark, UTF-16 if they detect a UTF-16 byte order mark and the windows ANSI code page (1252 for western versions) otherwise. In a pinch it is often possible to use notepad and similar editors to get the raw text out of other types of files, and it can be informative to try this on other files you plan to store.
Text documents are what you produce most of the time on one of the many commercial or free word processors. Most of the time you probably use it for writing basic text documents. Letters to friends and colleagues, project lists and so on. Applications for this type of text are found in popular office suites like Microsoft Office, AppleWorks and OpenOffice.org.
For the purpose of durability of your documents it is important that the document you write today will still be readable next year. For a long time there has been no open standard for documents, so compatibility has been a constant problem. People have had different levels of success when they've chosen to migrate from one document editor to another, as each used its own format. The .doc format is now well supported by several editors.
Whichever word processor you use, it should support several formats, choosing the most durable format is very important. While work proceeds on the OpenDocument standard (Version 1.0 was approved as an OASIS standard in May 2005), RTF (Rich Text Format) is the most widely supported and documented format available. You should be able to make this your default format so all future documents are in the RTF format. (Tutorial on changing the default format in Microsoft Word) If you choose not to do this because RTF does not support some feature you need, you should still consider using RTF as you archival format. Your formatting may not be represented correctly, but at least your content is there for posterity.
If you spend time making Documents for presentation you'll know that Word processors are limited in this area. You might be using programs like Adobe Illustrator/InDesign, sodipodi or CorelDRAW. These programs are great, but they can be tricky to successfully archive.
The Portable Document Format (PDF) is the file format created by Adobe Systems in 1993 for document exchange. PDF is a fixed-layout format used for representing two-dimensional documents in a manner independent of the application software, hardware, and operating system. Each PDF file encapsulates a complete description of a 2-D document (and, with Acrobat 3-D, embedded 3-D documents) that includes the text, fonts, images, and 2-D vector graphics that compose the documents. PDF is an open standard that has been officially published on July 1, 2008 by the ISO as ISO 32000-1:2008. "The Portable Document Format (PDF)" Wikipedia Online Encyclopedia, accessed July 4th, 2008
PDF/A is described in ISO 19005-1:2005 Document Management - Electronic document file format for long term preservation - Part 1: Use of PDF 1.4 (PDF/A-1) that was published on October 1, 2005. This standard defines a format (PDF/A) for the long-term archiving of electronic documents and is based on the PDF Reference Version 1.4 from Adobe Systems Inc. (implemented in Adobe Acrobat 5). PDF/A is in fact a subset of PDF, leaving out PDF features not suited to long-term archiving. This is similar to the definition of the PDF/X subset for the printing and graphic arts. "PDF/A" Wikipedia Online Encyclopedia, accessed July 4th, 2008
The XML Paper Specification (XPS), formerly codenamed "Metro", is a specification for a page description language and a fixed-document format developed by Microsoft. It is an XML-based (more precisely XAML-based) specification, based on a new print path and a color-managed vector-based document format which supports device independence and resolution independence. "The XML Paper Specification (XPS)" Wikipedia Online Encyclopedia, accessed July 4th, 2008
One word of caution for using PDF files: do not use inbuilt compression of pdf files and if possible use the PDF 1.4 specification.
- Use plain ASCII text whenever possible
- Use ODT where formatting is important or where graphics need to be included
- Use PDF or XPS for documents which will not need to be edited in the future
- OpenDocument in Wikipedia
Producing (X)HTML files can be done with a wide variety of software, and in general web browser are very forgiving of errors in HTML code. In most cases it would still be unwise to use office applications for creating (X)HTML documents. Although some office software applications are now quite good at generating clean (uncluttered) code, there are always some mistakes. Never under any circumstances use Microsoft Word's "Save as HTML function". The code that will be produced is full of non-standard, Microsoft-specific extensions, and the files it produces are very large.
The advantage of (X)HTML is that it can always be read with your eye, whether you have a suitable browser or not. For example, the title of an HTML page (if you look at the code of the file) is surrounded by
</title> so it looks like this:
<title>A page about me</title>. This makes (X)HTML ideal for storing text files with structure. You are however limited by your ability to create (X)HTML files and (X)HTML's limitations on formatting
- Set the DTD and (X)HTML flavour your software uses before starting a new file
- Validate your code with W3C's online validator: http://validator.w3.org/
- Group or compress the (X)HTML and CSS files together in a folder so they do not become separated
- World Wide Web Consortium: http://www.w3.org/
Images (Raster Graphics)
Most of your images are probably raster graphics, as vector graphics are less common. Although there are hundreds of formats to choose from two of the most popular are GIF (Graphics Interchange Format) and TIFF (Tagged Image File Format). Unfortunately these have been caught up in legal issues since 1994 over a Unisys patent on the LZW compression algorithm which both formats use. These patents have now expired although there is still an IBM patent valid until August 2006.
Thus the GIF and TIFF formats do not currently meet the requirement of being 'free from legal restriction in their use'.
The lossless PNG (Portable Network Graphic) format replaces GIF and has many advantages in quality, size and options (but lacks animation). Most importantly the PNG format is patent free and has been a W3C Recommendation and ISO Standard since 2003.
...the JPEG committee have always tried to ensure in their standardisation work that the 'baseline' part of their standards should be implementable without payment of either royalty fees (volume related) or license fees (non-volume related)." JPEG Committee
...there are many patents associated with some optional features of JPEG, namely arithmetic coding and hierarchical storage. For this reason, these optional features should not be used for long-term storage of valuable images. W3C
Remember that JPEG is a lossy format, meaning that each time the image is modified and resaved there is some irretrievable data loss. Therefore it should be avoided for archival use unless shortage of space requires that lossy compression is used.
- Use the PNG format.
- Use the JPEG format, but avoid optional features.
- If your software supports it, you can save in TIFF format without compression.
Understanding Vector Formats
Vector Formats represent shapes by describing their geometric properties in points, lines, curves, and polygons. The difference between vector and raster images, and when to use each, is important, but outside the scope of this document.
Sadly there are no widely adopted standard formats for vector images. Part of the reason for this is the variety of uses for vector images. At one extreme of complexity is a DWG (pronounced drawing) file from AutoCAD which can represent a multi storey building in three dimensions. At the other extreme is an SVG (Structured Vector Graphics) file for the elegant graphics used in digital cartoons.
So which format one uses depends a lot on how long you will need the information to be stored and who will need to have access to it.
As a starting point here are some vector formats and things to consider:
|Architecture, engineering and construction||DWG||Proprietary format of AutoDesk. The Open Design Alliance asserts that DWG 2004 is partially encrypted.|
|DWG||Also known as OpenDWG. The Open Design Alliance's version of the DWG file format used by Autodesk. Published as an open standard.|
|DXF (Drawing Interchange Format, or Drawing Exchange Format)||Promoted by AutoDesk as the preferred format for interoperability with other CAD software. Partially documented with partial support for functions contained in DWG files. Its limitations are described in more detail in the white paper "Why Isn't DXF Good Enough?"|
|DGN (DesiGN file)||Also known as OpenDGN. The native format of MicroStation, a product of Bentley Corp. DGN is|
In 2001, the Scalable Vector Graphics (SVG) format became a W3 recommended standard.
Single layer images
Another application of 2D vector graphics are for images where the image itself is the final product. Commercial products doing this are Adobe Illustrator and Macromedia Flash.
Adobe uses their proprietary format AI (standing for 'Adobe Illustrator', one of their products.) Movement is being made on a platform and product independent format called SVG (Scalable Vector Graphics). The SVG format has much of the functionality of Macromedia's SWF format plus many others, including the ability to be searched for text by search engines. Although at the time of writing this article SVG was not a mainstream format, its use is growing and it seems likely to become a standard in wide use . Development of the SVG format are carried out by the W3C and at present SVG version 1.0 is a W3C Recommendation.
Multiple layer images
For print shops and graphic designer it is often necessary to store the original working files. These may contain many layers of different images, as well as some record of previous changes made to the file.
The main players in this field are the PhotoShop Document (.psd) format and Corel PhotoPaint. The main OpenSource rival is GIMP Image File (.xcf) format. Because Gimp is an open source application the potential for its format becoming unreadable is low, however the two applications should not really be directly be compared.
TODO: Research availability and licensing of .psd and .xcf format specifications
Adobe Acrobat Reader includes support for SVG, Adobe also has a standalone SVG viewer which can be imbedded easily into internet explorer browsers.
- Do not use vector graphics to store important information unless unavoidable.
- Keep original files on CD with copies in SVG format. Keep paper copies.
3D vector format
Three-dimensional vector files can be divided into two groups in the same way as 2D file formats. Programmes for producing 3D files include AutoCAD, ArchiCAD, 3DStudioMax, Rhino etc. The main openly published standard format is VRML ( Virtual Reality Markup Language ). Another widely used format is DXF which is touted by AutoDesk (the makers of AutoCAD) as the transportable file format for 3D drawings. The DXF format, as implemented by AutoCAD is not an ideal solution as outlined in Why Isn't DXF Good Enough? by the Open Design Alliance. It is however a well supported format by many programmes and is widely used to transport files from one programme to another.
- Do not use vector graphics to store important information unless unavoidable.
- Keep original files on CD with copies in VRML and/or DXF format. Keep paper copies.
Vector files (2D & 3D scalable line drawings)
Vector files present special challenges partly because they are used for everything from 3 dimensional models of aircraft to scalable desktop icons. So to talk about one format for all the different uses would be misleading. Instead we'll break vector file formats down by their purpose.
|File type||Reviewed formats|
|3 Dimensional models (architectural, animation, engineering...)||DWG, DXF, VRML, IFC|
|2 Dimensional drawings (architectural, engineering...)||SVG, DXF, DWG|
3 Dimensional models
3D computer graphics can model or represent almost anything found in the physical world, and much more. Because of the variety of uses progress has been slow on adopting standard formats for exchanging and storing data. Many good programs can import and export in several propriety formats, although the quality of the resulting file can be suspect because in many cases developers have had to experiment and guess their way into how the external format works.
In the fields of architecture, engineering and building, some formats (and families of formats) have begun to emerge as contenders for the role of an industry-wide standards. One such format (are there others?) is the Industry Foundation Classes format (IFC) which is now compulsory for state supported building projects in Denmark and many state supported Finnish projects ref.
Until the adoption of Object Class formats like IFC, the main contenders for being called a standard are the proprietary but open Drawing Exchange Format (DXF) from AutoDesk and the W3C recommendation Virtual Reality Modeling Language (VRML).
Because the use of AutoCAD (an AutoDesk product) is so widespread there has been a very successful attempt at supporting their format in other products. The Open Design Alliance has produced the commercially licensed OpenDWG (DWG format) as a format compatible with AutoCADs own DWG format. Many competitors of AutoCAD now offer support for DWG via this format.
The OpenSource 3D modeling application Blender can export in both DXF and VRML.
2 Dimensional drawings
Two-dimensional vector graphic formats can be divided into two groups. Commercial products like AutoCAD and ArchiCAD use 2D (and 3D) vector information to make highly advanced, multiply layered drawings for architects and engineers.
In the graphics and to some extent the animation industry, formats like the proprietary Flash format and the open source standard SVG are popular. Many Adobe programs use a variety of formats which could well be very good working files in the parent program, but can be a problem to open later when, especially when a copy of that program is no longer available.
(this section needs expansion)
- 3 Dimensional Models: Keep the original working file and make a copies in the DXF (to protect metadata) and VRML (to secure visual elements) formats.
- 2 Dimensional Drawings: Keep the original working file and make a copy in the DXF format or SVG format depending on which is best suited.
CAD Standards (not fully incorporated in this section yet)
Databases & Spreadsheets
Databases are inherently good for long term storage of information. Because of the way they are constructed it is generally easy to extract information and reform it for restorage.
The backbone of standards compliant databases is Structured Query Language (SQL). SQL is not a format for storing database information. It is a format for storing the requests made to a database. In other words a database stores your information and SQL is the language for retrieving that information.
For a long time SQL was being developed in different places by different people and for some time the 1999 version has therefore been widely used as a safe bet. Now SQL:2003 is an ISO/IEC standard and very wisely there are "No changes or conformance requirements - Products conforming to Core SQL:1999 should conform automatically to SQL:2003".
Part of the beauty of the SQL standard is that you can extract your information together with the structural information needed to put that information into another database. The resulting file is often called an 'SQL dump'.
- Do not use MS Access, it is not a fully functional RDBMS and does not support standards compliant SQL queries.
- Use a database which can import and export/dump in SQL
- Check that your SQL is standard SQL:1999 or SQL:2003
- Backup the information in your database regularly as a TXT file or SQL file
- Migrating from Microsoft Access to MySQL
- SQL: The Standard and the Language
- Databases in the Open Directory Project
- SQL:1999 validators
- SQL:2003 validator
- Recommended Data Formats for Preservation Purposes in the FCLA Digital Archive.
- Digital preservation: a time bomb for Digital Libraries - Margaret Hedstrom
- Digital Preservation in Wikipedia
- Guidelines for the Preservation of Digital Heritage (PDF) UNESCO, March 2003.
- European Interoperability Framework for pan-European eGovernment Services pdf(1449Kb), 2004. Includes the EU definition of Open Standards (p 9) and outlines reasons for giving strong consideration to OpenSource software (p 10).
- Public Sector Use of Open IT Standards and Open Source Software in the Norwegian public sector
- The Interoperability Framework is the Danish e-Government Interoperability Framework for exchange, storage and availability of electronic information
- Wikipedia:Comparison of document markup languages
- "Holding My Data Hostage: Why software licenses should not expire" article by Michael Herf 2001-05-08
- "Planning for longevity" article by Jack Ganssle 2004-07-01
- World Wide Web Consortium (W3C)
- Organization for the Advancement of Structured Information Standards (Oasis)
- International Standards Organisation (ISO)
- American National Standards Institute (ANSI)
- PDF-Archive Committee
File types for further research
Feel free to expand this list.
- Contact lists (ldap for external address books, ldif -Lightweight Directory Interchange Format)
- Audio files (Ogg Vorbis) and the problems with mp3 as proprietary format Frauenhofer Patents
- Video files: Moving Pictures Experts Group MPEG-2 & MPEG-4 include Frauenhofer Patents
- 3D Modeling: ASCII Alias/Wavefront OBJ. What is SQL DDL?
- Financial records
- Calendars: ScheduleWorld
- Diagrams Dia over Microsoft Visio
File formats for portability
- Addressbook information: LDIF