Programming Language Concepts Using C and C++/Data Level Structure

From Wikibooks, open books for an open world
Jump to navigation Jump to search

In this chapter, we will start with defining properties common to all data items usable in a programming context and then move on to classifying data according to their structure and type. While doing so, we will also try to give an idea of how they can be laid out in memory.

General Properties[edit | edit source]

Mutability[edit | edit source]

Immutable Data[edit | edit source]

Constant is a data item that remains unchanged throughout its lifetime. A constant may be used literally or may be named. Named constants are sometimes termed symbolic constants or figurative constants.

Literal and named constants.
3, '5', .TRUE., "String literal" are examples to literal constants, whereas const double pi = 3.141592654; is the C++ definition of [an approximation to] Π as a named constant.

Some programming languages make a distinction between constants whose values are determined at compile-time and those whose values are determined at run-time. In C#, for example, the former is tagged with the keyword const while the latter with readonly. In Java, constancy of a field (or local identifier) is flagged with the final keyword and presence or absence of the static modifier classifies the associated data item to be a compile-time or run-time constant, respectively.

Example: Compile-time constants in C#.

public class Math { ... public const /* static final in Java */ double pi = 3.141592654; public const double e = 2.718281; ... }

Here the values for pi and e (Euler constant) can be determined even before the program starts to run. So they are defined to be const.

Note that same definitions are valid for all instances, if there is any, of the Math class. That is, the value of pi or e does not change from one instance to another. As a matter of fact, they exist independently of the instances as class fields. In other words, they are [implicitly] static. Explicit use of static together with const in C++, which is a source of inspiration for C#, is a manifestation of this.

Example: Run-time constants in C#.

public class Citizen { ... Citizen (..., long SSNoftheNewCitizen, ...) { ... SSN = SSNoftheNewCitizen; ... } ... private readonly long SSN; /* private final long SSN in Java*/ } // end of class Citizen

Here, value for SSN cannot be determined before the related Citizen object is created. Once the object is created, its SSN field never changes during the lifetime of the object. So, we conclude that the only possible place for the initialization is the constructor. This value can be either supplied as an argument to the constructor or computed using other attributes of the object and/or class.

Observe that values of such a field may change from one instance to another. In our example this is the expected behavior: any two Citizen objects must have different SSN values.

Mutable Data[edit | edit source]

A variable provides us with named memory storage that we can write to, retrieve, and manipulate throughout the course of our program.

There are two values associated with a variable:

  1. Its data value, stored at some memory address. This is sometimes referred to as an object’s rvalue (pronounced "are-value"). You might think of it as meaning read (or right) value. Both a constant and a variable can serve as an rvalue.
  2. Its address value–that is, its address in memory within which its data value is stored. This is sometimes referred to as an object’s lvalue (pronounced "ell-value"). You might think of it as meaning location (or left) value. Note that a constant cannot serve as an lvalue. (Because it cannot appear on the left-hand side of an assignment.)

Values may be provided for variables in one of three ways:

  1. By initialization at the time of creation. This can be user-provided or a default value assigned by the language. An example to the second method is the Java programming language.
  2. By an input operation.
  3. By an assignment statement.

Act of relating an identifier–be that a variable, constant, or a function–with its lvalue is called binding and can be established in two different ways. In cases when the relation between the identifier and the lvalue is one-to-one, binding takes place before run-time and therefore is referred to as static binding. An example to this type of binding is that of a global variable where there is only one instance of the identifier. In other cases when the relation is one-to-many, binding takes place at run-time and is called as dynamic binding. An example is a local variable whose instances, zero or more, are allocated anew each time a function is called.

Visibility[edit | edit source]

We can also speak of scope of an identifier. Scope can be defined as the portion of the program over which the declaration of an identifier is effective. An identifier with an effective declaration is said to be in scope, while it would otherwise be referred to as out-of scope.

The scope rules of a language determine the treatment of references to non-local identifiers. A common rule used by most programming languages, called the lexical- or static-scope rule, determines the declaration that applies to an identifier by examining the program text alone. An alternative rule, called the dynamic-scope rule, determines the declaration applicable to an identifier at run time, by considering the current subprogram activations.

An identifier being in scope does not mean that it is visible. A declaration of an identifier is visible in some context if a use of the identifier in that context will be bound to the declaration. That is, the identifier will be associated with that declaration. A declaration might be visible throughout its scope, but it may also be hidden by other declarations whose scope and visibility overlap that of the first declaration.

Example: Scope and visibility in C.

... { int i1, i2; ... { double i1; double d; ... } /* end of inner block */ ... } /* end of outer block */ ...

In the above code fragment, scopes of variables (identifiers) are:

i1 (of outer block), i2: from its declaration point to the end of the outer block
i1 (of inner block), d: from its declaration point to the end of the inner block

Visibilities of the variables are:

i2, i1 (of inner block), d: same with their scope
i1 (of outer block): its scope minus the inner block

Accessibility[edit | edit source]

A notion found in modular and object-oriented languages, accessibility constraints can be applied on data. This is done for two purposes:

  1. To preserve the consistency of the data and
  2. To give the implementer the freedom of changing the implementation details.
Example: Access constraints in Java.

public class Stack { ... private int _top; private Object[] _contents; } // end of class Stack

Declaring the implementation details in the above class definition as private marks them off-limits to the user, which means the below statements are not permitted.

stk._top = 0; stk._contents[7] = new Integer(5);

Issuing the first statement would probably cause a nonempty Stack object to look empty. Similarly, the second statement would probably populate the Stack object by adding a new element to some location other than that indicated by _top, which is definitely against the definition of a stack.

Additionally, such a definition enables the implementer of the Stack class to change the implementation details. For instance, she may choose to use a Vector as the underlying data structure. Now that it is not known to the outside world, the users of the Stack class will not be affected by this decision.

Data Categories[edit | edit source]

Data Elements[edit | edit source]

The most basic data entity is the data element. These data entities may be grouped together to form structures.

Primitive Data Elements[edit | edit source]

Primitive data elements are those that can be directly operated on by machine-language instructions and are broadly divided into numeric, character, logical (boolean). Numeric data elements can further be divided into integers and reals.

Example: Primitive data elements in C/C++.

In C/C++, char, int, short, long, and long long are used for representing integers. char can also be interpreted as holding a single-byte character. float, double, and long double represent floating-point single, double, and extended precision values. bool is used to represent boolean values.[1]

Data Elements that Are Addresses[edit | edit source]

An address is a value that indicates a location in the process image created as a result of running the program. Depending on the program segment it points to, an address falls in either one of two groups.

  1. Label is actually the address of a program statement, an address in the code segment of a program. It may also be thought of as a data element that may be operated upon by a goto operation.
  2. Pointers reference or point to other elements in the code and data segments. In the former case, a pointer indicating the start of a subprogram can be used to invoke the subprogram dynamically, which enables implementation of polymorphic subprograms by means of callbacks. Pointers into the data segment often refer to unnamed data elements. They are heavily used in constructing dynamic data structures and recursive algorithms. Handles, which may be considered as intelligent pointers, serve the same purpose.

Compound Data Elements[edit | edit source]

Strings of characters, generally referred to as simply strings, are linear sequences of characters. They are sometimes considered as primitive data elements, because they are directly operated on by machine language instructions, and sometimes classified as data structures.

Structures[edit | edit source]

Data Structures[edit | edit source]

Data structures' are organized collections of data elements that are subject to certain allowable operations. Data structures are logical entities in the sense that they are created by programmers and are operated on by high-level programs. These may have little bearing to the physical entities, that is, the storage structures operated on by machine language code.

Data structures may be classified as:

  • Linear vs. nonlinear: A linear, as opposed to nonlinear, data structure is one in which the individual components are an ordered sequence. Examples to linear data structures are strings, arrays, and lists. Examples to nonlinear data structures include trees, graphs, and sets.
  • Static vs. dynamic: A static structure is one that has no capacity for change, specifically with regard to its size, during the course of execution. Arrays and records are examples to static data structures. An example to dynamic data structures is lists.

Storage (Memory) Structures[edit | edit source]

Storage structures are data structures after they have been mapped to memory. While the data structure is the logical organization of your data, the storage structure represents the way in which your data is physically stored in memory during the execution of your program.

Possible layouts for multi-dimensional arrays.
Suppose you are working with a two-dimensional array. Your array, say n-by-m, is not actually stored in two dimensions in the memory, but as a linear sequence of elements. In row-major order, the array is stored as follows:
A(1, 1), A(1, 2), ..., A(1, m), A(2, 1), ..., A(n, 1), A(n, 2), ..., A(n, m)

In column-major order, same array is laid out in memory as follows:

A(1, 1), A(2, 1), ..., A(n, 1), A(1, 2), ..., A(1, m), A(2, m), ..., A(n, m)

Storage can be allocated in two ways:

  • Sequential: A structure allocated this way may also be called a static structure since it is incapable of change throughout its lifetime. Such structures use implicit ordering: components are ordered by virtue of their sequential ordering in the structure (indexing).
  • Linked: A structure allocated this way may also be called a dynamic structure since the data structure can grow and shrink during its lifetime. They use explicit ordering: each component contains within itself the address of the next item so that it is in effect "pointing" to its own successor.

As for the pros and cons of each method:

  • In static structures, full storage remains allocated during the entire lifetime of the structure. Additionally, in order to avoid overflow, maximum theoretical size is used in the declaration of the structure. For these reasons, static structures are not memory efficient. In dynamic structures, there is a space overhead due to the pointer field(s).
  • Due to the possibility of shifting, which can be avoided at the expense of some extra memory, insertions into and deletions from any position other than the end of a static structure are expensive. Such is not the case for dynamic structures.
  • Using the index value direct access is possible in static structures, which means accessing any item in the structure takes constant time. This, however, is not valid for dynamic structures. This weakness, however, can be alleviated by imposing a hierarchical structure on the data.

File Structures[edit | edit source]

File structures refer to data residing in secondary storage. When program execution terminates, file structures are the only structures to survive the termination of the program.

The data hierarchy refers to the logical organization of data that is probably stored on secondary or external storage media such as magnetic tape.

A file is a collection of related records, related to a particular application. A record is a collection of related data items, or fields, related to a single object of processing. A field is a data item, a piece of information (either a data element or a structured data item) that is contained within the record.

A file can be accessed for input, output, input/output, and append. It can be processed in two different modes: batch mode, query mode. In batch mode, the component records of the file are operated on in sequential order. Report generation for test results of a class is an example to this type of processing. In query mode individual records are manipulated by accessing them directly. Retrieving the record of one single student falls into this category.

Another important issue is the organization of the file on secondary storage. This (physical) ordering of records can be done in three ways:

  • Sequential: The file is seen as a linear sequence of records. Such files cannot be accessed in input/output access mode. Text files are typical examples of sequential files.
  • Relative: Each record in the file can be accessed directly by location. Naturally, such organization becomes possible only when the file is stored on a direct-access storage device (DASD). The mapping between the key field in the record and the location on disk can be done in two ways: direct-mapping and hashing.
  • Indexed sequential: This is a compromise between the first two methods. Files are stored sequentially on a DASD but there is also an index file that allows optimum direct access by way of a search on index.

Each file organization technique must be evaluated on the basis of:

  • Access time: The time it takes to find a particular data item.
  • Insertion time: The time it takes to insert a new data item. This includes the time it takes to find the correct place to insert the new data item as well as the time it takes to update the index structure.
  • Deletion time: The time it takes to delete a data item. This includes the time it takes to find the item to be deleted as well as the time it takes to update the index structure.
  • Space overhead: The additional space occupied by an index structure.

These in turn are affected by three factors. The system must first move the head to the appropriate track or cylinder. This head movement is called a seek and the time to complete is seek time. Once the head is at the right track, it must wait until the desired block rotates under the read-write head. This delay is called latency time. Finally, the actual transfer of data between the disk and main memory can take place. This last part is transfer time.

Typical operations on files are: open, close, read, write, EOF, and maintenance operations such as sorting, merging, updating, and backup.

Data Types[edit | edit source]

A data type, usually referred to as simply type, is composed of a domain of data elements and a set of operations that act on those elements–operations that can construct, destroy, or modify instances of those data elements.

In addition to built-in types, which can be structured as well as primitive data types, most languages have facilities for the definition of new data types by the user. [ALGOL68 and Pascal were the first programming languages to provide this.]

A type system is a facility for defining new types and for declaring variables to be of such types. A type system may also have the capability for type checking, which may be static or dynamic depending on whether this checking is done at compile time or during execution.

Example: (Restricted) type system of C.
Basic types: char, short, int, long, float, double

Type constructors: *, [], (), struct, union

typedef char Performance[2]; typedef char* String; typedef struct { String first_name; String last_name; Performance grades; } Student; typedef void (*EVAL_PERF) (Student* Class); ...

A strongly typed programming language is one in which the types of all variables are determined at compile time. Programs written in such a language, which is said to have static typing, must explicitly declare all programmer-defined words. Storage requirements for global and local variables are determined completely during compile time. A strongly typed programming language may include a typing facility for defining new types.

With dynamic typing, variables are not bound to type at compile time. Languages that allow for dynamic typing of variables, which are also classified as weakly typed, may utilize dynamic type checking, which requires the presence of run-time routines to check types and perform correct operations such as those involving storage mapping and code selection. One may also say that types are associated with values, not variables.

Example: Dynamic typing in Scheme.

> (define x 3) ; x denotes a value of type number
> (+ x 1)
4
> (number? x) ; Is x a number? –Yes.
#t
> (set! x "Nermin") ; x now denotes a value of type string
> (string-length x)
6
> (number? x) ; Is x a number? –Not anymore!
#f
> (string? x) ; Is x a string? –Yes.
#t
>_

Relationships between Data Types[edit | edit source]

Equivalence[edit | edit source]

In an assignment statement, like many other programming constructs, a compiler not only checks the syntactic correctness but also tests the semantic validity. For instance, it would not be permitted to assign a string value to an integer variable; the compiler would enforce some sort of assignment-compatibility rule. An important ingredient of this process is finding out whether types of the expression and the variable are equivalent.

This can be done in two different ways: structural equivalence and name equivalence. In the former, two types are equivalent if they are made up of the same underlying data types.

Example: Structural equivalence in pseudo-C.

struct S1 { int i; double d; char c; }; struct S2 { int f1; double f2; char f3; };

S1 and S2 are deemed to be equivalent.

In name equivalence, two types are equal only if their names are the same. This is the approach taken by most Algol-based languages.

Example: Name equivalence in Pascal.

type arrtype1: array [1..10] of integer; arrtype2: array [1..10] of integer; var v1, v2: arrtype1; v3: arrtype2;

In the above fragment, v1 and v2 have equivalent types, while type of v3 is different than that of these variables.

Extension[edit | edit source]

A type is said to extend another if all of its instances can be seen as instances of the type it extends. The base type, the one being extended, is called a generalization of the extended type, while extended type is said to be a specialization of its base type.

Thanks to the nature of the relation, an expression of extended type is assignment-compatible with a variable of base-type.

The most popular technique used to provide this relation is inheritance common to most object-oriented programming languages.

Implementation[edit | edit source]

A type is said to implement another if it provides an implementation for all operations listed in the interface of the implemented type.

Implementation can be seen as a special case of extension, where the base type does not provide any implementation of its operations.

In many object-oriented programming languages, such a relation is found between an interface and its implementing class.

Declarations[edit | edit source]

In languages with static type checking, the program must somehow communicate the types of the identifiers it uses. Complete knowledge of identifier types at compile time leads to a more efficient program because of the following reasons.

More efficient allocation of storage. For instance, all integer types can be stored similarly using the largest such type. But, if you know the exact type, you don’t have to allocate the largest size. More efficient routines at run time. A + B is handled differently depending upon whether A and B are integers or real numbers Compile-time checks. Many invalid uses of the programming constructs are spotted before the program even starts to run.

On the other hand, at the expense of ensuring type safety compilers may at times reject valid programs.

Identifier types can be communicated in two ways.

Explicit Type Declarations[edit | edit source]

Widely preferred over the alternative, the programmer uses declarative statements to communicate the types of variables, functions, and so on. Some programming languages, such as ML and Haskell, do not require the programmer to provide type declarations for all identifiers. Through a process called type inference, compiler does its best to figure out the types of expressions.

Implicit Type Declarations[edit | edit source]

In (some versions of) languages like FORTRAN and BASIC, the way a variable is named reveals its type.

Implicit type declarations in Fortran.
In versions of FORTRAN before FORTRAN 90, an identifier beginning with I-N is taken to be an integer, and an identifier beginning with any other letter is taken to be real.

Scalar Data Types[edit | edit source]

A scalar data type has a domain composed only of individual primitive data elements.

Numeric Types[edit | edit source]

Numeric types are related to or represent quantities in the outside world. However, whereas in real life these types may have infinite domains, in the world of computers their domains are finite.

Numeric types.
Integer, floating-point, fixed-point, and complex numbers.

Logical (Boolean) Type[edit | edit source]

Variables of such a type can take on only two values, true or false, which may be represented in the machine as 0 and 1, zero and non-zero. Typical operations on booleans are and, or, not.

Pointer Type[edit | edit source]

A pointer is a reference to an object or data element. A pointer variable is an identifier whose value is a reference to an object.

Pointers are important for the dynamic allocation of a previously undetermined amount of data. Additionally, they permit the same data to reside on different structures simultaneously. In other words, they make sharing of data possible.

Pointer variables point to and provide the means for accessing unnamed, or anonymous variables. Consequently, operations on pointers must distinguish between operations on the pointer variable itself and operations on the quantity to which the pointer is pointing.

Example: Pointers in C/C++

int num1, num2, *pnum; ... ... pnum = &num1; num1 = 15; num2 = *pnum; ...

Memory layout after the last assignment
Memory layout after line 6

pnum in the above example is used to access a location holding an int value. As a matter of fact, it can hold a reference to ints only. This is the most important difference between a pointer and a plain address: while an address can be used to refer to values of any type, a pointer holds a reference to a specific data type. However, a precious tool in implementing generic collections in C, type agnosticism of addresses can be reclaimed by disciplined use of void *.

In C, where pointers are heavily used, advantages may turn into maintenance nightmares. Following program is an example to this.

Example: Pointer pitfall in C.

#include <stdio.h> int main(void) { int ch = 65; int* p2i; const int *p2ci = &ch; p2i = p2ci; /* !!! */ printf("p2ci: %i\tp2i: %i\n", *p2ci, *p2i); *p2i = ch++; /* !!! */ printf("p2ci: %i\tp2i: %i\n", *p2ci, *p2i); exit(0); } /* end of int main(void) */

When we compile and run this program, it will produce the following output.

p2ci: 65 p2i: 65
p2ci: 66 p2i: 66

This is a violation of the contract we have made. On line 6, we guarantee that the content of the location pointed to by p2ci will not change. On line 8, we assign p2ci to p2i, which is another pointer that lets itself be updated. We later go on to change the value found at the location through the non-constant pointer p2i, which means the value pointed to by p2ci is also changed.

The compiler may issue a warning for this error, which is the case in the GNU C compiler. But you cannot rely on this: not all compilers will issue a warning. Sometimes, the programmer will turn off the warnings to avoid reading annoying messages and the message will go unnoticed.

Structured Data Types[edit | edit source]

A structured data type has domain elements that are themselves composed of other scalar or structured type elements.

Strings[edit | edit source]

Strings are ordered set of data elements of dynamically changing size.

Two attributes can be associated with a string: type and length. Type refers to the domain of the individual elements while length refers to the number of elements. (Note that length and size are two different things.)

The type attribute is generally character although bit strings are also commonly used for implementing sets.

Example: Character strings in C/C++.

char *name = "Atilla";

Memory layout of name

Another possible representation of the same string in a different language would be:

Memory layout of a Pascal-style character String

Note that number of bytes reserved for the length prefix can change from implementation to implementation. Another point to keep in mind: ASCII is not without competition. Alternatives include Unicode and ISO/IEC-8859 series of ASCII-based encodings. In the case of Unicode, each character takes up two bytes in memory.[2]

Example: BSTR type used in [OLE] Automation.[3]

CString name = _T("Atilla"); BSTR bstrName = name.AllocSysString();

Memory layout of bstrName

As hinted by the above figure, BSTR is used for exchanging character string data between components written possibly using different languages. As a matter of fact the length prefix is inherited from Visual Basic while the terminating null character is taken from C.

Note that length prefix holds the number of bytes the character string proper occupies and the bstrName identifier is actually a pointer to the first character.

Typical operations on strings are:

  • Concatenation creates a string from smaller strings.
  • Substring creates a string from a subsequence of another string.
  • Index tests for containment of a smaller string in a larger string. It returns the index value where the subsequence containing the smaller string starts.
  • Length returns the number of components in the string.
  • Insert, delete, replace, ...

Arrays[edit | edit source]

An array may be defined as a fixed-size, ordered set of homogeneous data. It is stored in contiguous memory locations and this makes direct access possible. Its fixed-size makes an array a static structure.

In some programming languages such as Standard Pascal, size must be known at compile time while in many this size can also be given at run time. But, even in the latter case, size of the array does not change during its lifetime.

Example: Static nature of arrays in Standard Pascal.

program primes(input, output); const N = 1000; var a: array [1..N] of boolean; ... begin ... end

This Pascal fragment must be recompiled and run for a different value of N.
Example: Open arrays in Oberon.

PROCEDURE SetZero (VAR v: ARRAY OF REAL); VAR j: INTEGER; BEGIN j := 0; WHILE j < LEN(v) DO v[j] := 0; INC(j) END END SetZero;

In the above Oberon fragment, any actual (one dimensional) array parameter with element type REAL is compatible with v. The subprogram can be called with a one-dimensional array of any size. But, once the array is passed as an argument its size cannot change.

Among the important attributes of an array are component type, dimensionality of the array, and size in each dimension.

Arrays can be represented in the memory in two different ways:

  1. Row-major (Almost all major programming languages)
  2. Column-major (FORTRAN)

In order to minimize the access time to individual array components, we should make sure that fastest changing indexes in our programs (that is, the innermost loop variables) correspond to the fastest changing indexes in the memory layout. If we ignore the loop ordering, with large multidimensional arrays, the virtual memory performance may suffer badly.

Example: Multidimensional array usage in Pascal.

var a: array [3..5, 1..2, 1..4] of integer; ... for i := 3 to 5 do for j := 1 to 2 do for k := 1 to 4 do {some processing done using a[i, j, k]}

This program fragment shows the accurate loop order for a language that represents arrays using row-major representation. Note that the innermost loop variable (the fastest changing one) in the fragment corresponds to the fastest changing index in the memory layout. This correspondence must be extended to other loop variables and indexes, i.e. the second innermost loop variable must correspond to the second fastest changing index in the memory and ... .

Example: Multidimensional array usage in Fortran.

DO k = 1, 4 DO j = 1, 2 DO i = 3, 5 {some processing done using a(i, j, k)} END DO END DO END DO

This second fragment gives the correct loop order for a language with column-major representation.

Two ways of representing arrays in memory

It should be noted that header information included in the array representation is not standard and may vary or even disappear in some programming languages. Most notable example to this is Java, in which size of the array is not used in type checking. This means one can use an array handle to manipulate arrays of different sizes.

Example: Type-compatibility of arrays in Java.

int[] intArray = new int[10]; ... // use intArray as an array of 10 ints intArray = new int[20]; // OK! ... // use intArray as an array of 20 ints

This may at first seem to contradict the static nature of arrays. After all, size of intArray has been changed from 10 to 20. Not really! What we have done in the previous code fragment is to make intArray indicate two different array objects in the heap. In other words, we have changed the array handle, not the array object itself.

Determining the address of an element in a Pascal-style array.
Given the pseudo-Pascal declaration
var a: array [5..10, 0..3, -2..2] of integer;

calculate the address of a[7, 2, 1] assuming a) column-major representation and b) row-major representation. For both cases, assume that base of a is 200 and an integer is represented in four bytes.

a) (10 – 5 + 1) * (3 – 0 + 1) * (1 – (-2 )) + 1 gives us the order of the component at [5, 0, 1]. We need (10 – 5 + 1) * (2 – 0) more to get to [5, 2, 1]. (7 – 5) more and we reach [7, 2, 1]. So, [7, 2, 1] is the (6 * 4 * 3 + 1) + (6 * 2) + 2 = 87th component. It can be found at address 200 + (87 – 1) * 4 = 544.
The above calculation can be generalized as follows:


In C-based programming languages, where the lower bound is always taken to be 0, this formula can be simplified to:


Memory address of a particular component is given as follows:[4]


b) (2 – (-2) + 1) * (3 – 0 + 1) * (7 – 5) + 1 gives us the order of the component at [7, 0, -2]. (2 – (-2) + 1) * (2 – 0) more and we are at (7, 2, -2). [7, 2, 1] is (1 – (-2)) location after [7, 2, -2]. So, [7, 2, 1] is the (5 * 4 * 2 + 1) + (5 * 2) + 3 = 54th component. Its address is 200 + (54 – 1) * 4 = 412.
After generalization we have:


For C-based programming languages this is simplified to:

Determining the address of an element in a C-style array.
Given the pseudo-C declaration
double a[13][10][9];

calculate the address of a[7][6][8] assuming a) column-major representation and b) row-major representation. For both cases, assume that base of a is 200 and a double is represented in eight bytes.

a)



b)


Multidimensional Arrays[edit | edit source]

Support for multidimensional arrays can be provided in two ways: jagged arrays—sometimes referred to as ragged arrays—and rectangular arrays.[5]

Example: Jagged arrays in Java.

int[][] numArr = { {1, 2, 3}, {4, 5, 6, 7}, {8, 9} };

Jagged arrays Memory layout of a jagged array

What we have here is actually an array of arrays. Since different arrays can possibly have differing component counts, sub-arrays in our example can and do have different lengths. Same array can be formed using the following code, which reflects the way arrays are treated in Java.

int[][] numArr = new int[3][]; numArr[0] = new int[3]; for (int i = 0; i < numArr[0].length; i++) numArr[0][i] = i + 1; numArr[2] = new int[2]; for (int i = 0; i < numArr[2].length; i++) numArr[2][i] = i + 8; numArr[1] = new int[4]; for (int i = 0; i < numArr[1].length; i++) numArr[1][i] = i + 4;

Observe there are four array allocations involved: numArr, numArr[0], numArr[1], and numArr[2]. These allocations need not be done in a specific order and can be interspersed in the code. The only restriction is that sub-arrays cannot be allocated before the containing array. That is, numArr must be allocated before the others. This leads us to the conclusion, which is also reflected in the layout given before, that Java-style multidimensional arrays are not necessarily allocated in contiguous memory locations.[6]

Example: Rectangular arrays in C#.
int[,]4 numMatrix = { {1, 2}, {3, 4}, {5, 6} };
Here, we have a 3-by-2 matrix. It’s also guaranteed that the entire matrix is allocated in contiguous memory locations.
Associative Arrays[edit | edit source]

Array index is usually of a scalar type although languages such as Perl and Tcl provide for non-scalar indexes through the use of hashing.5 Such arrays are called associative arrays.

Example: Associative arrays in Perl.

$dictionary{'word'} = 'sözcük, kelime'; $grade{'Emrah'} = 90; $dictionary{'sentence'} = 'tümce, cümle';

Thanks to the operator name overloading facility, languages such as C++ and C# incorporate such a facility by means of a class that overloads the subscript operator ([]).
Example: Associative arrays in C#.

HashTable dictionary = new HashTable(); ... dictionary["word"] = "sözcük, kelime"; dictionary["sentence"] = "tümce, cümle"; ... Console.Write("sentence in Turkish is {0}", dictionary["sentence"]); // will print "sentence in Turkish is tümce, cümle" on the standard output

In languages that lack operator name overloading, such as Java, one has to resort to using the relevant class with its messages.

Example: Associative arrays in Java.

Map dictionary = new HashMap(); ... dictionary.put("word", "sözcük, kelime"); dictionary.put("sentence", "tümce, cümle"); ... System.out.print("Sentence in Turkish " + dictionary.get("sentence")); // will print "sentence in Turkish is tümce, cümle" on the standard output

Lists[edit | edit source]

Lists are generalized, dynamic, linear data structures. A (linked) list is a set of items organized sequentially, just like an array. In an array, the sequential organization is provided implicitly (by the position in the array); in a list, we use an explicit arrangement in which each item is part of a node that also contains a link to the next node.

Example: Linked lists in Pascal.

type link = ^node; node = record key : integer; next : link end; var head : link; ...

Depending on how links are provided, we have:

  • Singly linked lists: These provide us only with a forward pointer pointing to the next node in the list.
  • Doubly linked lists: These provide us with two pointers, one pointing to the next and one pointing to the previous node in the list.

Another classification is possible with how the first and last nodes in the list are related:

  • Circular lists: In circular lists, the first and last nodes in the list are connected with link(s). In a circular doubly linked list, the next link of the last node points to the first node and previous link of the first node points to the last node of the list. In the case of a circular singly linked list, there is no pointer from the first node to the last one.
  • Linear lists: In linear lists, there are no links between the first and last nodes. The next field of the last node and the previous field of the first node point nowhere, i.e. they are null.

In the implementation of a list, one can make use of a header and/or a dummy end node.

Operations on lists are:

  • Creation/destruction operations.
  • Insert: Before/after a node with a certain key value; into the end, at the beginning.
  • Remove: A component with a certain key value; the first, last component.
  • Search: A component with a certain key value.
  • Empty: Test whether or not the list is empty.

Some commonly used, specialized (restricted) forms of lists are stack (LIFO), queue (FIFO), deque (Double-ended queue), output-restricted deque, and input-restricted deque.

Multilists[edit | edit source]

Multilists are similar to lists. The only difference is that nodes reside on more than one list simultaneously.

Example: Pascal representation of a sparse matrix using a multilist.

type link = ^node; node = record row_no, column_no : integer; key : integer; next_row, next_column : link end;

Multilists

Dynamic Arrays[edit | edit source]

A cross between arrays and lists, dynamic arrays can grow or shrink their size in the run-time. Known to the Java-world as Vector and C#-world as ArrayList, a dynamic array maintains an internal, heap-allocated array and replaces this with a larger (smaller) one as need arises.

Records[edit | edit source]

A record as a logical data structure is a fixed-size, ordered collection of possibly heterogeneous components that may themselves be structured. It may also be called the hierarchical or structured type. Record components, often called fields, are accessed by name rather than by subscript.

Example: Record type definition in COBOL.

01 STUDENT-RECORD. 02 NAME. 03 FIRST-NAME PIC X(15). 03 MIDDLE-INITIAL PIC X. 03 LAST-NAME PIC X(15). 02 STUDENT-NO PIC 9(9). 02 TEST-SCORES. 03 MIDTERM PIC 9(3).99. 03 FINAL PIC 999.99. 02 ASSIGNMENTS OCCURS 5 TIMES PICTURE IS 9(3).9(2).

Variant Records[edit | edit source]

Some languages allow us to have variations of a record. Such a record with variations is called a variant record. This means that there will be some fields that are common to all variations and some fields that are unique to each variation.

Example: Variant records in C.

enum shape_tag {Circle, Square}; struct SHAPE { float area; enum shape_tag tag; union { float radius; float edge_length; }; }; typedef struct SHAPE shape; shape s; ... ... s.tag = Square; or s.tag = Circle; s.edge_length = 3.5; s.radius = 5; ... ... void print_areas(shape s) { switch (s.tag) { case Circle: printf("Area of the circle: %f\n", PI * s.radius * s.radius); break; case Square: printf("Area of the square: %f\n", s.edge_length * s.edge_length); } /* end of switch statement */ } /* end of void print_areas(shape) */

Note that, with the introduction of the type extension facility, variant record has now become an obsolete feature. With the exception of languages like C++, which keep it for C compatibility, object-oriented programming languages do not include variant records in their arsenal of tools.

Variable-length Records[edit | edit source]

Variable-length records are similar to variant records. In this case the variation involves the number of a repeated field.

Example: Variable-length records in COBOL.

01 CLASS. 03 NO-OF-STUDENTS PIC 99. 03 STUDENTS OCCURS 70 TIMES DEPENDING ON NO-OF-STUDENTS. 05 FIRST-NAME PIC X(15). 05 MIDDLE-INITIAL PIC X. 05 LAST-NAME PIC X(15). 05 TEST-SCORE PIC 999.99 OCCURS 3 TIMES.

Some programming languages provide tuples, which are actually records whose fields are implicitly named by their position.

Example: Tuples in ML.

val t = (4, 5.0, "six"); val t = (4, 5.0, "six") : int * real * string #1(t); val it = 4 : int #2(t); val it = 5.0 : real #3(t); val it = "six" : string

Before we see examples of how records are laid out in the memory, let’s look at a rather important issue affecting it: alignment requirements.

Alignment Requirements[edit | edit source]

For reasons of efficiency, some architectures do not let a data element start at an address that is not a multiple of the data element’s size. As a result of this alignment, it takes fewer memory accesses to fetch the data required in the program. This improvement on the running speed, however, comes at the expense of more memory.

It should be noted that the alignment scheme offered in the following examples is not the only one. Programming environments may in some way let you alter the way data is aligned. As a matter of fact, programming environments built on top of the Intel architecture may even let you turn on and off alignment.

Alignment requirement of double
An IEEE754 double precision floating point number can be represented in 8 bytes. A data type compatible with this specification and implemented on an architecture with an alignment requirement stated as above cannot start at memory location, say, 4 or 6. It should start at locations whose addresses are divisible by 8. So, it can start at 0, 8, ..., 33272, ... .

The alignment requirement for a record type will be at least as stringent as for the component having the most stringent requirements. The record must terminate on the same alignment boundary on which it started.

Example: Alignment in records.

struct S { double value; char name[10]; };

Alignment requirements in records

struct S { char name[10]; double value; };

Alignment requirements in records

Realize that alignment requirement of an array is equal to the alignment requirement of its component. Whether the array size is 1 or a million bytes, its alignment requirement is always the same. This is due to the fact that an array declaration is shorthand for defining multiple variables. The above fragment should actually be considered as given below:

struct S { double value; char name0, name1, name2, ..., name9; };

So, alignment requirement of our structure definitions is equal to that of value: 8.

Example: Alignment requirements in variant records.

type tagtype = (first, second); vtype = record f1 : integer; f2 : real; case c : tagtype of first : (f3, f4 : integer); second : (f5, f6 : real) end; var v : vtype;

Alignment requirement in variant records

If we change the selector of the variant part to case tagtype of we have

Alignment requirement in variant records (with no selector type)

Depending on the architecture, this saves us 4 to 8 bytes for each record. But it leaves us without type safety.

Sets[edit | edit source]

A set is a nonlinear structure containing an unordered collection of distinct values. Typical operations on sets are insertion of an individual component, deletion of an individual component, union, intersection, difference, and membership.

Example: Sets in Pascal.

type days = (sun, mon, tues, wed, thur, fri, sat); dayset = set of days var weekdays, weekend: dayset; ... begin weekdays := [mon, tues, wed, thur, fri]; weekend := [sun, sat]; ... ... end.

Base types in sets are restricted to scalar types and, due to the storage requirements; the number of potential members is severely limited.

Trees[edit | edit source]

A tree is a nonempty collection of nodes and edges that satisfies certain requirements. A node is a simple object containing link(s) to other nodes and some value; a link to another node is called an edge.

In a tree,

  • Predecessor of a node is called its parent.
  • Successor(s) of a node is (are) called its child(ren).
  • Nodes without successors are called leaves.
  • The node without a predecessor is called the root.
  • There are no unreachable nodes from the root.
  • There is only one path between the root and some node.

Trees are encountered frequently in everyday life. For example, many people keep track of their ancestors and/ or descendants with a family tree. As a matter of fact, much of the terminology derives from this usage. Another example is found in the organization of sports tournaments.

Example: Representing binary trees in Pascal.

type link = ^node; node = record info: char; left_child, right_child: link end;

Graphs[edit | edit source]

Similar to a tree, a graph is a collection of nodes and edges. Unlike trees, graphs do not have the notion of a root, a leaf, a parent, or a child. A node, also called vertex, can live in isolation; that is, it can be unreachable. For some vertex pairs, there can be more than one path connecting them to each other.

A great many problems are naturally formulated using graphs. For example, given an airline route map of Europe, we might be interested in questions like: "What's the fastest way to get from Izmir to St. Petersburg?" It's very likely that many city pairs have more than one path connecting them. Another example is found in finite-state machines.

A graph can be represented in two ways:

1. Adjacency matrix representation
A V-by-V, where V is the number of vertices, array of boolean values is maintained, with a[x, y] set to true if there is an edge from vertex x to vertex y and false otherwise.
Example: Adjacency matrix representation in Pascal.

program adjmatrix (input, output); const max_no_of_vertices = 50; type matrix_type = array [1..max_no_of_vertices, 1..max_no_of_vertices] of boolean; var a: matrix_type; ... ...

2. Adjacency-structure representation
In this representation all the vertices connected to each vertex are listed on an adjacency list for that vertex.
Example: Adjacency structure representation in Pascal,

program adjlist (input, output); const max_no_of_vertices = 100; type link = ^node; node = record v: integer; next: link end; bucket_array = array [1..max_no_of_vertices] of link; var adj: bucket_array; ... ...

While adjacency matrix is the better choice for dense graphs, for sparse graphs adjacency-structure representation turns out to be a more feasible solution.

User-defined Types[edit | edit source]

Besides enhancing the readability and clarity of the program text, defining user-defined types makes it possible to compose a complicated data structure once and then create as many instances (variables of that type) of it as necessary and, secondly, to use the language’s own type-checking facility for input data validation such as range or consistency checking.

User-defined types come in two flavors:

1. Enumeration types
An enumeration type provides for the enumeration of the domain of the type by the programmer. The domain values are listed in a declarative statement.
Example: Enumeration type in C++
In C++, instead of

const int input = 1; const int output = 2; const int append = 3; ... bool open_file(string file_name, int open_mode); ... if (open_file("SalesReport", append)) ...

we can have

enum open_modes {input = 1, output, append}; ... bool open_file(string file_name, open_modes om); ... if (open_file("SalesReport", input)) ...

2. Subtypes
A subtype is the specification of the domain as a subrange of another already existing type.
Example: Subtypes in Pascal.

type ShoeSizeType = 35..46; var ShoeSize : ShoeSizeType;

With these declarations in place, we cannot assign any value outside the range to the ShoeSize variable. Any such attempt would be caught by the typing system and cause the program to terminate with an error. [In programming languages supporting exceptions, this could be handled more graciously without terminating the program.]

Realize, the enumeration and subtype definitions do not allow the programmer to specify operations on the newly defined data type.

Abstract Data Types[edit | edit source]

The defining characteristics of an abstract data type (ADT) is that nothing outside of the definition of the data structure and the algorithms operating on it should refer to anything inside, except through function and procedure calls for the fundamental operations.

Example: Stack implementation in Pascal,

type link = ^node; node = record key: integer; next: link end; var head, z: link; procedure stackinit; begin new(head); new(z); head^.next := z; z^.next := z end; procedure push(v: integer); var t: link; begin new(t); t^.key := v; t^.next := head^.next; head^.next : = t end; function pop: integer; var t: link; begin t := head^.next; pop := t^.key; head^.next := t^.next; dispose(t) end; function peek: integer; begin peek := t^.key; end function stackempty: boolean; begin stackempty := (head^.next = z) end;

The above Pascal fragment is unfortunately not an ideal implementation of an ADT. Fields of the record definition is open to manipulation; there is no language construct that prohibits changing the underlying structure directly by changing its components. It is also possible for the programmer to use any subprogram, be it one meant for export or one meant for implementing some auxiliary functionality. A mechanism to provide controlled access to subprograms and record fields, such as provided with access specifiers in object-oriented programming languages, is missing.

Organization of the related subprograms into a compilation unit is not enforced by the language. One could add subprograms that are irrelevant to the ADT being implemented. There is no compiler-reinforced rule relating the pre-existing and/or newly added subprograms to the ADT in question. The best we can do is to stick to conventions for better organizing our programs.

One other weakness, one that can be remedied very easily, of the preceding fragment is the fact that we cannot create more than one stack.

What we need is something like the module construct found in modular programming languages or the class construct of object-oriented programming languages.

Example: Stack implementation in Ada83.

PACKAGE Stack_Package IS TYPE Stack_Type IS PRIVATE; PROCEDURE Init(Stack: IN OUT Stack_Type); PROCEDURE Push (Stack: IN OUT Stack_Type; Item: IN Integer); FUNCTION Pop (Stack: IN OUT Stack_Type) RETURN Integer; FUNCTION Peek (Stack: IN Stack_Type) RETURN Integer; FUNCTION Empty (Stack: IN Stack_Type) RETURN Boolean; PRIVATE Stack_Size: CONSTANT Integer := 10; TYPE Integer_List_Type IS ARRAY (1..Stack_Size) OF Integer; TYPE Stack_Type IS RECORD Top: Integer RANGE 0..Stack_Size; Elements: Integer_List_Type; END RECORD; END Stack_Package; PACKAGE BODY Stack_Package IS PROCEDURE Init (Stack: IN OUT Stack_Type) IS BEGIN Stack.Top := 0; END Init; PROCEDURE Push (Stack: IN OUT Stack_Type; Item: IN Integer) IS BEGIN Stack.Top := Stack.Top + 1; Stack.Elements (Stack.Top) := Item; END Push; FUNCTION Pop (Stack: IN OUT Stack_Type) RETURN Integer IS Item: Integer := Stack.Elements (Stack.Top); BEGIN Stack.Top := Stack.Top 1; RETURN Item; END Pop; FUNCTION Peek (Stack: IN Stack_Type) RETURN Integer IS BEGIN RETURN Stack.Elements(Stack.Top); END Peek; FUNCTION Empty (Stack: IN Stack_Type) RETURN Boolean IS BEGIN IF Stack.Top = 0 THEN RETURN True; ELSE RETURN False; END Empty; END Stack_Package;

Using this implementation, a procedure reading ten integers and printing them in reverse order can be implemented as given below.

WITH Basic_IO, Stack_Package; PROCEDURE Reverse_Integers IS x: Integer; St: Stack_Package.Stack_Type; BEGIN Stack_Package.Init(St); FOR i IN 1..10 LOOP Basic_IO.Get(x); Stack_Package.Push(St, x); END LOOP; FOR i IN 1..10 LOOP x := Stack_Package.Pop(St); Basic_IO.Put(x); Basic_IO.New_Line; END LOOP; END Reverse_Integers;

The above implementation is a lot better than the previous one: The only way a stack can be manipulated is through the operations defined on it. The representation of the ADT is declared to belong to the private part of the package. This certainly is a great improvement over the first example. But, it still requires the user to perform creation and initialization in two separate steps, a rather error-prone process. What if we forget to call the initialization routine?[7]

One other problem with this approach is its lack of adaptability. Once it has been defined, the user cannot alter its behavior except for by modifying its definition. This may at times be very restrictive. Consider defining a shape type in a graphics system. Although shapes may differ in many aspects, there are certain operations that can be applied to all shapes, such as drawing, rotating, ttanslating, and etc. At first we might come up with a central routine where we call the appropriate drawing routine of the specific shape type. An implementation, which is typical of non-object-oriented programming languages, would look like as follows:

enum kind { circle, triangle, rectangle }; class Shape { // attributes common to all shapes kind k; ... public: // common interface of all shapes void draw(void); void rotate(int degree); ... }; void Shape::draw(void) { switch(k) { case circle: // draw the circle break; case triangle: // draw the triangle break; case rectangle: // draw the rectangle break; default: error(...); } // end of switch(k) } // end of void Shape::draw(void) void Shape::rotate(int howManyDegrees) { switch(k) { case circle: // rotate the circle break; case triangle: // rotate the triangle break; case rectangle: // rotate the rectangle break; default: error(); } // end of switch(k) } // end of void Shape::rotate(int)

One weakness of the preceding solution is the requirement that the central function draw and rotate must know about all kinds of different shapes. If you define a new shape, every operation on a shape must be examined and possibly modified. As a matter of fact, you cannot even add a new shape to the system unless you have access to the source code. And generally, you do not have such a privilege when you are a mere user of the class. So, the implementer faces a dilemma: either ship the implementation with missing shapes or indefinitely postpone shipping it until you make sure you have an exhaustive list of possible shapes.[8] Ideally, you would like to ship your software as early as you can and as fully functional as possible. This is referred to as the open-closed principle. But, how can you possibly provide an exhaustive functionality when you have limited time and you don’t have much idea about the alternatives? The answer is to get the user to extend your software and the keyword is inheritance.[9] In C++, you would provide the following solution:

class Shape { // attributes common to all shapes. No attributes for holding the kind! ... public: // interface common to all shapes virtual void draw(void); virtual void rotate(int degree); ... } // end of class Shape class Circle : public Shape { // attributes special to Circle. ... public: void draw(void) { /* circle-specific implementation of draw */ } void rotate(int degree) { /* circle-specific impl. of rotate */ } ... } // end of class Circle class Triangle : public Shape { // attributes special to Triangle. ... public: void draw(void) { /* triangle-specific implementation of draw */ } void rotate(int degree) { /* triangle-specific impl. of rotate */ } ... } // end of class Triangle class NewShape : public Shape { // attributes special to the new shape. ... public: void draw(void) { /* new shape-specific implementation of draw */ } void rotate(int degree) { /* new shape-specific impl. of rotate */ } ... } // end of class NewShape

In addition to the aforementioned advantages, it is the compiler that controls the whole process, whereas in the previous case it was the user doing the all bookkeeping stuff.

Notes[edit | edit source]

  1. Although it is a language conceived in early 70's, bool has been added to C in 1999. Therefore, you will probably not see it being used very frequently. Instead, you will see conventional uses of integer values through macros and typedefs.
  2. A character taking up two bytes in memory does not mean that it will be serialized as a sequence of two bytes. that is, it will probably not consume two bytes of disk space or will not be transmitted in two bytes over the network. Different encoding techniques may be used. A commonly used technique is the UTF-8, which assumes that the most commonly used characters are those found in the ASCII subset of the Unicode standard. In this scheme, the following mappings are used to encode individual characters: 0000 0000 0xxx xxxx → 0xxx xxxx; 0000 0xxx xxxx xxxx → 110x xxxx 10xx xxxx; xxxx xxxx xxxx xxxx → 1110 xxxx 10xx xxxx 10xx xxxx
  3. Automation is a Microsoft technology that allows you to take advantage of an existing program’s content and functionality, and to incorporate it into your own applications. A typical example is the automation of MS Office applications (automation objects) by VBA (automation controller).
  4. The formulas given here are made slightly more complex to avoid having a 0th component. Normally, the compiler implementer would not add 1, just to subtract it in the next computation.
  5. This term admittedly invokes the image of a two-dimensional array. However, it covers arrays of any rank.
  6. This is not true for components of the same sub-array. That is, this is not true for components of the same sub-array. numArr[0] and numArr[1], numArr[1][2] and numArr[1][3], and so on are guaranteed to reside in contiguous locations. Otherwise would have ruled out random access. However, numArr[0][2] and numArr[1][0], which are neighboring components living in different sub-arrays, cannot be guaranteed to occupy adjacent locations.
  7. As a matter of fact, this is exactly why constructors are provided in object-oriented programming languages. Similarly, object-oriented programming languages without automatic garbage collection provide a destructor routine for each class.
  8. A third option would be giving out the source code of the class, which would spell disaster for a software company: bankruptcy.
  9. Note that inheritance is not the only way to extend your software. Composition can be used as an alternative to inheritance.