Computer network technologies and services/VoIP
Voice over IP (VoIP) is the set of technologies to carry voice calls, along with multimedia data, over IP networks.
Circuit switching versus packet switching
Circuit-switching telephone network
In the traditional circuit-switching telephone network (POTS), voice is carried through allocation of static circuits where voice is sampled at a bit rate of 64 Kbps (according to the sampling theorem). By using such a network there are some limitations:
- no compression: it wouldn't make sense to save bits as 64 kilobits per second are statically allocated for each phone call;
- an integer number of circuits must be allocated in order to support multimedia or multi-channel communication;
- no silence suppression: voice samples are transmitted also during pauses and circuits keep being allocated;
- no statistical multiplexing: it is not possible to dynamically share bandwidth among multiple calls according their current needs;
- signaling procedure (ring tone, busy tone, idle tone, etc.) is required for circuit allocation.
Packet-switching data network
In a packet-switching data network (IP), voice is dynamically carried via packets, and this enables new features:
- better compression for a smaller number of packets;
- high-quality communication: the bit rate is not limited to 64 Kbps anymore;
- silence suppression: no packets are transmitted during pauses;
- statistical multiplexing: bandwidth allocation is flexible;
- signaling procedure does not allocate static resources anymore;
- nomadicity: the user can be reachable through the same phone number or account when he moves.
However a new problem is introduced: resources can not be really reserved in a packet-switching network → it is very difficult to grant quality of service for voice calls because packets may arrive with some delays or may be lost:
- delays: some reference values for end-to-end delays have been defined by ITU:
- 0-150 ms: this is acceptable for human ear;
- 150-400 ms: this is acceptable only for inter-continental calls;
- > 400 ms: this is not acceptable and it harms conversation;
- losses: the human ear can tolerate without problems at most 5% missing packets.
- TCP or UDP?
UDP and TCP packets arrive (theoretically) at the receiver at the same time; the only difference is that TCP has to wait for the acknowledge packets → UDP would be the most natural choice. Indeed Skype often uses TCP because it is simpler to go through NATs and firewalls even if sometimes there may occur small silences due to sliding-window mechanisms.
Migration from circuit switching to packet switching
The traditional circuit-switching network (POTS) can be migrated to a IP-based packet-switching network in a gradual way:
- Telephone over IP (ToIP)-based network: terminals at network edges still work in a circuit-switching way, but the network backbone is based on IP and performs internally packetization → as VoIP usage is hidden from the user additional multimedia services are not available to the end user.
New telecom operators can build their phone networks as ToIP-based networks → telecom operators can save money by building and maintaining a single integrated infrastructure;
- mixed network: some terminals are VoIP, other ones are still traditional;
- IP network: all terminals are VoIP, but intelligent network services (e.g. toll-free numbers) are still traditional, because they work so well that network operators are reluctant to modernize them;
- IP-only network: all terminals are VoIP and all intelligent network services work over IP.
The gateway is a device which allows to connect a POTS network with an IP network. It is made up of three components:
- media gateway: it is able to convert voice samples from the POTS network into data packets to the IP network and vice versa;
- signaling gateway: it is able to convert signaling tones from the POTS network into signaling packets to the IP network and vice versa;
- gateway controller: it is in charge of supervising and monitoring the whole gateway, by controlling traffic quality, by performing authorization, by performing authentication (for billing), by locating destinations, and so on.
- The gateway controller alone is still useful in IP-only networks.
Steps for VoIP flow creation
At the transmitter side
The receiver should perform the following steps:
Sampling allows to convert voice from an analogue signal to digital samples. Sampling is characterized by sensibility (bit), sampling frequency (hertz = 1/s), and theoretical bit rate (bit/s).
Encoding techniques allow to reduce the bit rate, but they may introduce additional delays due to encoding algorithms.
Main encoding techniques are:
- differential encoding: each sample is encoded based on differences with respect to the previous sample and/or the following sample;
- weighted encoding: during a video call the interlocutor figure should be encoded at a higher bit rate with respect to the surrounding environment;
- lossy encoding: some audio and video information are irreversibly removed (possibly the quality loss should not be perceived by human senses).
Complexity is an important issue when encoding algorithms have to be executed on mobile terminals (such as embedded systems or not very performing low-power devices). Moreover some services do not support data encoded by lossy compression: for example, a fax does not support quality loss. That is why telecom operators still prefer to use PCM64 codec with a constant bit rate of 64 Kbps: it requires few power for processing and it is supported also by faxes and other services using the telephone network.
The speaker's voice, after exiting the receiver loudspeaker, may come back through the receiver microphone arriving at the transmitter loudspeaker after a delay called round trip delay which if significant it may disturb the speaker himself → echo cancellation is thought to avoid the speaker to hear his own voice.
Packetization delay depends on the number of samples inserted into each packet, that is a trade-off between delay and efficiency:
- delay: if too many samples are put in a single packet, the packet should wait for the last sample before being sent → if too many samples are packetized together, the first sample will arrive with an important delay;
- efficiency: every IP packet has an overhead in size due to its headers → if too few samples are packetized together, the bit rate will significantly increase due to header overhead.
Some redundancy-based error correction techniques may be contemplated: each packet carries also the previous sample along the new sample, so if the previous packet is lost it will be still possible to recover its sample.
When input traffic exceeds the output link capacity, the router should store packets waiting for transmission (buffering) → this increases delay and jitter. Priority queue management addresses these issues: Computer network technologies and services/Quality of service.
In order to reduce transmission delays there are some possible solutions:
- increasing the bandwidth, but ADSL providers usually are more interested in increasing just the downstream bandwidth;
- using PPP interleaving, that is splitting a large frame into several smaller PPP frames, but providers not always implement PPP interleaving;
- avoiding using other applications (e.g. data transfer) during voice calls.
At the receiver side
The receiver should perform the following steps:
- de-jitter: de-jitter modules should play back the packets at the same pace used to generate them;
- re-ordering: as it is packet-switching the network may deliver out-of-order packets → a module is required for re-ordering;
- decoding: decoding algorithms should implement some techniques:
- missing packets should be managed by using predictive techniques, inserting white noise, or playing samples from the last received packet;
- silence suppression: the receiver introduces white noise during pauses in conversation, because perfect silences are perceived by user as call malfunctions. It is important to be able to immediately stop the white noise as soon as the speaker resumes talking.
Real-time Transport Protocol (RTP) is used to transport VoIP flows over UDP.
- Native multicast transmission
RTP allows multicast transmission also over a network which does not support multicast.
Indeed IP does support multicast, but its usage requires the network provider to configure its network devices in order to create a multicast group for every VoIP flow → RTP allows at application layer to multicast data in a plug-and-play way without the intervention of the network provider.
- Just essential features
RTP does not specify features which are supposed to be managed by the underlying layers, such as packet fragmentation and transmission error detection (checksum).
- Independence of data formats
RTP just includes the 'Payload Type' field to specify the kind of packet contents and the used codec, but it does not specify how to encode data and which codecs to use (this information is specified separately by 'Audio Video Profiles' documents).
It is impossible to associate every codec in the world with a code → transmitter and receiver should agree on codes to be used to identify codecs during the session setup, and those codes are valid just within the session.
- Real-time data transport
Missing packets are allowed → the 'Sequence Number' and 'Timestamp' fields are combined to restart the audio/video playback at the right time instant in case of packet loss.
- Flow differentiation
A multimedia session needs opening an RTP session, so an UDP connection, for each multimedia flow (audio, video, whiteboard, etc.).
- RTP Control Protocol (RTCP)
It performs connection monitoring and control: the destination collects some statistics (information about losses, delays, etc.) and it periodically sends them to the source so that the latter can reduce or increase the quality for the multimedia flow in order to make the service working as much as possible according to the current network capabilities. For example, the receiver can understand that a certain codec has a too high bit rate that is not supported by the network, and therefore it can change to a codec having a lower bit rate.
- Non-standard ports
RTP does not define standard ports → the RTP packets are difficult to detect for firewalls and quality of service. However some implementations use static port ranges, to avoid opening too many ports on firewalls and to make the marking for quality of service easier.
The traditional solutions without the RTP mixer always require high bandwidth capabilities for all hosts.
The RTP mixer is a device able to manipulate RTP flows for multicast transmissions: for example, in a videoconference the mixer for each host takes the flows coming from the other hosts and it mixes them together into a single flow towards that host.
Every host transmits and receives one single flow → the mixer is useful to save bandwidth: even a host with low bandwidth can join the videoconference. The mixer should be the host having the highest bandwidth capacity, so as to be able to receive all the flows from the other hosts and transmit all the flows to the other hosts.
The RTP header has the following format:
|V||P||X||CC||M||Payload Type||Sequence Number|
|Synchronization source identifier (SSRC)|
|Contributing source identifier (CSRC) :::|
where the most significant fields are:
- CSRC Count (CC) field (4 bits): it specifies the number of identifiers in 'CSRC' field;
- Marker (M) flag (1 bit): it is used for marking the packet as high-priority or low-priority for quality of service;
- Payload Type (PT) field (7 bits): it specifies the kind of packet payload; it generally contains the code corresponding to the used codec;
- Synchronization source identifier (SSRC) field (32 bits): it identifies the RTP mixer (mixer M in the example below);
- Contributing source identifier (CSRC) field (variable length): it identifies the multiple sources contributing to a multicast flow (sources S1, S2, S3 in the example below).
H.323 is an application-layer signaling protocol suite standardized by ITU. It is a very complex standard because it inherits the logics from the telephony operators.
H.323 network components
H.323 was originally developed in order to allow communication (audio, video, shared whiteboard...) between hosts connected to a corporate LAN and remote devices connected to the traditional circuit-switching network (PSTN):
- gatekeeper: it implements the gateway controller, being in charge of authenticating and locating users, keeping trace of the registered users, etc.;
- proxy gatekeeper: the client contacts the gatekeeper indirectly through the proxy gatekeeper → this reduces the efforts for low-power client devices, but it is not mandatory;
- Multipoint Control Unit (MCU): it implements the RTP mixer;
- gateway: it implements the signaling gateway and the media gateway, translating data channels, control channels and signaling procedures between the LAN and the PSTN, and it is seen as H.323 terminal in the LAN and as a telephone terminal in the PSTN.
Later the H.323 standard was extended over a wide area network (WAN), allowing communication also with remote users through the Internet.
The zone of a gatekeeper is made up of the set of terminals it manages. A zone may involve different network layers, such as multiple LANs separated by routers.
H.323 protocol architecture
The H.323 protocol stack is quite complex because it is made up of several protocols:
- data plane: it consists of RTP and RTCP protocols lying on UDP;
- control plane: it consists of protocols lying on TCP/UDP for signaling:
- RAS controller: it allows a terminal to exchange control messages with the gatekeeper:
- Registration messages: the terminal asks the gatekeeper to join a zone;
- Admission messages: the terminal asks the gatekeeper to contact another terminal;
- Status messages: the terminal tells the gatekeeper if it is active;
- bandwidth messages: the RAS controller notifies the gatekeeper about changes in bandwidth, even when the call is in progress, so that the gatekeeper will be able to deny new calls if the link is overloaded;
- call controller: it allows a terminal to exchange control messages directly with another terminal;
- H.245 controller: it allows a pair of terminals to agree with each other about parameters like codecs;
- data: it allows a terminal to send control messages for desktop sharing or other multimedia data flows.
- RAS controller: it allows a terminal to exchange control messages with the gatekeeper:
At the end the H.225 layer puts all messages together: it allows to create a sort of reliable virtual tunnel in order to send H.323 messages over the unreliable IP network emulating the circuit reliability.
Each terminal is identified uniquely by a pair (IP address, TCP/UDP port), so it can be contacted directly through its address/port pair without the need of a gatekeeper.
If there is a gatekeeper, address/port pairs can be mapped to aliases easier to be reminded by users (for instance firstname.lastname@example.org, E-164 phone number, nickname). As they are associated to user accounts, aliases enable nomadicity: an user will keep being reachable even if he moves changing his IP address.
Main steps of an H.323 call
An H.323 call happens in six main steps:
- registration: the caller terminal searches for a gatekeeper within its zone and opens a RAS channel by using the RAS control;
- call setup: the caller terminal establishes the channel to the callee terminal by using the call control;
- negotiation: parameters such as bandwidth and codecs are negotiated by using the H.245 control;
- data transfer: the voice is carried by RTP;
- closing: the data channel is closed by using the H.245 control;
- tear down: the RAS channel is closed by using the RAS control.
The gatekeeper can play two roles:
- gatekeeper routed call: the call always goes through the gatekeeper → this may be useful for NAT traversal: the gatekeeper acts like a relay server;
- gatekeeper direct endpoint: the call goes directly to the endpoint, but first the caller and callee clients should perform the Admission step with the gatekeeper for charging and bandwidth management purposes.
Main issues and criticisms
- the H.323 standard does not provide any fault-tolerance assistance because it contemplates just a single gatekeeper → vendors have developed their own customizations providing this functionality that are incompatible among themselves;
- the H.323 standard does not provide any support for the communication among different zones → a corporation can not 'merge' its zone with another corporation's one;
- messages are encoded by using the ASN.1 format: this is not textual, therefore the debug is very difficult and it is required to deal with low-level details of machines (e.g. little-endian);
- the protocol stack is made up of a lot of protocols, one for every feature.
Session Initiation Protocol (SIP) is an application-layer signaling protocol standardized by IETF via RFC. Nowadays SIP is growing much faster than H.323, mainly thanks to its approach of following the internet philosophy ('keep it simple'): for example, it uses a text-based approach (like HTTP), so the codification is easy to understand. The interaction is client-server.
The SIP protocol stack is simpler than the H.323 one because SIP is a common layer in the control plane. SIP only covers signaling: it commits aspects not related to signaling, such as bandwidth management, to other already existing protocols, reducing the complexity of its design:
- RTP/RTCP: it is used to transmit and control a multimedia flow;
- SDP: it is used to notify control information about multimedia flows;
- RTSP (Real Time Streaming Protocol): it is an RTP-like protocol used to handle both real-time flows and other kinds of resources (e.g. fast-forward of a recorded voice message for a voice mailbox);
- RSVP: it is used to reserve resources on IP networks, trying to build a sort of circuit-switching network over a packet-switching one.
SIP can operate over one of three possible transport layers:
- UDP: a TCP connection has not to be kept alive → good for low-power devices;
- TCP: it guarantees more reliability and it is useful for NAT traversal and to cross firewalls;
- TLS (TCP with SSL): the messages are encrypted for security purpose, but the advantage of text messages is lost.
SIP provides voice calls with some main services:
- user localization: it defines the destination terminal to be contacted for the call;
- user capacity: it defines the media (audio, video...) and the parameters (codecs) to be used;
- user availability: it defines whether the callee wants to accept the call;
- call setup: it establishes a connection with all its parameters;
- call management.
SIP signaling can be used for several additional services besides voice calls: e-presence (the user status: available, busy, etc.), instant messaging, whiteboard sharing, file transfer, interactive games, and so on. SIP supports nomadicity: an account is associated to every user, so he will keep being reachable even if he moves changing his IP address.
SIP network components
- Terminal: every host needs to be both client and server (server in order to be reachable).
- Registrar server: it is in charge of keeping track of mappings between hosts and IP addresses.
- It implements the gatekeeper: a host has to register in order to enter a SIP network.
- Proxy server: it manages the exchange of messages between hosts and other servers.
- A host may decide to talk just with the proxy server, delegating to it all the tasks required for SIP calls.
- Redirect server: it is used to redirect incoming calls (e.g. a user wants to be reachable at his work mobile phone only during working hours).
- Media server: it is used to store value-added contents (e.g. voice mailbox).
- Media proxy: it can be used as a relay server for firewall traversal.
- Location server: it is used for locating users.
- When a host wants to make a phone call it asks the location server to find the destination user address.
- AAA server (Authentication, Authorization, Accounting): the registrar server exchanges messages with the AAA server to check users (e.g. whether the user is authorized to enter the network).
- Gateway: it connects the IP network with the PSTN network, by translating SIP packets to samples and vice versa.
- Multipoint Control Unit (MCU): it implements the RTP mixer, with the same functionality as in H.323.
In many cases a single machine, called SIP server (or SIP proxy), implements the functionalities of registrar server, proxy server, redirect server, media proxy. In addition, the location server is usually located in the DNS server, and the AAA server is usually located in the corporate AAA server.
Accounting and domains
Each user has a SIP account, so he will keep being reachable even if he moves changing his IP address (nomadicity). Account addresses are in the form email@example.com; telephone terminals can have SIP addresses too in the form telephone_number@gateway.
A SIP network has a distributed architecture: each SIP server is in charge of a SIP domain (the equivalent of H.323 zone), and all the hosts referring to the same SIP server belong to the same SIP domain and they have the same domain name in their account addresses. In contrast to H.323, a user can contact a user belonging to another SIP domain: his SIP server will be in charge to contact the other user's SIP server.
Let us suppose that an American user belonging to the Verizon domain moves to Italy and connects to the Telecom Italia network. In order to keep being reachable he needs to contact the Verizon SIP server to register himself, but he is using the Telecom Italia network infrastructure → he needs to pass through the Telecom Italia SIP server, which is its outbound proxy server, as a roaming-like service, and in this way Telecom Italia can keep track of the user's calls for billing purposes.
In order to interconnect domains, it is required that all registrar servers can be found, since they store the mappings between account aliases and IP addresses → two additional records are required in DNS servers for locating registrars servers:
- NAPTR record: it defines which transport protocol can be used for the specified domain, specifying the alias to be used for the SRV query;
- SRV record: it specifies the registrar server alias, to be used for the A/AAAA query, and the port for the specified transport protocol;
- A/AAAA record: it specifies the IPv4/IPv6 address for the specified registrar server alias.
The DNS record table may contain more than one SRV/NAPTR record:
- multiple NAPTR records: multiple registrar servers are available for the specified transport protocol, and the 'Preference' field specifies the order preference;
- multiple SRV records: multiple transport protocols are available for the specified domain, and the 'Priority' field specifies the order preference (in order: TLS/TCP, TCP, UDP);
or it may contain no SRV/NAPTR records:
- no NAPTR records: the host just tries SRV queries (often UDP) and it will use the transport protocol corresponding to the first SRV reply;
- no SRV records: the registrar server address must be statically configured on the host, and the host will use standard port 5060.
- ENUM standard
How to type an account address on a traditional telephone to contact a SIP user? Every SIP account is associated by default to a phone number called E.164 address:
- the user types the phone number on his traditional telephone;
- the gateway between the POTS network and the SIP network converts the phone number to an alias with fixed domain e164.arpa and it queries the DNS asking whether NAPTR records exist:
- if NAPTR records are found by the DNS server, the phone number is associated to a SIP account and the call is forwarded to the target SIP proxy;
- if no NAPTR records are found by the DNS server, the phone number corresponds to a user in the POTS network.
Each SIP message has the following textual format:
- message type (one line): it specifies the message type;
- SIP header: it includes information about the multimedia flow;
- empty line (HTTP-like behaviour);
- SDP message (payload): it includes control information about the multimedia flow.
Main message types
A SIP message can be of one of several types, including:
- REGISTER message: it is used to register oneself to a domain, and it can be sent via multicast to all registrar servers;
- INVITE message: it is used to set up a phone call;
- ACK message: it is the last SIP message just before the beginning of the RTP flow;
- BYE message: it is used to close a phone call;
- CANCEL message: it is used to cancel a pending request for a call setup;
- SUBSCRIBE, NOTIFY, MESSAGE messages: they are used for e-presence and instant messaging;
- code messages: they include:
- 1xx = Provisional codes: they refer to operations in progress (e.g. 100 Trying, 180 Ringing);
- 2xx = Success codes: they are success codes (e.g. 200 OK);
- 4xx = Client Error codes: they are error codes (e.g. 401 Unauthorized).
Main fields in SIP header
SIP header can contain several fields, including:
- From field: it includes the SIP address for the terminal who would like to start the call;
- To field: it includes the SIP address for the terminal who the caller terminal would like to contact;
- Contact field: it is used by the SIP server to specify the callee terminal's IP address, that can be used by the caller terminal to contact directly the callee terminal;
- Via field: it is used to keep track of all the SIP servers which the message should pass through (e.g. outbound proxy servers);
- Record Routing field: it specifies whether all the SIP messages should pass through the proxy, useful for NAT traversal;
- Subject field: it includes the subject for the SIP connection;
- Content-Type, Content-Length, Content-Encoding fields: they include information about payload type (in a MIME-like format, e.g. SDP), length (in bytes), and encoding.
SDP (Session Description Protocol) is a text-based protocol used to describe multimedia sessions: number of multimedia streams, media type (audio, video, etc.), codec, transport protocol (e.g. RTP/UDP/IP), bandwidth, addresses and ports, start/end times of each stream, source identification.
SDP is included in the payload of a SIP packet to notify control information about the multimedia flow (e.g. the SIP message carrying an invite message to a phone call also needs to notify which codec to use). Since SDP was designed some time ago, it has some features (such as start/end times of each stream) that are useless for SIP, but SDP was just adopted by SIP without any change to re-use existing software.
- SDP message format
Every SDP message is made up of a session section and one or more media sections (one for each multimedia flow):
- session section: starting with a line v=, it includes parameters for all the multimedia flows within the current session;
- media section: starting with a line m=, it includes parameters for the current multimedia flow.
Steps for a SIP call
A SIP call happens in 4 steps:
- registration: the caller terminal registers itself to a domain;
- invitation: the caller terminal asks to set up a call;
- data transfer: the voice is carried by RTP;
- tear down: the call is closed.
User agent A wants to register itself to domain A by contacting its SIP proxy:
- 1. 2. DNS queries and replies (NAPTR, SRV, A/AAA): A asks the DNS server for the SIP proxy IP address;
- 3. REGISTER message: A asks the SIP proxy to be registered, without inserting its password here;
- 4. 401 Unauthorized message: the SIP proxy asks for authentication by inserting a challenge, which is changed on every registration;
- 5. REGISTER message: A computes a hash function based on the challenge and the password and it sends the resulting string to the SIP proxy;
- 6. 200 OK message: the registrar server checks the reply to the challenge and it grants access to the user.
User agent A wants to set up a call with user agent B through B's SIP proxy:
- 1. A asks its SIP proxy to contact B by sending an INVITE message to it;
- 2. 3. A's SIP proxy performs the DNS queries to find B's SIP proxy IP address (NAPTR, SRV, A/AAA);
- 4. A's SIP proxy sends an INVITE message to B's SIP proxy.
- 5. B's SIP proxy sends an INVITE message to B;
- 6. 7. 8. B make A's phone ring by sending, through the SIP proxies, a RINGING message to A;
- 9. 10. 11. B accepts the call by sending, through the SIP proxies, an OK message to A;
- 12. A, either through the SIP proxies or directly according to the Record Routing field value, notifies B that it has received the OK message.
Tear down step
At the end of the call, after closing the RTP flow:
- 1. BYE message: B notifies A that it wants to close the call;
- 2. OK message: A notifies B that it has received the BYE message.
- The distinction between media and signaling gateway is often not clear: in fact signaling tones are normal audio samples, and signaling packets are normal data packets.
- It would be better to speak more in general about enterprise networks, because H.323 actually does not give any assumption on the topology of the underlying network.
- The gatekeeper is not mandatory: a client can contact directly a destination if it knows its address.
- The Admission step is not mandatory if the caller knows the callee's IP address.
- RSVP just tries to do that, because it is impossible to guarantee a circuit-switching service over a packet-switching network.
- This message should not be confused with the TCP ACK packets: it works at application layer, so also on UDP.
- Here the registrar server is supposed to be implemented into the SIP proxy.