Protocols

H.323

The H.323 Series of Recommendations evolved out of the ITU-T's work on video telephony and multimedia conferencing: after completing standardization on video telephony and video conferencing for ISDN at up to 2 Mbit/s in the H.320 series, the ITU-T took on work on similar multimedia communication over ATM networks (H.310, H.321), over the analog Public Switched Telephone Network (PSTN) using modem technology (H.324), and over the still-born Isochronous Ethernet (H.322). The most widely adopted and hence most promising network infrastructure - and the one bearing the largest difficulties to achieve well-defined Quality of Service - was addressed in the beginning of 1995 in H.323: Local Area Networks, with the focus on IP as network layer protocol. The primary goal was to interface multimedia communication equipment on LANs to the reasonably well-established base on circuit-switched networks.

The initial version of H.323 was approved by the ITU-T about one year later in June 1996, thereby providing a basis on which the industry could converge. The initial focus was clearly on local network environments, as QoS mechanisms for IP-based wide area networks such as the Internet were not well established at this point. In early 1996 Internet-wide deployment of H.323 was already explicitly included in the scope as was the aim to support voice-only applications and, thus, the foundations to use H.323 for IP Telephony were laid. H.323 has continuously evolved towards becoming a technically sound and functionally rich protocol platform for IP telephony applications, the first major additions to this end being included in H.323 version 2 approved by the ITU-T in January 1998. In September 1999, H.323 version 3 was approved by the ITU-T, incorporating numerous further functional and conceptual extensions to enable H.323 to serve as a basis for IP telephony on a global scale and to make it meet requirements in enterprise environments as well.

Scope

As stated before, the scope of H.323 encompasses multimedia communication in IP-based networks, with significant consideration given to gatewaying to circuit-switched networks (particular to ISDN-based video telephony and to PSTN/ISDN/GSM for voice communication).

Figure 2.1. Scope and Components defined in H.323

Picture showing the scope and the components defined in
		H.323

H.323 defines a number of functional / logical components as shown in figure Figure 2.1:

  • Terminal -- Terminals are H.323-capable endpoints, which may be implemented in software on workstations or as stand-alone devices (such as telephones). They are assigned to one or more aliases (e.g. a user's name / URI) and/or telephone number(s).
  • Gateway -- Gateways interconnect H.323 entities (such as endpoints, MCUs, or other gateways) to other network/protocol environments (such as the telephone network). They are also assigned one or more aliases and/or telephone number(s). The H.323 series of Recommendations provides detailed specifications for interfacing H.323 to H.320, ISDN/PSTN, and ATM based networks. Recent work also addresses control and media gateway specifications for telephony trunking networks such as SS7/ISUP.
  • Gatekeeper -- The gatekeeper is the core management entity in an H.323 environment. It is, among other things, responsible for access control, address resolution, and H.323 network (load) management and provides the central hook to implement any kind of utilization / access policies. An H.323 environment is subdivided into zones (which may but need not be congruent with the underlying network topology); each zone is controlled by one primary gatekeeper (with optional backup gatekeepers). Gatekeepers may also provide value-add, e.g. act as conferencing bridge or offer supplementary call services.
  • Multipoint Controller (MC) -- A multipoint controller is a logical entity that interconnects call signaling and conference control channels of two or more H.323 entities in a star topology. MCs coordinate the (control aspects of) media exchange between all entities involved in a conference; they also provide the endpoints with participant lists, exercise floor control, etc. MCs may be embedded in any H.323 entity (terminals, gateways gatekeepers) or implemented as stand-alone entities. They can be cascaded to allow conferences spanning multiple MCs.
  • Multipoint Processor (MP) -- For multipoint conferences with H.323, an optional Multipoint Processor may be used that receives media streams from the individual endpoints, combines them through some mixing/switching technique, and transmits the resulting media streams back to the endpoints.
  • Multipoint Control Unit (MCU) -- In the H.323 world, an MCU simply is a combination of an MC and an MP in a single device. The term originates in the ISDN videoconferencing world where MCUs were needed to create multipoint conferences out of a set of point-to-point connections.

Signaling protocols

H.323 resides - similar to the IETF protocols discussed in the next subsection - on top of the basic Internet Protocols (IP, IP Multicast, TCP, UDP) and makes use of integrated and differentiated services along with resource reservation protocols.

Figure 2.2. H.323 protocol architecture

Picture showing the scope and the components defined in
		H.323

For basic call signaling and conference control interactions with H.323, the aforementioned components communicate using three control protocols:

  • H.225.0 Registration, Admission, and Status (RAS) -- The RAS channel is used for communication between H.323 endpoints and their gatekeeper and for some inter-gatekeeper communication. Endpoints use RAS to register with their gatekeeper, to request permission to utilize system resources, to have addresses of remote endpoints resolved, etc. Gatekeepers use RAS to keep track of the status of their associated endpoints and to collect information about actual resource utilization after call termination. RAS provides mechanisms for user / endpoint authentication and call authorization.
  • H.225.0 Call Signaling -- The call signaling channel is used to signal call setup intention, success, failures, etc. as well as to carry operations for supplementary services (see below). Call signaling messages are derived from Q.931 (ISDN call signaling), as is the protocol; however, simplified procedures and only a subset of the messages are used in H.323. The call signaling channel is used end-to-end between caller and callee and may optionally run through one or more gatekeepers (the call signaling models are later described in the Signaling models section).

    Optimizations: Since version 3 H.225.0 supports the following enhancements:

    • Multiple Calls - To prevent using a dedicated TCP connection for each call gateways can be built to handle multiple calls on each connection.
    • Maintain Connection - Similar to Multiple Calls this enhancement shall reduce the need to open new TCP connections. After the last call has ended the endpoint may decide to maintain the TCP connection to provide a better call setup time for the next call.

    Primary use of both enhancements is at the communication between servers (Gatekeeper, MCU) or gateways. While in theory both mechanisms were possible before, beginning with H.323v3 the messages contained fields to indicate support for the mechanisms.

  • H.245 Conference Control -- The conference control channel is used to establish and control two party calls (as well as multiparty conferences). Its functionality includes determining possible modes for media exchange (e.g. select media encoding formats both parties understand) and configuring actual media streams (including exchanging transport addresses to send media streams to and receive them from). H.245 can be used to carry user input (such as DTMF), it also enables confidential media exchange, defines syntax and semantics for multipoint conference operation (see below). Finally, it provides a number of maintenance messages. Also this logical channel may optionally run through one or more gatekeeper or directly between caller and callee (please refer to the Signaling models section for details).

    It should be noted that H.245 is a legacy protocol inherited from the collective work on multimedia conferencing over ATM, PSTN, and other networks. Hence it carries a lot of fields and procedures that do not apply to H.323 but make the protocol specification quite heavyweight.

    Optimizations: The conference control channel is also subject to optimizations. Per default it is transported over an exclusive TCP connection but it may also be tunneled within the signaling connection (H.245 tunneling). Other optimizations deal with the call setup time. The last chance to start a H.245 channel is on receipt of the CONNECT message which implies that the first seconds after the user accepted the call no media is transmitted. H.245 may also start parallel to the setup of the H.225 call signaling, which is not really a new feature but another way of dealing with H.245. Vendors often call this Early connect or Early media. Since H.323 V2 it is possible to start a call using a less powerful but sufficient capability exchange by simply offering possible media channels that just have to be accepted. This procedure is called FastConnect or FastStart, requires less round-trips and is transported over the H.225 channel. After the FastConnect procedure is finished or when it fails the normal H.245 procedures start.

A number of extensions to H.323 include mechanisms for more efficient call setup (H.323 Annex E) and reduction of protocol overhead e.g. for simple telephones (SETs, simple endpoint types, H.323 Annex F).

Gatekeeper Discovery and Registration

A H.323 endpoint usually registers with a gatekeeper that provides services like address resolution to end endpoint. There are two possibilities for an endpoint to find its gatekeeper:

  • Multicast discovery - The endpoint sends a gatekeeper request (GRQ) to a well known multicast address (224.0.1.41) and port (1718). Receiving gatekeepers may confirm their responsibility for the endpoint (GCF) or ignore the request.
  • Configuration - The endpoint knows the IP address of the gatekeeper by its configuration. While their is no need that a gatekeeper request (GRQ) must be sent to the preconfigured gatekeeper some products need this protocol step. If a gatekeeper receives a GRQ via unicast it must either confirm the request or reject it (GRJ).

When trying to discover the gatekeeper via multicast an endpoint may request gatekeeper or specify the request by adding a Gatekeeper identifier to the request. Only gatekeeper that have the requested identifier reply positively. (see figure Figure 2.3)

Figure 2.3. Discovery and registration process

Picture showing the message flow of gatekeeper discovery
		and registration.

After the endpoint knows the location of the gatekeeper it tries to register itself (RRQ). Such a registration includes (among other information):

  • The addresses of the endpoint - For a terminal this may be the user ids or telephone numbers. An endpoint may have more than one address. In theory it is possible that addresses belong to different users to enable multiple users to share a single phone - in practice this depends on the phones and gatekeeper implementation.
  • Prefixes - If the registering endpoint is a gateway it may register number prefixes instead of addresses.
  • Time to live - An endpoint may request how long the registration shall last. This value can be overwritten by gatekeeper policies.

The gatekeeper checks the registration information and confirms the (eventually modified) values (RCF). It may also reject such a registration because of e.g. invalid addresses. In case of a confirmation the gatekeeper assigns a unique identifier to the endpoint that shall be used in subsequent requests to indicate that the endpoint is already registered.

Addresses and registrations

H.323 distinguishes several address types. Most commonly used an derived from the PSTN world is the Dialed digit that defines a number as dialed by the endpoint. It doesn't include further information (e.g. about the dialplan) and needs to be interpreted by the server. The server might convert the dialed number into an Party Number that includes information about the type of number and the dialplan.

To provide name dialing H.323 supports H.323-IDs that represent names or e-mail like addresses or the more general approach of an URL-ID which represents any kind of URL.

Unlike SIP in H.323 an address can be only registered by one endpoint (per zone) so a call to that address only resolves to a single endpoint. To call multiple destinations simultaneously requires a gatekeeper that actively maps a single address to multiple different addresses and tries to contact them.

Updating registrations

A registration expires after a defined time and must therefore be refreshed. This can be done by simply sending another registration request including the assigned endpoint identifier. To reduce the registration overhead in regularly registrations H.323 supports KeepAlive registrations that contain just the previously assigned endpoint identifier. Of course these registrations may only be sent if the registration information (esp. addresses) are unchanged.

Especially for registration endpoints with a huge amount of addresses to register (which would exceed the size of a UDP packet) H.323 version4 supports Additive Registration, a mechanism that allows an endpoint to send multiple registration requests (RRQ) in which the addresses don't replace existing registrations but are added to them.

Signaling models

The call signaling messages and the H.245 control messages may be exchanged either end-to-end between caller and calle or through a gatekeeper. Depending on the role the gatekeeper plays in the call signaling and in the H.245 signaling the H.323 specification foresees three different types of signaling models:

  • Direct signaling, with this signaling model only H.225.0 RAS messages are routed through the Gatekeeper while the other logical channel messages are directly exchanged between the two endpoints;
  • Gatekeeper routed call signaling, with this signaling model H.225.0 RAS and H.225.0 Call signaling messages are routed through the Gatekeeper while the H.245 Conference control messages are directly exchanged between the two endpoints;
  • Gatekeeper routed H.245 control, H.225.0 RAS and H.225.0 Call signaling an H.245 Conference control messages are routed through the Gatekeeper and only the media streams are directly exchanged between the two endpoints.

In the following sub-sections we are going to detail each signaling model. The figures reported in this section apply both to the use of a single Gatekeeper and to the use of a "Gatekeeper network". Since the signaling model is decided by the endpoint's Gatekeeper configuration and apply to all the messages such Gatekeeper handles, the extensions to the multiple Gatekeeper case is straightforward (simply apply the definition of the signaling model described in the itemized list above to each Gatekeeper involved) except for the location of zone external targets (described later in Locating zone external targets section); we decided not to report those message exchange in any of this section figures as it is intended to remain bounded in the ellipse where the H.323 Gatekeeper is depicted and it is described in the Locating zone external targets section. Please note that there is no indication about the call termination in each signaling model sub-section, please refer to Communication phases section for details.

The Direct signaling model is depicted in Figure 2.4. In this model the H.225.0 Call signaling and H.245 Conference control messages are exchanged directly between the call termination. As shown in the figure, the communication starts with an ARQ (Admission ReQuest) message sent by the caller (which may be either a Terminal or a Gateway) to the Gatekeeper. The ARQ message is used by the endpoint to be allowed to access the packet-based network by the Gatekeeper, which either grants the request with an ACF (Admission ConFirm) or denies it with an ARJ (Admission ReJect), if an ARJ is issued the call is terminated. After this first step the Call signaling part of the call begins with the transmission of the SET UP message from the caller to the callee. The transport address of the SET UP message (and of all the H.225.0 Call signaling messages) is retrieved by the caller from the "destCallSignalAddress" field carried inside the ACF received, in the case of Direct signaling model it is the address of the destination endpoint. Upon receiving the SET UP message the callee starts its H.225.0 RAS procedure with the Gatekeeper, if successful a CONNECT message is sent back to the caller to indicate acceptance of the call. Before sending the CONNECT message, two other messages may be sent from the callee to the caller (those two messages are not depicted in the figure since we have reported only mandatory messages):

  • ALERTING message, this message may be sent by the called user to indicate that called user alerting has been initiated (in everyday terms, the "phone is ringing");
  • CALL PROCEEDING message, this message may be sent by the called user to indicate that requested call establishment has been initiated and no more call establishment information will be accepted.

Figure 2.4. Direct signaling model

Direct signaling model

The CONNECT message closes the H.225.0 Call signaling part of the call and make the Terminals starting the H.245 Conference control one. In such call model the H.245 Conference control messages are exchanged directly between the two endpoints (the correct "h245Address" was retrieved from the CONNECT message itself). The procedures started with the H.245 Conference control channel are used to:

  • allow the exchange of audiovisual and data capabilities, with the TERMINAL CAPABILITY messages;
  • request the transmission of a particular audiovisual and data mode, with the LOGICAL CHANNEL SIGNALLING messages;
  • to manage the logical channels used to transport the audiovisual and data information;
  • to establish which terminal is the master terminal and which is the slave terminal for the purposes of managing logical channels, with the MASTER SLAVE DETERMINATION messages;
  • to carry various control and indication signals;
  • to control the bit rate of individual logical channels and the whole multiplex, with the MULTIPLEX TABLE SIGNALLING messages;
  • to measure the round trip delay, from one terminal to the other and back, with the ROUND TRIP DELAY messages.

Once the H.245 Conference control messages are exchanged the two endpoints have all the necessary information to open the media streams.

Gatekeeper routed call signaling model

The Gatekeeper routed call signaling model is depicted in Figure 2.5. In this model the H.245 Conference control messages are exchanged directly between the call termination. As each call, the communication starts with an ARQ (Admission ReQuest) message sent by the caller to its Gatekeeper. The ARQ message is used by the endpoint to be allowed to access the packet-based network by the Gatekeeper, which either grants the request with an ACF (Admission ConFirm) or denies it with an ARJ (Admission ReJect). After this first step the Call signaling part of the call begins with the transmission of the SET UP message from the caller to its Gatekeeper. The transport address of the SET UP message (and of all the H.225.0 Call signaling messages) is retrieved by the caller from the "destCallSignalAddress" field carried inside the ACF received, in the case of Gatekeeper routed call signaling model it is the address of the Gatekeeper itself. The SET UP message is then forwarded by the Gatekeeper (or by the "Gatekeeper network") to the called endpoint. Upon receiving the SET UP message the callee starts its H.225.0 RAS procedure with its Gatekeeper, if successful a CONNECT message is sent to indicate acceptance of the call; because of the call model, also this message is sent to the called endpoint's Gatekeeper which is in charge of forwarding it to the caller endpoint (either directly or using the "Gatekeeper network"). Before sending the CONNECT message, two other messages may be sent from the callee to its Gatekeeper (those two messages are not depicted in the figure since we have reported only mandatory messages):

  • ALERTING message, this message may be sent by the called user to indicate that called user alerting has been initiated (in everyday terms, the "phone is ringing");
  • CALL PROCEEDING message, this message may be sent by the called user to indicate that requested call establishment has been initiated and no more call establishment information will be accepted.

Figure 2.5. Gatekeeper Routed call signaling model

Gatekeeper Routed call signaling model

The two optional messages listed above are then forwarded by the Gatekeeper (or by the "Gatekeeper network") to the caller. After receiveing the CONNECT message, the caller starts the procedures H.245 Conference control channel procedures directly with the callee (the correct "h245Address" was retrieved from the CONNECT message itself). The H.245 Conference control channel procedure scopes are the same detailed above, please refer to Direct signaling model section for details.

Gatekeeper routed H.245 control model

The Gatekeeper routed H.245 control model is depicted in Figure 2.6. In this model only the media streams are exchanged directly between the call termination. As each call, the communication starts with an ARQ (Admission ReQuest) message sent by the caller to its Gatekeeper. The ARQ message is used by the endpoint to be allowed to access the packet-based network by the Gatekeeper, which either grants the request with an ACF (Admission ConFirm) or denies it with an ARJ (Admission ReJect). After this first step the Call signaling part of the call begins with the transmission of the SET UP message from the caller to its Gatekeeper. The transport address of the SET UP message (and of all the H.225.0 Call signaling messages) is retrieved by the caller from the "destCallSignalAddress" field carried inside the ACF received, in the case of Gatekeeper routed H.245 control model it is the address of the Gatekeeper itself. The SET UP message is then forwarded by the Gatekeeper (or by the "Gatekeeper network") to the called endpoint. Upon receiving the SET UP message the callee starts its H.225.0 RAS procedure with its Gatekeeper, if successful a CONNECT message is sent to indicate acceptance of the call; because of the call model, also this message is sent to the called endpoint's Gatekeeper which is in charge of forwarding it to the caller endpoint (either directly or using the "Gatekeeper network"). Before sending the CONNECT message, two other messages may be sent from the callee to its Gatekeeper (those two messages are not depicted in the figure since we have reported only mandatory messages):

  • ALERTING message, this message may be sent by the called user to indicate that called user alerting has been initiated (in everyday terms, the "phone is ringing");
  • CALL PROCEEDING message, this message may be sent by the called user to indicate that requested call establishment has been initiated and no more call establishment information will be accepted.

Figure 2.6. Gatekeeper Routed H.245 control model

Gatekeeper Routed H.245 control model

The two optional messages listed above are then forwarded by the Gatekeeper (or by the "Gatekeeper network") to the caller. After receiveing the CONNECT message, the caller starts the H.245 Conference control channel procedures with its Gatekeeper (the correct "h245Address" was retrieved from the CONNECT message itself). All the H.245 channel messages are then exchanged by the endpoints with their Gatekeeper (or Gatekeepers), it is the Gatekeeper (or "Gatekeeper network") which takes care of forwarding them up to the remote endpoint as foreseen by the Gatekeeper routed H.245 control model. The H.245 Conference control channel procedure scopes are the same detailed above, please refer to Direct signaling model section for details.

Communication Phases

In a H.323 communication may be identified 5 different phases:

  • Call set up;
  • Initial communication and capability exchange;
  • Establishment of audiovisual communication;
  • Call services;
  • Call termination.
Call set up

Recommendation H.225.0 defines the Call set up messages and procedures here detailed. The recommandation foresees that requests for bandwidth reservation should take place at the earliest possible phase. Differently from other protocols, there is no explicit synchronization between two endpoints during the call setup procedure (two endpoints can send a Setup message each other at exactly the same time). Actions to be taken when problems of synchronization during SET UP message exchange arise are resolved by the application itself; applications not supporting multiple simultaneous calls should issue busy signal when they have an outstanding SET UP message, while applications supporting multiple simultaneous call should issue a busy signal only to the same endpoint to which they sent an outstanding SET UP message. Moreover, an endpoint shall be capable of sending the ALERTING messages. Alerting has the meaning that the called party has been alerted of an incoming call ("phone ringing" in the language of the old telephony). Only the ultimate called endpoint shall originate the ALERTING message and only when the application has already alerted the user. If a Gateway is involved, the Gateway shall send ALERTING when it receives a ring indication from the Switched Circuit Network (SCN). The sending of an ALERTING message is not required if an endpoint can respond to a SET UP message with a CONNECT, CALL PROCEEDING, or RELEASE COMPLETE within 4 seconds. After successfully sending a SET UP message an endpoint can expect to receive either an ALERTING, CONNECT, CALL PROCEEDING, or RELEASE COMPLETE message within 4 seconds after successful transmission. Finally, to maintain the consistency of the meaning of the CONNECT message between packet based networks and circuit switched networks, the CONNECT message should be sent only if it is certain that the capability exchange will successfully take place and a minimum level of communications can be performed.

The Call set up phase may have different realizations, basically we can identify different call set up:

  • Basic call setup when neither endpoint are registered, in this call set up the two endpoints communicate directly;
  • Both endpoints registered to the same gatekeeper, in this call set up the communication is decided by the signaling model configured on the Gatekeeper;
  • Only calling endpoint has gatekeeper, in this call set up only the caller sends messages to the Gatekeeper depending on the signaling models configured while the callee sends the messages directly to the caller endpoint;
  • Only called endpoint has gatekeeper, in this call set up only the called sends messages to the Gatekeeper depending on the signaling models configured while the caller sends the messages directly to the called endpoint;
  • Both endpoints registered to different gatekeepers, each of the two endpoints communicate with their Gatekeeper depending on the signaling model configured, additional H.225.0 RAS messages may be exchanged between gatekeeper in order to retrieve location information (see Locating zone external targets section dor more details);
  • Call set up with Fast connect procedure, in this call set up the media channels are established using either the "Fast Connect" procedure. The Fast Connect procedure speeds up the establishment of a basic point-to-point call (only one round-trip message exchange is needed), enabling immediate media stream delivery upon call connection. The Fast connect procedure is started if the calling endpoint initiates it by sending a SETUP message containing the fastStart element (to advice it is going to use the Fast Connect procedure). Such element contains, among the other things, a sequence of all of the parameters necessary to immediately open and begin transferring media on the channels. Fast Connect procedure may be refused by the called endpoint (motivations may be either because it wants to use features requiring use of H.245 or because it does not implement it). Fast Connect procedure may be refused with any H.225.0 Call signaling message up to and including the CONNECT one. Refusing the Fast Connect procedure (or not initiating it) requires that H.245 procedures be used for capabilities exchange and opening of media channels. Moreover, the Fast Connect procedure allows to have a more detailed view on H.323/SIP gatewaying (further details to be found in Chapter 4);
  • Call setup via gateways, when a gateway is involved the call setup between it and the network endpoint is the same as the endpoint-to-endpoint call set up;
  • Call setup with an MCU, when an MCU is involved all endpoints exchange call signalling with the MCU (and with the interested Gatekeepers if any). No changes are foreseen between an endpoint and the MCU call set up since it proceeds the same as the endpoint-to-endpoint;
  • Broadcast call setup, this kind of call set up follows the procedures defined in Recommendation H.332.
Initial communication and capability exchange

After exchanging call setup messages, the endpoints shall, if they plan to use H.245, establish the H.245 Control Channel. The H.245 Control Channel is used for the capability exchange and to open the media channels. The H.245 Control channel procedures shall either not be started or closed if CONNECT does not arrive (an H.245 Control chanel can be opened on reception of ALERTING or CALL PROCEEDING messages, too) or an endpoint sends RELEASE COMPLETE. H.323 endpoints shall support the capabilities exchange procedure of H.245. The H.245 TERMINALCAPABILITYSET message is used for endpoint system capabilities exchange. This message shall be the first H.245 message sent. Master-slave determination procedure of H.245 has to be supported by H.323 compliant endpoints as a must. In cases of multipoint conferencing (MC) capability is present in more than one endpoint, the master-slave determination is used for determining which MC will play an active role. The H.245 Control channel procedure also provides master-slave determination for opening bi-directional channels for data. After Terminal Capability Exchange has been initiated, master-slave determination procedure (consisting of either MASTERSLAVEDETERMINATION or MASTERSLAVEDETERMINATIONACK) has to be started as the first H.245 Conference control procedure. Upon failure of initial capability exchange or master-slave determination procedures a maximum of two retries shall be performed before the endpoint passes to the Call Termination phase. Normally, after successful completion of the requirements of this phase, the endpoints shall proceed directly to Establishment of audiovisual communication phase.

Encapsulation of H.245 messages within H.225.0 Call signaling messages

Encapsulation of H.245 messages inside H.225.0 Call signaling messages instead of establishing a separate H.245 channel is possible in order to save resources, synchronize call signalling and control, and reduce call setup time. This process is named as "encapsulation" or "tunneling" of H.245 messages. This procedure allows the terminal to copy the encoded H.245 message using one structure inside the data of the Call Signalling Channel. If tunneling is used, any H.225.0 Call signaling message may contain one or more H.245 messages. If there is no need of sending an H.225.0 Call signaling message when an H.245 message has to be transmitted, a FACILITY message shall be sent detailing (with appropiate fields inside) the reason of such a message.

Establishment of audiovisual communication

The Establishment of audiovisual communication shall follow the procedures of Recommendation H.245. Open logical channels for the various information streams are opened using the H.245 procedures. The audio and video streams are transported using an unreliable protocol while data communications are transported using a reliable protocol. The transport address that the receiving endpoint has assigned to a specific logical channel (audio, video or data) is transported by the OPENLOGICALCHANNELACK message (an example is given in Figure 2.7). That transport address is used to transmit the information stream associated with that logical channel.

Figure 2.7. OPENLOGICALCHANNELACK message content

OPENLOGICALCHANNELACK message content
Call services

When the call is active, the terminal may request additional call services, among those we report here on the Bandwidth changes services and on the Supplementary services. As regards as Bandwidth changes services, during a conference, the endpoints or Gatekeeper (if involved) may, at any time, request an increase or decrease in the call bandwidth. If the aggregate bit rate of all transmitted and received channels does not exceed the current call bandwidth then an endpoint may change the bit rate of a logical channel without requesting a bandwidth change. After requesting for bandwidth change, the endpoint shall wait for confirmation prior to actually changing the bit rate (confirmation usually comes from the Gatekeeper). Asking call bandwidth changes is performed using a BANDWIDTH CHANGE REQUEST (BRQ) message, if the request is not accepted, a BANDWIDTH CHANGE REJECT (BRJ) message is returned to endpoint. If the request is accepted, a BANDWIDTH CHANGE CONFIRM (BCF) is sent back to the endpoint. As regards as Supplementary services, support for them is optional. The H.450-Series of Recommendations describe a method of providing Supplementary Services in the H.323 environment. Figure 2.8 reports some of the supplementary services defined so far and their Recommendation number.

Figure 2.8. Supplementary services of the H.450-Series

Supplementary services of the H.450-Series
Call termination

A call may be terminated either by both endpoint or by the Gatekeeper. Call termination is defined using the following procedure:

  • video should be terminated after a complete picture and then all logical channels for video closed;
  • data transmission should be terminated and then all logical channels for data closed;
  • audio transmission should be terminated and then all logical channels for audio closed;
  • the H.245 ENDSESSIONCOMMAND message (H.245 Control Channel) should be sent by the endpoint/Gatekeeper, this message indicates the the call has to be disconnected, then the H.245 message transmission should be terminated;
  • the ENDSESSIONCOMMAND message should be sent back to the sendind endpoint and then the H.245 Control Channel should be closed;
  • a RELEASE COMPLETE message should be sent closing the Call Signaling channel if this is still open;

An endpoint receiving ENDSESSIONCOMMAND message does not need to receive it back again after repying to it in order to clear a call. Terminating a call within a conference does not mean the all conference needs to be terminated. In order to terminate a conference an H.245 message (DROPCONFERENCE) is used, then the the MC should terminate the calls with the endpoint as described above.

A call may be terminated differently depending on the Gatekeeper presence and on the party issuing the call termination:

  • Call clearing without a Gatekeeper - No further action is required.
  • Call clearing with a Gatekeeper - The Gatekeeper needs to be informed about the Call termination. After RELEASE COMPLETE is sent, an H.225.0 DISENGAGE REQUEST (DRQ) message should be sent by each endpoint to its Gatekeeper. A Disengage Confirm (DCF) message is sent back to the endpoints to acknowledge the reception.
  • Call clearing issued by the Gatekeeper - A call may be terminated by the Gatekeeper by sending a DRQ to an endpoint. The procedure described above for Call termination should be immediatly followed by the endpoint up to the RELEASE COMPLETE message included,then a reply to the Gatekeeper should be sent using a DCF message. The other endpoint should follow the same Call termination procedures upon receiving the ENDSESSIONCOMMAND message. Moreover, if a multipoint conference is taking place, in order to close the entire conference, the Gatekeeper should send a DRQ to each endpoint in the conference.

Locating zone external targets

When calling an address that is registered at the same gatekeeper the callee is registered with is simple - the gatekeeper just needs to look up its internal tables to resolve the address. It is more complex if the destination is registered with another gatekeeper. While Chapter 7, Global telephony integration will cover this topic more detailed the most basic mechanism H.323 provides shall be explained here.

A gatekeeper may explicitly request the resolution of an address from other gatekeepers. On receipt of an request to call an address that the gatekeeper hasn't registered it can send out a location request (LRQ) to other gatekeepers (see figure Figure 2.9). The receiving gatekeeper - assuming it knows the address - will reply with the TSAP (IP+Port) of either the requested address or its own call signaling TSAP.

Figure 2.9. External address resolution using LRQs

An picture showing the message flow of an endpoint
		initiating an ARQ, resulting in a LRQ/LCF between to
		gatekeepers, an ACF reply and the start of the call between to endpoints.

A location request can be sent via Unicast or Multicast. If sent via Multicast only the gatekeeper that can resolve the address shall reply. If a gatekeeper receives a unicast LRQ it shall either confirm or reject the request.

This mechanism can be used to have a list of peer gatekeeper to ask parallel or sequential. It is also possible to assign a domain suffix or number prefix to each peer so an address with a number prefix of an institution will result in a request to the gatekeeper of that institution. By defining default peers one could also build a hierarchy of gatekeepers (Again, see Chapter 7, Global telephony integration for more details.)

Sample Call Scenario

Figure 2.10 depicts an example for inter-zone call setup using H.323. The caller in zone A contacts its gatekeeper to ask for permission to call the callee in zone B (1). Gatekeeper of zone A confirms this request and provides the caller with the address of zone B's gatekeeper (2).1 The caller establishes a call signaling channel (and subsequently / in parallel the conference control channel) to the gatekeeper of zone B (3), which determines the location of the callee and forwards the request to the callee (4).

Figure 2.10. Sample H.323 Call Setup Scenario

Picture showing the signaling flow of H.323 messages in a
	      simple inter-zone call.

The callee explicitly confirms with its gatekeeper that it is allowed to accept the call (5, 6) and, if yes, alerts the recipient of the call, returns an alerting indication and (once the receiving user picks up the call) eventually an indication of successful connection setup back to the caller (7, 8). In (parallel to) this exchange, capability negotiation and media stream configuration take place. When the setup has completed, both parties start sending media streams directly to each other.

Additional (Call) Services

As known from daily interaction with PBXes, telephony service comprises far more than just call setup and teardown: n-way conferencing and various supplementary services (such as call transfer, call waiting, etc.) are available. Similar features - at least the more commonly known and used ones - need to be provided by IP telephony systems as well to be accepted by customers. Additional call services in H.323 can be grouped into three categories:

  • Conferencing -- H.323 inherently supports multipoint tightly-coupled conferencing - i.e. conferences with access control, optional support for conference chairs, and close synchronization of conference state among all participants - from the outset: through the concept of a Multipoint Controller and an optional Multipoint Processor. While control is centralized in the MC, data exchange may be either via IP multicast, multi-unicast (i.e. peer-wise fan-out between endpoints without MP), or through an MP. The distribution mode may be selected per-media and per endpoint peer and is controlled by the MC.
  • "Broadcast conferencing" -- H.323 also provides an interface to support large loosely-coupled conferences as are frequently used in the Mbone to multicast seminars, events, etc. In this case, the MC defines session description (using the Session Description Protocol, SDP, see below) for the H.323 media sessions (which have to operate using multicast) and announces this description by some means (e.g. the Session Announcement Protocol, SAP). Details are defined in ITU-T H.332.
  • Supplementary Services -- H.323 provides a variety of supplementary services with additional ones continuously being defined. While some services can be accomplished using the basic H.323 specifications, the H.450.x Recommendations define a framework (derived from QSIG, the ECMA/ISO/ETSI standard for supplementary service signaling in PBXes) and a number of services (call transfer, call diversion, call hold, call park & pickup, call waiting, message waiting indication, call completion).

Further extensions for supplementary services and other functional enhancements are on the way. In particular, an HTTP-based extension framework is being defined at the time of writing to enable rapid introduction of new services without the need for standardization.

H.235 Security

The H.235 recommendation defines ways of security for H.323. This includes:

  • Authentication - Authentication can be achieved by using a shared secret (password) or digital signatures. The RAS messages include a token that was generated using either the shared secret or the signature. A receiving entity authenticate the sender by comparing the received token with a self generated token.
  • Message Integrity - Integrity is achieved by generating a password based checksum over the message.
  • Privacy - Mechanisms are provided to setup encryption on the media streams. They must be used in conjunction with the H.245 protocol and use DES, Triple DES or RC2 - the use of SRTP isn't supported yet (in H.235v2).

Those mechanisms are grouped into so called Security Profiles, where the Baseline Security Profile provides authentication and message integrity making it suitable for subscription based environments and the Voice Encryption Profile that provides confidential end-to-end media channels.

Protocol Profiles

H.323 has its origin - as mentioned before - in the area of multimedia conferencing. This implies that a vast number of options are available which are not necessary for providing telephony services. The TIPHON project of the European Telecommunication Standards Institute (ETSI) has defined a Telephony Profile for H.323 that specifies which combination of options should be implemented.

Similarly, H.323 contains a security framework (H.235) that describes a collection of algorithms and protocol mechanisms but lacks - because of international political constraints - a precise specification of a mandatory baseline. This is accounted for by the ETSI TIPHON security profile: this specification fills in the gaps and provides the foundation for interoperable implementations.

In summary, it can be said that the H.323 family of standards provides a mature basis for commercial products in the field of IP telephony. While the details of the protocol are often dominated by their legacy from various earlier ITU protocols, there is an active effort to profile and simplify the protocol to reduce the complexity.

SIP

Purpose of SIP

SIP stands for Session Initiation Protocol. It is an application-layer control protocol which has been developed and designed within the IETF. The protocol has been designed with easy implementation, good scalability, and flexibility in mind.

The specification is available in form of several RFCs, the most important one is RFC3261 which contains the core protocol specification. The protocol is used for creating, modifying, and terminating sessions with one or more participants. By sessions we understand a set of senders and receivers that communicate and the state kept in those senders and receivers during the communication. Examples of a session can include Internet telephone calls, distribution of multimedia, multimedia conferences, distributed computer games, etc.

SIP is not the only protocol that the communicating devices will need. It is not meant to be a general purpose protocol. Purpose of SIP is just to make the communication possible, the communication itself must be achieved by another means (and possibly another protocol). Two protocols that are most often used along with SIP are RTP and SDP. RTP protocol is used to carry the real-time multimedia data (including audio, video, and text), the protocol makes it possible to encode and split the data into packets and transport such packets over the Internet. Another important protocol is SDP, which is used to describe and encode capabilities of session participants. Such a description is then used to negotiate the characteristics of the session so that all the devices can participate (that includes, for example, negotiation of codecs used to encode media so all the participants will be able to decode it, negotiation of transport protocol used and so on).

SIP has been designed in conformance with the Internet model. It is an end-to-end oriented signalling protocol which means, that all the logic is stored in end devices (except routing of SIP messages). State is also stored in end-devices only, there is no single point of failure and networks designed this way scale well. The price that we have to pay for the distributiveness and scalability is higher message overhead, caused by the messages being sent end-to-end.

It is worth of mentioning that the end-to-end concept of SIP is a significant divergence from regular PSTN (Public Switched Telephone Network) where all the state and logic is stored in the network and end devices (telephones) are very primitive. Aim of SIP is to provide the same functionality that the traditional PSTNs have, but the end-to-end design makes SIP networks much more powerful and open to the implementation of new services that can be hardly implemented in the traditional PSTNs.

SIP is based on HTTP protocol. The HTTP protocol inherited format of message headers from RFC822. HTTP is probably the most successful and widely used protocol in the Internet. It tries to combine the best of the both. In fact, HTTP can be classified as a signalling protocol too, because user agents use the protocol to tell a HTTP server in which documents they are interested in. SIP is used to carry the description of session parameters, the description is encoded into a document using SDP. Both protocols (HTTP and SIP) have inherited encoding of message headers from RFC822. The encoding has proven to be robust and flexible over the years.

SIP URI

SIP entities are identified using SIP URI (Uniform Resource Identifier). A SIP URI has form of sip:username@domain, for instance, sip:joe@company.com. As we can see, SIP URI consists of username part and domain name part delimited by @ (at) character. SIP URIs are similar to e-mail addresses, it is, for instance, possible to use the same URI for e-mail and SIP communication, such URIs are easy to remember.

SIP Network Elements

Although in the simplest configuration it is possible to use just two user agents that send SIP messages directly to each other, a typical SIP network will contain more than one type of SIP elements. Basic SIP elements are user agents, proxies, registrars, and redirect servers. We will briefly describe them in this section.

Note that the elements, as presented in this section, are often only logical entities. It is often profitable to co-locate them together, for instance, to increase the speed of processing, but that depends on a particular implementation and configuration.

User Agents

Internet end points that use SIP to find each other and to negotiate a session characteristics are called user agents. User agents usually, but not necessarily, reside on a user's computer in form of an application--this is currently the most widely used approach, but user agents can be also cellular phones, PSTN gateways, PDAs, automated IVR systems and so on.

User agents are often reffered to as User Agent Server (UAS) and User Agent Client (UAC). UAS and UAC are logical entities only, each user agent contains a UAC and UAS. UAC is the part of the user agent that sends requests and receives responses. UAS is the part of the user agent that receives requests and sends responses.

Because a user agent contains both UAC and UAS, we often say that a user agent behaves like a UAC or UAS. For instance, caller's user agent behaves like UAC when it sends an INVITE requests and receives responses to the request. Callee's user agent behaves like a UAS when it receives the INVITE and sends responses.

But this situation changes when the callee decides to send a BYE and terminate the session. In this case the callee's user agent (sending BYE) behaves like UAC and the caller's user agent behaves like UAS.

Figure 2.11. UAC and UAS

Picture showing UAC and UAS

Figure 2.11 shows three user agents and one stateful forking proxy. Each user agent contains UAC and UAS. The part of the proxy that receives the INVITE from the caller in fact acts as a UAS. When forwarding the request statefully the proxy creates two UACs, each of them is responsible for one branch.

In our example callee B picked up and later when he wants to tear down the call it sends a BYE. At this time the user agent that was previously UAS becomes a UAC and vice versa.

Proxy Servers

In addition to that SIP allows creation of an infrastructure of network hosts called proxy servers. User agents can send messages to a proxy server. Proxy servers are very important entities in the SIP infrastructure. They perform routing of a session invitations according to invitee's current location, authentication, accounting and many other important functions.

The most important task of a proxy server is to route session invitations “closer” to callee. The session invitation will usually traverse a set of proxies until it finds one which knows the actual location of the callee. Such a proxy will forward the session invitation directly to the callee and the callee will then accept or decline the session invitation.

There are two basic types of SIP proxy servers--stateless and stateful.

Stateless Servers

Stateless server are simple message forwarders. They forward messages independently of each other. Although messages are usually arranged into transactions (see Section , “SIP Transactions”), stateless proxies do not take care of transactions.

Stateless proxies are simple, but faster than stateful proxy servers. They can be used as simple load balancers, message translators and routers. One of drawbacks of stateless proxies is that they are unable to absorb retransmissions of messages and perform more advanced routing, for instance, forking or recursive traversal.

Stateful Servers

Stateful proxies are more complex. Upon reception of a request, stateful proxies create a state and keep the state until the transaction finishes. Some transactions, especially those created by INVITE, can last quite long (until callee picks up or declines the call). Because stateful proxies must maintain the state for the duration of the transactions, their performance is limited.

The ability to associate SIP messages into transactions gives stateful proxies some interesting features. Stateful proxies can perform forking, that means upon reception of a message two or more messages will be sent out.

Stateful proxies can absorb retransmissions because they know, from the transaction state, if they have already received the same message (stateless proxies cannot do the check because they keep no state).

Stateful proxies can perform more complicated methods of finding a user. It is, for instance, possible to try to reach user's office phone and when he doesn't pick up then the call is redirected to his cell phone. Stateless proxies can't do this because they have no way of knowing how the transaction targeted to the office phone finished.

Most SIP proxies today are stateful because their configuration is usually very complex. They often perform accounting, forking, some sort of NAT traversal aid and all those features require a stateful proxy.

Proxy Server Usage

A typical configuration is that each centrally administered entity (a company, for instance) has it's own SIP proxy server which is used by all user agents in the entity. Let's suppose that there are two companies A and B and each of them has it's own proxy server. Figure 2.12 shows how a session invitation from employee Joe in company A will reach employee Bob in company B.

Figure 2.12. Session Invitation

Picture showing a session invitation message flow

User Joe uses address sip:bob@b.com to call Bob. Joe's user agent doesn't know how to route the invitation itself but it is configured to send all outbound traffic to the company SIP proxy server proxy.a.com. The proxy server figures out that user sip:bob@b.com is in a different company so it will look up B's SIP proxy server and send the invitation there. B's proxy server can be either preconfigured at proxy.a.com or the proxy will use DNS SRV records to find B's proxy server. The invitation reaches proxy.bo.com. The proxy knows that Bob is currently sitting in his office and is reachable through phone on his desk, which has IP address 1.2.3.4, so the proxy will send the invitation there.

Registrar

We mentioned that the SIP proxy at proxy.b.com knows current Bob's location but haven't mentioned yet how a proxy can learn current location of a user. Bob's user agent (SIP phone) must register with a registrar. The registrar is a special SIP entity that receives registrations from users, extracts information about their current location (IP address, port and username in this case) and stores the information into location database. Purpose of the location database is to map sip:bob@b.com to something like sip:bob@1.2.3.4:5060. The location database is then used by B's proxy server. When the proxy receives an invitation for sip:bob@b.com it will search the location database. It finds sip:bob@1.2.3.4:5060 and will send the invitation there. A registrar is very often a logical entity only. Because of their tight coupling with proxies registrars, are usually co-located with proxy servers.

Figure 2.13 shows a typical SIP registration. A REGISTER message containing Address of Record sip:jan@iptel.org and contact address sip:jan@1.2.3.4:5060 where 1.2.3.4 is IP address of the phone, is sent to the registrar. The registrar extracts this information and stores it into the location database. If everything went well then the registrar sends a 200 OK response to the phone and the process of registration is finished.

Figure 2.13. Registrar Overview

Picture showing a typical registrar

Each registration has a limited lifespan. Expires header field or expires parameter of Contact header field determines for how long is the registration valid. The user agent must refresh the registration within the lifespan otherwise it will expire and the user will become unavailable.

Redirect Server

The entity that receives a request and sends back a reply containing a list of the current location of a particular user is called redirect server. A redirect server receives requests and looks up the intended recipient of the request in the location database created by a registrar. It then creates a list of current locations of the user and sends it to the request originator in a response within 3xx class.

The originator of the request then extracts the list of destinations and sends another request directly to them. Figure 2.14 shows a typical redirection.

Figure 2.14. SIP Redirection

Picture showing a redirection

SIP Messages

Communication using SIP (often called signalling) comprises of series of messages. Messages can be transported independently by the network. Usually they are transported in a separate UDP datagram each. Each message consist of “first line”, message header, and message body. The first line identifies type of the message. There are two types of messages--requests and responses. Requests are usually used to initiate some action or inform recipient of the request of something. Replies are used to confirm that a request was received and processed and contain the status of the processing.

A typical SIP request looks like this:

INVITE sip:7170@iptel.org SIP/2.0
Via: SIP/2.0/UDP 195.37.77.100:5040;rport
Max-Forwards: 10
From: "jiri" <sip:jiri@iptel.org>;tag=76ff7a07-c091-4192-84a0-d56e91fe104f
To: <sip:jiri@bat.iptel.org>
Call-ID: d10815e0-bf17-4afa-8412-d9130a793d96@213.20.128.35
CSeq: 2 INVITE
Contact: <sip:213.20.128.35:9315>
User-Agent: Windows RTC/1.0
Proxy-Authorization: Digest username="jiri", realm="iptel.org", 
  algorithm="MD5", uri="sip:jiri@bat.iptel.org", 
  nonce="3cef753900000001771328f5ae1b8b7f0d742da1feb5753c", 
  response="53fe98db10e1074
b03b3e06438bda70f"
Content-Type: application/sdp
Content-Length: 451

v=0
o=jku2 0 0 IN IP4 213.20.128.35
s=session
c=IN IP4 213.20.128.35
b=CT:1000
t=0 0
m=audio 54742 RTP/AVP 97 111 112 6 0 8 4 5 3 101
a=rtpmap:97 red/8000
a=rtpmap:111 SIREN/16000
a=fmtp:111 bitrate=16000
a=rtpmap:112 G7221/16000
a=fmtp:112 bitrate=24000
a=rtpmap:6 DVI4/16000
a=rtpmap:0 PCMU/8000
a=rtpmap:4 G723/8000
a=rtpmap: 3 GSM/8000
a=rtpmap:101 telephone-event/8000
a=fmtp:101 0-16

The first line tells us that this is INVITE message which is used to establish a session. The URI on the first line--sip:7170@iptel.org is called Request URI and contains URI of the next hop of the message. In this case it will be host iptel.org.

A SIP request can contain one or more Via header fields which are used to record path of the request. They are later used to route SIP responses exactly the same way. The INVITE message contains just one Via header field which was created by the user agent that sent the request. From the Via field we can tell that the user agent is running on host 195.37.77.100 and port 5060.

From and To header fields identify initiator (caller) and recipient (callee) of the invitation (just like in SMTP where they identify sender and recipient of a message). From header field contains a tag parameter which serves as a dialog identifier and will be described in Section , “SIP Dialogs”.

Call-ID header field is a dialog identifier and it's purpose is to identify messages belonging to the same call. Such messages have the same Call-ID identifier. CSeq is used to maintain order of requests. Because requests can be sent over an unreliable transport that can re-order messages, a sequence number must be present in the messages so that recipient can identify retransmissions and out of order requests.

Contact header field contains IP aaddress and port on which the sender is awaiting further requests sent by callee. Other header fields are not important and will be not described here.

Message header is delimited from message body by an empty line. Message body of the INVITE request contains a description of the media type accepted by the sender and encoded in SDP.

SIP Requests

We have described how an INVITE request looks like and said that the request is used to invite a callee to a session. Other important requests are:

  • ACK--This message acknowledges receipt of a final response to INVITE. Establishing of a session utilizes 3-way hand-shaking due to asymmetric nature of the invitation. It may take a while before the callee accepts or declines the call so the callee's user agent periodically retransmits a positive final response until it receives an ACK (which indicates that the caller is still there and ready to communicate).
  • BYE--Bye messages are used to tear down multimedia sessions. A party wishing to tear down a session sends a BYE to the other party.
  • CANCEL--Cancel is used to cancel not yet fully established session. It is used when the callee hasn't replied with a final response yet but the caller wants to abort the call (typically when a callee doesn't respond for some time).
  • REGISTER--Purpose of REGISTER request is to let registrar know of current user's location. Information about current IP address and port on which a user can be reached is carried in REGISTER messages. Registrar extracts this information and puts it into a location database. The database can be later used by SIP proxy servers to route calls to the user. Registrations are time-limited and need to be periodically refreshed.

The listed requests usually have no message body because it is not needed in most situations (but can have one). In addition to that many other request types have been defined but their description is out of the scope of this document.

SIP Responses

When a user agent or proxy server receives a request it send a reply. Each request must be replied except ACK requests which trigger no replies.

A typical reply looks like this:

SIP/2.0 200 OK
Via: SIP/2.0/UDP 192.168.1.30:5060;received=66.87.48.68
From: sip:sip2@iptel.org
To: sip:sip2@iptel.org;tag=794fe65c16edfdf45da4fc39a5d2867c.b713
Call-ID: 2443936363@192.168.1.30
CSeq: 63629 REGISTER
Contact: <sip:sip2@66.87.48.68:5060;transport=udp>;q=0.00;expires=120
Server: Sip EXpress router (0.8.11pre21xrc (i386/linux))
Content-Length: 0
Warning: 392 195.37.77.101:5060 "Noisy feedback tells:  
    pid=5110 req_src_ip=66.87.48.68 req_src_port=5060 in_uri=sip:iptel.org 
    out_uri=sip:iptel.org via_cnt==1"

As we can see, responses are very similar to the requests, except for the first line. The first line of response contains protocol version (SIP/2.0), reply code, and reason phrase.

The reply code is an integer number from 100 to 699 and indicates type of the response. There are 6 classes of responses:

  • 1xx are provisional responses. A provisional response is response that tells to its recipient that the associated request was received but result of the processing is not known yet. Provisional responses are sent only when the processing doesn't finish immediately. The sender must stop retransmitting the request upon reception of a provisional response.

    Typically proxy servers send responses with code 100 when they start processing an INVITE and user agents send responses with code 180 (Ringing) which means that the callee's phone is ringing.

  • 2xx responses are positive final responses. A final response is the ultimate response that the originator of the request will ever receive. Therefore final responses express result of the processing of the associated request. Final responses also terminate transactions. Responses with code from 200 to 299 are positive responses that means that the request was processed successfully and accepted. For instance a 200 OK response is sent when a user accepts invitation to a session (INVITE request).

    A UAC may receive several 200 messages to a single INVITE request. This is because a forking proxy (described later) can fork the request so it will reach several UAS and each of them will accept the invitation. In this case each response is distinguished by the tag parameter in To header field. Each response represents a distinct dialog with unambiguous dialog identifier.

  • 3xx responses are used to redirect a caller. A redirection response gives information about the user's new location or an alternative service that the caller might use to satisfy the call. Redirection responses are usually sent by proxy servers. When a proxy receives a request and doesn't want or can't process it for any reason, it will send a redirection response to the caller and put another location into the response which the caller might want to try. It can be the location of another proxy or the current location of the callee (from the location database created by a registrar). The caller is then supposed to re-send the request to the new location. 3xx responses are final.
  • 4xx are negative final responses. a 4xx response means that the problem is on the sender's side. The request couldn't be processed because it contains bad syntax or cannot be fulfilled at that server.
  • 5xx means that the problem is on server's side. The request is apparently valid but the server failed to fulfill it. Clients should usually retry the request later.
  • 6xx reply code means that the request cannot be fulfilled at any server. This response is usually sent by a server that has definitive information about a particular user. User agents usually send a 603 Decline response when the user doesn't want to participate in the session.

In addition to the response class the first line also contains reason phrase. The code number is intended to be processed by machines. It is not very human-friendly but it is very easy to parse and understand by machines. The reason phrase usually contains a human-readable message describing the result of the processing. A user agent should render the reason phrase to the user.

The request to which a particular response belongs is identified using the CSeq header field. In addition to the sequence number this header field also contains method of corresponding request. In our example it was REGISTER request.

SIP Transactions

Although we said that SIP messages are sent independently over the network, they are usually arranged into transactions by user agents and certain types of proxy servers. Therefore SIP is said to be a transactional protocol.

A transaction is a sequence of SIP messages exchanged between SIP network elements. A transaction consists of one request and all responses to that request. That includes zero or more provisional responses and one or more final responses (remember that an INVITE might be answered by more than one final response when a proxy server forks the request).

If a transaction was initiated by an INVITE request then the same transaction also includes ACK, but only if the final response was not a 2xx response. If the final response was a 2xx response then the ACK is not considered part of the transaction.

As we can see this is quite asymmetric behavior--ACK is part of transactions with a negative final response but is not part of transactions with positive final responses. The reason for this separation is the importance of delivery of all 200 OK messages. Not only that they establish a session, but also 200 OK can be generated by multiple entities when a proxy server forks the request and all of them must be delivered to the calling user agent. Therefore user agents take responsibility in this case and retransmit 200 OK responses until they receive an ACK. Also note that only responses to INVITE are retransmitted !

SIP entities that have notion of transactions are called stateful. Such entities usually create a state associated with a transaction that is kept in the memory for the duration of the transaction. When a request or response comes, a stateful entity tries to associate the request (or response) to existing transactions. To be able to do it it must extract a unique transaction identifier from the message and compare it to identifiers of all existing transactions. If such a transaction exists then it's state gets updated from the message.

In the previous SIP RFC2543 the transaction identifier was calculated as hash of all important message header fields (that included To, From, Request-URI and CSeq). This proved to be very slow and complex, during interoperability tests such transaction identifiers used to be a common source of problems.

In the new RFC3261 the way of calculating transaction identifiers was completely changed. Instead of complicated hashing of important header fields a SIP message now includes the identifier directly. Branch parameter of Via header fields contains directly the transaction identifier. This is significant simplification, but there still exist old implementations that don't support the new way of calculating of transaction identifier so even new implementations have to support the old way. They must be backwards compatible.

Figure 2.15 shows what messages belong to what transactions during a conversation of two user agents.

Figure 2.15. SIP Transactions

Message flow showing messages belonging to the same transaction.

SIP Dialogs

We have shown what transactions are, that one transaction includes INVITE and it's responses and another transaction includes BYE and it responses when a session is being torn down. But we feel that those two transactions should be somehow related--both of them belong to the same dialog. A dialog represents a peer-to-peer SIP relationship between two user agents. A dialog persists for some time and it is very important concept for user agents. Dialogs facilitate proper sequencing and routing of messages between SIP endpoints.

Dialogs are identified using Call-ID, From tag, and To tag. Messages that have these three identifiers same belong to the same dialog. We have shown that CSeq header field is used to order messages, in fact it is used to order messages within a dialog. The number must be monotonically increased for each message sent within a dialog otherwise the peer will handle it as out of order request or retransmission. In fact, the CSeq number identifies a transaction within a dialog because we have said that requests and associated responses are called transaction. This means that only one transaction in each direction can be active within a dialog. One could also say that a dialog is a sequence of transactions. Figure 2.16 extends Figure 2.15 to show which messages belong to the same dialog.

Figure 2.16. SIP Dialog

Message flow showing transactions belonging to the same dialog.

Some messages establish a dialog and some do not. This allows to explicitly express the relationship of messages and also to send messages that are not related to other messages outside a dialog. That is easier to implement because user agent don't have to keep the dialog state.

For instance, INVITE message establishes a dialog, because it will be later followed by BYE request which will tear down the session established by the INVITE. This BYE is sent within the dialog established by the INVITE.

But if a user agent sends a MESSAGE request, such a request doesn't establish any dialog. Any subsequent messages (even MESSAGE) will be sent independently of the previous one.

Dialogs Facilitate Routing

We have said that dialogs are also used to route the messages between user agents, let's describe this a little bit.

Let's suppose that user sip:bob@a.com wants to talk to user sip:pete@b.com. He knows SIP address of the callee (sip:pete@b.com) but this address doesn't say anything about current location of the user--i.e. the caller doesn't know to which host to send the request. Therefore the INVITE request will be sent to a proxy server.

The request will be sent from proxy to proxy until it reaches one that knows current location of the callee. This process is called routing. Once the request reaches the callee, the callee's user agent will create a response that will be sent back to the caller. Callee's user agent will also put Contact header field into the response which will contain the current location of the user. The original request also contained Contact header field which means that both user agents know the current location of the peer.

Because the user agents know location of each other, it is not necessary to send further requests to any proxy--they can be sent directly from user agent to user agent. That's exactly how dialogs facilitate routing.

Further messages within a dialog are sent directly from user agent to user agent. This is a significant performance improvement because proxies do not see all the messages within a dialog, they are used to route just the first request that establishes the dialog. The direct messages are also delivered with much smaller latency because a typical proxy usually implements complex routing logic. Figure 2.17 contains an example of a message within a dialog (BYE) that bypasses the proxies.

Figure 2.17. SIP Trapezoid

Message flow showing SIP trapezoid.
Dialog Identifiers

We have already shown that dialog identifiers consist of three parts, Call-Id, From tag, and To tag, but it is not that clear why are dialog identifiers created exactly this way and who contributes which part.

Call-ID is so called call identifier. It must be a unique string that identifies a call. A call consists of one or more dialogs. Multiple user agents may respond to a request when a proxy along the path forks the request. Each user agent that sends a 2xx establishes a separate dialog with the caller. All such dialogs are part of the same call and have the same Call-ID.

From tag is generated by the caller and it uniquely identifies the dialog in the caller's user agent.

To tag is generated by a callee and it uniquely identifies, just like From tag, the dialog in the callee's user agent.

This hierarchical dialog identifier is necessary because a single call invitation can create several dialogs and caller must be able to distinguish them.

Typical SIP Scenarios

This section gives a brief overview of typical SIP scenarios that usually make up the SIP traffic.

Registration

Users must register themselves with a registrar to be reachable by other users. A registration comprises a REGISTER message followed by a 200 OK sent by registrar if the registration was successful. Registrations are usually authorized so a 407 reply can appear if the user didn't provide valid credentials. Figure 2.18 shows an example of registration.

Figure 2.18. REGISTER Message Flow

Message flow of a registration.
Session Invitation

A session invitation consists of one INVITE request which is usually sent to a proxy. The proxy sends immediately a 100 Trying reply to stop retransmissions and forwards the request further.

All provisional responses generated by callee are sent back to the caller. See 180 Ringing response in the call flow. The response is generated when callee's phone starts ringing.

Figure 2.19. INVITE Message Flow

Picture showing a session invitation.

A 200 OK is generated once the callee picks up the phone and it is retransmitted by the callee's user agent until it receives an ACK from the caller. The session is established at this point.

Session Termination

Session termination is accomplished by sending a BYE request within dialog established bye INVITE. BYE messages are sent directly from one user agent to the other unless a proxy on the path of the INVITE request indicated that it wishes to stay on the path by using record routing (see Section , “Record Routing”.

Party wishing to tear down a session sends a BYE request to the other party involved in the session. The other party sends a 200 OK response to confirm the BYE and the session is terminated. See Figure 2.20, left message flow.

Record Routing

All requests sent within a dialog are by default sent directly from one user agent to the other. Only requests outside a dialog traverse SIP proxies. This approach makes SIP network more scalable because only a small number of SIP messages hit the proxies.

There are certain situations in which a SIP proxy need to stay on the path of all further messages. For instance, proxies controlling a NAT box or proxies doing accounting need to stay on the path of BYE requests.

Mechanism by which a proxy can inform user agents that it wishes to stay on the path of all further messages is called record routing. Such a proxy would insert Record-Route header field into SIP messages which contain address of the proxy. Messages sent within a dialog will then traverse all SIP proxies that put a Record-Route header field into the message.

The recipient of the request receives a set of Record-Route header fields in the message. It must mirror all the Record-Route header fields into responses because the originator of the request also needs to know the set of proxies.

Figure 2.20. BYE Message Flow (With and without Record Routing)

Picture showing BYE message flow with and without record routing.

Left message flow of Figure 2.20 show how a BYE (request within dialog established by INVITE) is sent directly to the other user agent when there is no Record-Route header field in the message. Right message flow show how the situation changes when the proxy puts a Record-Route header field into the message.

Event Subscription And Notification

The SIP specification has been extended to support a general mechanism allowing subscription to asynchronous events. Such evens can include SIP proxy statistics changes, presence information, session changes and so on.

The mechanism is used mainly to convey information on presence (willingness to communicate) of users. Figure 2.21 show the basic message flow.

Figure 2.21. Event Subscription And Notification

Picture showing subscription and notification.

A user agent interested in event notification sends a SUBSCRIBE message to a SIP server. The SUBSCRIBE message establishes a dialog and is immediately replied by the server using 200 OK response. At this point the dialog is established. The server sends a NOTIFY request to the user every time the event to which the user subscribed changes. NOTIFY messages are sent within the dialog established by the SUBSCRIBE.

Note that the first NOTIFY message in Figure 2.21 is sent regardless of any event that triggers notifications.

Subscriptions--as well as registrations--have limited lifespan and therefore must be periodically refreshed.

Instant Messages

Instant messages are sent using MESSAGE request. MESSAGE requests do not establish a dialog and therefore they will always traverse the same set of proxies. This is the simplest form of sending instant messages. The text of the instant message is transported in the body of the SIP request.

Figure 2.22. Instant Messages

Picture showing a MESSAGE.

Media Gateway Control Protocols

In a traditional telephone network, the infrastructure consists of large telephone switches which interconnect with each other to create the backbone network and which also connect to customer premise equipment (PBXs, telephones). While the internal network today is based upon digital communication, links to customers may be either analog (PSTN) or digital (ISDN). The links to customers are shared between call signaling (for dialing, invocation of supplementary services, etc.) and carriage of voice/data; in the backbone, dedicated (virtual) links interconnecting switches are reserved for call signaling (de-facto creating a dedicated network of its own) whereas voice/data traffic is carried on separate links. The Signaling System No. 7 (SS7) or variants of it are used as the call signaling protocol between switches; this protocol is used to route voice/data channels across the backbone network by instructing each switch on the way which incoming "line" is to be forwarded to which outgoing "line" and which other processing (such as simple voice compression, in-band signaling detection to customer premise equipment, etc.) is to be applied. Voice/data channels themselves are plain bit pipes identified by roughly a trunk and line identifier at each switch.

Figure 2.23. Application Scenario for Media Gateway Control Protocols

Application Scenario for Media Gateway Control Protocols

A similar construction is now considered by a number of telcos for IP-based backbone networks that may successively replace parts of their overall switched network infrastructure, as depicted in figure 3.7. Instead of voice switches, IP routers are used to build up a backbone network - which employs IP routing, possibly MPLS, and, most likely, some explicit form of QoS support to carry voice and data packets from any point in the network to any other. In contrast to voice switches, this does not require explicit configuration of the individual routers per voice connection; rather, only the entry and exit points need to be configured with each others' addresses so that they know where to send their voice/data packets to. Two types of gateways are used at the edges of the IP network to connect to the conventional telephone network: signaling gateways to convert SS7 signaling into IP-based call control (which may make use of H.323 or SIP or simply provide a transport to carry SS7 signaling in IP packets [SIGTRAN]) and media gateways that perform voice transcoding.1 Some central entity (actually, probably a number of co-operating entities) forms the intelligent core of the backbone: the Media Gateway Controller(s). They interpret call signaling and decide how to route calls, provide supplementary services, etc. Having decided on how a call is to be established, they inform the (largely passive and "dumb") media gateways at the edges (ingress and egress gateways) how and where to transmit the voice packets. The Media Gateway Controllers also reconfigure the Gateways in case of any changes in the call, invocation of supplementary services, etc. The media gateways may be capable of detecting invocation of control features in the media channel (e.g. through DTMF tones) and notify the Media Gateway Controller(s) which then initiate the appropriate actions.

A number of protocols have been defined for communication between Media Gateway Controllers and media gateways: initial versions were developed by multiple camps, some of which merged to create the Media Gateway Control Protocol (MGCP), the only one of the proprietary protocols that is documented as an Informational RFC (RFC 2705). An effort was launched to make the two remaining camps cooperate and develop a single protocol to be standardized which resulted in work groups in the ITU-T (rooted in Study Group 16, Q.14) and in the IETF (Media Gateway Control, MEGACO WG). The protocol being jointly developed is referred to as H.248 in the ITU-T and as MEGACOP in the IETF.

It turned out that due to largely political issues between the two camps and "religious" differences between the IETF and the ITU-T (use of text vs. binary encoding) the work progressed slowly with much time being spent discussing procedural aspects and sorting out minor technical work. Nevertheless, progress was made over time and improvements have been achieved over the individual input protocols, and it is expected that this work will come to closure some time during the year 2000. In the meantime, various stages of the proprietary protocols are being deployed, with vendors probably providing migration paths to the finally standardized H.248/MEGACOP.

One particular protocol extension currently discussed in the IETF is the definition of a protocol for communication with an IP telephone at the customer premises that fits seamlessly with the Media Gateway Control architecture. Such a telephone would be a rather simple entity essentially capable of transmitting and receiving events and reacting to them while the call services are provided directly by the network infrastructure.

Proprietary Signaling Protocols

Today nearly every vendor that offers VoIP products uses his own VoIP protocol - e.g. Cisco's Skinny or Siemens's CorNet. They were invented by the vendors to be able to provide more or specific supplementary services in the Voice over IP world to give customers all the features they already know from their classic PBX. The enterprise solutions usually feature such a proprietary protocol and provide simple support for a standardized protocol (until now usually H.323) with only basic call functionality.

Giving detailed information about those protocols is out of the scope of this document - and usually difficult to provide for most protocols aren't public available.

Real Time Protocol (RTP) and Real Time Control Protocol (RTCP)

RTP and RTCP are the transport protocol used for IP telephony. Both of them were defined in RFC1889; the former as a protocol to carry data that has real-time properties, the latter to monitor the quality of service and to convey information about the participants in on-going session. The services provided by the RTP protocol are:

  • identification of the carried information (audio and video codecs);
  • checking packet in-order delivery and, if necessary, re-ordering the out-of-sequence blocks;
  • transport of the coder/decoder synchronization information;
  • monitoring of the information delivery.

The RTP protocol uses the underlying User Datagram Protocol (UDP) to manage multiple connections between two entities and to check for data integrity (checksum). An important point to stress is that RTP neither provides any mean to have a guaranteed QoS nor assumes the underlying network delivers ordered packets.

The RTCP protocol uses the same protcols as RTP to periodically send control packets to all session participants. Every RTP channel using port number N has its own RTCP protocol channel with port number equal to N+1. The services provided by the RTCP are:

  • giving a feedback on the data quality distribution, feedback used to keep control of the active codecs;
  • transporting a constant identifier for the RTP source (CNAME), used by the receiver to link a SSRC identifier and its source to synchronize audio and video data;
  • advertising the number of session partecipants, number used to adjust the RTCP data transmision rate;
  • carrying session control information, used to identify the session partecipants.

In the next two subsection we are going to descibe the RTP and RTCP header and the different types of packet those two protocol use.

RTP Header

Figure 2.24 shows the RTP header. The first 12 bytes are present in all the RTP packets, the last bytes containing the CSRC (Contributing SouRCe) identifiers list is present only when a mixer is crossed (with mixer we refer to a system which receives two or more RTP flows, combines them and forwards the resulting flow).

Figure 2.24. RTP Header

RTP Header

The header fields are here detailed:

  • version (V - 2 bits), contains the RTP protocol version;
  • padding (P - 1 bit), if set to 1 then the packet contains one or more additional bytes after the data field;
  • extension (X - 1 bit), if set to 1 then the header is followed by an extension;
  • CSRC count (CC - 4 bits), contains the CSRC identifiers number which follow the header;
  • marker (M - 1 bit), application available field;
  • payload type (PT - 7 bits), identifies the data field format of the RTP packet and determines its interpretation by the application;
  • sequence number (16 bits), value incremented by one for each RTP packet sent, used by the receiver to dectect losses and to determine the right sequence;
  • RTP timestamp (32 bits), is the sampling time of the first RTP byte, used for synchronization and jitter calculation;
  • SSRC ID (32 bits), identifies the synchronization source, chosen randomly within a RTP session;
  • CSRC ID list (from 0 to 15*32 bits), optional field identifying the sources which contribute to the data in the packet, the number of the CSRC IDs is written in the CSRC count field.

RTCP packet types and format

In order to transport the session control information, the RTCP foresees a number of packet types:

  • SR, Sender Report, to carry the information sent by the transmitters, to give notice to the other partecipants on the control information they should receive (number of bytes, number of packets, etc.);
  • RR, Receiver Report, to carry the statistics of the session participants which are not active transmitters;
  • SDES, Source DESscription, to carry the session description (including the CNAME identifier);
  • BYE, to notify the intention of leaving the session;
  • AAP, to carry application specific functions, used by experiemtnal use of new applications.

Every RTCP packet begins with a fixed part similar to the one of the RTP ones, such part is then followed by structural element of variable length. More than one RTCP packet may be linked togetherto build a COMPOUND PACKET. Moreover, in order to maximize the statistics resolution, the SR and the RR packet types are to be sent more often than the other packet types.