Boutique Tech Conference · 4. – 6. June in Rostock (Germany)
Picture of the talk

OpenSIPS - clustering and balancing media servers

in English by Bogdan Iancu of OpenSIPS project at AMOOCON 2009

Abstract

Staring with version 1.5.0, OpenSIPS has the ability to perform real load balancing between heterogeneous peers.
Peers can provide different sets of resources (like voicemail, transcoding, gatewaying, conferencing, etc).

For a call requiring a set of resources, OpenSIPS can determine which is the peer that can compete the call, keep the level load on the system.

The load-balancing mechanism is dynamic, can be tuned during runtime and also peers can provide feedback to the load-balancer in regards to their current load or changes in capacity.

Bogdan_iancu_small

Additional material

Here you can find all available material for this talk.

PDFs

Audio recordings

Video recordings

The slides

There are 33 different slides. Click on them to view an enlarged version.

  1. Slide-0
  2. Slide-1
  3. Slide-2
  4. Slide-3
  5. Slide-4
  6. Slide-5
  7. Slide-6
  8. Slide-7
  9. Slide-8
  10. Slide-9
  11. Slide-10
  12. Slide-11
  13. Slide-12
  14. Slide-13
  15. Slide-14
  16. Slide-15
  17. Slide-16
  18. Slide-17
  19. Slide-18
  20. Slide-19
  21. Slide-20
  22. Slide-21
  23. Slide-22
  24. Slide-23
  25. Slide-24
  26. Slide-25
  27. Slide-26
  28. Slide-27
  29. Slide-28
  30. Slide-29
  31. Slide-30
  32. Slide-31
  33. Slide-32

Transcript

Bogdan Iancu: The topic for today is how to control a cluster or a bunch of servers, particularly intermediate servers with OpenSIPS. More or less how to do the balancing of servers. I will start the presentation by telling a very short and, I think, nice story that starts like all stories with once upon a time.

There was a Switch and a PBX and to give names, it was about Diana from Yate. So how can we actually make – because we have a switch, we have a PBX – in 90% of the installations we have both so they’re not some things that exclude each other more or less, they are complementary things. So you have in all kinds of installations you have both: you have the switch and you have the PBX. So the idea was how to make this integration easier because everybody wants it.

I think the first time was in 2003 when I heard the question: how do you make OpenSIPS work with Asterisk? And that was the starting point where we could do all kinds of things. So there were different solutions for this and I think right now we got more or less to a point where it works and tried to come up with some real solutions. But more or less for these source targets we needed to make the installations in an easier way.

Now more technically everything started more or less last year when we said, “OK, let’s put on paper everything and see what we need on both sides to develop a view to make it work, so you can just simply put it together and easily work.” We learned that if the integration effort is higher, then it would be easier to give up and try something else. It’s hard.

From our point-of-view, from an Open Source point-of-view, the result of this “inter-project” operation was the development of a new functionality; what we call right now the “load balancer”. This load balancer has the main purpose to control a cluster or a bunch of servers, not necessarily to be identical peers. So we are talking about a heterogeneous media cluster with different kinds of servers and different kinds of functionality and different properties and so on. So it would be able to control such a cluster.

And because it’s kind of a relationship between the load balancer and the cluster its balancing, it’s lead directional communication. We have also the feedback from the other way around, from the cluster to load balancer, informing about changes, side effects in the balancing rules, for example, if some server is getting overloaded because of whatever reasons. Maybe there is a failure and you need to readjust the balancing rules. So the cluster is all of the time able to push the information back to the load balancer in order to adjust the rules of the system.

The most important thing about load balancing – so I’m saying “load balancing” because for sure you heard about “Wow, you can do it for a long time” kind of load balancing with OpenSIPS and so on. And actually it’s really not about that because it actually has simple dispatching. Like you have boxes and you have a bunch of calls – you have one call there, you send one call to the other point and so on. So it’s just a distribution of a call without any actual load on the machine. So you have no idea how many ongoing calls are on the peer. You just simply send one call to a peer, the next call to the next peer and so on.

Load balancing actually means the load – because the word “load” is the key here, is to do the routing to the peer based on the current load of each peer. So this is what the new module brings in with it. The interesting properties of the module is that, first of all, it’s able to monitor the load – by load I mean how many ongoing calls are for each peer. So, OK, for this peer I have four ongoing calls and for the other one I have some. Depending on this information it can route the calls.

And also a nice addition was the addition of resources you’re going to see later. What it means exactly, the peers are not alike so they then become peers with the same properties. Each peer may provide different kinds of functionalities. For example, the peers are media servers so you can have you can have your software in transporting. Maybe you extend between voicemail and conferencing and probably any kind of combination; we’re offering more than one single functionality. So OpenSIPS are made with such clusters.

Hopefully it’s quite interesting because we mentioned the dialogue and originally OpenSIPS is a proxy and the proxy in the dialogue state is not quite – they don’t make a good family. In the latest version we have quite a lot of work in adding a dialogue level in OpenSIPS. So right now there is a dialogue state; you can trace dialogues, you can store dialogue-related information. The simple presence of this dialogue support made possible the development of the load balancing module.

And, as with all load balancing units irrespective of themselves, there is support for failover so you can detect failures at the signaling level of the failure of the peer and re-route according to what routes you have. And also, very important, is because of the flexibility from OpenSIPS script, the load balancing functionality can be quite easily combined with additional routing logic. Like you can have separate groups, like a cluster you want to balance. You may detect what kinds of resources are required by a call so you can say exactly what you want from the cluster and so on. We’re going to see some examples later.

Before talking about the implementation styles, just to be clear to everybody what we had in mind when we developed this module, I will just briefly present some of what they consider standard clustering scenarios, starting with inbound calls. A very simple scenario is there are a lot of businesses, a lot of providers starting with, OK, small sizes. They have an Asterisk with a Yates, for example. They each 1,000 customers and various calls so it’s not quite possible to scale because it cannot handle quite large amounts of traffic.

So the next step for them is to try to put more boxes, identical boxes, and somehow to balance the load between these boxes. So a simple scenario: you put a balancer in front, all this what you had before, multiplied. You started sending calls to these PBXs.

Also with inbounds, a typical scenario is you have a call center with operators and you just use the load balancer to all the time send the call to an available, free operator. Also, you may have different types of operators: you can have a standard one or maybe somebody related to business customers or some supervisor or whatever. This brings back again that these are not necessarily identical so this complicates a bit the load balancing rules, the logic. Maybe because of these scenarios there was a need to introduce what we call resources. So what kind of thing you need for the call to be completed.

This is another example. If you are a PSTN quality provider, then for a module for business you need something to front all your gateways not only from a security point-of-view but also to accordingly balance all the traffic through the gateways.

A load balancer can also be used for outbound calls, so it’s not only for inbound. A simple example, you have a service running on your SIP proxy server from there to there and then you have to deliver for several kinds of services. You have to deliver the calls to a bunch of machines providing media applications. So you have here servers for announcements and several media-related services. So a proxy will simply detect, Oh, the guy called for some media service. It gets sent to the load balancer and the load balancer will decide which server to relate to in order to complete the call.

So in this case it’s very easy to expand this and also to achieve high availability on these ones because if simply one server would fail, then automatically the load balancers will route the call to the available on and so on. In this instance it is absolute necessary because if it [inaudible] we can combine actually these two together in a single instance. It’s justified to use it only in some cases.

And similar to the media servers balancing, you can consider outbound calls going to PSTN to be balanced. Consider it not necessarily to be a single gateway; this can be a whole set of providers so you can have – as I said at the beginning, the load balancer rules logic can be combined with whatever other kind of routing that you have. So you may consider sending to some provider or to some local gateways or to some other kind of types of gateways. Because you can have PSTN, GSM and so on. So you can combine the logic to determine and also to do the load balancing between the available gateways.

Also you can mix this inbound and outbound. A simple example, we have PBXs, so you balance the traffic, all the traffic you get over these PBXs. And when the PBX detects that a call needs to be sent for example to PSTN, then that sends all the traffic to a balancer and the balancer takes care of all the traffic properly routing the call to the available gateways.

This makes the balancing logic more complex, because as I said, we may have different groups to balance. In this case you have the first balancer using the PBX group and this balancer is using the gateway group. So you may consider again of a single one that is able to do both jobs at the same time. So a single opposite installation that can do the same type of balancing for the inbound traffic and for the outbound traffic by simply using different groups or identifying the pools of servers.

Now, some details about how this actual works. Before using something it’s better to understand how it works, just to see how it fits exactly in your set-up. First of all you need dialogue support, so you have to load the dialog module before the mandatory install or you won’t be able to use that dialogue module. Then we need from the peer or destination for the load balancer point-of-view are identified by simply SIP addresses, so SIP URI. You can identify a peer by SIPs on the port and transport them if you want and so on.

Again, it’s important to mention that the destinations are not alike. They may be different by capacity, so one peer may have 30 channels for whatever and a different peer may have only 20 channels. And again, they are maybe different also by what resources they offer. I can have Asterisk providing voicemail and conference and I can have a different Asterisk providing only conference. So if I have a call for conference, I have to consider balancing on the boxes with conference support.

Now I keep mentioning resources. What are resources actually, they are the several capabilities of a peer. Just for better understanding I have an example. We have a bunch of media servers and a server can support one or multiple of these functions. So it can be a transcoder, voicemail, conference, an announcement, PSTN or it can support a set of these functionalities.

So imagine a case where you have a call to PSTN and you simply detect before sending to the PSTN, by simply looking at the STP part, that you need also to do transcoding. So in that case, in order to complete the call, at least from a resource point-of-view, you will need a box to do also transcoding and PSTN gateway. We use two functionalities. You also may have a situation where you don’t need transcoding and you simply go to PSTN. So in that case you need to inform the load balancer that only the PSTN resource is required in order to perform load balancing.

Then we have the groups, the load balancing groups. As we saw in the scenario with the mixed load balancing, so we have load balancing with the inbound/outbound traffic. Well that’s exactly what a group is, so you can have multiple groups of service to do balancing. Like this scenario can define the group “0” for example for the inbound parts of the PBXs and group “1” for the outbound parts to do balancing over the gateways.

How you define a peer, because the load balancing works with a peer/destination. First of all, as I said, a peer is identified by the SIP URI, so by the address of the peer from a SIP point-of-view. Then you have the group it belongs to and for each peer you put the capacity. Actually, I got it wrong. First you have to put the list of resources and then for each resource you put the capacity for that specific resource.

The capacity of the peer more or less is how many simultaneous calls the peer can provide for that resource. So, a simple example, if you have four peers with the first peer having 30 channels for transcoding and 32 for PSTN, so it’s able to provide two resources. And the second has 100 for voicemail and 10 for transcoding and so on.

Here we have for example how it’s defined in a figure for the balancer. So you have the group, in this case it’s only one single group. You have the addresses for the peers and then you describe the resource and the maximum capacity for that resource. So there is a list of resources with the capacity for each one. And this is information that the load balancer will use in order to perform the balancing. So when you invoke the balancing logic you have to tell what group to use, so over what servers you want the balancing and second what resources are required in order to complete the call.

The resource detection is done in routing scripts. So you’re looking for whatever information you think is necessary. For example, you may look at RURI, that targets the voicemail application or its conference number and so on. You may look at the SDP to see if some transcoding is needed. So you look at whatever is necessary in order to identify the required resources and then you just simply pass the list to the load balancing function.

So we know the group, we know the resources, then let’s say, what’s the magic behind the function? To get the less loaded peer. First of all, the module will simply select the peer from the requested group and from the remaining set it will identify the peers that are able to provide the requested resources. So a single box must offer all the requested resources.

For the remaining set again, the load balancer will start operating in real time what is the load for each peer per resource because peers may have multiple resources. The tricky algorithm here, the winning peer would be the peer which has the maximum value for the minimum load available per resource. It sounds a bit strange, that’s why I have an example to show you example what this complicated phrase means.

So let’s go back to the definition. We have these four peers and from the opposite script you do load balance number of the group and you take the transcoding with the requested resources. As all of these peers are in the same group, all will be selected and then only boxes one and four will be selected because these are the only boxes that provide the same kind: transcoding and pstn. The other kinds offer only one or none of them.

So we have a subset of peers one and four. And then the next step would be to evaluate the load. I just said what they were, now I put in some values, assuming that the first one has 10 channels in use for transcoding and 18 for PSTN and add in some numbers for the number four peer, like 9 and 16 – that’s the load. So how many ongoing calls are going through that peer. And then we evaluate what’s the available load because that’s more or less the important part.

The decision is not based on the load but on how much available load is on the peer. So from the maximum numbers of channels, the current load is subtracted and you get an available number of channels for each peer and for each resource. And then we get the magic phrase: looking at each peer, we say, “OK, what is the minimum value for the available resources?”

So for the first one, for the PSTN it has only 14 channels, for the 20. Let’s say the critical resource for the first peer is the PSTN because it’s the minimum available load. For the fourth peer, the critical resource is transcoding because there is only one channel left available. So that’s the minimum for the peer.

Right now we have for peer one the minimum load is 14 for PSTN and for peer four the minimum load is one for transcoding. Of course the biggest minimum is selecting; the one with more available space. Peer number one, because 14 is higher than 1.

Just making a nice graphical presentation of this algorithm. So the color blue represents the total number of, the maximum number of channels. We have the used part and free part and here in red you have the critical resource of the minimum available for a peer. That’s the first peer and that’s the fourth peer. Here we have the 14 minimum available space and here we have 1, the minimum available space. So we take the one with the maximum of minimum available space. Selected here would be the first one.

So the whole idea here is to try to avoid overloading a specific resource on a machine. You might have here 100 channels left for PSTN but for the transcoding you may have only one. So if I push a call that requires both transcoding and PSTN on this machine, I will completely use the transcoding part. I don’t want to exhaust a resource on the machine because that’s the idea of the load balancer – not to use a maximum capacity kind of resource. In that case I am sending here the code because I have more available resources for everything.

So that was a way new load balancing works with OpenSIPS functionality. Now ideally would be to be able to control all this load balancing logic. So there are a bunch of what we call MI commands, management interface. Management interface – there is an extended interface module that you can push commands from outside web applications to OpenSIPS command. You may force an internal action or you can import some data from overseas.

The first idea is to have the possibility to change this load balancing information without restarting. So you have available here a reload command so you can specify the module to reload stuff from the whole database. You can maybe add a new peer or maybe there are some new resources, some changes in the resources available by the peer and so on.

An easy way to operate these changes first in the database – more or less all the changes you do in the database and then when you are done, you reload everything from the database of this load balancing stuff. For operating changes, the OpenSIPS Control Panel – a nice provisioning wave interface. For the next we will need a load balancing tool, so this tool will actually allow you to perform changes in a very nice way over the database and also remotely over the transporter to trigger the reload command.

Another important way to interact with the load balancing is the resize command. I said in the beginning, this connection between load balancing and the pull over the servers it’s doing this over is a bi-directional interaction. So far we’ve seen how the load balancer interacts with a different cluster by simply doing the load balancing stuff but this command is used for sending back feedback from the peers to the load balancer.

For example you may have monitoring tools on the peers and if some overload is detected or failures are detected on the peer, you can provide feedback to the peer and resize the capacity of the resource on the peer. Let’s say you have a voicemail on the system and you say, “OK, my voicemail system can handle up to 100 channels” and you put the size 100. But the voicemail is not so simple because you can’t make an accurate prediction. You might have 100 people in the same conference room or even if your conference room has two people inside and the load will be higher if you have 100 people in a single conference room.

In that case, you can put the best situation with 100 channels and you put the monitor onto the machine and you are in the worst case and all these 100 people are going into the same room, you can simply provide information into the load balancer and say, “OK, I’m already starting to be overloaded. Actually I do not accept 100 anymore, I will accept only 80 channels per conference. So you can provide feedback and resize and tell the load balancer not to send you any other calls because your capacity was reduced.

Also, it can be used for example in case of failure. Imagine you have a PSTN, you have a gateway, you have two cards, each with 32 channels. So you have 64 channels for PSTN gateway. You put that number in your load balancing definition and the load balancer knows not to send any more than 64 calls to the peer.

Maybe at the moment, one card fails or you want to replace or whatever. So you can resize without affecting the whole logic. You can resize the capacity and say, “OK, right now I want the 64 to be 32 because I am operating some changes. Maybe one card is broken and so on. You can provide feedback automatically with failure protection or simply by some admin operating adjustment. You can resize even to zero if you want for the load balancer to stop sending any kinds of calls for that resource.

So these are the examples I just gave. Well this load balancing module is available on in the latest stable version 1.5 which was released sometime in March. It’s a new feature right now in beta stage. We’re trying now to collect information about how to enhance this functionality. Right now we have the engine and we use it in all sorts of scenarios. Of course I just gave you a very simple scenario so you have examples.

We found out people have started using all sorts of things so we have the most amazing scenarios. We don’t want to leave it to only that scenario. You never know what new functionalities are required on top of the load balancing in some very strange scenarios.

Right now we have started building around the engine and putting more and more stuff just to try to cover as much as possible. That’s why we call this the snowball effect. Right now we have just the core and we start putting more on it. Right now on the roadmap there is some work to be done to the failure detection; so to be able to have better provision over the load bearing point-of-view or for the peer. To be able at the [inaudible] level to detect and mark the [inaudible] as fails and stop sending traffic to a peer that as been detected as a fail.

Also to do automatic re-enable of a peer: so if a peer was detected as failed, for example is not responding to the SIPS traffic, then you can automatically switch probing off, you can start the load balancer to send some probing packages to see exactly – to detect when the peer is back online. Once the peer is back online, then it will automatically enable the peer so you can start using the peer.

So the good part is that all this is an automatic process so you don’t need any human interventions. For example, consider automatic reboot of a server, of a gateway. The load balancer will shut the gateway down for the reboot time. It will not send anything during the reboot time to the gateway but once the gateway is back online, automatically without human intervention it will start using this gateway.

Also there is some work for the simple management, to be able to issue commands. For example it will enable some peers and some resources and finish the work with integration with the control panel. The whole control panel is a bunch of tools that allows you to provisions OpenSIPS – it’s a new coming tool for provisioning the load balancing stuff. So questions? Yeah?

*Man 1: * How do you find how many channels are in use? You said that all the channels use this track called a dialogue module?

Bogdan: Yes, so the load balancing module comes on top of this dialogue module and it’s automatically forcing what we call profiling. So it’s counting for this destination how many dialogues are for each peer.

Man 1: So if I use the dialogue module in both directions, it’s clear to me that if I send it all to a PSTN gateway load balancing module, that will be the usage of the channel. But if the call comes from PSTN gateway, the channel won’t be called, correct?

Bogdan: Depends – OK, you mean a case where you have an inbound/outbound gateway? Well in that case, that’s part of the normal effect. Yeah, it’s probably quite easy to force counting of inbound calls from the gateway than it is to mark the channels used, without doing the load balancing. Automatically, OK, I have one extra channel in use by whatever reasons.

That’s easy to do because you can do it from the script because you need to know exactly which profiles the module is building to keep track of the call. It’s quite a simple, interesting addition to do.

[inaudible question]

No, by established calls. So the load is the number of ongoing calls.

Man 2: If I have a PSTN gateway then it would also do it from the outside – I don’t know what is the exact order of the gateway and why they are in a provisional state. [inaudible]

Bogdan: Yeah, don’t consider the ongoing calls, only the ones that have been established. But actually the call starts from the invite, so that’s the point where you consider it. These ongoing calls are also considered calls in the early stage, so in between the invite and the [inaudible].

Man 1: I don’t remember how the dialogue module works.

Bogdan: So once you try to establish, it would be considered a call and you can reply either “OK, the call is successful” and then eliminate it. So it’s including also the final stage of the dialogue. Other questions?

Man 3: Is there the possibility to have more than one load balancer in case you – it might be better than if you have just one?

Bogdan: Well you mean to share information between two load balancers? So they take over?

Man 3: That is the second question. If you would have more than one then how would you share information?

Bogdan: Right now this part is not possible because there are some indications of some work to be done on the dialogue part. So right now you cannot share the dialogue information between two businesses. You cannot do a load backup – I mean to have too much building and then take over if you have two businesses running. You have to have an active one and this crash to bring up the other one because you cannot monitor the calls.

Man 2: [inaudible]

Bogdan: No because – OK, in this question, more or less, yes. But on the roadmap there is a part that is not on the load balancer to be persistent on the data in the course of reboots, for information about the profiles. For the data profiles there is a mechanism used to count how many ongoing calls are OK for a destination, in our case.

So once this information is saved when you bring up the same [inaudible] but a different machine, it will say, “OK, last time there were 20 calls on that machine.” Of course, depending on the period of time where there is no overages running, maybe some calls maybe terminated. You hang up and you hang up so more or less nobody will talk so the call is terminated. But in the meantime it has no idea there was a call terminated. So it will just count ongoing calls that simply expire. [inaudible]

From the middle of the traffic and you simply disappear and appear again, it cannot identify the event that took place in the time that you were offline. But [inaudible] will disappear.

Other questions? No? So in this case I really thank you for staying so late. I was not expecting to see so many of you at this hour. I was just chatting to a friend on the panel before and everyone said, “Oh, I have a plane to catch, I have a train to catch” or “I want to hit the road before dark” and so on. So I said, OK. Sounds dangerous.

Anyway, thank you all a lot and I hope this was interesting for you, this new load balancing stuff.

[applause]