Monitoring is an essential means in the management of service-oriented applications.
Specifically, the observation and analysis of events happening at different abstraction layers support the governance of service choreographies at run-time. Furthermore, event correlation results crucial when monitoring rules aim at checking the exposed levels of Quality of Service against the Service Level Agreements established among the choreography participants.
However, in the extremely dynamic context of large-scale choreographies enacted over distributed networks or clouds, the relevant monitoring rules might not be completely defined a-priori, as they may need to be adapted to the specific infrastructure and to the evolution of events. Therefore, we propose here an adaptive multi-source monitoring architecture that can synthesize instances of rules from generic templates, whenever some meta-rule is matched.
In the figure below you can see the overall infrastructure of the ChoreOS Multi-Source Monitoring Framework.
The Multi-source Monitoring Framework that we have developed, can correlate the messages monitored at business-service level with the observations captured by the infrastructure monitoring the low level resources.
In the following we describe the tool infrastructure implementing multi-source monitoring at three levels, namely:
The above three different tools are loosely integrated by means of the Distributed Service Bus (DSB).
The DSB distinguishes between a set of channels dedicated to the monitoring activities (i.e., Control Plane), and other channels where both coordination and application messages can flow (i.e.Data Plane).
The CEP correlates the data passing through the Control Plane.
The Business Service Monitoring (BSM) level provides the monitoring functionality related to business services, orchestrations and choreographies. Its architecture is exclusively message-driven, since it is composed of components receiving data as WS-Notifications from the ChoreOS middleware, and sending upper level notifications (QoS, SLA violations, Choreography status), resulting from runtime analysis.
BSM enables the monitoring of business services involved within the enacted choreographies: services participating to the choreography are exposed via the EasyESB bus.
Precisely, BSM is an EasyESB node that has a specific profile dedicated to monitoring purposes.
BSM includes a Data collector, a QoS Runtime Assessment component, and a Glimpse Component.
The Data collector aggregates the observed non functional data generated by the exchange between services.
It uses data coming from the QoS Runtime Assessment Component.
Moreover, the Glimpse Component is an integration bridge implemented to enable the communication between the BSM communicate and the Complex Event Processor Glimpse thanks to JMS notifications.
The QoS Runtime Assessment is the component that evaluates the respect of the non functional contracts between services. It is composed of an SLA Manager and the Web Service Distributed Management Component.
Who is interested to effectively detect unexpected or undesirable behaviors of services, locate the origin of the issue, or even predict potential failures it is generally necessary to track, combine,
and analyze events occurring at different abstraction levels. Therefore, in contrast with the use of more monitors operating in separate contexts, a promising strategy that is pursued in CHOReOS is to
architect SLA monitoring solutions able to reveal or predict run-time anomalies due to the combination of phenomena originated from sources operating at different levels.
The source code of all the three main components is available on the CHOReOS public SVN.
The resource monitor in ChoreOS leverages previous works by borrowing heavily from Ganglia and has three main components:
The gmetad component is deployed to every node and data is available for the system administrator if needed, but we have not identified a strong use case for the data it provides in ChoreOS, and it is consequently unused. The other components should be deployed in a similar fashion to the figure below.
In any event-based monitor, a central element is the CEP, which is the rule engine that analyzes the primitive events, generated from some kind of probes, in order to infer complex events matching the consumer requests. There exist several rule engines that can be used for this task (like Drools Fusion, RuleML), in our release we used Drools.
The Complex Event Processor within the ChoreOS Multi-source Monitoring Framework is able to combine information generated from
different abstraction levels and infer the reason why an SLA violation occurs.
A violation could be due to either the current status of the infrastructure hosting some of the services involved in the interaction, or rather to the implementation of some services.
In the former case, the Multi-source Monitoring framework through GLIMPSE CEP may reveal that the SLA has been violated due to either an overload, or a crash of the node hosting the service.
A reaction to this scenario may foresee the migration of the service to another (more powerful, more reliable) node. In the latter case, the Multi-source Monitoring framework reveals that the SLA has been violated even though the node is available and is not overloaded.
Here, the reaction can be the notifications of a request for the deployment of an updated version of some of the services involved.
The services will interact within the enacted choreography through the Distributed Service Bus that has been deployed on the ChoreOS architecture.
The DSB can be splitted in two main logical partition.
The first one is the ``Control Plane'', a component in charge to receive and forward all the information and messages related to the communication and management of the overall components of the proposed monitoring framework.
The second one is the ``Data Plane'', a component implemented by means of an Enterprise Service Bus with the specific task to receive and manage all the data payload generated at choreography runtime.
When the BSM reveals that an SLA violation has occurred, its associated Glimpse Probe sends a warning to the CEP.
The CEP first interacts with an internal registry in order to identify the IP address of the machine running the specific instance of the service that violated the SLA;
then the Rule Generator component synthetizes and loads a new rule looking for the issues on the node hosting that service.
The generated rule (see an example below) is composed by two parts: the first represents the SLA Alert event sent by the BSM to the CEP.
It is identified by the timestamp, a parameter checking if the event has been already managed by the CEP (i.e. isConsumed), and the name of the event.
The second part represents the infrastructure event the Multi-Source Monitoring framework looks for matching.
Notably, this second part specifies a parameter called getMachineIP containing the IP address of the node that generated the infrastructure-level notification, which would be matched with the IP address retrieved from the SLA notification during the generation of the rule.
In addition, such a declaration refers to a filter on the window frame within which the correlation should be considered valid.
Similarly, the meta-rule synthesized an additional rule matching the case an SLA notification has been received but no notification came from the infrastructure layer within a given time-frame, see example below.
We focus instead on the specific components that support adaptiveness: as depicted in figure below, we have extended the CEP in its functionalities by including the sub-components: Rules Repository, Rule Generator and Template Repository.
The Sla Repository, and the Infrastructure Repository can obviously also include sets of static rules that do not depend on the generative proecss discussed above.
The component Rules Repository abstracts the definition of three kind of repositories, each linking a dedicated kind of rule-set.
Specifically, there is a repository storing the rules matching infrastructure events; a repository storing event rules about the SLA agreed among the choreographed busineservices.
A meta-rule is a special rule whose body implements the run-time synthesis procedure for populating both the SLA Repository and Infrastructure Repository.
Figure below depicts a UML Sequence Diagram modeling the interaction schema that takes place among the traditional CEP and its new sub-components.
Specifically, the rule generation is done in two steps. First, whenever a meta-rule within the CEP matches, it triggers the synthesis by the Rule Generator component.
This will refer to the entries of the Template Repository relative to the kind of rule to be generated.
A rule template is a rule skeleton, the specification of which has to be completed at run-time by instantiating a set of template-dependent placeholders.
The Rule Generator will instantiate the latter with appropriate values inferred at run-time.
Once the run-time synthesis of the new set of rules is completed, the Rule Generator loads the new rules into their corresponding repostory (either SLA Repository or Infrastructure Respository) and enables them by refreshing the CEP's rule engine.
The main requirement for the Complex Event Processor (Glimpse) is to have a running ActiveMQ instance, and to be able to run Java 6.
The Complex Event Processor's license is GPL 3.0
You can find more information about the Event Monitoring (GLIMPSE) at: