CHOReOS was an FP7 project. It is now fully completed. This website is kept open for information purpose only, it is not updated. Please visit CHOReVOLUTION the project that takes over from CHOReOS.
choreos.eu: Multi-Source Monitoring Framework (Documentation.Multi-Source_Monitoring_Framework)

Multi-Source Monitoring Framework

Monitoring is an essential means in the management of service-oriented applications.
Specifically, the observation and analysis of events happening at different abstraction layers support the governance of service choreographies at run-time. Furthermore, event correlation results crucial when monitoring rules aim at checking the exposed levels of Quality of Service against the Service Level Agreements established among the choreography participants.
However, in the extremely dynamic context of large-scale choreographies enacted over distributed networks or clouds, the relevant monitoring rules might not be completely defined a-priori, as they may need to be adapted to the specific infrastructure and to the evolution of events. Therefore, we propose here an adaptive multi-source monitoring architecture that can synthesize instances of rules from generic templates, whenever some meta-rule is matched.

Overall Monitoring Architecture

In the figure below you can see the overall infrastructure of the ChoreOS Multi-Source Monitoring Framework.

The Multi-source Monitoring Framework that we have developed, can correlate the messages monitored at business-service level with the observations captured by the infrastructure monitoring  the low level resources.

In the following we describe the tool infrastructure implementing multi-source monitoring at three levels, namely:

  • business layer, reported in section Business Level Monitoring;
  • infrastructure layer, reported in section Infrastructure Monitoring;
  • Event Monitoring (Complex Event Processor), reported in section Event Monitoring

The above three different tools are loosely integrated by means of the Distributed Service Bus (DSB).

The DSB distinguishes between a set of channels dedicated to the monitoring activities (i.e., Control Plane), and other channels where both coordination  and application messages can flow (i.e.Data Plane).
The CEP correlates the data passing through the Control Plane.

monitoringOverall-2.png

Business Service Monitoring

The Business Service Monitoring (BSM) level provides the monitoring functionality related to business services, orchestrations and choreographies. Its architecture is exclusively message-driven, since it is composed of components receiving data as WS-Notifications from the ChoreOS middleware, and sending upper level notifications (QoS, SLA violations, Choreography status), resulting from runtime analysis.

BSM enables the monitoring of business services involved within the enacted choreographies: services participating to the choreography are exposed via the EasyESB bus.
Precisely, BSM is an EasyESB node that has a specific profile dedicated to monitoring purposes.

BSM includes a Data collector, a QoS Runtime Assessment component, and a Glimpse Component.

The Data collector aggregates the observed non functional data generated by the exchange between services.
It uses data coming from the QoS Runtime Assessment Component.
Moreover, the Glimpse Component is an integration bridge implemented to enable the communication between the BSM communicate and the Complex Event Processor Glimpse thanks to JMS notifications.
The QoS Runtime Assessment is the component that evaluates the respect of the non functional contracts between services. It is composed of an SLA Manager and the Web Service Distributed Management Component.


Target Audience

Who is interested to effectively detect unexpected or undesirable behaviors of services, locate the origin of the issue, or even predict potential failures it is generally necessary to track, combine,

and analyze events occurring at different abstraction levels. Therefore, in contrast with the use of more monitors operating in separate contexts, a promising strategy that is pursued in CHOReOS is to

architect SLA monitoring solutions able to reveal or predict run-time anomalies due to the combination of phenomena originated from sources operating at different levels.

Source code

The source code of all the three main components is available on the CHOReOS public SVN.

Infrastructure Monitoring

The resource monitor in ChoreOS leverages previous works by borrowing heavily from Ganglia and  has three main components:

  • A set of data collectors that gather local information such as load average, I/O rates, and network utilization. These collectors run on every active node of the cloud.  Data for each node is both made available on demand over TCP/IP and, at the same time, periodically pushed over UDP to be replicated in nearby nodes. These data collectors are simply instances of the Ganglia gmond component.
  • A set of aggregators that periodically pull data from the collectors, summarize it, and keep a historical record of the values received. Each aggregator may be responsible for a subset of the nodes in the system and may also be replicated. Finally, aggregators may be organized hierarchically, so that higher-level ones provide a broader view of the system. Data stored in each aggregator is made available on demand over TCP/IP. These aggregators are instances of the Ganglia gmetad component.
  • A notification mechanism that detects potentially relevant events, such as exceptional load average or too little available disk space, and notifies to the higher-level event monitor system described in section Event Monitoring for further analysis.

The gmetad component is deployed to every node and data is available for the system administrator if needed, but we have not identified a strong use case for the data it provides in ChoreOS, and it is consequently unused. The other components should be deployed in a similar fashion to the figure below.

platform-monitoring-deployment.png

Event Monitoring

In any event-based monitor, a central element is the CEP, which is the rule engine that analyzes the primitive events, generated from some kind of probes, in order to infer complex events matching the consumer requests. There exist several rule engines that can be used for this task (like Drools Fusion, RuleML), in our release we used Drools.

The Complex Event Processor within the ChoreOS Multi-source Monitoring Framework is able to combine information generated from

different abstraction levels and infer the reason why an SLA violation occurs.

How does it work


A violation could be due to either the current status of the infrastructure hosting some of the services involved in the interaction, or rather to the implementation of some services.

In the former case, the Multi-source Monitoring framework through GLIMPSE CEP may reveal that the SLA has been violated due to either an overload, or a crash of the node hosting the service.

A reaction to this scenario may foresee the migration of the service to another (more powerful, more reliable) node. In the latter case, the Multi-source Monitoring framework reveals that the SLA has been violated even though the node is available and is not overloaded.

Here, the reaction can be the notifications of a request for the deployment of an updated version of some of the services involved.

The services will interact within the enacted choreography through the Distributed Service Bus that has been deployed on the ChoreOS architecture.

The DSB can be splitted in two main logical partition.

The first one is the ``Control Plane'', a component in charge to receive and forward all the information and messages related to the communication and management of the overall components of the proposed monitoring framework.

The second one is the ``Data Plane'', a component implemented by means of an Enterprise Service Bus with the specific task to receive and manage all the data payload generated at choreography runtime.

monitoringDeployment2nomachineName.png

Rules repositories and Rules Generative process

When the BSM reveals that an SLA violation has occurred, its associated Glimpse Probe sends a warning to the CEP.

The CEP first interacts with an internal registry in order to identify the IP address of the machine running the specific instance of the service that violated the SLA;

then the Rule Generator component synthetizes and loads a new rule looking for the issues on the node hosting that service.


The generated rule (see an example below) is composed by two parts: the first represents the SLA Alert event sent by the BSM to the CEP.

It is identified by the timestamp, a parameter checking if the event has been already managed by the CEP (i.e. isConsumed), and the name of the event.

The second part represents the infrastructure event the Multi-Source Monitoring framework looks for matching.

<ComplexEventRuleActionList xmlns="http://labse.isti.cnr.it/glimpse/xml/ComplexEventRule" ... >
 <Insert RuleType="drools">
  <RuleName>SLA violation_Alive_Autogenerated_SecutityCompanyService</RuleName>
  <RuleBody>
      ... ...        
   declare GlimpseBaseEventChoreos
    @role( event )
    @timestamp( timeStamp )
   end
   rule "SecurityCompanyService_INFRASTRUCTURENOTRECEIVED"
   no-loop
   salience 1
   dialect "java"
   when
    $aEvent : GlimpseBaseEventChoreos(this.isConsumed == true, this.getTimeStamp == 1360752984631, this.getEventName == "SLA Alert - SecurityCompanyService")
    $bEvent : GlimpseBaseEventChoreos(this.isConsumed == false, this.getEventName == "alive",(!\label{lst:bAliveMsg}!) this.getMachineIP == "67.215.65.132", this after[0,10s] $aEvent));
   then        
    ResponseDispatcher.LogViolation("...","auto_generated_rule", "\nSLA violation\noccurred on: SecurityCompanyService");
    retract($aEvent);
   end
  </RuleBody>
 </Insert>
</ComplexEventRuleActionList>

Notably, this second part specifies a parameter called getMachineIP containing the IP address of the node that generated the infrastructure-level notification, which would be matched with the IP address retrieved from the SLA notification during the generation of the rule.

In addition, such a declaration refers to a filter on the window frame within which the correlation should be considered valid.

Similarly, the meta-rule synthesized an additional rule matching the case an SLA notification has been received but no notification came from the infrastructure layer within a given time-frame, see example below.

<ComplexEventRuleActionList xmlns="http://labse.isti.cnr.it/glimpse/xml/ComplexEventRule"...>
 <Insert RuleType="drools">
  <RuleName>SLA_violation_Autogenerated_SecutityCompanyService</RuleName>
  <RuleBody>
      ... ...        
   rule "SecurityCompanyService_INFRASTRUCTURENOTRECEIVED"
      ... ...        
   when
    $aEvent : GlimpseBaseEventChoreos(this.isConsumed == true, this.getTimeStamp == 1360752984631, this.getEventName == "SLA Alert - SecurityCompanyService") not(GlimpseBaseEventChoreos(this.isConsumed == false, this.getEventName == "load_one", this.getMachineIP == "67.215.65.132" , this after[0,10s] $aEvent));
   then        
    ResponseDispatcher.LogViolation("...","auto_generated_rule", "\nSLA violation\noccurred on: SecurityCompanyService");
    retract($aEvent);
   end
  </RuleBody>
 </Insert>
</ComplexEventRuleActionList>

We focus instead on the specific components that support adaptiveness: as depicted in figure below, we have extended the CEP in its functionalities by including the sub-components: Rules Repository, Rule Generator and Template Repository.


cepAndRuleRepositories4.png


The Sla Repository, and the Infrastructure Repository can obviously also include sets of static rules that do not depend on the generative proecss discussed above.

The component Rules Repository abstracts the definition of three kind of repositories, each linking a dedicated kind of rule-set.

Specifically, there is a repository storing the rules matching infrastructure events; a repository storing event rules about the SLA agreed among the choreographed busineservices.

A meta-rule is a special rule whose body implements the run-time synthesis procedure for populating both the SLA Repository and Infrastructure Repository.


Figure below depicts a UML Sequence Diagram modeling the interaction schema that takes place among the traditional CEP and its new sub-components.

Specifically, the rule generation is done in two steps. First, whenever a meta-rule within the CEP matches, it triggers the synthesis by the Rule Generator component.

This will refer to the entries of the Template Repository relative to the kind of rule to be generated.

ruleSynthesis.png


A rule template is a rule skeleton, the specification of which has to be completed at run-time by instantiating a set of template-dependent placeholders.

The Rule Generator will instantiate the latter with appropriate values inferred at run-time.

Once the run-time synthesis of the new set of rules is completed, the Rule Generator loads the new rules into their corresponding repostory (either SLA Repository or Infrastructure Respository) and enables them by refreshing the CEP's rule engine.


Target audience

Developers that would like to correlate information coming from different sources and different layers.

Requirements

The main requirement for the Complex Event Processor (Glimpse) is to have a running ActiveMQ instance, and to be able to run Java 6.

 

Source code

http://websvn.ow2.org/listing.php?repname=choreos&path=%2Ftrunk%2Fmonitoring%2Fem%2F

 

License

The Complex Event Processor's license is GPL 3.0

Documentation

You can find more information about the Event Monitoring (GLIMPSE) at: http://labsedc.isti.cnr.it/tools/glimpse

Contacts


This wiki is licensed under a Creative Commons 2.0 license - Legal Notice
XWiki Enterprise 5.4.6 - Documentation
Site maintained by