Friday, January 14, 2011

Design considerations: Logging framework for Distributed Service

Model 1: Customizing Log4J & Perf4J Framework

Usage of asynchronous Log4J appenders (AsyncAppender/JMSAppender or custom appenders) to push the log data as and when they arrive to a a JMSProvider from where the log messages are picked up and stored in the file-based storage and databases appropriately as per the logging rules & policies.

In above, the appenders may be termed as agent nodes, the JMSProvider may be termed as collector nodes and the file storage or database may be termed as storage nodes/repositories.

Key design considerations:

1. Asynchronous
2. a) Customization required to store the log data locally in case of failure of MQ storage; b) process the local messages when MQ comes live.
3. JMSAppender needs to be evaluated in the high stress.
4. Scalability of collector and storage need to be evaluated.
5. Clustering considerations need to be evaluated
5. Perf4J used to generate performance reports

Model 2: Flume & Log4J/Perf4J framework

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is
centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic application

Key Design Considerations:

1. Usage of FileAppenders with appropriate flush policy may lead to enhanced performance.
2. Reliable as Flume a well-defined recovery strategy if agent node crashes, and store on failure if collector/storage nodes crashes.
3. Well defined clustering considerations available for scaling out agent, collector and storage nodes.


http://archive.cloudera.com/cdh/3/flume/UserGuide.html
http://archive.cloudera.com/cdh/3/flume/Cookbook.html


Model 3: Scribe & Log4J/Perf4J framework

Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures. There is a scribe server running on every node in the system, configured to aggregate messages and send them to a central scribe server (or servers) in larger groups. If the central scribe server isn’t available the local scribe server writes the messages to a file on local disk and sends them when the central server recovers. The central scribe server(s) can write the messages to the files that are their final destination, typically on an nfs filer or a distributed filesystem, or send them to another layer of scribe servers.

Key Design Considerations:
1. Usage of FileAppenders with appropriate flush policy may lead to enhanced performance
2. Reliability: The scribe system is designed to be robust to failure of the network or any specific machine, but does not provide transactional guarantees. If a scribe instance on a client machine (we’ll call it a resender for the moment) is unable to send messages to the central scribe server it saves them on local disk, then sends them when the central server or network recovers. To avoid overloading the central server upon a restart, the resender waits a random time between reconnect attempts, and if the central server is near capacity it will return TRY_LATER, which tells the resender to not attempt another send for several minutes.
3. Well defined clustering considerations available for scaling out agent, collector and storage nodes.
4. Architecture is based on scribe agent lying on every server and these scribe agents sending logs to scribe server.


Model 4: Custom Log4J AsyncAppender/SocketAppenders and NIO SocketServer (Non-blocking IO SocketServer)

Using this model, AsyncAppender sends message to SocketAppenders(agent) that write the data to SocketServer (collector) which writes the data to appropriate storage nodes (file based or DB).

Key Design Considerations:

1. a) Customization required to store the log data locally in case of failure of SocketServer; b) process the local messages when MQ comes live.
2. Scalability of collector and storage need to be evaluated.
3. Clustering considerations need to be evaluated

Following could be one NIO SocketServer implementation which is very similar to what I am speaking about. The socketServer from project Voldemart could be customized to meet the centralized logging requirements.

http://sna-projects.com/blog/2009/08/introducing-the-nio-socketserver-implementation/


On all of the above desigm models, one thing is pretty clear that the key component of architecture is based on following:

- Set of Agent nodes residing on the web/app servers (physical nodes) that host one or more services
- Set of collector nodes which is logically separated from agent node but may reside on same or different physical nodes as web/app servers.
- Set of storage nodes which are basically file based or database repositories


My recomendation in order of preference:

Model 4 which involved socket based communication if some staff could be allocated for customization if it is found to be good. One other advantage would be that framework could be customized as and when required.
Model 2/3: Flume/Scribe & Log4J/Perf4J framework
Model 1: Customizing Log4J & Perf4J Framework to work with AsyncAppender/JMSAppender

No comments:

Post a Comment