GIT – Summary:  Logs are an essential part of any computing system, supporting capabilities from audits to error management. As logs grow and the number of log sources increases (such as in cloud environments), a scalable system is necessary to efficiently process logs. This practice session explores processing logs with Apache from a typical system.

Logs come in all shapes, but as applications and infrastructures grow, the result is a massive amount of distributed data that’s useful to mine. From web and mail servers to kernel and boot logs, modern servers hold a rich set of information. Massive amounts of distributed data are a perfect application for Apache Hadoop, as are log files—time-ordered structured textual data.

You can use log processing to extract a variety of information. One of its most common uses is to extract errors or count the occurrence of some event within a system (such as login failures). You can also extract some types of performance data, such as connections or transactions per second. Other useful information includes the extraction (map) and construction of site visits (reduce) from a web log. This analysis can also support detection of unique user visits in addition to file access statistics.


About this article

You may want to read these articles before working through the exercises:

These exercises give you practice in:

  • Getting a simple Hadoop environment up and running
  • Interacting with the Hadoop file system (HDFS)
  • Writing a simple MapReduce application
  • Writing a filtering Apache Pig query
  • Writing an accumulating Pig query

To get the most from these exercises, you should have a basic working knowledge of Linux®. Some knowledge of virtual appliances is also useful for bringing a simple environment up.

 Exercise 1. Get a simple Hadoop environment up and running

There are two ways to get Hadoop up and running. The first is to the Hadoop software, and then configure it for your environment (the simplest case is a single-node instance, in which all daemons run in a single node). See Distributed data processing with Hadoop, Part 1: Getting started for details.

The second and simpler way is through the use of the Cloudera’s Hadoop Demo VM (which contains a Linux image plus a preconfigured Hadoop instance). The Cloudera virtual machine (VM) runs on VMware, Kernel-based Virtual Machine (KVM), or Virtualbox.

Choose a method, and complete the installation. Then, complete the following task:

  • Verify that Hadoop is running by issuing an HDFS ls .

Exercise 2. Interact with the HDFS
The HDFS is a special-purpose file system that manages data and replicas within a Hadoop cluster, distributing them to compute nodes for efficient processing. Even though HDFS is a special-purpose file system, it implements many of the typical file system commands. To retrieve help information for Hadoop, issue the command hadoop dfs. Perform the following tasks:

  • Create a test subdirectory within the HDFS.
  • Move a file from the local file system into the HDFS subdirectory using copyFromLocal.
  • For extra credit, view the file within HDFS using a hadoop dfs command.
 Exercise 3. Write a simple MapReduce application

As demonstrated in Distributed data processing with Hadoop, Part 3: Application development, writing a word count map and reduce application is simple. Using the Ruby example demonstrated in this article, develop a Python map and reduce application, and run them on a sample set of data. Recall that Hadoop sorts the output of map so that like words are contiguous, which provides a useful optimization for the reducer.

 Exercise 4. Write a simple Pig query

As you saw in Data processing with Apache Pig, Pig allows you to build simple scripts that are translated into MapReduce applications. In this exercise, you extract all log entries (from /var/log/messages) that contain both the word kernel: and the wordterminating.

  • Create a script that extracts all log lines with the predefined criteria.
 Exercise 5. Write an aggregating Pig query

Log messages are generated by a variety of sources within the Linux kernel (such as kernel or dhclient). In this example, you want to discover the various sources that generate log messages and the number of log messages per source.

  • Create a script that counts the number of log messages for each log source.
 Exercise solutions

The specific output depends on your particular Hadoop installation and configuration.

 Solution for Exercise 1. Get a simple Hadoop environment up and running

In Exercise 1, you perform an ls command on the HDFS. Listing 1 illustrates the proper solution.

Listing 1. Performing an ls operation on the HDFS

$ hadoop dfs -ls /
drwxrwxrwx    - hue       supergroup           0 2011-12-10 06:56 /tmp
drwxr-xr-x    - hue       supergroup           0 2011-12-08 05:20 /user
drwxr-xr-x    - mapred    supergroup           0 2011-12-08 10:06 /var

More or fewer files might be present depending on use.

Solution for Exercise 2. Interact with the HDFS

In Exercise 2, you create a subdirectory within HDFS and copy a file into it. Note that you create test data by moving the kernel message buffer into a file. For extra credit, view the file within the HDFS using the cat command (see Listing 2).

Listing 2. Manipulating the HDFS

$ dmesg > kerndata
$ hadoop dfs -mkdir /test
$ hadoop dfs -ls /test
$ hadoop dfs -copyFromLocal kerndata /test/mydata
$ hadoop dfs -cat /test/mydata
Linux version 2.6.18-274-7.1.el5 ([email protected])...
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX
 Solution for Exercise 3. Write a simple MapReduce application

In Exercise 3, you create a simple word count MapReduce application in Python. Python is actually a great language in which to implement the word count example. You can find a useful writeup on Python MapReduce in Writing a Hadoop MapReduce Program in Python by Michael G. Noll.

This example assumes that you performed the steps of exercise 2 (to ingest data into the HDFS). Listing 3 provides the map application.

Listing 3. Map application in Python

#!/usr/bin/env python

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print '%s\t1' % word

Listing 4 provides the reduce application.

Listing 4. The reduce application in Python

#!/usr/bin/env python

from operator import itemgetter
import sys

last_word = None
last_count = 0
cur_word = None

for line in sys.stdin:
    line = line.strip()

    cur_word, count = line.split('\t', 1)

    count = int(count)

    if last_word == cur_word:
        last_count += count
        if last_word:
           print '%s\t%s' % (last_word, last_count)
           last_count = count
        last_word = cur_word

if last_word == cur_word:
    print '%s\t%s' % (last_word, last_count)

Listing 5 illustrates the process of invoking the Python MapReduce example in Hadoop.

Listing 5. Testing Python MapReduce with Hadoop

$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u2.jar \ -file -mapper -file -reducer \ -input /test/mydata -output /test/output
$ hadoop dfs -cat /test/output/part-00000
write	3
write-combining	2
wrong.	1
your	2
zone:	2
zonelists.	1
 Solution for Exercise 4. Write a simple Pig query

In Exercise 4, you extract /var/log/messages log entries that contain both the word kernel: and the word terminating. In this case, you use Pig in local mode to query the local file (see Listing 6). Load the file into a Pig relation (log), filter its contents to only kernel messages, and then filter that resulting relation for terminating messages.

Listing 6. Extracting all kernel + terminating log messages

$ pig -x local
grunt> log = LOAD '/var/log/messages';
grunt> logkern = FILTER log BY $0 MATCHES '.*kernel:.*';
grunt> logkernterm = FILTER logkern BY $0 MATCHES '.*terminating.*';
grunt> dump logkernterm
(Dec  8 11:08:48 localhost kernel: Kernel log daemon terminating.)
Solution for Exercise 5. Write an aggregating Pig query

In Exercise 5, extract the log sources and log message counts from /var/log/messages. In this case, create a script for the query, and execute it through Pig’s local mode. In Listing 7, you load the file and parse the input using a space as a delimiter. You then assign the delimited string fields to your named elements. Use the GROUP operator to group the messages by their source, and then use the FOREACH operator and COUNT to aggregate your data.

Listing 7. Log sources and counts script for /var/log/messages

log = LOAD '/var/log/messages' USING PigStorage(' ') AS (month:chararray, \
  day:int, time:chararray, host:chararray, source:chararray);
sources = GROUP log BY source;
counts = FOREACH sources GENERATE group, COUNT(log);
dump counts;

The result is shown executed in Listing 8.

Listing 8. Executing your log sources script

$ pig -x local logsources.pig


Get products and technologies

  • Cloudera’s Hadoop Demo VM (May 2012): Start using with Apache Hadoop with a set of virtual machines that include a Linux image and a preconfigured Hadoop instance.
  • Evaluate IBM products in the way that suits you best: a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.
Print Friendly, PDF & Email



Bài viết liên quan