Wednesday, October 14, 2009

Eric Wong’s mostly pure-Ruby HTTP backend, Unicorn, is an inspiration. I’ve studied this file for a couple of days now and it’s undoubtedly one of the best, most densely packed examples of Unix programming in Ruby I’ve come across.

Unicorn is basically Mongrel (including the fast Ragel/C HTTP parser), minus the threads, and with teh Unix turned up to 11.

That means processes. And all the tricks and idioms required to use them reliably. We’re going to get into how Unicorn uses the OS kernel to balance connections between backend processes using a shared socket, fork(2), and accept(2) — the basic Unix prefork model in 100% pure Ruby.

I like Unicorn because it’s Unix

I really, really love the whole philosophy this article points to. Aside from how easily it leverages the good ideas in a good operating system, it’s the antidote to all the pain and suffering a lot of my colleagues go through day after day due to the thread based applications they’ve designed or inherited.

Comments (View)

“One of the best pieces of advice Mickey ever gave us was to go rent a warehouse and build a prototype of a store, and not, you know, just design it, go build 20 of them, then discover it didn’t work,” says Jobs.

In other words, design it as you would a product.

Apple Store Version 0.0 took shape in a warehouse near the Apple campus. “Ron and I had a store all designed,” says Jobs, when they were stopped by an insight: The computer was evolving from a simple productivity tool to a “hub” for video, photography, music, information, and so forth. The sale, then, was less about the machine than what you could do with it.

But looking at their store, they winced. The hardware was laid out by product category - in other words, by how the company was organized internally, not by how a customer might actually want to buy things.

“We were like, ‘Oh, God, we’re screwed!’” says Jobs.

But they weren’t screwed; they were in a mockup.

“So we redesigned it,” he says. “And it cost us, I don’t know, six, nine months. But it was the right decision by a million miles.”

When the first store finally opened, in Tysons Corner, Va., only a quarter of it was about product. The rest was arranged around interests: along the right wall, photos, videos, kids; on the left, problems. A third area - the Genius Bar in the back - was Johnson’s brainstorm.

“When we launched retail, I got this group together, people from a variety of walks of life,” says Johnson. “As an icebreaker, we said, ‘Tell us about the best service experience you’ve ever had.’” Of the 18 people, 16 said it was in a hotel. This was unexpected. But of course: The concierge desk at a hotel isn’t selling anything; it’s there to help. “We said, ‘Well, how do we create a store that has the friendliness of a Four Seasons Hotel?’” The answer: “Let’s put a bar in our stores. But instead of dispensing alcohol, we dispense advice.”

Why Apple is the best retailer in America - March 19, 2007

This is the kind of thing I want to be a part of when developing software.

Comments (View)
Sunday, September 20, 2009

Software Development in a World Without Clear Requirements

A bit of history…

Sometime in early 2004 an awful project I was on got cancelled. It was awful for lots of reasons, none of them especially technical. I was almost the only developer on the project. Normally, this is great for me. I get to write a lot of code and see the whole thing out the door from top to bottom. But not in this case. One of the main issues I had to contend with were constantly shifting requirements, due, in large part, to the fact that the project involved many stakeholders, each one in a bid to see their particular vision as the right one.

To get the job done (a web application), I used a company customized version of Struts to develop an application running on a cluster of WebLogic servers. Struts is not the most agile of technologies, obviously enough, but it’s not the worst, either. In fact, I got quite good at getting things done with it.

Nevertheless, the constant stream of requirements changes, some of them sprung on me after I’d finished the application, some of them due to my own misinterpretation of what was said in meetings meant that I could not keep up with the code. We brought on another developer, who didn’t help much because he was completely unfamiliar with Struts, web applications in general, and all the discussions that had happened before.

(I won’t even bring up the nightmare that was integration via SOAP when you don’t control both sides of the interface. Shudder.)

After the project was cancelled, I had a few months in which I didn’t work on any project at all. In that time, I studied Lisp, Scheme, Python and asynchronous messaging systems and key/value store databases. When I finally got another project to work on, it was a video conferencing project.

I and another developer used Python. I wrote all the front end bits (controlling scheduling and the user interface), and he wrote the backend bits (controlling hardware).

At first we tried to bind our code together, his as a library to my application, but there were just too many issues, mostly involving the whacky Python threading model (at that time), so we used a message bus, inspired by IRC, separate processes communicating to a central broker.

Our collaboration using this technology was extremely successful: we implemented a massive amount of functionality in parallel in a very short time using a surprisingly small amount of code. (It took the subsequent team about a year to rebuild the functionality we had. They ported the code to Java, using a distributed object model, rather than messaging. And a much bigger team. They even had a full time build guy!)

Comparing the success of the new project, and the failure of the old one, I came up with a few notes (around late 2004) I’d like to share now. I recently found these notes, and what I noticed about them is that I’ve been trying to get these ideas allowed into my daily practice ever since. I’m pleased to say that when I did manage to get my team members to use these sorts of things, the projects were on time, required fewer people, and were generally easy to change and always adjusted well to constantly changing requirements.

Anyway, here are the notes which were a sketch of a presentation I gave at a Developer Days conference (sponsored by our group at that company):

(Italics denote my current commentary on my ancient notes.)

Frustrations

As we all know, requirements help us figure out how to get done what we need to get done and when. Here’s a list of frustrations you’re likely to encounter even if things are going wonderfully well:

  • “That’s not what we meant.”

  • “That’s cool, but now, let’s change it.”

  • “We appreciate the work you’ve put in, but we’ve been discussing this among ourselves. Let me tell you what we’ve concluded while you were working on that.”“

  • Cross-team stake-holders:

    • Partner 1: Basically, it all comes down to A.

    • Partner 2: Basically, it all comes down to B.

    • A completely contradicts B.

    • (Basically, all vision is Parallax vision.)

  • “Well, in a few months, we’re going to hire some developers.”

  • “Let’s use our differing assumptions about what the customers want to justify our disagreements, given that we can’t talk to the actual customers.”

  • “That solution seems simple enough, but we want to use buzzword X with buzzword Y because:”

    • We’ve heard everyone’s going that way,
    • We’ve heard it’s Best Practices,
    • The company is standardizing on that technology,
    • Your choice is not company approved.

I think I’ve heard each of those at status meetings and demos when I thought the work was essentially done.

Problems

As a developer, you need to make engineering choices that help you deal with the following problems:

  • No requirements, or not enough to justify one technology choice or design over another.

  • Frequent “sea” changes, requiring a rethink of the basic software architecture (or ought to), or at least changes from the database on up through to the UI.

  • One day your “model” (code organization, problem breakdown) is what you need, the next day it seems like it was a bad choice: recipe for spaghetti because it’s too hard to start over.

Solutions

The following was my assessment, in 2004, about what features a solution to the above problem should have in order to have a chance at success:

  • Make it as easy to change code as it is to change a paragraph in a requirements doc, a slide, or a diagramming tool.

  • Discover requirements and be able to adjust to them.

  • Discover constraints, and be able to live within them.

Note: Most folks think that you need to have all the requirements down and locked in stone at least for an initial implementation. I’ve never experienced this, and given that projects live of die based on how fast you can bootstrap them, I’ve come to believe that one gathers requirements via implementing ideas. The goal, then, is to figure out how to embrace that and stop doing the things that make it difficult.

If the basic precondition of much software methodology and technology choices is a good set of requirements, then we need a different set of methodologies and technologies when that precondition cannot be met.

Technology Goals

What we need are tools, philosophies, and techniques encouraging:

  • Ability to get stuff done quickly based on little information, guesses, proposals, etc.

  • Guessing wrong should be cheap.

  • Ability to change practically everything without too much cost.

  • Writing executable requirements.

Keeping it simple

  • Python, Ruby, Lisp, Scheme, Groovy, PHP, Erlang

  • Avoid complex build systems (i.e., ant > maven, or projects depending on other projects in an elaborate tree of dependencies).

  • Avoid complex data / business models (data pipeline and transformation over elaborate relational state, if possible).

  • Ability to change things while the app is running.

  • No need to re-compile, re-deploy.

  • Extreme-programming techniques, as much as makes sense, but unit tests if nothing else (esp. with late-bound, dynamic languages).

  • Super decoupled application architecture: talking separate processes for layers, not just object interfaces. Write code to network interface specs, not giant libraries which hide the details. (Think HTTP/Rest vs SOAP.)

  • Prefer asynchronous to synchronous everywhere possible.

  • In-memory DB, then flat-file object persistence, then RDBMS. (An RDBMS should always be your last choice, not your first. Do you really, really need one for your app?).

  • Prefer computed HTML over templates.

  • If using PHP, embed everything in each page: no fancy MVC framework: too hard to debug, too hard to maintain, and really difficult for future maintainers to unravel.

I’d add that AJAX or Javascript clients are preferred over computed HTML or template languages of any sort. Keeps the backend very clean, and it’s very easy to change Javascript quickly as requirements shift over time.

Question Frameworks

  • Fear frameworks: don’t trade the possible complexity of the problem for the definite complexity of the framework.

  • Prefer straight SQL to object/relational mapping frameworks.

  • Recognize that frameworks often exist to overcome the shortcomings of less dynamic languages.

Documentation

  • Generated from code if possible.

  • The code, and what it does, is the documentation.

  • We can help best by documenting the problems we’re solving rather than the implementation. Rather than an SQL E-R diagram (for instance), a description of the type of information and the basic domain entities is more import.

  • Always document for future re-implementors and re-writers, not for future maintainers.

And so it goes…

A lot of what I wrote up back in 2004 still seems controversial among a lot of people I work with. I find, though, that the controversy usually breaks down to a single point: requirements.

When you work in an environment where you start with vague ideas and write code in order to solidify those ideas, to discover the requirements as you evolve a system, the above makes a lot of sense (or would, if I fleshed them out a bit more). This kind of environment is much more on the artistic, intuitive side of software development, the side that acknowledges that every new project is a “first time” situation. If the solution already existed, we’d just buy it, so we might as well embrace the uncertainty and develop techniques to minimize bad choices.

When you work in an environment where the outcome of a given project is absolutely clear, then I think most of the above is not necessary. It’s easy enough to go with the waterfall method, or at least to start there, by gathering all the requirements, making sure they’re written down, and then using those requirements to schedule and scope. In such a case, you can use any technology you want because you can know up front if you’re going down the wrong path simply by looking at your requirements document.

A comfortable world, if you can get it, and one that, I’m convinced, no longer exists. In fact, I bet it never existed.

Comments (View)
Sunday, August 2, 2009

A Groovy Clock

Hello!

I write a lot of code that needs to execute periodically. For instance, I’ve written services that need to check a remote file system for changes and import those changes once detected.

I’ve also written quite a few services that need to send information over a message bus periodically to, say, report status information or provide information about a long running process, or update the status of a given resource (such as whether or not a database is up).

Certainly, with Java 1.5 and its built in java.util.concurrent libraries, a lot of this sort of thing has become easier, but it’s always a good idea to make it easier still and, more importantly for me, much more readable.

For instance, given a class that does something useful, a timer (or clock, as I call it) should be super easy, something like (in Groovy):

class Monitor {

    def clock

    Monitor() {
        clock = new Clock(120, this.&execute)
    }

    def execute() {
        println "doing something useful every 2 minutes"
    }

    def start() {
        clock.start()
    }

    def stop() {
        clock.stop()
    }
}

When the Monitor is instantiated, it instantiates a Clock object which takes the number of seconds to wait between ticks, and a closure, which gets called whenever the clock ticks. The idea is that all the machinery for starting, stopping, and maintaining the life-cycle of a periodic timer is encapsulated in the clock class, including all those complicated Java factory, object-wrapping-other-object stuff that is un-Groovy-like, and, frankly, makes code so unreadble.

Another use case for this kind of thing is a module in your application that fires events at regularly scheduled intervals. With the Clock object, you could separate that concern from your various worker modules, and implement the concern in one place as simply as:

class Events {

    def clocks = []

    def fireDatabaseCheck() {
        Notifications.fire(Notifications.DB_CHECK)
    }

    def fireStatusUpdate() {
        Notifications.fire(Notifications.STATUS_UPDATE)
    }

    def start() {
        clocks = [
            new Clock(120, this.&fireDatabaseCheck),
            new Clock(30, this.&fireStatusUpdate)
        ]
    }

    def stop() {
        clocks.each { clock ->
            clock.stop()
        }
    }
}

You’ll have to imagine a Notification class used by other long-lived objects to subscribe to various events. And let’s hope that the Notification class somehow manages concurrency such that messages delivered to subscribers are not held-up if the subscribers take a long time to handle the messages. Oy. Well, I think I have a solution for that problem, but I’ll leave that for another blog entry.

The advantage (arguably) to the Events class is that you have all your timed events in one place so that maintainers can easily find it, and once found, can be assured that this one file contains everything they need to know about when events are generated. Regardless, the use of the Clock object helps make the code as clear as possible under the circumstances.

The end of this entry has the complete code, but I’ll take you through some bits and pieces just for fun. Please note that this all works in Java just was well as Groovy, though you’ll have to implement some interfaces and anonymous classes to emulate Groovy’s true closures.

Implementing an actual scheduled task using the java.util.concurrent library is fairly straight forward:

import java.util.concurrent.*
import java.util.concurrent.TimeUnit.*

def scheduler = Executors.newSingleThreadScheduledExecutor()
def task = scheduler.scheduleAtFixedRate(new Runnable(), 
                        0, 120, TimeUnit.seconds)

All you have to do is acquire a scheduled thread executor from the library then create a task, which is just an instance that implements the Runnable interface. The parameters are typical: the number of time units to wait before the first run (in this case, 0), the number of time units between ticks (120, here), and the time unit itself.

In Groovy, all closures implement the Runnable interface, so you can just use a closure rather than an object instance:

def start = 0
def interval = 120
def units = TimeUnit.seconds

def closure = { println "do something " }
def task = scheduler.scheduleAtFixedRate(closure, start, interval, units)

There’s a problem, here, though, which is that if the closure throws an exception, it’ll get swallowed. You’ll never know it happened unless you capture a Throwable, or add an exception handler to the thread (which I won’t show here).

With Groovy, you can use closures to simulate new control structures to help with things like these. For instance:

def safe(Closure closure) {
    return {
        try {
            closure()
        }
        catch (Throwable t) {
            log.info "thread terminated: $t"
        }
    }
}

What this interesting bit of code does is return a closure, which wraps a closure in a try/catch block. It’s a function that takes a function, and returns another function. I know that sounds confusing, but it’s a really nice thing to do in a language that supports functions as first class entities.

What the above enables you to do is something like:

def closure = safe { println "do something" }

The closure variable will be a closure that wraps another closure in a try/catch block. The try catch block does nothing other than log the error. You can imagine adding some functionality that will enable you to recover from certain error conditions, but I’ve found that for the most part, you really can’t as far as periodic events go. You log the error, and then just die until it’s time for another attempt. Presumably the closure you’re using as a scheduled task will know how to report errors to other parts of your application.

And any rate, what you end up with is:

def start = 0
def interval = 120
def units = TimeUnit.SECONDS

def closure = safe { println "do something " }
def task = scheduler.scheduleAtFixedRate(closure, start, interval, units)

Or:

def task = scheduler.scheduleAtFixedRate( 
        safe { closure }, 0, 120, TimeUnit.SECONDS)

The iffy-ist (technical term) part of this whole proposition is stopping the scheduled task. I’ve come up with the following:

if (task)
    task.cancel(true)

if (scheduler) {
    def tasks = scheduler.shutdownNow()
    tasks.each { task ->
        task.cancel(true)
    }
}

This all seems reasonable, but the problem is the task itself. If it’s doing something like talking to an SFTP server, or hanging on to an HTTP connection, or writing to a database, the tasks won’t really cancel. Or do they? I’ve never gotten consistent results, and in fact, have given up trying too hard for a generic solution. (Again, I’d rather just move to a better platform for concurrent processing.)

Assuming, though, that the tasks themselves properly terminate via other means, such as setting a shared variable, as in def quitNow = true, then I think the above is sufficient.

I’ve developed a Worker class which at least helps with this sort of thing. The Worker class takes a closure and manages its life-cycle. It’s great for using as a task in the Clock class.

An interesting modification of the Clock class might be to provide not only a closure that gets called when work needs to be done, but another closure that gets called when work needs to be interrupted.

Regardless, please know that the code presented here is in no way full proof against the host of concurrency problems any Java based technology is heir to. What I hope it does is provide a way to get something working in your application more quickly than you might otherwise have done, or better yet, spurred your own “simplifying” thoughts for working with this sort of issue.

Personally, I cut/paste this code into every application I write that needs this sort of thing, and then tweak it to fit the specifics of that app. For me, this is much, much easier than developing a far more comprehensive and flexible library with dozens of classes and tens of ways to modify behavior via interfaces, overrides, injections, subclassing, delegation, and so on. Sounds great: way too hard to maintain.

So, here’s the full code for the Clock class:

package com.zentrope.lib

import java.util.concurrent.*
import java.util.concurrent.TimeUnit.*

import org.apache.log4j.*

class Clock {

    static log = Logger.getLogger(Clock.class)

    def scheduler = null
    def task = null
    def seconds = 0
    def timeUnit = TimeUnit.SECONDS  // no minutes in Java 1.5
    def closure = null

    Clock(seconds, Closure closure) {
        this.seconds = seconds
        this.closure = closure
    }

    def safe(function) {
        return {
            try {
                function()
            }
            catch (Throwable t) {
                log.info "thread terminated: $t"
            }
        }
    }

    def start() {
        scheduler = Executors.newSingleThreadScheduledExecutor()
        task = scheduler.scheduleAtFixedRate(safe { closure() }, 
            0, seconds, timeUnit)
    }

    def stop() {
        if (task)
            task.cancel(true)

        if (scheduler) {
            def tasks = scheduler.shutdownNow()
            tasks.each { task ->
                task.cancel(true)
            }
        }
    }

}

Caveats (and just to repeat yet again):

As implemented above, the closure called by the scheduled thread might block, or take a long time to run, or spawn its own thread and return immediately. As will all things concurrent in Java, you have to pay close attention to these issues or they’ll bite. Chances are, they’ll bite you anyway. That’s why I prefer an Actors metaphor, and thus Erlang.

Technorati Tags: , , ,

Comments (View)
Sunday, July 26, 2009

Erlang Multicast Presence/Discovery Notification

Overview

Just for fun, I thought I’d show a tiny Erlang app (or library) I started for a project I was going to write in Erlang, but which got cancelled before I could finish up all the infrastructure pieces.

elib_ping, as I call it, enables services running on Erlang nodes to find each other without having to configure them at start time. Of course, this works as long as they’re all running on the same subnet in, I assume, the same data center.

The idea I was working from was that in a given distributed system, you might have dozens of Erlang nodes, all hosting applications of specific types. For instance, one node might be a “worker” node for putting data into a database, and another might be a “reader” node for getting data out of a database. (Think of this as something of a super specialized grid or map/reduce system, but don’t think too much about that: this is just a contrived example scenario.)

What I wanted to do was have all the worker nodes find the reader nodes, and start sending messages to those nodes indicating that the worker was ready to do some work.

What I did NOT want to do was have to pre-configure each node with the location of all the other nodes. Instead, I wanted each node to build up a map of all the others, and then send messages to them according to the work a given node can do. Although not shown in any code presented below, I envisioned the ping application calling net_adm:ping() when it discovered a node, thus registering that node in the nodes() list, which carries with it a lot of functionality, such as heartbeat monitoring.

Because this is my first little article dealing with Erlang, I’ll provide an simple outline of the parts of an Erlang application in case you’re unfamiliar with such things, and then proceed with the code samples making up the initial pass at the idea.

Erlang Applications

In general, simple Erlang applications are built on a few supplementary files that, taken with the main functional modules, make up an “application” that can be loaded into an Erlang VM and made use of by other applications.

To make this work, you need the following:

  1. an application descriptor file containing metadata describing the modules making up the application,

  2. a main application callback module used to start the application,

  3. and containing any useful “interface” functions you’d like to be the main API of the application,

  4. a supervisor callback module, tasked with monitoring the application’s running processes and keeping them going if they fail,

  5. and at least one module making up the actual application itself, which is, quite often, a gen_server callback module of some sort.

With all that said, let’s get the basic boilerplate part of the application out of the way before moving on to the stuff that makes this work.

The Application Part (ping.app)

The application file describes the modules making up the application and the roles those modules play:

{application, ping, [
    {description, "Node Discoverer"},
    {vsn, "1.0"},
    {modules, [ping_serv, ping_sup, ping_app,ping]},
    {registered, [ping_serv,ping_sup,ping_event]},
    {applications, [kernel,stdlib]},
    {mod, {ping_app, []}},
    {start_phases, []}
]}.

The application definition is really just a series of key/value pairs, or properties. When the application is started via:

application:start(ping)

The Erlang OTP will call the start function on the ping_app module defined by the mod property. The registered property notes which processes will have registered names, and the modules property notes all the modules participating in the application. OTP uses this data for, among other things, figuring out what to do if you need to live-upgrade the code.

The Main Module Part (ping_app.erl)

The main application callback module is nothing more than boilerplate:

-module(ping_app).
-behaviour(application).

-export([start/2,stop/1]).

start(_Type, _StartArgs) ->
    ping_sup:start_link().

stop(_State) ->
    ok.

As you can see, the start function is responsible for starting the supervisor. Simple and straightforward.

The Supervisor (ping_sup.erl)

The supervisor is also not much more than boilerplate:

-module(ping_sup).
-behaviour(supervisor).

-export([start_link/0,start_link/1]).

-export([init/1]).

-define(SERVER, ?MODULE).

start_link() ->
    supervisor:start_link({local, ?SERVER}, ?MODULE, []).

start_link(Args) ->
    supervisor:start_link({local, ?SERVER}, ?MODULE, Args).

init([]) ->
    PingServ = {tag1, {ping_serv,start_link,[]},
        permanent,2000,worker,[ping_serv]},
    PingEvent = {tag2, {gen_event,start_link,[{local,ping_event}]},
        permanent,2000,worker,dynamic},
    {ok, {{one_for_one,3,10}, [PingEvent, PingServ]}}.

The init module is responsible (in this case) for defining the two processes it will monitor:

  1. The Ping Server process, which actually does the work of sending and receiving pings to and from other Erlang nodes,

  2. and the Ping Event process, to which you register handles so that your application can be notified of ping events.

So far, so good.

The Functional Interface Part (ping.erl)

The functional interface part is not part of the OTP system, but it’s a common convention to have a module named after your application that contains the functions needed to access the application’s functionality, start and stop convenience functions, and so on.

-module(ping).

-export([start/0,stop/0,make/0,set_status/1,set_type/1,add_handler/2]).

start() ->
    case node() of
        'nonode@nohost' ->
            node_name_not_defined;
        _ ->
            application:start(sasl),
            application:start(ping)
    end.

stop() ->
    application:stop(ping).

make() ->
    make:all([load]).

set_status(Status) ->
    when_running(fun() -> ping_serv:set_status(Status) end).

set_type(Type) ->
    when_running(fun() -> ping_serv:set_type(Type) end).

add_handler(Module, Params) ->
    when_running(fun() -> 
        gen_event:add_sup_handler(ping_event, Module, Params) end).

%% Internal functions

when_running(Fun) ->
    case whereis(ping_serv) of
        undefined ->
            ping_app_not_started;
        _Pid ->
            Fun()
    end.

The start function invokes application:start for you and makes sure that all the applications it depends on (in this case, sasl) are also started. I’ve added a make function for interactive development, and an add_handler function to register callbacks for handling ping events.

Whenever the ping server gets a ping packet from somewhere else, it sends a message to (a.k.a. notifies) the ping_event process, so to make use of the ping server, your specific application will need to register a handler.

The set_status and set_type are convenience functions for configuring the ping server itself. The idea is to set a status (running, paused, busy, sick, etc), and a type (reader, worker, logger, etc) so that other nodes can recognize both what the sending node can do, and whether or not it’s ready to do it.

The Ping Server (ping_serv.erl)

The main functionality is a gen_server which uses timeouts as a way to generate timed events for sending out packets.

Here’s the code:

-module(ping_serv).
-behaviour(gen_server).

-define(INFO(Fmt, Args), 
    error_logger:info_msg("INFO:  [~p:~p] " ++ Fmt ++ "~n", 
        [?MODULE, ?LINE | Args])).

-define(ERROR(Fmt, Args), 
    error_logger:info_msg("ERROR: [~p:~p] " ++ Fmt ++ "~n", 
        [?MODULE, ?LINE | Args])).

-define(SERVER, ?MODULE).

-define(PULSE, 5000).

%% Should allow these to be configured.

-define(PORT, 40000).
-define(MULTICAST_ADDR, {239, 10, 11, 12}).
-define(LOCAL_ADDR, {0,0,0,0}).   % "localhost" nor {127,0,0,1} work

-export([start/0,start_link/0,stop/0,set_status/1,set_type/1]).

-export([init/1,handle_call/3,handle_cast/2,
    handle_info/2,code_change/3,terminate/2]).

-record(state, {send, recv, type, status, pids=[]}).

%% API

start() ->
    gen_server:start({local, ?SERVER}, ?MODULE, [], []).

start_link() ->
    gen_server:start_link({local, ?SERVER}, ?MODULE, [], []).

stop() ->
    gen_server:call(?SERVER, stop).

set_status(Status) ->
    gen_server:cast(?SERVER, {status, Status}).

set_type(Type) ->
    gen_server:cast(?SERVER, {type, Type}).

%% gen_server

init([]) ->
    ?INFO("spawning send socket", []),
    SendSocket = make_send_socket(),
    RecvSocket = make_recv_socket(),
    ?INFO("sockets: ~p ~p", [SendSocket, RecvSocket]),
    {ok, #state{send=SendSocket, recv=RecvSocket}, ?PULSE}.

handle_call(stop, _From, State) ->
    {stop, stopped, State};

handle_call(_Request, _From, State) ->
    {reply, ok, State}.

handle_cast({status, Status}, State) ->
    NewState = State#state{status=Status},
    {noreply, NewState};

handle_cast({type, Type}, State) ->
    NewState = State#state{type=Type},
    {noreply, NewState};

handle_cast(_Request, State) ->
    {noreply, State}.

handle_info(timeout, #state{send=Send}=State) ->
    Packet = make_packet(State),
    case gen_udp:send(Send, ?MULTICAST_ADDR, ?PORT, Packet) of
        ok ->
            ok;
        {error, Reason} ->
            ?ERROR("~p unable to send packet (~p)", [node(), Reason])
    end,
    {noreply, State, ?PULSE};

handle_info({udp, _Socket, _IP, _InPortNo, Packet}, State) ->
    Node = binary_to_term(Packet),
    gen_event:notify(ping_event, Node),
    {noreply, State, ?PULSE};

handle_info(Info, State) ->
    ?INFO("info msg: ~p", [Info]),
    {noreply, State, ?PULSE}.

code_change(_OldVersion, State, _Extra) ->
    {ok, State}.

terminate(Reason, #state{send=undefined, recv=undefined}) ->
    ?INFO("terminating (~p)", [Reason]),
    ok;

terminate(Reason, State) ->
    ?INFO("terminating (~p)", [Reason]),
    gen_udp:close(State#state.send),
    gen_udp:close(State#state.recv),
    ok.

%% Internal Functions

make_recv_socket() ->
    Opts = [ { active, true },
             { ip, ?MULTICAST_ADDR },
             { add_membership, { ?MULTICAST_ADDR, ?LOCAL_ADDR } },
             { multicast_loop, true },
             { reuseaddr, true },
             binary ],

    { ok, Socket } = gen_udp:open (?PORT, Opts),
    Socket.

make_send_socket() ->
    Options = [ { ip, ?LOCAL_ADDR },
                { multicast_ttl, 255 }, 
                { multicast_loop, true } ],
    {ok, Socket} = gen_udp:open(0, Options),
    Socket.

make_packet(State) ->
    Packet = {presence, [
                {node, node()},
                {host, inet:gethostname()},
                {type, State#state.type},
                {status, State#state.status}]},
    term_to_binary(Packet).

There’s lots to cover, but I’ll try to be brief. I’m going to assume that if you’re reading this, you’re familiar with the callback nature of the gen_server OTP mechanism. The code contains some constants, definitions and exports, the module’s API, the gen_server callbacks, and internal functions.

The server works by constructing a send socket and a receive socket via gen_udp with settings I discovered in some code for a similar project called, I think, nodefinder on Google Code. Once the sockets are created, they’re added to a state record and returned via the gen_server’s init function with a timeout (named ?PULSE here). When the timeout times out, the handle_info function is sent a timeout message. After matching the appropriate handle_info clause, the gen_server sends out a packet on the send socket, and then returns another ?PULSE timeout.

In effect, we have a timer loop, of sorts.

So far so good.

If there are any other similar listeners out on the network (other ping_serv instances running on other nodes), they’ll get that packet.

When a packet arrives across the multicast interface, OTP will send a message to the ping_serv process, where it is matched against one of the handle_info functions. The packet is received, decoded, and then sent to the ping_event via asynchronous notify.

If there are any handlers registered with ping_event, they’ll get the packet and can do something interesting with it, such as call net_adm:ping(), or keep a local state of the metadata in the packet. Or do nothing.

The packet itself:

{presence, [
    {node, node()},
    {host, inet:gethostname()},
    {type, State#state.type},
    {status, State#state.status}]}

is not all that interesting. I’ve modeled it here as a “presence” packet, named after the similar concept in XMPP. It’s a property list containing:

  • node:
    The node name of the sender (which you can use to ping via the net_adm module), which acts as, among other things, a “process-id”, if you want to do some sort of failover scheme,

  • host:
    The host on which the node is running, mostly just for tourist information, but also to facilitate some sort of balancing, or to help identify which server went down if you no longer get any presence packets from them.

  • type:
    A type, which amounts to the purpose of the node sending out the packet. Examples, as mentioned above, would be “reader” or “worker” or what have you.

  • status:
    A status, which represents the state of the node, such as paused, running, busy.

These are really quite arbitrary, but I hope this gives you an idea how to do this sort of thing.

I’d not recommend this as a main transport for valuable data, but it works well for heart-beat style applications in which a few dropped, or garbled messages really don’t matter.

Technorati Tags: , , ,

Comments (View)
Thursday, July 23, 2009

Case Study: An Asynchronous Web Service (Part 3 of 3)

Overview

There are two other parts to this epic story:

  • Part 1, about the problem we had to solve (validating product serial numbers), and the resources available to us to solve the problem, and

  • Part 2, about our solution: using an asynchronous web service as an external interface to our application, and asynchronous messaging as the backbone of the internal architecture.

This third and final part is a catch all covering some of the operational details of the service, including build, deployment, and monitoring, connecting to the outside world, and testing.

Digression on the Evils of the “Software Factory”

Why would I, a developer of this distributed, asynchronous architecture, have much at all to say about operational details? Let me begin with a digression:

A lot of my fellow colleagues — developers, operational staff, and quality-assurance folks — tend to think that software can be done in an assembly line fashion. The developers write the code, someone else builds it, yet another team tests it, and a final team deploys it. They see this as a sign of organizational maturity, or even as part of the maturation of the software industry at large.

Alas, I don’t believe the above for a minute. Yes, it can work, but, in my own experience, it turns a going concern into a slow moving, classical IT shop, who says “no” to product, marketing or sales groups, bogging them down in endless progress and process details. (And I’m not even talking about what it does to developers.) In fact, I’d say that projects running in Software Factory mode are, essentially, dead projects. No growth, no change, no evolution, and no radical discoveries that open up whole new possibilities.

I’ve found that the more detached a given development team is from issues of testing and deployment, the more mistakes they make, and the more mandated policy and management is required, thus causing even more mistakes. At best, moving things along is slow. At worst, developers make fundamentally bad designs not because the designs don’t work, but because they’re too hard to operate. What makes sense in a single binary doesn’t make sense when an application is spread over several binaries, and what makes sense on a single workstation doesn’t make sense running on multiple hosts in a data center.

But let me leave all this for another rant. What I’d like to talk about is how the asynchronous messaging architecture facilitated operational concerns.

Build And Deploy

If you read Part 2, you’ll remember that we created five services making up the serial number validation application:

  • Submitter: Accepted jobs for validation.

  • Publisher: Published results of validation.

  • Oracle Querier: Queried a remote Oracle Database for serial number data.

  • Web Service Querier: Queried a remote Web Service for serial number data.

  • Refiner: Delegated validation requests to the above query services, and assembled results for publication, including “fuzzy logic” for “almost” matches.

To build for deployment, we decided on the following principles:

  • Developers should be able to check out each project, compile and run it with no extra environment setup on their development machines. In other words, projects, as organized in a revision control system, should be optimized for developer productivity. And by optimize we meant quick edit-compile-test cycles, and minimal (or no) documentation about how to set up your machine.

  • Production deployment issues should be captured in its own project, which knows how to check out the services, build them, apply operational details such as configuration, production-oriented log4j.properties (say), file locations, init.d start/stop scripts, etc.

The guiding principle for all of the above was to separate the issue of developing the code from the issue of deploying it and then solving each of the problems according to the problem’s specific requirements. (Using a single build process for both issues makes for something far more complicated than keeping the concerns separate.)

Each of these services existed in a separate directory in a Subversion Repository. Each service was build-able on the command line using ant, which created a “target” subdirectory, moved all the third-party jars, log4j.properties configuration and application classes into that directory, and included a run.sh script which could start the application for testing as you developed code.

Edit, compile, run, test wasn’t much more than the following command line:

target$ (cd .. ; ant ; cd target ; ./run.sh)

After changing the code, you could just hit Control-C, up-arrow (to get the above line), and return. Experts could refashion the above command-line to terminate if the compile was unsuccessful rather than run the code regardless). IDE lovers could configure their software to do the above, but why bother? Using the command-line guarenteed that other software (such as the packager or tester) could also check out and build your app without involving an IDE.

We created a sixth project directory called the packager, which was responsible for building the code for deployment. The packager created RPM packages (our target was a RedHat Linux VMWare instance). The project contained the production oriented log4j.properties files, RPM spec files for the post/pre-install steps, and so on.

On installation or update, the RPM packages:

  • created non-shell users for each service,

  • installed config files in /etc/,

  • installed RedHat style start/stop scripts in /etc/init.d,

  • deployed the binaries in /opt/apps/,

  • created a data partition for storing published files in /data,

  • configured Apache to redirect all port 80 traffic to port 443,

  • configured Apache to use mod_jk for proxying to the submitter and publisher,

  • managed and rotated the SSL certificate for Apache,

  • set up HTTP Basic Authentication,

and so on and so forth. In other words, installing the RPMs turned a commodity, standard-ops RedHat Linux machine into a Serial Number Validation machine without any user intervention.

The slightly-modified RedHat installed by the operations group had apt-get installed and pointed to a corporate repository for Linux, and so all we had to do in terms of “manual” configuration was add a line to the apt-get config file to point to our own repository.

From then on, deploying code for the first time was a simple:

apt-get install snv

with snv being a meta package which depended on the Apache config package, Apache itself, Java, our services, and so on. The dependencies were arranged such that everything was installed in the correct order.

To upgrade to new versions of the service:

apt-get dist-upgrade

and that was all there was to it. This worked for test environments, QA environments, and so on.

Because of apt-get, we were assured that all dependencies we needed were downloaded and installed, even if we introduced new ones with new versions of the application. It was impossible to install our code if a dependency couldn’t be met, and that’s exactly what we wanted.

The Ops Staff, overworked, underpaid, and under constant threat of being “right-shored,” were very happy about this situation. We developers were happy because our documentation for setting up and maintaining the service, wasn’t more than a single page, most of which was letter-head, introductory remarks, contact information, and so on.

Connecting to the Outside World

The serial number validation service was in no way a public service, and was meant, at least initially, to serve only a single client. (We accounted for the possibility of other clients inside the batch submission format, the publication URL construction formula, and other authentication schemes). As such, The Company insisted on a two-way, SSL certificate authentication scheme.

What we ended up with was something like the following:

Front End

The client used a certificate to communicate with a load balanced web proxy farm running in the data center. The web proxy redirected traffic over an SSL encrypted socket connection to an Apache server running as part of our service. The Apache server only allowed connections via port 443, using HTTPS, and redirected all other traffic to an error page. Also, the Apache server was configured with basic HTTP authentication so as to protect it from other services also running on the internal network, of which it was a part.

This is a fairly traditional set up for web services, so I don’t really need to go in to it. The set up was also out of our hands as a development team. The one thing to note is that we deployed the Apache set up, including the locally generated certs it needed, as part of our installation, so it needed no intervention by an Ops staff.

In fact, the Ops staff took a cue from us and began to deploy certificates via RPMs on most of their other machines. This made things very easy for them when it came time to update them.

Monitoring, Observing, Etc

The one thing we needed to instrument for the first pass of the application was whether or not the external web interfaces to the application (Submitter, Publisher) were up or down. The idea was that the load balancer between the web proxy farm and our application would detect if the service was down and alert the appropriate support group.

Rather than afix this concern to either of the web services, we decided to apply the idea of ruthlessly separating concerns by creating another service, called the health monitor, which would monitor all the other services, and publish a static Apache page containing the status of the given services.

What this required was that each service implement a module which subscribed to a ping topic, and published to a pong topic. A message on the ping topic would produce an event that lead to a message on the pong topic. That message contained the name of the component, its location, and any other details we cared about. For the first pass, all we sent was the name of the component, which was good enough.

Here’s an illustration of the anatomy of a given service running in our application:

Anatomy of a Service

(The above shows how easy it is to write event driven services which are largely ignorant of the applications feeding them data, and are also largely clear of complicated, data flow logic.)

The monitor service subscribed to all the pong topics, kept track of the last time it saw a pong for a given service, and displayed an error message on a web page if it had not seen a message in over a minute. (In honor of the national security ‘color alert’ system going on at the time, we added a colored square next to the name of the component: with yellow, red, and green, for just how ‘late’ a pong notification was.)

We never went any further than this, but it was pretty clear to us that we could leverage that ping topic for all kinds of status messages, and that we could use a similar set of topics for adjusting service parameters on the fly. I worked on subsequent services where we did this, but that story’s for another day.

Testing

We created a Python test script, similar to Junit, but suitable for asynchronous testing. It could spawn a process to send a serial number to the service, then wait a bit, then poll for the result, test it, eventually timing out if something went wrong. A Black Box tester. After I left the group, another developer rewrote the whole thing in Java because he was more comfortable with the language and with the threading tools available. The fact that the testing module was separate from all the others, that it was just another project within the source code vault, made it easy to do just this sort of thing. No need to touch all the other code: just create a new testing module, a better one, ditch the old one, and there you go.

Conclusion

These three long semi-essays are really all the conclusion I need: I wouldn’t have written this up if I didn’t think that designing a service in just this way, using the underlying principles, anyway, was just about always the right way to go.

For me, the big win was using topic-based, asynchronous messaging as the way to do interprocess communication between the components of the distributed application.

Using topics disassociated consumers from producers, simulating the adaptability and conceptual simplicity of the stdin, stdout filters making up most of the tools we all know and love on the UNIX command line.

Using an asynchronous mode encourages event-based programming, which tends to make each component much easier to write and far more fault tolerant. Actually, I should amend that: asynchronous massing forces you to deal with fault tolerance as a design issue rather than an afterthought when you go about making your code production worthy. For instance, if you ship data off to a topic, and don’t know when you’re going to get the results back, the solution of persisting the intermediate state to disk (say), and then re-loading it when a message comes back, is both the solution to an asynchronous request/response pairing, and services the needs of a fault-tolerant system that might crash (or get re-installed) at any time.

Asynchronous modes are so usable, I think, that they should be the default for how you design services rather than an exception. You should only use synchronous calls when there’s no way you can get around it (such as we did for the submitter). And even then, you can often simulate asynchronicity.

Finally, a big win all the way around is the use of packages (or installers) native to the Operating System on which you’re going to deploy the distributed application. This encourages automation, gives you dependency checking for free, reduces the amount of documentation you have to write, and builds trust between the development and operations sides of the house. (Any Ops person who’s had to read a five page “cookbook” for installing updates while in a cube being interrupted by marketing and sales folks will very much appreciate your diligence.)

I went on to use these techniques in a couple of later, much bigger projects, and the developers I worked with, once they gave up thinking I was crazy or self-serving, ended up really liking these techniques. We were always done way ahead of schedule, never had to work weekends (at least not because of our own code), and were generally insufferable in our glee at being ahead of the game.

And what’s not to like? You write very simple, single-purpose applications, and, somehow, as a side effect, you end up with a rich and complex distributed system.

Complexity is, after all, an emergent property of systems of simple components. Make those components OS processes, and you’ve got a distributed application that works and is easy to evolve. Ideas worth embracing.

Technorati Tags: , , , , ,

Comments (View)
Wednesday, July 15, 2009

Case Study: An Asynchronous Web Service (Part 2 of 3)

Overview

This is part 2 of a 3 part series about an asynchronous web service I worked on a few years ago which lead to a lot of the ideas I now hold about how to design distributed systems. In part 1, I talked about the problem we had to solve, which was:

  • create a web service to validate serial numbers, and

  • figure out how to negotiate numerous internal resources, not all of which are available all of the time, and most of which were expected to change over time.

We ended up deciding to solve the problem by creating an asynchronous web service with the following interaction (from the client’s point of view):

  • A client submits a job containing one or more serial numbers for validation.

  • At a later point in time, the client retrieves the results by using the individual serial numbers to compute a URL.

In other words, the client polls for the results and may resubmit numbers if it feels that the result has not appeared in a reasonable amount of time.

Areas of Concern

The first thing we did in figuring out how to build the application was to figure out what problems we had to solve, or what areas of concern we had as far as solving the problem. Here’s what we came up with:

  • Accepting serial numbers from external clients for validation.

  • Publishing validation results.

  • Querying the Web Service resource for serial number data.

  • Querying the Oracle Database resource for serial number data.

  • Given all the results, refining an appropriate answer.

What we ended up with was a rough pre-design as follows:

Areas of Concern

As you can see, these areas of concern line up pretty obviously along the lines of what external resources they interact with, or clients they serve.

The dashed lines represent the division between the solution space in which we implement pieces, and the partners with whom we need to integrate.

The submitter and publisher line up with the client: they’re client interfaces and most of their concern is made up of how best to interface with a remote client, the Rebate Processing company.

The Oracle and Web Service Query concerns line up with the services they consult. Most of their concern is with how to contact, authenticate, query for and process the results of the data.

It doesn’t take much of a leap of the imagination to see the above five areas of concerns as five separate services within a distributed application. You could also see them as five separate modules in a single monolithic application. (Or even, say, five different Web Applications in a Web Container.)

Why Separate Services?

Based on my own experience writing monolithic apps in the presence of ever changing requirements, integration touch points, and implementation technologies was quite painful in that the nature of what it takes to manage separate concerns in the same code base slowed me down considerably.

Therefore, I advocated strongly for maintaining separate applications for each area of concern.

It’s not that monolithic applications are all that bad (oh, all right, they are), it’s just that by merging all five concerns into a single application, one ends up introducing all kinds of additional abstractions in order to manage substantially different tasks.

For instance, just about everyone is tempted to treat the results of the Web Service in a way similar to the results of the Oracle SQL result set in some data abstraction layer that’s really, really cool to implement, but becomes the absolutely wrong thing to do when you need to adjust to a new requirement. And don’t get me started on elaborate XML meta/domain-specific languages meant to bind disparate concerns together into a single binary in hopes of creating something easy to refactor.

Finally, if the implementation of one of your areas of concern is a bit wonky, or uses memory-leaking libraries, it’ll take down the whole app and you’ll never figure out why. Is it due to the implementation of one of the concerns, or due to the impact of one concern’s implementation against another concerns when they’re running in the same address space?

By splitting the app into five separate services, you can at least rule out the other four areas if something goes wrong.

Pass 1: A Vague Architecture

Okay, so we decided to write a bunch of stand-alone services rather than a single application.

Here’s what we ended up with:

Vague Architecture

Each area of concern became a new service in the distributed architecture. Each service can be optimized and designed according to its specific concern. For instance, the Submitter Service is good at accepting client connections and validating data without having any part if its code base have to deal with Oracle JDBC drivers. It can implement caching schemes, thread pools, and so on, depending on how the service needs to work under load.

The biggest win for this kind of separation is how much easier it makes any given developer’s life. If the author of the Submit Service wrote a lot of ad hoc, not-well-planned, first-draft code, well, no big deal for any subsequent maintainer. Because the code only does one thing, even the worst code ends up being easier to figure out.

The question the above illustration brought up next was how these applications were going to communicate with each other. Back then, SOAP was on the way out. Even when you own all sides of a given distributed application, SOAP proves to be just too much book-keeping all the way around.

That left “simple” sockets and a custom protocol, or HTTP, or something asynchronous, like a JMS provider, which is what we went with.

Pass 2: Asynchronous Messaging

The things we liked about the JMS / asynchronous-messaging approach were:

  • emphasis on interfaces: In a message-based system, the messages are the architecture. As long as the messages are self-describing, complete, autonomous data qua data, nouns instead of verbs, the application becomes easy to document and easy to understand for the people who have to maintain it. In other words, given a certain message going in, and another message coming out, you can pretty much deduce what the service does without any documentation at all. This is a good thing. Burying interface decisions in shared-libraries (say) of your own composing, or available via app stacks, such as J2EE or .Net, often hide how things are done and thus make debugging and integration difficult.

  • decoupled concerns: With asynchronous messaging (using topics rather than queues), your interfaces are decoupled even from the other services making up your application. A given service publishes data to a topic and doesn’t need to be concerned if, or where, a given consumer of those messages resides, or what its purpose is. (With HTTP, you have to know the URL to post to, and that URL has to be up. If it’s not, you have to manage fault tolerance yourself.) The service consumes messages in the same way. It’s as close as you can get to a standard-in, standard-out kind of UNIX command-line filter.

  • event driven: With such a system, individual services can be event driven. A message comes in, it gets processed, and then it gets posted to another topic. Very clean, especially given that messaging infrastructure, in our case, ActiveMQ, provides all fault-tolerant communication for you. Writing the individual services in such a distributed application becomes similar to the callback methodologies in GUI programming (though I’d not want to press on that analogy too much).

  • easy to evolve: If all interprocess communication (except leaf nodes, e.g., the interfaces to the outside world) use topics (the JMS version of the blackboard metaphor), you can hook additional clients to those topics to expand functionality without having to change any of the existing components. This comes in handy for monitoring and metrics, especially during development. Given that we weren’t sure what additional internal resources we might need to consult as the project matured, being easy to evolve with as little code change as possible was a very good thing for us.

  • hot upgrades: If you’re okay with the external interfaces going down for brief moments, you can re-install all the components making up a message based system without taking special care to be “down for maintenance.” As one service shuts down, the message broker keeps the messages in its local store. When the message broker itself goes down, each producer client blocks until it comes up again.

Another aspect of the technical choice we made was that a message-based system was new to most of us, and its good to gain a much broader perspective on the types of architecture one can use to solve problems. The appeal of trying something new rather than suffering from the same old problems is not something to be shrugged off. We’re all human and software is an art and a craft. Sure, it makes use of some engineering principals, some science, some mathematics, and even rules of thumb, but so does any fine art. We wanted to recognize the need to explore alternatives and embrace rather than deny it under the rubric of traditional “best practices”.

If the above seems kind of sketchy for justification, chalk it up partly to my faulty memory, and also to the fact that an architecture that embraces the asynchronous style everywhere it can is best justified by how easy it is to maintain, how little code you have to write to support it, and how simple it is to understand and trouble-shoot. These are experiential justifications which are hard to justify via diagrams or the simple three-tier design that so many managers and stake holders and operational staffs are familiar with.

Pass 3: How it turned out

Given the above diagram as a starting point, and the notion that we wanted to use asynchronous, topic-based messaging as our data pipeline, all we had to do was place a topic between each stand-alone service along all the internal interfaces:

Topics

The above illustrates the one big drawback to messaging systems: they’re hard to draw in such a way that a managers or architects don’t rub their eyes and mutter, “too complicated,” or, as I translate it, “too many notes.”

The most complicated part of all of the above is the Refiner Service which has a lot of inputs and outputs.

Here’s a narrative of how the Refiner worked, which should give you a flavor of how easy it is to think about something rather complicated:

  • The Refiner receives a message from the job-submit topic, unpacks it, crafts up a serial-number message, and posts it to the web-service-query topic, and then it’s ready for the next message.

  • The Refiner receives a message from the query-service-result topic, unpacks the message, and examines it. If the serial number is validated, it posts the message to the job-complete topic. If it is NOT validated, it posts the message to the database-query topic. And then it’s done (or, rather, is ready to process the next message).

  • The refiner receives a message from the database-query-result topic, examines all the results available, figures out how to describe the result (good, shaky, invalid), appends that data to the message, and writes it to the job-complete topic, and we’re done.

With not much imagination, you can see how each of these flows can be organized as a “plugin” floating in the Refiner Service, with an outbox publisher and an inbox subscriber object with appropriate callbacks. Need to query additional resources? Just add more plugins and adjust the existing inboxes or outboxes as necessary. (And remember, changing topic names in your code is a lot easier than changing XML bindings in three config files.)

Don’t like how things are going? You might choose to rewrite the Refiner Service, or split it into three services. Regardless, the Submitter, Publisher, and both Resource Query services all remain untouched.

That is, unless you change the message format. But, again, changing the message format is changing the architecture, and even that’s pretty easy (if you use, say, XML and XPath, or even JSON, in which case adding new elements does not require immediate changes if the consuming services don’t need the extra data).

The upshot of all this is not that there won’t be change over time, or that a particular change might not have to be done in multiple places, but that it’s always clear what the impact of any change will be, and, because each service in the application is small and single-focussed, it’s easy to assess the impact of the change on any given subsystem.

I cannot over-emphasize how important this kind of architecture turned out to be for managing change over time with only one or two developers and an extremely over-worked operations staff.

Operational Details

In the next part of this long, long essay, I’d like to discuss some of the operational details that the messaging backbone afforded us, and how we deployed and maintained the application as a series of services running on a linux VMWare instance, and, finally, what happened to the service when the maintainers were forced to move it to a J2EE WebLogic cluster solution.

Technorati Tags: , , , , ,

Comments (View)
Saturday, July 11, 2009

Case Study: An Asynchronous Web Service (Part 1 of 3)

General Problem

A company I worked for (let’s just call it The Company) sold a lot of products and offered a lot of rebates. The rebates were processed by a Rebate Processor company which took in the numbers and other rebate information from customers (such as product descriptions), did all the paperwork, sent out the cash, then billed The Company for its efforts.

In other words, The Company outsourced the handling of rebates, as I imagine many companies do.

The problem, though, was that it was possible for miscreants to introduce fraud into the system by submitting properly formatted serial numbers which where, nevertheless, fake.

What The Company wanted to do was offer a service such that the Rebate Processor could ask us if a given serial number was not only valid (proper numbers and letters in the right order), but had actually been issued against a product instance.

(My apologies for the vague language, but hopefully you understand the legal implications of mentioning anything to anyone. Oy.)

Technical Problem

You’d think that the solution should be pretty easy. Just offer a web service that, when you post a serial number, responds with a “true” or “false,” depending on whether or not the serial number was every used.

However….

My company did not have a single data source with all the serial numbers ever used for all the products it sold, or had ever sold.

Why?

  • Acquisitions: The Company had acquired many other companies, each of which had their own methods and data stores dedicated to issuing, managing and tracking serial numbers.

  • Federated Divisions: The Company itself had, for a long time, developed a federated culture in which each division was locally managed, with only minimal oversight from the corporate leadership. Each of those divisions represented quite varying products and product families, and each one had its own way of managing serial numbers.

Over the years, there were efforts to consolidate this information, and those efforts were largely successful in that there were just two sources to consult about the validity of serial numbers:

  • Web Service: A web service, with a complicated XML/HTTP interface. Not SOAP, not REST, but just XML posted to and retrieved from an HTTP endpoint.

  • Oracle Database: An Oracle database. A BIG Oracle database, with lots of views, and many tables containing many serial numbers defined in a not-readily-discoverable, and potentially ever-changing way.

These two data source were internal data sources, and did not have particularly stringent service level agreements. If the Oracle Database needed to go down for maintenance, it went down for maintenance, users beware. Same with the web service. The potentially lackadaisical uptime for these services was reasonable, given what they were normally used for, and given their role in the normal business operations of the company.

Finally, there was a good change that a perfectly valid serial number on an actual, physical product was not in either data store.

Yikes!

Summary of Complicating Factors

Let’s summarize the situation:

  • More than one internal service.

  • Unreliable internal services.

  • Both internal services (potentially) must be consulted to resolve a question about each submitted serial number.

  • Valid serial numbers might not be found in any internal data source.

  • With new acquisitions, there might be additional data sources to be integrated.

  • The two existing internal services might merge, or morph into a third, grand-vision, data-warehouse-like thing (which is always the threat in a corporation as mind-bogglingly, borg-imitating like The Company).

The bottom line is that any design, we thought, would have to accommodate, maybe even, dare we say it, make it easy to make changes over time.

Solution Space

In attempting to work out what to do about the above, we contemplated several options, which boil down to the following three approaches:

  1. Synchronous with Synchronized Cache:
    Periodically import all serial number data from all available systems into a local service database, and serve out answers synchronously.

  2. Synchronous, Luck of the Draw:
    Each incoming web request should consult each internal service in turn, and respond with the results, as best it can, even if one or more of them are down.

  3. Asynchronous:
    Submit a request asynchronously, and look for the completed request at a later time. Internally, we move the job around, consulting each source, make our best guess about the validity of the number, and “publish” the result for later pickup.

We chose the last option (thankfully, or there’d be no reason to write this, at least as far as my interests are concerned).

We couldn’t use Option 1, in which we’d import all available data for several reasons: even if we import only the serial numbers with no associated metadata, we’d have more data than our little effort could sustain, and some Architect who didn’t understand that shared state is bad would see the copy as duplication, rather than caching, and nix the project. Finally, we had to also import additional metadata so that we could guess if a given serial number we didn’t have is at least likely to be legitimate. (We’d publish a “confidence factor” if we couldn’t find an exact match.) And, of course, we only had about two months to develop the entire solution and even if importing data was fast and easy, procuring enough infrastructure to make it happen most definitely wasn’t.

Option 2 seems, on the surface, the most reasonable, except that one of the resources we needed to consult was a database with millions of rows of data. It was unclear that the queries we’d have to run to make it work could complete before an HTTP request could complete. Timeouts are unpleasant on either side of a remote procedure call. Also, the resulting monolithic webapp code would be further complicated each time we added a new resource to consult, or had to change our strategy. How would we know if fixing one part of the app would break the other, seemingly unrelated part? What we needed was a way to handle a potentially long-running request.

And so, finally, we settled on Option 3, an asynchronous web request style service, which is maybe another way of saying “batch processing”.

The client would submit a batch of serial numbers and any associated metadata (such as the model of the product). We’d return with an OK if the batch job was valid, and at some later point in time, the client would use the numbers in that batch to poll for any results. If the client could find no results for a specific number, they were free to re-submit it in another job.

Using the above strategy also allowed us to carry the idea of asynchronous services behind the scenes and make it the underlying methodology of the entire supporting architecture.

That architecture, with all the unintended benefits it provided, is the whole point of this long exposition, and will be the main subject matter in the next article.

Comments (View)

Indeed, it’s not really clear to me that operating systems is a valid academic field at all. If someone had axed its funding (the dirtiest word in the English language has seven letters and starts with “F”) in 1980, how different would your computer be?

If abject failure were an obstacle to continued funding, most of “computer science” would have ceased to exist sometime in the ’90s.

Unqualified Reservations: Wolfram Alpha and hubristic user interfaces
Comments (View)
Thursday, July 9, 2009

Initial Thoughts on Asynchronous REST

Tim Bray’s article got me thinking about REST and about synchronous vs asynchronous interfaces.

What really got me interested in asynchronous services, especially message-based services of the fire-and-forget kind, was how helpful such things were as you develop and maintain services over time, and across organizations, or even across the “divide” between one head-strong developer and another, or between two tasks that are completely different, but share data.

But that’s for another note, another time.

One of the issues I’ve had with REST is not the style itself, but its synchronous nature, or at least how it’s used in common web-service style architectures as not much more than a function call. Cleaner than SOAP, certainly, more maintainable and understandable, but, basically, a function call.

Nevertheless, a step in the right direction, at least for me, is to be able to use a REST style HTTP interface in a fire-and-forget kind of way.

Tim’s article is mostly about making HTTP requests which initiate actions that take longer than a traditional HTTP request should last. How do you workaround connection timeouts?

I’m interested in a slightly different but related idea: how do you make a REST request without getting any answer, but then set things up so that you can get the results of that request at a later time?

Polling

The idea of polling is that you submit a request to a specific resource, which returns an in-progress result code, and a payload with a URL providing you with a resource you can consult about the status of the job.

You can periodically issue a GET on that resource to find out the status of the job. Presumably, when the job is complete, the poll request will provide a link to the finished results (if there are any).

I’d guess that this solution is pretty hard to scale, though that might be done by returning not only how complete the job is as a percentage, but an allowance for how many times in a given time frame a client is allowed to check back. For instance, a client might be allowed to check back 3 times a minute, or each check might provide a suggested time for when to check back next (and refuse any checks earlier than that).

I’ve actually implemented a polling-style service like this a few years ago. The client would submit a request to the service with a payload containing a unique ID. At a later time, the client was supposed to use that ID to construct a URL to look for the result. It was considered okay if, after a sufficiently long time without a result, the client could re-submit the request. As far as I know, the service is still in production.

Callback

My favorite method is the callback.

When you submit a job, you include in that job a URL to which the results should be POSTed. Your client can then consider its task of submitting the job as complete and go on doing other things. Sometime later, another part of your app, the “server” part, gets a request, which is the result of the job.

Event driven logic.

Very clean. And works well, if you control both sides of the network, as in, both sides reside within your data center. Not so good if you want to make a request from your data center to the external world, given firewall issues. (Of course, given that the call back is HTTP, it’s probably not as controversial as your average Ops manager might make you think.)

Thoughts

The really important part of this, though, is to allow for asynchronous behavior on top of a strategy that (in the pop culture that rules the tech world), is mostly conceived of as synchronous, remote-procedure calls.

If you submit a request without requiring an immediate response, that job can be shipped off to be handled my many internal (and invisible to you) services, any one of which may fault, or timeout, or throw exceptions. By requiring an asynchronous methodology, the immediate transaction, that of connecting to the job-submission service, is very simple. Either the job was submitted, or it was not.

It’s also straight forward to find out if the job succeeded or failed via the polling or callback methods, at which point you re-submit the whole thing, or simply log it for later human intervention.

Large grained simplicity over fine-grained complexity.

Technorati Tags: ,

Comments (View)
Tuesday, July 7, 2009 Sunday, July 5, 2009

erlang and jobs

People in my line of work (writing distributed, network applications) seem to be afraid of Erlang. Several reasons, I think:

  1. Developers
    Something new to learn, and thus they feel that they’re at a disadvantage, or won’t be able to contribute.

  2. Managers
    Worried that they won’t be able to find developers to work on the code base once the original authors leave, and fear that any new technology is just as bad as the last few “hacks” they graciously allowed to invade the architecture (except Java, which is somehow always the right thing).

The strange thing is, I see people all over the place who want to work with Erlang (or Python or Ruby or Scheme or Lisp or Smalltalk), not so much because Erlang is a cool language, but because it solves so many of the problems they have to deal with day in and day out. And it’s a cool language.

The whole thing strikes me as a destructive, self-fulfilling prophecy. We won’t invest in the small amount of time it takes to learn the language because we don’t see that there are potential employees out there to be hired if we need them. And there aren’t any potential employees because no one will let anyone use Erlang for production projects.

A Catch-22, to be sure.

What distresses me is that, deep down, such decisions aren’t rational. They’re based on fear. Fear of change, and fear that one’s employees might be necessary to one’s success, rather than being discardable, fungible assets. One could claim that choosing Erlang in itself is emotional, and I’d tend to agree, but it’s a positive emotion: pleasure at being able to solve problems more easily, pleasure learning something new, pleasure at opening up the possible range of solutions for any given problem. If one makes a decision on emotional grounds, these are the right emotions.

What also distresses me is that I don’t believe hiring managers know anything at all about who they need to hire, and the skills they need, and thus fall back on the notion that expertise in a set of platform tools means anything at all. (A man who can wield a hammer with the best of them may or may not be good at building cabinets.)

Right now, when I see job positions asking for candidates well-versed in Java, J2EE, Spring, Hibernate, Inversion-of-Control, and so on, I see a shop in which very little gets done over a long period of time, a process-bound, over hierarchicalized, overly large team: the mythical man month incarnate. I see a recipe for failure.

Like the lumbering empires of old, the appearance of strength and stability hides a vast and empty core.

Comments (View)