a better mousetrap #5: a Couch potatoes take on transports

As pointed out in my last post, I am about to dive a little deeper into our prototypical, experimental usage of Apache CouchDB in an environment which is a bit off the typical “web application” use case yet seems not all too bad a thing to use a technology like CouchDB for. Meet the “transport engine”.

First things first

Starting all out, maybe the name “transport engine” is pretty much misleading and way too big for what it does actually cover. Here’s the point: In our system, which basically is a large-scale document management facility, automatically providing external users with information is an essential part of the business. There is a bunch of ways how to make the system want to send out such messages, and going into detail all too much definitely is out of scope here. In the end, in all cases what is left is one out of several data structures containing obvious information, such as

  • who should be notified,
  • what documents (files) the notification should contain,
  • what additional information should be provided along with the files.

Looking at this description with an open eye, maybe you already spotted an interesting aspect of this discussion: Providing “additional information” with such a notification sent out. Not too much of a surprise, this to quite some degree depends upon various aspects such as the way the notification should be transported (obviously setting a subject line or a message text in an FTP transmission is pointless or not doable in an obvious way). The fact of different means of notification transport being different in some aspects, after all, lead to an early design decision of having some of these data structures kept in a more or less object oriented manner in quite some different places, and the initial understanding of how the transport of these notifications should be done even more enforced this decision.

Ultimately, this ended up with certain cross-cutting concerns (think of queueing or polling notifications that need to be processed, dealing with transports failed due to error situations, …) being reimplemented in various different subsystems in similar yet slightly different ways or to not being addressed at all – in example, a “pre-transport” implementation of handling e-mail transport would quietly discard any SMTP related errors, thus mark any message “successfully delivered” even in cases in which the mails generated were already rejected by the local MTA.

Examining the Status Quo

Consequently, the notion of “transport jobs” was introduced to this system, with the idea of “unifying” all these things, establishing some kind of separation of concerns and have most of the processing and transport logic isolated in a well encapsulated “transport” subsystem of its own. Transport jobs were and are modeled subclassing an abstract TransportObject base class and in the most barebone way look akin to this:

1
2
3
4
5
6
7
8
9
10
{
 "id" : "some-sort-of-UUID",
 "targetUrl": "ftp://foo:bar@remoteSystem/inbound/fileStore/",
 "attachedFiles" : 
 {
    "foo.pdf": "document12131/fileStore/data/file.pdf",
    "bar.pdf": "document244123/fileStore/data/file.pdf",
    "baz.pdf": "document441243/fileStore/data/file.pdf"
 }
}

In these structures, the transportUrl is the most important piece of data used by all of the transport subsystem logic to do various more or less configurable processings to the data included in this structure and, in the end, have some “transport implementation” handle the actual data transfer. Along with this, too, there is a bunch of preprocessing logic to create these transport objects out of the existing legacy database structure which can’t be refactored all too easily without dumping most of the production environment – which is not an option at the moment.

Building a mess from a mess

What initially worked out pretty well and looked really clean, however, started to get dusty as time passed. Worst of all, the whole structure needed to be extended in virtually any way possible, which left us introducing new members to the object structures, some of them depending upon the transport URL and thus processed in the “transport” subsystem, some of them “just” depending upon the transport technology used and thus being pretty friendly in usage (like “usePassiveFtp” or “useBccForRecipients”). So the data structures grow more complex, like having a fully-fledged e-mail transport description…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
{
 "id" : "some-sort-of-UUID",
 "targetUrl": "mailto:someguy@somwhere.outthere.net",
 "mailSubject" : "Document Transfer Notification 123-1231-1",
 "mailBccsTo" : [ "foo@localhost", "boss@outthere.net" ],
 "requestDSN" : "true",
 "messageTextParts" : [ "transport/some-sort-of-UUID/message.html", "transport/some-sort-of-UUID/message.txt" ],
 "attachedFiles" : 
 {
    "foo.pdf": "document12131/fileStore/data/file.pdf",
    "bar.pdf": "document244123/fileStore/data/file.pdf",
    "baz.pdf": "document441243/fileStore/data/file.pdf"
 }
}

… or the same for an external FTP server transmission, including transport of local attribute metadata (basically a set of key/value pairs – similar to JSON except for keys being limited to simple data types), transformation of “our” metadata to the recipient side data schema and waiting for the recieving site to place a “recieved” status file in a defined output folder for each file transferred:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
 "id" : "some-sort-of-UUID",
 "targetUrl": "ftp://foo:bar@somehost.net/transfer/inbound/",
 "ftpUsePassive" : "true",
 "ftpConfirmationFileFolder": "ftp://foo:bar@somehost.net/transfer/status/",
 "metadataTransformationRule" : "this-project-set",
 "attachedFiles" : 
 {
    "foo.pdf": "document12131/fileStore/data/file.pdf",
    "bar.pdf": "document244123/fileStore/data/file.pdf",
    "baz.pdf": "document441243/fileStore/data/file.pdf"
 }
}

Given SOAP, this could be even more complicated (even though this solution so far is not in productive use but mainly another internal prototype), including considerations such as uploading files outside the context of SOAP via posting to some HTTP URI or invoking some remote method needed to be invoked to trigger automatic document processing on the remote side:

1
2
3
4
5
6
7
8
9
10
11
12
13
{
 "id" : "some-sort-of-UUID",
 "targetUrl": "http://user:credential@some.remote.host/ServiceEndpoint.wsdl",
 "soapPostTransmissionMethod" : "importDocumentsToStore",
 "soapPostTransmissionParameters" : ["foo.pdf","bar.pdf","baz.pdf"],
 "metadataTransformationRule" : "ServiceEndpoint.xsd",
 "attachedFiles" : 
 {
    "foo.pdf": "document12131/fileStore/data/file.pdf",
    "bar.pdf": "document244123/fileStore/data/file.pdf",
    "baz.pdf": "document441243/fileStore/data/file.pdf"
 }
}

In the end, these structures live somewhere in between modestly complex (mainly dynamic) data objects and some sort of “configuration” or job language to describe tasks to be done automatically. Though not too difficult and just the way to go given actual requirements, this prove to be painful mainly in two different ways:

  • Lacking a better approach, most of the code for dealing with these various specialties ended up in dedicated Java classes, leaving us with a bunch of more or less deep inheritance trees in Java code, all along with a bunch of semi-specialized handler class implementations working with particular subclasses (of subclasses)* of TransportObject, again being slightly similar in some ways while particular different in others, ending up not to be a maintaineance heaven either.
  • Similar problems did arise from keeping instances of these classes temporarily persisted in an SQL database for processing: Use O/R mapping to keep these objects stored in single or multiple table inheritance (ending up with a bunch of tables) or resembling a key/value storage structure on top of a relational database engine – or just keep the old structures “as is” and recompute the transport object data each time when accessed.

Though in most ways still better than the original implementation, the solution didn’t really please, and we came across the idea behind CouchDB in the midsts of optimizing the second problem – keeping track of these data objects temporarily for processing purposes.

A new approach

It seemed to make sense immediately. Given the structure of our TransportObject implementations and the variety of member variables present or absent depending upon the very class (or even actual instances of a given class), the schema-less nature implemented in CouchDB seems a fit way more natural than mapping these structures to any RDBMS table. These days, we were playing with CouchDB just in a prototype system outside our production environment, but this use case provided a more real-life use case that yet was safe enough to give it a try:

Data structures seemed just like made for it, and still the application wasn’t as critical as having to keep all the data safe and sound all the time at any price. We didn’t had to expect large amounts of binary data as in the core system of our productive environment, and security requirements also were low given the system in question just has to be available internally and to the few systems part of the transport processing system. And, all along, we did and do have a playground for trying out many of the CouchDB “stock” features and see how they do in terms of data storage, handling and performance of the solution, application and infrastructure architecture. It’s a process of learning, with some insights.

Data mapping

Our initial attempts, pretty straightforward, involved jcouchdb for accessing the CouchDB instance from Java. Most of the data in this system will be solely written from a server-sided Java application, so this was the first thing to start with. This worked well. By now we live somewhere in between jcouchdb in some cases, HTTP client classes along with svenson JSON library in others and, more and more, ektorp for dealing with object structures which are more complex. As pointed out earlier – the ability to easily access the data stored in CouchDB on various levels of abstraction really comes in handy here. And yet, most of our data structures living in CouchDB aren’t all that complicated, at least not as far as the transport engine is concerned, so in most situations we just use a very limited subset of the features provided by these frameworks.

Database handling

Again, our initial versions were quite straightforward, again exploiting the schema-less nature of CouchDB and simply dumping all the TransportObjects into a single database structure. In this setup, all data was written by the backend Java application and read by a small set of client applications making use of CouchDB views, designated “type” fields and certain status information (“NEW”, “SENT”, “FAIL”, …) to find the data they were supposed to handle.

This has grown considerably more easy with moving different kinds of objects to databases of their own and adapting the change notification concept to make client applications listen to changes in “their” structures, in some way resembling “topics” in Java Message Service. This, by now, made us almost completely give up on views for the core transport logic and just use them in some fields for querying information in the database structures manually.

There is another advantage of this structure, however, which we first started exploiting in our “fax” transport database which (by the way) is our only automated “inbound transport” so far and the only CouchDB use case in which data is not written by the backend Java application but by the local fax service running on a GNU/Linux machine and being implemented mostly in Perl and Python. In this situation, it’s not just “metadata” (fax sender identifier, fax recipient identifier, date/time) but also the recieved document itself in various representations (SFF, TIFF, JPG images, thumbnails) stored in the CouchDB, and, with a size of around 2GB and somewhere next to 20,000 documents, this also is our largest CouchDB use case. As we run the fax server as a “standalone” installation (also in terms of disk storage) yet wanted to keep the information in this database stored somewhere in our storage backend environment, this is the first situation in which we used continuous replication to copy the fax documents from the fax server to the “backup database” – which works reliable and, compared to most of the SQL databases we worked with so far, in an astoundingly little ado. We so far consider doing the same for the transport structure, as this situation also in some cases begs for a solution just like that – why not have all the mail transport functionality kept on a dedicated mailer host and just have a CouchDB replication target running there to which all the relevant documents are replicated? I am so far into evaluating this, as it seems a logical next step.

Where not (yet) to go next

So far, this is the use case of our transport system, and in there, CouchDB did and does pretty well. We actually even thought about exposing the CouchDB transport structure, at least relevant parts of it, to the “outer world” to allow people to get whichever data they need, which of course (compared to the approach outlined above) would require to add the actual document files to the CouchDB documents – by now, they are just left inline as references assuming each system attached to the structure knows well where to fild these files. Major showstopper to this in my opinion, so far, is the lack of having a access control to resources fine-grained enough to allow for keeping people or clients away from data they aren’t supposed to see or edit. Despite being a royal PITA as far as (container-independent) configuration and deployment is concerned, JAX-RS security annotations found in the Java EE REST world really are a desirable feature here, yet I am unsure whether one wants to wrap a Java authentication REST proxy around a CouchDB infrastructure. So at the moment the CouchDB will be just internally available to the machines allowed to access it, and this so far does pretty well.

Even if it works rather well in the fax environment, another thing we will eventually keep ourselves from doing on top of CouchDB is attaching too much of binary content to the documents. Though the approach of “representations” and “attachments” is really neat and in some ways definitely reflects our understanding of documents and files, this is not really to be handled at the moment because of the sheer data volume in our system. So far I hesitate storing files up to the size of DVD images in a CouchDB database. Likewise, the current way of attachment handling in CouchDB is painful if you run a large storage system (like, in our case, a NetApp filer) where most of the binary data is supposed to live. This is where one would really like to see a concept akin to the external data stores found in JCR / Apache Jackrabbit which provides a way of having binary large attachments (or, in general, large data sets) stored in files outside the “repository” invisible to the developer working on top of this platform.

Likewise, then and now, we are unsure whether CouchDB completely might replace RDBMS in our structure as far as indexing and searching data is concerned. Right now, this also is a challenge in our SQL backend as document retrieval and indexing works on top of complex and extensible metadata structures and in all cases has to deal with rather large document sets. But maybe, then again, this ain’t at all important. As we learnt so far, pointing this out once again, it’s a good idea picking a “smart” tool to do a given job, and ensuring the overall structure open and flexible enough to actually allow for working with a bunch of different “smart tools”. And if this is just the lesson to be learnt here, it’s definitely worthwhile. 😉

Leave a Reply

Your email address will not be published. Required fields are marked *