hross: June 2007 Archives

Practice What You Preach

| | Comments (0) | TrackBacks (0)

Here at BEA we tend to spit out a lot of marketing literature about SOA, which makes sense, since that is one of our main product focuses. The truth is (sorry, marketing folks), I don't read most of that literature. I am sure there is a place for it, but if I want to know what a product stack is really about, I need to see it in action (and not just in a glitzy sales demo).

Normally, I don't take a non-technical focus with this blog, since there are plenty of other people (who are smarter than me) who can give you excellent content about product management, enterprise development planning, or any number of other 'soft' skills. I did, however, get a chance to play with an application we call Ensemble at a recent all-hands meeting, and I wanted to share my impressions...

Create Your Own Portal  The simple fact is, the 'mash up' capability of the product is what gets me excited the most. We already embed a tagging framework in the ALUI portal, but if it's extended to provide any piece of an application in a standard HTML document, the buzzword mashup might actually mean something.

Watch out... we're actually using SOA. The reason this post is titled Practice What You Preach is that we're actually using SOA in our own products. Ensemble is really a collection of enterprise services (a configuration service, a security service, a content proxy, an admin service), rather than one giant application. This lets you better manage each piece of the product, and I have a hunch, will eventually build to an overall integration across the product stack, plus a set of web-based API's you can use for customization.

Perimeter Security: It's cool, but why is it the first thing mentioned?  Some of our literature positions Ensemble as a 'perimeter security' product, which is true to a certain extent, but, in my opinion, off the mark when it comes to the true power of the application. Getting excited about SSO is like getting excited about dishwashing detergent. There are plenty of varieties out there and most of them work just fine.

What's the endgame? If it were up to me, the tagging and security framework would eventually replace the ALUI portal entirely. The ALUI portal you see now would become a set of standard widgets and HTML pages with custom markup that are fully modifiable. I'm not sure if this really is the plan, but if I were thinking about the future of the portal and the enterprise market, I'd be considering it.

Just to show you I haven't been replaced by someone from our marketing department, I'll leave you with some lingering doubts:

  • Many enterprise applications are still custom built, and are certainly not expecting to end up as embedded tags in someone else's HTML artwork. There will be plenty of headaches when fantasy meets reality.
  • Will the performance of a SOA based product stack up to a homogenous environment? If there are HTTP calls on the back end for security plus every piece of a page, what guarantee do we have that our rendering times are going to be acceptable to our front end users?
  • How will this all fit together without replicating functionality other vendors are already providing (or even doing better than us)? One differentiation we have had from the market in general is our less technical approach to the portal. Sure, you can customize, but you can also run it out of the box. The last thing we want is to lose our business users by letting programmers run the show.
  • Will SOA be more of a headache than a problem solver? How many services do we need to make the product stack flexible without making it overly complicated?

Anyway, the Ensemble product team deserves a lot of kudos for their efforts thus far. We're only at 1.0 and I'm already excited. (I promise to write something obscure and technical in my next post)

Those of you still with me have successfully navigated my series on the document repository with only one last and mysterious segment to go: Utilizing the Repository. Why the mysterious title? I was hoping not to explicitly give away the fun part of this series: writing unsupported Java code.

For a preview of things to come, take a look at the output of my sample application:

******************************************
A simple document repository test application.
  by H. Ross Brodbeck, BEA Systems Inc


No warranty is provided with this application. It is not even
guaranteed to function. I am sure there are bugs. Use it at your own risk.

******************************************

A test of DR encryption:
Password is: password
Encrypted password is: RDQyMzUxNjYwRDQ0OTI4QQ==
Decrypted password is: password

Testing document upload: (uploading C:/sandbox/drtest/dummy_text.txt)
Binding to service at - http://localhost:8020/dr
Uploaded ID is: F0321F33/D027E5FA.ACT

Archiving document...
Archived ID is: F01768DC/D02B4333.ARK

Deleting archive...Success!

Deleting document...Success!

Document repository test completed. Exiting...

If you've been following part 1 and part 2, then you already know, theoretically, what this application is doing. Perhaps if you're the adventurous type you already know how to write this application. If not, that's what the final segment of this series is for.

In this post I'll take you on a tour of a toy Java application which creates its own document repository and exposes the upload/download/archive capabilities of the repository. Essentially, this is what every embedded application server that uses the document repository does.

Setup

To get started, you're going to need to either edit or create the following files on your client (I put them all in the same directory):

  • dr.xml - I copied the one from the content upload service ($PT_HOME/ptupload/6.1/settings/config/dr.xml) and modified it, as per the below instructions.
  • dr.jar - the meat and potatoes of the DR remote client
  • hessian-3.0.12.jar (or whatever version of the repository you're using has) - This is needed for the file transfer part of the application
  • a test file to upload (could be anything -- be creative)

You can find a copy of the jar files in a number of war files (use a zip file editor or the jar utility to extract them), including the knowledge directory upload and document repository war files ($PT_HOME/ptupload/6.1/webapp/). They will need to be added to the classpath of the sample application.

The repository itself must be made aware of our application, unless we are planning on using an existing repository. Since this is a new entity, I need to configure a new repository. I'm going to call my application PTHack (due to the hacked up nature of the code). As I explained in part 1, I'm going to set up an application and provider element in settings/config/dr-server.xml. Under the applications node:

<application>
<name>pthack</name>
<enabled>true</enabled>
<password>RDQyMzUxNjYwRDQ0OTI4QQ==</password>
<provider>filesystem</provider>
</application>

Under the file system provider:

<application>
<name>pthack</name>
<paths>
<active>C:\sandbox\bea/ptdr/documents/PTHack/Active</active>
<archived>C:\sandbox\bea/ptdr/documents/PTHack/Archived</archived>
</paths>
</application>

Finally, I need a dr.xml for my client application. For the purposes of this application I modified the one from PTUpload (see above for location). I could have changed the password for the application if I had edited the <application> node above and the relevant segment of dr.xml. The sample code I provided has examples of the DR's encrypt/decrypt methods, in case you want to change the password. I'm lazy, so I kept the default. Here is what my dr.xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<dr>
<application>
<name>pthack</name>
<password>RDQyMzUxNjYwRDQ0OTI4QQ==</password>
</application>
<transport>glue</transport>
<transports>
<transport>
<name>local</name>
<factory>com.plumtree.dr.transport.local.client.LocalClientFactory</factory>
</transport>
<transport>
<name>glue</name>
<factory>com.plumtree.dr.transport.glue.client.GlueClientFactory</factory>
<startup>
<url>
http://localhost:8020/dr</url>
<home>@ELECTRIC_INSTALL_PATH@</home>
</startup>
</transport>
</transports>
</dr>

The Sample Application

The sample application I've provided below is fairly straightforward. It contains methods for creating a remote repository object, as well as methods for uploading and downloading files from that object. It wouldn't take much imagination to adapt this into a back end repository for whatever applicaiton you might decide to write. A few highlights:

  • The main method has an example of using the encrypt/decrypt methods in the repository. I didn't bother deciphering the encryption scheme.
  • You'll notice the need to specify an application name, password, and configuration directory twice: once in your config file (dr.xml) and once in the application itself (see the static strings at the top of the attached source). This is kind of... annoying. If you look at the content upload service, you'll see these same properties specified twice as well: once in dr.xml and once in the application.conf file in the same directory (settings/config). I suppose at some deep level before transport this is unnecessary, but I didn't bother going far enough to find it.
  • The rest of the methods use simple method calls and a stream for uploading/downloading the file. That's pretty much it.

None of this is supported, of course, but that doesn't mean it isn't a fun or interesting exercise. Click here for the full source to the application (RepositoryDemo.java).

Welcome back to the second installment of my blog series on Deconstructing the Document Repository. In the first installment, I tried to give you a foundation in the fundamentals of the repository. This second installment is intended to give you a better perspective on the operation of the repository by deconstructing a real world example: the content upload component of the ALUI knowledge directory.

Rather than deconstruct the actual functionality of the KD, I'm going to focus more on the upload component's 'bridge' role as it relates to the KD. By the end of this post you should have a good idea of what the DR is used for, and exactly how it is used in this capacity.

Content Upload

The content upload component of the ALUI portal product allows us to directly upload files into the Knowledge Directory. If you are unfamiliar with the KD and its basic operation, I'll sum it up in one sentence: a directory of links to documents and other web sites that is searchable and organizable based on a structure similar to your file system. If you're still in the dark, check out the Administratoration Guide to the portal for details.

The upload component itself is nothing more than a way to provide document upload and download capability directly in the KD. Most of the time, the KD is used to index other sources of content, such as other web sites or document stores (intranet sites, content management systems, etc). In the case of content upload, the goal is to allow users to directly upload files to the KD, apply security, and then directly download them, without a back end repository (other than the DR).

Knowledge Directory Cards

The KD operates using a 'card' indexing system, sort of like a library card catalog (or, at least, that's how it was explained to me). The directory is merely a folder system of index cards, organized according to a taxonomy. The cards themselves are pointers to an HTTP based content source. Requests for the document are redirected through the portal to the underlying content source, the data being provided by the underlying provider rather than the directory itself (a content source, in portal parlance).

How does the Document Repository fit in?

How indeed? The problems with integrating the document repository into this system are manifold:

  • First, the document repository has no central directory of documents like a traditional content management system. In other words, it doesn't know how many documents it contains, or where those documents are.
  • Second, even if the DR knew where those documents were located, it does not know the content type or names of the documents.

Thus the creation of the content upload component of the portal, which serves to alleviate both of these problems. The content upload service is really just another web service that wraps the document repository and provides an HTTP link for downloading documents, as well as a way to translate between the document in the repository and the type of document the user is downloading.

How does it work?

The content upload service is actually spread across multiple pieces of the portal:

  1. The card creation and document upload interface (the part of the KD where you actually choose the document you want to submit)
  2. The web service that allows for single URL download of a document from the repository
  3. The PTUpload application folder in the document repository

The actual process works something like this:

  1. User navigates to knowledge directory, chooses to upload a document directly
  2. User is redirected to the Directory (Dir) activity space
  3. Dir activity space posts multi-part form data (the uploaded document) to the upload data source, via a gatewayed URL
  4. The document is read into the upload component and redirected to a store in the document repository
  5. Upon upload completion, the DR repsonds with a document ID and a 302 redirect back to the portal (see part 1 for more info on document IDs)
  6. The document ID is forwarded back to the portal via the querystring in the redirect URL
  7. The Dir activity space uses the ID in the URL, along with the document content type and name, to create a card in the knowledge directory
  8. Subsequent requests for the document via the card are responded to by the upload component, which grabs the document from the repository, sets the content-type in the response and sends it back to the user (again via a gatewayed URL)

All of the information that relates to the individual document (where it is in the repository, what its original name is, and what its content type is) is stored in the Knowledge Directory card for the document. That means the upload component really is just a middle man, with no underlying database or knowledge store. Cool, huh?

You can find most of the source code for this entire process in the UI source code distributed for portal developers. The functions that create KD cards are located in com.plumtree.portalpages.browsing.directory.DirModel. Specifically, check out the SubmitCardWithPropertyBag function. Source code for the document submission user interface can be found, among other places, in the com.plumtree.portalpages.browsing.directory.documentsubmitsimple package.

KD Card Properties

As I mentioned above, the actual information about the document is stored in various properties of a knowledge directory card. If you click the Details link for an uploaded document, you can view these properties with the insight I've provided below:

  • Open Document URL - URL the knowledge directory will use to open the document
  • Document Upload Repository Server - the file ID, including directory, in the document repository
  • Document Upload DocID - the application name in the document repository (this is always ptupload)
  • URL (Customized Document Property) - an encoded value which specifies the following:
    • Document ID in the repository (including folder ID)
    • Content Type of uploaded document (this is important so that the user's browser correctly recognizes downloaded documents)
    • Original file name of the document

The URL property can be decoded using the below (simple) function, which references the class com.plumtree.portalupload.common.Utilities (this is in documentupload.jar in the ptupload.war file installed with the upload component):

public static final String[] parseDocumentProperty(String propertyValue) {
		if (!propertyValue.startsWith("download")) return null;
        StringTokenizer tokenizer = new StringTokenizer(propertyValue, "/");
        tokenizer.nextToken();
        try
        {
            String id = Utilities.simpleDecode(tokenizer.nextToken());
            String contentType = tokenizer.nextToken() + "/" + tokenizer.nextToken();
            String fileName = Utilities.simpleDecode(tokenizer.nextToken());
            
            return new String[] { id, contentType, fileName };
        }
        catch(Exception e)
        {
        	return null;
        }
}

Document Repository Configuration

The document repository itself is configured in much the same way as discussed in part 1 of this series. That is to say, there is an application defined for the upload service called PTUpload, which in turn contains an Active/Archived setup. The content upload service configuration contains the requisite dr.xml under settings/config.

PTUpload's Dirty Little Secret

There is one problem with this whole setup, which I have alluded to with some of my comments on repository functionality. That is, there is no way for the repository to know when a document is deleted from its client application. The application itself has to tell it. It may be possible that I've missed something with the weekly housekeeping agent, but it has been my experience that documents which are uploaded directly via the content upload service are not removed from the repository when they are deleted. That being the case, there are two possible solutions to this problem:

  1. An application could be developed which would search the repository for all uploaded document IDs, then compare them to the current file/folder structure in the PTUpload directory of the repository.
  2. A Model/View override for the Directory space could be developed which would signal the repository that a card has been deleted.

AFAIK, Neither has been developed at the time of this writing, but would not take an undue amount of effort, if it were determine this was a problem.

Summary

Hopefully, this has been another enlightening installment on my blog. Comments or questions welcome; as always I make no guarantees as to the accuracy of this information in the past, present or future (but I'm pretty sure it's right). Next time, the mysteriously named third portion of this series: Part 3 - Utilizing the Repository.

If you've been paying attention to my blog for its short and relatively sparse lifetime, then you probably already know that one of my hobbies is taking apart seemingly meaningless and esoteric pieces of the portal. Let's face it, there's nothing more satisfying than thinking 'I wonder how this works', then many painful hours later thinking 'now I know'. Of course, it's much more satisfying when you can write about it on dev2dev and bask in the praise and adoration of your colleagues (hopefully sarcasm translates well in the blogosphere). Which, inevitably, leads us on today's journey into... the Document Repository.

(feel free to break out your 2001 soundtrack at any point during this post)

What is the Document Repository?

The Document Repository, or DR as I will refer to it hereafter, is exactly what it sounds like: a place to store and retrieve files. That's it.  Sure, there are a few extra features like archiving, passwords, and the fact that it's available via a single port, but those are just the details of the "get a file"/"give me a file" mechanism (all of which will be shared below).

The reason you may care about the operation of the document repository has to do with its core role in nearly all of the ALUI embedded application servers (Publisher, Collaboration, direct Knowledge Directory upload, and I think BPM, but don't quote me on that), and the various implications it may have on system management.

How does it work?

At it's heart, the repository itself is nothing more than a clever, sometimes complicated, way of exposing a subset of Java interfaces via a simple web-based binary protocol (Hessian). Once these interfaces have been exposed and are available to both a client (a program using the DR as its file store), and a server (the DR itself), it is a trivial matter for the client to store and retrieve files from a central repository.

Hessian is an open source binary web service protocol for exposing POJOs (Java objects) and other pieces of system infrastructure. Feel free to click the link for more information, including sample source code. In the case of the DR, the web services used to exposed DR objects are actually hosted on their own port (default 8020) using the ALUI embedded application server (Tomcat). For details on the embedded application server, see my previous post.

Since nothing else is hosted out of the DR container, some of the coolness of using a web-based binary protocol is lost in the shuffle. On the other hand, it allows your friendly ALUI developers to provide other services in the Tomcat container, like a diagnostics URL, which can be found at: :/drdiagnostics">:/drdiagnostics">:/drdiagnostics">http://<dr-host>:<dr-port>/drdiagnostics.

Repository Structure

First and foremost, let's take a tour of the guts of the repository: it's organization. Below is a diagram of the repository's functional structure (what happens when a document is submitted):repository_structure

Looking at the diagram above, I'd like to point out a few significant features of the repository: its ability to split out content on a per-application basis, and the password based authentication provided out of the box. When an application actually connects to the DR to send a document, it provides both an application name and an application password which are configured within the repository (see the configuration section of this post for elaboration). Once the correct application 'bucket' is determined, the document itself is stored using the file system structure described by the diagram below (the first two levels are directories):

individual_repository_structure

The repository first determines whether or not the document is active or being archived (simply a difference in web service call -- most documents are active). Next, the repository stores the document in a pseudo-random folder with a pseudo-random ID. Documents are not encrypted or compressed, so if you know the document type and ID you can actually go to the repository, rename the document on the file system and open it.

Naming convention for documents and folders is pseudo-random', meaning there is a structure to the file system, but the actual convention is nearly impossible to predict. Folder names are generated starting always with F, then appended with a hex string representation of a pseudo-random number mixed with a serial generation scheme (generated from the java.util.Random class -- initialized with the startup time of the repository, in milliseconds). The file names are generated in much the same fashion, with a bit more structure.

The important thing to know about the file structure is that every time the DR is restarted, newly uploaded files will always go into a new folder in the repository for each application. The name of that folder can be found in nextfolder.cfg in the relevant Active or Archived folder. In addition, no folder will ever have more than 256 files in it. After 256 files have been generated, the repository will automatically move to the folder specified in nextfolder.cfg.

Once nice thing about the repository is that it uses a temporary file structure (the Temp directory under Active/Archived) when actually uploading and downloading files, meaning read-only backups and virus scans should not disturb the normal operation of the repository.

For more details on the folder/file name generation convention, see the package com.plumtree.dr.provider.filesystem in the dr.jar. The most relevant classes include Suffixes, Names, and NodeIds.

Repository Configuration

Finally, we can examine the repository configuration files to determine how the structure described above is actually built. Most of the configuration files are either self explanatory, or relate to the embedded application server. In fact, the only 'important' configuration file is:

 $DR_HOME/settings/config/dr-server.xml

There are two main elements we are concerned about in this file:

  • applications - This element allows us to configure which applications are allowed to connect to the document repository and what password will be used for the connection. This portion of the server configuration must be matched by a dr.xml file provided with any client that uses the repository. The password and application name fields must match on both client and server in order for a successful connection to be made (see settings/config/dr.xml on any embedded application server). One interesting exception to this is that the password field can be provided as an unencrypted string in dr.xml on the client (see the difference by looking at dr.xml for collaboration server and comparing it to the one for the KD upload component).
  • providers - The providers element itself is unimportant, as there is only one provider (the file system). The more important element is the applications element below it, which allows us to specify file system locations for archived and active content on a per application basis.

Interestingly, the providers element seems to indicate there could be more than one way of storing documents using the document repository, though only one seems to have been implemented at the time of this writing. It might be a worthwhile exercise to develop a database based, or even more distributed, version of a provider.

It also seems, from looking at the configuration, that document and folder ID's could be more tightly controlled by specifying a node ID element in the configuration. This is purely conjecture, as I haven't seen much evidence to support this as part of my spelunking into the repository while researching this post.

Summary

One final technical caveat before I throw out the 'wrap up paragraph': you may have noticed that during this whole discussion I have yet to mention how the DR tracks what documents are contained in the repository. The answer: it doesn't! There really is no way for the DR to track whether or not a document has been deleted from the corresponding application unless the application itself tells it.

At the end of the day, the repository provides a robust yet simple mechanism for remote file storage and retrieval. Once we understand the basics, there isn't much more to know on the back end. Hopefully this article has shed some light on the dark inner workings of something you normally take for granted. Tune in next time for parts 2 and 3 of this series:

Part 2 - Content Upload and the Knowledge Directory

Part 3 - Utilizing the Repository

Until next time...

About this Archive

This page is a archive of recent entries written by hross in June 2007.

hross: May 2007 is the previous archive.

hross: July 2007 is the next archive.

Blogroll


Integryst

Function1

Fabien Sanglier

Bill Benac

Jordan Rose

Chris Bucchere

Robert Herrera

Nanek Blog Aggregator

Spartan Java




if you'd like to be listed here.




I don't blog about non-tech issues here, but you can check my Google Reader Shared Items if you want to know what I'm currently interested in.

Categories