If you've been paying attention to my blog for its short and relatively sparse lifetime, then you probably already know that one of my hobbies is taking apart seemingly meaningless and esoteric pieces of the portal. Let's face it, there's nothing more satisfying than thinking 'I wonder how this works', then many painful hours later thinking 'now I know'. Of course, it's much more satisfying when you can write about it on dev2dev and bask in the praise and adoration of your colleagues (hopefully sarcasm translates well in the blogosphere). Which, inevitably, leads us on today's journey into... the Document Repository.
(feel free to break out your 2001 soundtrack at any point during this post)
What is the Document Repository?
The Document Repository, or DR as I will refer to it hereafter, is exactly what it sounds like: a place to store and retrieve files. That's it. Sure, there are a few extra features like archiving, passwords, and the fact that it's available via a single port, but those are just the details of the "get a file"/"give me a file" mechanism (all of which will be shared below).
The reason you may care about the operation of the document repository has to do with its core role in nearly all of the ALUI embedded application servers (Publisher, Collaboration, direct Knowledge Directory upload, and I think BPM, but don't quote me on that), and the various implications it may have on system management.
How does it work?
At it's heart, the repository itself is nothing more than a clever, sometimes complicated, way of exposing a subset of Java interfaces via a simple web-based binary protocol (Hessian). Once these interfaces have been exposed and are available to both a client (a program using the DR as its file store), and a server (the DR itself), it is a trivial matter for the client to store and retrieve files from a central repository.
Hessian is an open source binary web service protocol for exposing POJOs (Java objects) and other pieces of system infrastructure. Feel free to click the link for more information, including sample source code. In the case of the DR, the web services used to exposed DR objects are actually hosted on their own port (default 8020) using the ALUI embedded application server (Tomcat). For details on the embedded application server, see my previous post.
Since nothing else is hosted out of the DR container, some of the coolness of using a web-based binary protocol is lost in the shuffle. On the other hand, it allows your friendly ALUI developers to provide other services in the Tomcat container, like a diagnostics URL, which can be found at: :/drdiagnostics">:/drdiagnostics">:/drdiagnostics">http://<dr-host>:<dr-port>/drdiagnostics.
Repository Structure
First and foremost, let's take a tour of the guts of the repository: it's organization. Below is a diagram of the repository's functional structure (what happens when a document is submitted):
Looking at the diagram above, I'd like to point out a few significant features of the repository: its ability to split out content on a per-application basis, and the password based authentication provided out of the box. When an application actually connects to the DR to send a document, it provides both an application name and an application password which are configured within the repository (see the configuration section of this post for elaboration). Once the correct application 'bucket' is determined, the document itself is stored using the file system structure described by the diagram below (the first two levels are directories):
The repository first determines whether or not the document is active or being archived (simply a difference in web service call -- most documents are active). Next, the repository stores the document in a pseudo-random folder with a pseudo-random ID. Documents are not encrypted or compressed, so if you know the document type and ID you can actually go to the repository, rename the document on the file system and open it.
Naming convention for documents and folders is pseudo-random', meaning there is a structure to the file system, but the actual convention is nearly impossible to predict. Folder names are generated starting always with F, then appended with a hex string representation of a pseudo-random number mixed with a serial generation scheme (generated from the java.util.Random class -- initialized with the startup time of the repository, in milliseconds). The file names are generated in much the same fashion, with a bit more structure.
The important thing to know about the file structure is that every time the DR is restarted, newly uploaded files will always go into a new folder in the repository for each application. The name of that folder can be found in nextfolder.cfg in the relevant Active or Archived folder. In addition, no folder will ever have more than 256 files in it. After 256 files have been generated, the repository will automatically move to the folder specified in nextfolder.cfg.
Once nice thing about the repository is that it uses a temporary file structure (the Temp directory under Active/Archived) when actually uploading and downloading files, meaning read-only backups and virus scans should not disturb the normal operation of the repository.
For more details on the folder/file name generation convention, see the package com.plumtree.dr.provider.filesystem in the dr.jar. The most relevant classes include Suffixes, Names, and NodeIds.
Repository Configuration
Finally, we can examine the repository configuration files to determine how the structure described above is actually built. Most of the configuration files are either self explanatory, or relate to the embedded application server. In fact, the only 'important' configuration file is:
$DR_HOME/settings/config/dr-server.xml
There are two main elements we are concerned about in this file:
- applications - This element allows us to configure which applications are allowed to connect to the document repository and what password will be used for the connection. This portion of the server configuration must be matched by a dr.xml file provided with any client that uses the repository. The password and application name fields must match on both client and server in order for a successful connection to be made (see settings/config/dr.xml on any embedded application server). One interesting exception to this is that the password field can be provided as an unencrypted string in dr.xml on the client (see the difference by looking at dr.xml for collaboration server and comparing it to the one for the KD upload component).
- providers - The providers element itself is unimportant, as there is only one provider (the file system). The more important element is the applications element below it, which allows us to specify file system locations for archived and active content on a per application basis.
Interestingly, the providers element seems to indicate there could be more than one way of storing documents using the document repository, though only one seems to have been implemented at the time of this writing. It might be a worthwhile exercise to develop a database based, or even more distributed, version of a provider.
It also seems, from looking at the configuration, that document and folder ID's could be more tightly controlled by specifying a node ID element in the configuration. This is purely conjecture, as I haven't seen much evidence to support this as part of my spelunking into the repository while researching this post.
Summary
One final technical caveat before I throw out the 'wrap up paragraph': you may have noticed that during this whole discussion I have yet to mention how the DR tracks what documents are contained in the repository. The answer: it doesn't! There really is no way for the DR to track whether or not a document has been deleted from the corresponding application unless the application itself tells it.
At the end of the day, the repository provides a robust yet simple mechanism for remote file storage and retrieval. Once we understand the basics, there isn't much more to know on the back end. Hopefully this article has shed some light on the dark inner workings of something you normally take for granted. Tune in next time for parts 2 and 3 of this series:
Part 2 - Content Upload and the Knowledge Directory
Part 3 - Utilizing the Repository
Until next time...
Leave a comment