"Why is our portal so slow? What's going on with the database? How is content getting served up to users? What is being cached? What do you mean 'published content redirect'?'"
If you're using publisher with any sort of frequency in your portal (and a lot of our customers are), then this post is for you. Understanding Publisher is the first step in diagnosing, tuning and solving many performance related problems. Plus, it really helps when you're troubleshooting portlet issues.
A Logical Diagram
To kick this off, let's start off with my working knowledge of Publisher, in diagram form:
On the left, you'll see our web client. In every case, this is actually going to be the portal, making a request on behalf of a portlet. More on this later, but for now, let's start with the assumption that there's only one portlet per page and there is no caching going on.
On the right are what I'll call the Publisher 'data stores'. All of your content is going to fit into one of these stores.
As you can (sort of) see from the diagram, Publisher is comprised of four main parts:
- Publisher Administration Host
- Publisher Content Store
- Published Content
- Published Content Redirect
What do each of those parts do, and what does the above diagram mean? Read on...
Publisher Administration Host
When I say 'Publisher Administration', I don't just mean what you see in the portlet of the same name, I mean anything related to the administration of content in publisher, whether you are creating a new portlet, navigating through publisher explorer, or creating a workflow. All of this gets served up via a java application server, usually from the location http://publisherhost:7087/ptcs, or something similar. It is important to note that the publisher administration host is basically it's own product. It's only real role is to manipulate the information in the Publisher Content Store and publish it to a Published Content location. Technically, it doesn't really have anything to do with the finished product: the published content.
Published Content Store
Whenever you edit, change, manipulate, or otherwise do anything that must be saved, that information goes straight to the store. The store is really just a database and a web service connection to our old friend the document repository. And really, the only things that go into the document repository are binary files (images, spreadsheets, etc). Any text based content items, presentation templates, data entry templates, and versioning information are all stored in the database.
What does that mean to you? It means the only things you need to back up to recover from total catastrophic failure are the database and the document repository. It helps to sync them up, but since the DR is mostly binary content anyway, you can probably deal with the loss of a few images or Word documents. What does that mean about the content, though?
Published Content
That means that the content itself is all published from the store to a file system, then hosted by your favorite web server (Apache, IIS, Tomcat, etc). If you run the 'out of the box' Publisher configuration, the java application server hosting the administration piece becomes the web server. That doesn't mean you have to host out of this container. In fact, I recommend you don't, since any problems with your content will bog down Publisher Administration and vice versa. Every folder in publisher can be configured with a publishing path (somewhere to put the file -- either on the local file system or via FTP) and hosting path (the URL your web server will host it at), so you can do all kinds of crazy things (like publish JSP's to a java application server via FTP).
Of course, if Publisher can put files in all kinds of locations, how come there is only one web service that points to all of the content (the Published Content Web Service in the portal)? And how does it know how to host it? (I love foreshadowing)
Published Content Redirect
The final piece of this puzzle is the Published Content Redirect, featured above in my half-hearted attempt at time-lapse Visio. If you look at the numbered requests on the left side of the diagram, you'll see what really happens when a portlet hits the Published Content Web Service (the web service in the portal that you build all your portlets from). To elaborate:
- Web client hits published content redirect (usually something like http://publisherhost:7087/ptcs/redirect.jsp)
- Published content redirect instantiates publisher session, checks the database for the published content URL (see Published Content above) and throws an HTTP 302 (Redirect)
- Web client is redirected to new location (the actual published content) and your web server of choice serves the page
Load Balancing, Failover and Performance
What I am not going to talk about: how to set up load balancing or failover in the portal using DNS or any other method. If you want a quick rundown on this, Gerald Kanapathy has an excellent post on the subject. Read that, then come back and read the Publisher specifics.
You're back? Okay, good...
The problem we run into with Publisher, as with many COTS products, is that it's not simply a static web host or database that we can tune for performance. We actually have to know how it works in order to make it work for us. That said, there are 5 pieces you'll have to deal with separately, both in terms of performance and failover. Here are each, accompanied by a brief discourse as to how to handle them:
- The Database - Load balancing, performance and failover are all subjects best left to your friendly neighborhood DBA. Implementation is well documented and should be pretty easy.
- The Document Repository - The DR, the other back end data store, is a bit of a different animal. If you want to implement failover, it's possible (and probably should be addressed in a separate post). In a nutshell:
- Install multiple document repositories on different hosts.
- Point the 'documents' directory on each to a shared file system.
- Configure a DNS entry to point to a load balanced pool of the DR IP's.
- Use the above DNS entry when configuring the publisher document repository configuration entry.
- Published Content Redirect - The redirect can be a sticky wicket. It's probably the most overlooked, yet most important, piece of Publisher availability. If the redirect goes down, none of your published content will show up, whether or not the content itself is set up for load balancing or failover. Luckily, BEA has provided you with a Publisher installation option called the 'Published Content Host'. What this install does is only install the redirector. Keeping that in mind, you can do the following:
- Install the redirector on multiple hosts.
- Configure a DNS entry to point to the pool of IP's hosting redirectors (either via round robin DNS or a load balancing solution).
- Point all of the redirectors at the same database Publisher.
- Change the Published Content Web Service URL to point at your new redirector pool of IP's.
- Change the Preferences section of the Published Content Web Service to point back to the admin server (don't forget this, or you'll have problems editing content from portlets).
- Publisher Administration - Publisher administration can be set up for clustering, similar to our Collaboration product. I believe it uses some from of UDP multicast to sync multiple administrative instances. Unfortunately, I've never actually done this, so I can't give you an enlightened tutorial. On the other hand, it's not essential. Most of the time you will be concerned about your content being available. If the editing interface goes down for any period, it won't effect end users of your portal, only administrators.
Other Avenues
Finally, there is the matter of highly available content, caching, and the image server. At the end of the day, you may still have content that will be available to almost all your users, shown on every page. This type of content may require different caching strategies than the rest of your Publisher content. For instance, you have a header portlet that is displayed on every page of your portal and you want to publish it using Publisher.
In these types of cases, what I normally do is set up a web service independently from Publisher. What this allows me to do is configure a separate caching strategy for the content (if published content is available immediately, maybe my header only changes every 6 hours). It also allows me to avoid putting unnecessary load on the published content redirect, since I normally point this separate web service directly at the content. And finally, if I'm really looking for performance, I tend to publish the content directly to the image server. As long as I configure the publishing URL to point to the image server URL, this allows me to publish binary content (say, all the images in my header) straight to a static host, and serve them up from the image server, rather than going through the gateway (one extra layer of unnecessary HTTP traffic).
and... I'm spent.
Using the above information, you should be able to configure a highly available, high performing, Publisher environment (and if not, please don't hesitate to give our consulting services department a call (feel free to mention my name ;-)).