hross: October 2008 Archives

Back in the Saddle (Again)

| | Comments (0) | TrackBacks (0)

As you may have noticed (or not noticed, if your read this site from your aggregator), my blog has undergone some much needed renovation. I realize that as a blog reader myself, I appreciate a dearth of "State of the Union" posts in the blogs I read. Nonetheless, there have been quite a few changes, so here's an infrequent update...

  • I use Movable Type to publish, and despite its complexity, I love the customizability and performance. I found a great style from the CMR Movable Type Styles Blog, and have customized it to my liking. Finally, no more blah "Minimalist Red" scheme.
  • Dev2Dev is finally gone for good, which means some of my blog images were 404'ing. Thank you helpful readers for pointing this out. I finally went back and reset them to point to my own blog.
  • Bill Benac recommended we Plumtree bloggers start linking to each other, so you'll notice a blogroll on the right side of the page. If you're blogging and you want to be listed here, give me a shout.
  • The blog slogan has now officially become "Tech + Caffeine = blog", since I'm tired of dealing with these product name changes. As are other people, apparently.
  • That Jeep in the picture is mine, featured in some mind-blowingly isolated terrain right outside the Badlands.

So what's in store for the future? Well, I still have some posts to write in my Search Series, I have a long backlog of posts regarding various technical minutiae, and then there is your input... Thoughts?

Let's take a quick timeout from Search for a more basic post...

I don't have a "Cool Tools" section of my blog, like some other notable ALUI bloggers, but I do know of a few "cool tools" that have helped me do my job. One of my favorites is a fancy diff utility called WinMerge.

(go download it now if you haven't already)

One of the primary things I use it for is validating product upgrades. If you're as lazy and/or paranoid as I am, you have probably given pause during an ALUI upgrade when you saw the step "re-import the PTE". As most of us know, re-importing a PTE is a mixed bag, as it comes along with a lot of dependencies and can frequently wipe out customizations to web services, portlets, etc. Worse yet, you never quite know what's happening when you import.

What if we could analyze a PTE and figure out what changes were made so that we could either:

    • make the changes ourselves
    • not bother re-importing
    • at least know what changes were going to be made to our existing data?

Turns out this is rather simple (and, obviously, involves WinMerge).

Let's use a relevant example to demonstrate: a Publisher upgrade from 6.4 to 6.5. This is an upgrade of a minor revision number, so you would think there would be relatively few changes to the PTE's. Nonetheless, the install guide tells me to re-import, re-import, re-import.

Yuck.

Instead, I'll take an alternate approach. First, I run the Publisher 6.5 upgrade installer as I normally would. However, once I get to the re-import step, I navigate to the ptcs/6.4/serverpackages directory of my previous Publisher install and grab the publisher.pte file therein. Next, I grab the same PTE file from my ptcs/6.5/serverpackages directory.

Now I have both default install PTE's. Any differences between them will be the changes due to the 6.4 to 6.5 upgrade. Since these PTE's are really just XML files with fairly obvious naming conventions, I simply open them up side by side in WinMerge and compare the differences...

pte_diff

As it turns out, the only changes to the Publisher package in 6.5 are some /jspell URL's that have been added to the gateway settings for some web services. Since I can read the new URL in WinDiff, I can copy the gateway URL's and add them manually. Now I no longer need to import the PTE.

... and even if there were more changes and I had to re-import, I would be well informed of what they were before running the import.

Okay. We now return you to your regularly scheduled programming.

Here we are, back again for another installment in my new blog "mini-series" about search. When I first started researching these posts (er... presentation, actually) the mini-series might have been more aptly titled "Lost" (not to be confused with ABC's hit series, except for the mass confusion and never ending storyline).

Last time I promised some hard-hitting dirt on Search Administration, and as always, I deliver on my blog promises. Okay, maybe hard hitting is a bit of a stretch... let's talk about Search Administration. Most of you are probably familiar with the Search Cluster Manager and Search Service Manager in the Administrative Utilities drop down, but what are they and how do they work?

Let's start tackling this with a diagram:

search_admin_1 

This diagram represents the end-all be-all of the search administration process. There are two parts:

  1. Portal communication with a search node directly. This is the Search Service Manager (left side of the diagram). It is basically the portal asking the node about the health and topology of the search server and the node replying with this information. This node is extremely important, since it tells the portal front end how and which search nodes to query. The query is performed over the same port as any other search request, using the same mechanisms, and will show up in your search logs if you have them at a high enough verbosity.
  2. Portal communication with the search topology indirectly. This is done via the Search Cluster Manager (right side of the diagram). I have heard much rumor and hearsay regarding the Search Cluster Manager, so let me clear up any misconceptions you might have with a properly bolded and formatted statement:

The Search Cluster Manager is a Java web application that reads and writes files on the Cluster File System.

What this really means is that the Search Cluster Manager is totally unnecessary. All administration can be done with the cadmin tool (in your search server's bin directory) or via direct changes to specific initialization files (this is what the Search Cluster Manager does, anyway). So basically, the diagram above actually looks like this:

search_admin_2

Wrap Up

So that's it. Basically, the take-away's here are:

  1. Search Cluster Manager is simply a prettied up version of the command line utility and does not need to run for search to function in the portal.
  2. Search Service Manager controls the contact node and determines search topology for the portal front end.

Pretty simple, eh? Next up... some more interesting details on node operation.

Once again, I'm back from the dead. I admit it, I haven't been that busy lately, just had a hard time motivating myself to get through this search series. Perhaps more coffee will do the trick...

Breaking Down a Search Collection

Last time I listed the various functions of search and reposted my first search slide. It was fairly simple, just an abstract "Search Collection" diagram. This time let's break that diagram down a bit more:

what_is_search_3

What we see above is a less abstract view of the same diagram. Instead of one giant "Search" lump, we actually have an API, which makes the communication decisions, and a collection of search nodes. These nodes are just processes running somewhere, listening on a specific port. More about them later.

Partitions

That was pretty simple, right? Let's throw in one more wrinkle before moving on to the complicated bits: Partitions. A partition is simply a grouping of search data into a set of nodes. Applying that concept to the above diagram, a partitioning of our search collection might look something like:

what_is_search_partitions

In other words, some of the data indexed by search (search results) will reside in Partition 1 on Node 1, and some of the data will reside in Partition 2 on Nodes 1 and 2. If we draw out the partitions in a more abstract manner, they look like this:

what_is_search_partitions_abstract

As you can see, there are two separate "bins" of data. When new information is indexed it goes into one of these two bins. It is important to note that neither partition contains duplicate data, so when you search for something the results from Partition 1 and Partition 2 must be aggregated together. Duplicate data will, however, exist on Nodes 1 and 2 in Partition 2 (see above).

Search Coordination

With all this data moving about, being partitioned, searched, etc, you may be wondering how all of the search nodes communicate with one another. How do they know which partition they belong to, which node they are and what data has already been indexed?

The answer, it turns out, is extremely simple. They all must share at least one common set of files and directories, which I'll call the "Cluster File System". There is no special port-to-port communication, magic pixie dust, or any other way for search nodes to talk to each other. The cluster file system contains configuration information about the entire search topology, as well as a common queue/locking mechanism for incoming search indexing requests (more detail later). In other words, our previous diagram now looks like this:

what_is_search_cluster_file_system

And that's really all there is to it. I've just covered all of the concepts you'll need for a basic understanding of search.

Wrap Up

Alright, well we've covered the basics, but as you know, I'm never fully satisfied with the basics. Hopefully you now have a base understanding of search operation and are ready to stick with me for the under-the-covers part. Most of the information I've provided to this point is covered in the docs, just (in my opinion) not very well. Next time look for some more detailed information on how search administration works and under the covers node operation.

About this Archive

This page is a archive of recent entries written by hross in October 2008.

hross: July 2008 is the previous archive.

hross: December 2008 is the next archive.

Blogroll


Integryst

Function1

Fabien Sanglier

Bill Benac

Jordan Rose

Chris Bucchere

Robert Herrera

Nanek Blog Aggregator

Spartan Java




if you'd like to be listed here.




I don't blog about non-tech issues here, but you can check my Google Reader Shared Items if you want to know what I'm currently interested in.

Categories