We fear what we do not understand. As a consequence, one of the things that has scared me in the past is how the Knowledge Directory (KD) and search work (cards, indexing and metadata, oh my!). However, an emphasis on preserving KD cards and ensuring crawlers operate properly at a major client means that my fear has been greatly reduced. As has become a theme with this blog, the reader should benefit as a result of my suffering.
Knowledge Directory Cards
The KD has a very simple concept at its heart: a centrally located list of information about documents and their locations. The documents themselves are elsewhere: on a file system, in a content management system, on a web site. They are accessible via the crawler concept, which basically allows for a categorization of HTTP accessible document links and corresponding information about those documents.
The folder structure you see in the KD user interface is a bit misleading, since it's really just a flat set of document metadata. The folders are just for human understanding and classification. That's why duplicate documents will show a (1), (2), etc, no matter where they are in the KD (you can "fix" this with a simple portal customization, but that's another post).
What's in a Card?
So what is all this stuff that's in a card, and how can you check it out ? The easiest way, in my opinion, is to query the portal database. All of the data for KD cards is in the following tables in the portal database: PTCARDS, PTCARDPROPERTIES, PTINTERNALCARDINFO. If it's not there, it's not in the KD. How about a bit more detail?
PTCARDS - This is all basic information: the name of the document, its description and instructions to the portal on what URL opens it.
Brief Sidebar: Why doesn't the portal build URL's dynamically? Well, for starters there's your browser: you want someone downloading a file to get the relevant txt, mpg or jpg at the end of a file name so they know what to do with it. The second reason is that you may or may not want to point the user's browser directly at the document location (rather than gatewaying the request through the portal). Remember, there are some content sources you'll want to allow to use their own security mechanisms to prevent unauthorized access and there are other sources, like web sites, that are better off being accessed directly, both from a traffic load standpoint, and from a usability perspective (when I click a link to Google, I want by browser's URL to say http://www.google.com).
PTCARDPROPERTIES - This table contains all of that juicy metadata. When you view card properties in the portal, this table pretty much sums it up. It also has meta information that tells a crawler or search indexing agent what to do with the document (e.g. should it be searchable?). You'll notice there is a weird XML format to the data in this table. That's because the data is stored using the somewhat controversial PropertyBag structure.
You can find some good examples of the card submission internals by going back to our old friend the UI source. The key thing to look for in the case of card properties is the com.plumtree.server.PT_CARD_SETTINGS class, which contains constants for the internal crawler metadata. It should also give you a good idea of what I mean by property bag.
PTINTERNALCARDINFO - The fun table. This table tells the card refresh agent about what to do with the card. Most of these settings can be changed/manipulated in the crawler options screen. A few relevant fields are listed below, with some explanations:
- CRAWLERID - object ID of the crawler this card belongs to.
- DATASOURCE - object ID of the data source this card belongs to
- REFRESHDATE, LASTREFRESHED, REFRESHRATEUNITS - Properties that determine and store the last time the document properties were refreshed in the KD, as well as the next time these properties should be refreshed.
- EXPIRATIONDATE - Controls the expiration date of the document. If this date/time passes, the document is deleted. This property is set in the crawler configuration under Document Settings | Document Expiration
- MISSINGDELETEUNITS - Set in the crawler configuration under Document Settings | Broken Links. If set, this property determines the amount of time past the refresh date a card will be deleted if it is not found in the crawler (NULL means it won't be deleted).
- LOCATIONA_CRC, LOCATIONB_CRC - CRC values calculated from the document location. I'm not quite sure what these are actually used for.
Update (great info from reader danyadsmith -- thanks for the info and the kudos):
The PTCARDSTATUS table holds link and property refresh settings and contains the following columns:
OBJECTID(int, not null)
STATUS(int, not null)
INDEXLASTUPDATED(datetime, null)
LASTMODIFIED(datetime, null)
The STATUS field is the most useful of the four. By modifying the integer value, you can delete, refresh, or re-crawl cards into the directory. The available values are:
0 - Do Nothing
1 - Refresh Properties
2 - Not Used/Disregard
3 - Delete
4 - Recrawl and Refresh
Any changes to this table kick in with the run of Search Update and Doc Refresh jobs.
As always, I make no guarantees as to the past, present, or future accuracy of this information and I encourage you to do your homework before doing anything related to the portal database. Official sources would probably tell you not to touch it. I might be inclined to disagree, but that's mostly because I like causing headaches for our support guys.
Leave a comment