Tuesday, January 27, 2009

Deduplicating apparent duplicates?

In an effort to use the blog to document ongoing SouthComb activities, this entry is the first in a long line (I hope) of ongoing conversations within the development team. This project is designed to provide transparent access to scholarly materials related to the U.S. South. Some of the issues in providing transparent access are more easily understood. However, I have a feeling this one will be a bit tricky.

Building a search application means running the risk of collecting what might appear as duplicate records. Web sites, for example, can have multiple pages (home page, navigation page, contact page, site map page, multiple content pages). To an automated web crawler, the content of these pages may be different, but the title is the same. This causes a good bit of confusion when trying to find specific content.

So how do we develop SouthComb so that each record it provides is a unique and meaningful record ?

2 comments:

Chick said...

One thing would be to identify the stucture of web pages on a per site basis (there really aren't that many) as to which element -- probably a named div -- that contains the significant text content. Doing this would significantly improve the quality of the our contents

Stacey Martin said...

What do you mean exactly by "per site basis." How does that translate to an actual task?

Li mentioned building a rule that defines hot to generate a title from content. Is this along the same lines?