How to Organize Content Sources – Best Practices

Working with search for years has given me a lot of experience with Content Sources as well as a lot of questions about them, like “should I create one huge content source or would be better to split up to smaller chunks?” or “can I amass my small content sources into one big one?” or “how to schedule the crawls for each of my content sources?”. We look at how to organize content sources from a best practices perspective.

As usual, there is no “silver bullet” answer for this, the only general answer to this is “it depends…”

Hey, but I know the slogan “it depends” is not why you started to read this blog post, so let me give you some ideas at least what you’d better to consider when facing these questions.

You have to know though: sometimes you don’t have any choice, your content source is one and big, cannot get split. For example a big database.
You might also think about amassing your small content sources into one big, for example when having small file shares that are “similar enough” (similarity must be your definition, again).

In many cases, the situation is that you can split your huge content source into two or more smaller and/or you can amass the small ones into big(ger) ones.
For example, a huge file share can get split up by the subfolders (or subfolder groups). Or a SharePoint farm can get split up by Web Applications or Site Collections or even sites (although it’s a rare situation). Or a 3rd party document management system can get split by the repositories.
Small file shares can be treatet as one big content source. As well as small SharePoint sites.
You get the point.

The real question is: when to keep/create one big content source and when to create multiple smaller ones IF it’s possible to split.

Considerations to make here:

•Content Source types – In SharePoint, the following content source types are available: SharePoint Sites, Web Sites, File Shares, Exchange Public Folders, Line of Business Data, Custom Repository. These types cannot be mixed (for example, you cannot have a SharePoint site and a file share in the same content source), but in the same type, you can add more than one start addresses on your choice. This might be a good option if you have multiple, small content sources.

•Crawling time and schedule – The more changes you have since the last crawl, the longer time the crawl takes. The more often you crawl the less changes you have to process during an incremental. The more often you do an incremental the less idle time your system will have. – But the more often you crawl, the more resources you consume on both the crawler and the source system. And the more often you crawl the bigger chance you have not being able to finish the crawl before the next one should get started – the result is worse content freshness with worse search performance than you expect.Moreover, if you have multiple content sources, you have to align their schedules to keep your system not overloaded by multiple parallel crawls.

Performance effect on the crawler components – This is an obvious one: crawling takes resources. The more you crawl the more resources you take. If you crawl more content sources parallel, it takes more resources. If you run one huge crawl, it takes resources for longer time. If you don’t have enough resources, the crawl might fail or run “forever”, making effects on other crawls.

•Performance effect on the source system – This is usually the less considered one: crawling takes resources on the source system as well!

•Bandwidth – Crawling pulls data from the source system that will be processed on the indexer components. This data should be transferred and this takes bandwith. In many cases, this is the bottleneck in the whole crawling process, even if the source system and crawler performs well. The more crawling process you run at the same time and the more parallel threads they have the more bandwith will be needed. Serialized crawls mean more balanced bandwith requirements.

•Similar content sources? – At the same time, you might have similar content sources that should be treated the same. For example, if you have small file shares, you might “aggregate” them, collect into one content source, so that their crawls can be managed together. You definitely have to do a detailed inventory for this.

•Live content vs. Archive – While “live” content changes often, archive either doesn’t change at all or changes very rarely. While “live” content has to get crawled often, archive doesn’t need incremental crawls to run very often. Remember, after the initial full crawl, content is in the index, and due to the rare changes, it can be considered pretty up-to-date. So that, if you have a system (any kind of) with both live and archive content, you’d better to split them and crawl the live content often while the archive doesn’t need any special attention after the initial full crawl.

•Automated jobs running on the content source – There are many systems where automated jobs create or update content. In most cases, these jobs are time-scheduled, running in the late evenings or early mornings, for example. As these jobs are predictive, we have two best practices here:

It isn’t an easy decision, is it?

During the planning phase of a search project, each of these points should be evaluated, and the result would be something like this table:

Source system Type Amount of content Content Source(s)
X: file share 20,000,000 Marketing X:Marketing
HR X:HR
IT X:IT
Z: file share 15,000 Documents
Y: file share 100,000
http://intranet SharePoint site 2,000,000 Local SharePoint Content
http://extranet SharePoint site 150,000

Agnes was a speaker at the ESPC13. Check out Agne’s blog for more insightful content!

Check out Agne’s insightful ESPC13 conference presentation on ‘10 Things I Like in SharePoint 2013 Search‘. Download Now>>

Share this on...

Rate this Post:

Share: