<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="bbPress/1.1-alpha-2539" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
		>
	<channel>
		<title>LucidWorks support Forum &#187; Tag: web crawler - Recent Posts</title>
		<link>http://forum.lucidworks.com/tags/web-crawler</link>
		<description>LucidWorks support Forum &#187; Tag: web crawler - Recent Posts</description>
		<language>en-US</language>
		<pubDate>Wed, 22 May 2013 20:22:14 +0000</pubDate>
		<generator>http://bbpress.org/?v=1.1-alpha-2539</generator>
				<atom:link href="http://forum.lucidworks.com/rss/tags/web-crawler" rel="self" type="application/rss+xml" />

		<item>
			<title>charlie on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-76</link>
			<pubDate>Thu, 16 Dec 2010 05:49:39 +0000</pubDate>
			<dc:creator>charlie</dc:creator>
			<guid isPermaLink="false">76@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>I also wondered whether <a href="http://home.ccil.org/~cowan/XML/tagsoup/" rel="nofollow">http://home.ccil.org/~cowan/XML/tagsoup/</a> might be useful to clean up XHTML before it was indexed by Aperture. In our experience (and we've built a lot of web crawlers over the years!) web content is never, ever clean....</p>]]></description>
					</item>
		<item>
			<title>charlie on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-75</link>
			<pubDate>Thu, 16 Dec 2010 05:36:38 +0000</pubDate>
			<dc:creator>charlie</dc:creator>
			<guid isPermaLink="false">75@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>I've been working with the new version and I'm glad to see more detailed crawler errors being reported.</p>]]></description>
					</item>
		<item>
			<title>Mark Miller on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-74</link>
			<pubDate>Wed, 15 Dec 2010 08:57:38 +0000</pubDate>
			<dc:creator>Mark Miller</dc:creator>
			<guid isPermaLink="false">74@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>It's avaiable now: <a href="http://www.lucidimagination.com/lwe/download" rel="nofollow">http://www.lucidimagination.com/lwe/download</a></p>
<p>Keep in mind that there is no upgrade path from the last developer release currently - start from a new install.</p>]]></description>
					</item>
		<item>
			<title>charlie on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-70</link>
			<pubDate>Mon, 06 Dec 2010 09:26:48 +0000</pubDate>
			<dc:creator>charlie</dc:creator>
			<guid isPermaLink="false">70@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>Is there a timetable for the next release of LWE?</p>]]></description>
					</item>
		<item>
			<title>Mark Miller on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-57</link>
			<pubDate>Tue, 30 Nov 2010 05:08:01 +0000</pubDate>
			<dc:creator>Mark Miller</dc:creator>
			<guid isPermaLink="false">57@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>Hi Charlie -</p>
<p>I think there is some confusion here - and this is what you will see in the next release - not what already existed in 1.5.</p>
<p>- Mark</p>]]></description>
					</item>
		<item>
			<title>charlie on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-56</link>
			<pubDate>Tue, 30 Nov 2010 02:57:12 +0000</pubDate>
			<dc:creator>charlie</dc:creator>
			<guid isPermaLink="false">56@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>&#62;- AFAIK, there are no warnings of the form "Doc failed:"+&#60;document-URI&#62;+&#60;nothing-else&#62;</p>
<p>&#62;It is possible that due to line folding the text after the document URI got wrapped onto the next line.</p>
<p>Nope, I'm afraid that the errors had no subsequent detail. The errors were shown exactly as I showed them at the top of this thread. Thanks for the further detail, but I think there's still an issue with the amount of detail reported by the crawler.</p>]]></description>
					</item>
		<item>
			<title>hwasu.kim on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-54</link>
			<pubDate>Mon, 29 Nov 2010 13:18:33 +0000</pubDate>
			<dc:creator>hwasu.kim</dc:creator>
			<guid isPermaLink="false">54@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>Hi Charlie,     <br />One of our engineer, Jack Krupansky, looked into this issue and I'm learning a lot from his findings which I'd like to share.</p>
<p>Here are the known "Doc Failed" warnings that the LWE crawler (Aperture) can output to the log: </p>
<p>1. Exception.</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Exception while crawling:&#60;document-URI&#62;&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;document-URI&#62; -cause:&#60;excpetion-cause-message&#62;</p>
<p>2. Out of memory.</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN File caused an Out of Memory Exception, skipping:&#60;document-URI&#62;&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;document-URI&#62;  - cause:&#60;OOM-exception-message&#62;</p>
<p>3. SubCrawlerException.</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;document-URI&#62;  - cause:&#60;exception-message&#62;</p>
<p>4. Unknown file type.</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed: Could not find extractor:&#60;document-URI&#62;</p>
<p>5. I/O error.</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN IO Exception processing:&#60;document-URI&#62;&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160;&#160; WARN Doc failed:&#60;document-URI&#62;  - cause:&#60;exception-message&#62;</p>
<p>6. HTML/XML/XTML parsing errors:</p>
<p>&#160;&#160;&#160;&#160; WARN Doc failed:&#60;exception-with-stack-trace&#62;</p>
<p>&#160;&#160;&#160;&#160; WARN Doc failed:&#60;document-URI&#62; - cause:&#60;exception-cause-message&#62;</p>
<p>These are HTML syntax errors that browsers tend to ignore, but htat are treated as fatal errors by a code library that the Aperture crawler uses.</p>
<p>- PDF files are notorious for causing exceptions in their processing, but that is primarily for file system crawls <br />- Type #4 is a good bet for web crawl failures. Typically they may be media files or "resources" intended for download but not necessarily the kinds of documents that would be indexed in a search engine. It is possible that the Lucid crawler simply doesn't recognize the file type for whatever reason.</p>
<p>- AFAIK, there are no warnings of the form "Doc failed:"+&#60;document-URI&#62;+&#60;nothing-else&#62;</p>
<p>It is possible that due to line folding the text after the document URI got wrapped onto the next line.</p>
<p>Specific to the site you were crawling:  The site is actually XHTML rather than HTML. This means that Aperture uses a different parser; it would normally use HtmlParser for HTML, but for XML/XHTML the SAXParser is used. Jack crawled the site and did see lots of "Doc failed" warnings. They all seemed to be due to ill-formed HTML on the web pages. Browsers are tolerant of such errors, but unfortunately the SAXParser that our Aperture crawler uses is weak in that area and requires properly-formed HTML and throws an exception if even one small error is encountered.   <br />- Hwasu (credit to Jack!)</p>
<p>&#160;</p>]]></description>
					</item>
		<item>
			<title>Mark Miller on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-49</link>
			<pubDate>Sat, 27 Nov 2010 13:25:32 +0000</pubDate>
			<dc:creator>Mark Miller</dc:creator>
			<guid isPermaLink="false">49@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>Hi Charlie,</p>
<p>Actually, our upcoming release of LWE will provide further information about why a file/URL could not be crawled. Hold tight!</p>
<p>- Mark</p>]]></description>
					</item>
		<item>
			<title>charlie on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-44</link>
			<pubDate>Fri, 19 Nov 2010 03:41:40 +0000</pubDate>
			<dc:creator>charlie</dc:creator>
			<guid isPermaLink="false">44@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>Ah. That's really not great, as it means there is no way to fix files/URLs that have failed the crawl.</p>]]></description>
					</item>
		<item>
			<title>Cassandra Targett on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-41</link>
			<pubDate>Tue, 09 Nov 2010 16:20:51 +0000</pubDate>
			<dc:creator>Cassandra Targett</dc:creator>
			<guid isPermaLink="false">41@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>Hi Charlie,</p>
<p>The web crawler that we're using is Aperture, which has some limitations in its ability to provide good feedback on problems found during the crawl. &#160;The cause of failure may be the type of content (.js files aren't supported, for example) or difficulties parsing the page. &#160;We are working to replace the Aperture crawler in a future version, and until then more robust error reporting is not possible.</p>
<p>Thanks,</p>
<p>Cassandra</p>]]></description>
					</item>
		<item>
			<title>charlie on "Crawler errors"</title>
			<link>http://forum.lucidworks.com/lwe/crawler-errors#post-37</link>
			<pubDate>Mon, 08 Nov 2010 06:49:23 +0000</pubDate>
			<dc:creator>charlie</dc:creator>
			<guid isPermaLink="false">37@http://forum.lucidworks.com/</guid>
			<description><![CDATA[<p>Hi,</p>
<p>I'm using the web spider in LWE to crawl a ColdFusion site, prior to demonstrating to a customer. There are a lot of errors: inspecting the logs (lucid.log.[datetime]) gives errors such as:</p>
<p>2010-11-05 13:18:38,857 INFO&#160; handler.SolrApertureCallbackHandler&#160; - accessingObject crawler: org.semanticdesktop.aperture.crawler.web.WebCrawler@ffc3fc url: <a href="http://www.somethingsomething/something" rel="nofollow">http://www.somethingsomething/something</a><br />2010-11-05 13:18:39,615 INFO&#160; handler.SolrApertureCallbackHandler&#160; - new  <a href="http://www.somethingsomething/something" rel="nofollow">http://www.somethingsomething/something</a><br />2010-11-05 13:18:39,620 WARN&#160; handler.SolrApertureCallbackHandler&#160; - Doc failed:  <a href="http://www.somethingsomething/something" rel="nofollow">http://www.somethingsomething/something</a></p>
<p>However there's no further information. How do I find out *why* the document failed (and tweak the crawl appropriately)?</p>]]></description>
					</item>

	</channel>
</rss>
