LucidWorks support Forum » LucidWorks Enterprise

Crawler errors

(11 posts) (4 voices)
  • Started 2 years ago by charlie
  • Latest reply from charlie

Tags:

  • web crawler
  1. charlie
    Member

    Hi,

    I'm using the web spider in LWE to crawl a ColdFusion site, prior to demonstrating to a customer. There are a lot of errors: inspecting the logs (lucid.log.[datetime]) gives errors such as:

    2010-11-05 13:18:38,857 INFO  handler.SolrApertureCallbackHandler  - accessingObject crawler: org.semanticdesktop.aperture.crawler.web.WebCrawler@ffc3fc url: http://www.somethingsomething/something
    2010-11-05 13:18:39,615 INFO  handler.SolrApertureCallbackHandler  - new http://www.somethingsomething/something
    2010-11-05 13:18:39,620 WARN  handler.SolrApertureCallbackHandler  - Doc failed: http://www.somethingsomething/something

    However there's no further information. How do I find out *why* the document failed (and tweak the crawl appropriately)?

    Posted 2 years ago #
  2. Cassandra Targett

    Hi Charlie,

    The web crawler that we're using is Aperture, which has some limitations in its ability to provide good feedback on problems found during the crawl.  The cause of failure may be the type of content (.js files aren't supported, for example) or difficulties parsing the page.  We are working to replace the Aperture crawler in a future version, and until then more robust error reporting is not possible.

    Thanks,

    Cassandra

    Posted 2 years ago #
  3. charlie
    Member

    Ah. That's really not great, as it means there is no way to fix files/URLs that have failed the crawl.

    Posted 2 years ago #
  4. Mark Miller
    Moderator

    Hi Charlie,

    Actually, our upcoming release of LWE will provide further information about why a file/URL could not be crawled. Hold tight!

    - Mark

    Posted 2 years ago #
  5. hwasu.kim
    Moderator

    Hi Charlie,
    One of our engineer, Jack Krupansky, looked into this issue and I'm learning a lot from his findings which I'd like to share.

    Here are the known "Doc Failed" warnings that the LWE crawler (Aperture) can output to the log:

    1. Exception.

          WARN Exception while crawling:<document-URI><exception-with-stack-trace>

          WARN Doc failed:<exception-with-stack-trace>

          WARN Doc failed:<document-URI> -cause:<excpetion-cause-message>

    2. Out of memory.

          WARN File caused an Out of Memory Exception, skipping:<document-URI><exception-with-stack-trace>

          WARN Doc failed:<exception-with-stack-trace>

          WARN Doc failed:<document-URI> - cause:<OOM-exception-message>

    3. SubCrawlerException.

          WARN Doc failed:<exception-with-stack-trace>

          WARN Doc failed:<document-URI> - cause:<exception-message>

    4. Unknown file type.

          WARN Doc failed: Could not find extractor:<document-URI>

    5. I/O error.

          WARN IO Exception processing:<document-URI><exception-with-stack-trace>

          WARN Doc failed:<exception-with-stack-trace>

          WARN Doc failed:<document-URI> - cause:<exception-message>

    6. HTML/XML/XTML parsing errors:

         WARN Doc failed:<exception-with-stack-trace>

         WARN Doc failed:<document-URI> - cause:<exception-cause-message>

    These are HTML syntax errors that browsers tend to ignore, but htat are treated as fatal errors by a code library that the Aperture crawler uses.

    - PDF files are notorious for causing exceptions in their processing, but that is primarily for file system crawls
    - Type #4 is a good bet for web crawl failures. Typically they may be media files or "resources" intended for download but not necessarily the kinds of documents that would be indexed in a search engine. It is possible that the Lucid crawler simply doesn't recognize the file type for whatever reason.

    - AFAIK, there are no warnings of the form "Doc failed:"+<document-URI>+<nothing-else>

    It is possible that due to line folding the text after the document URI got wrapped onto the next line.

    Specific to the site you were crawling: The site is actually XHTML rather than HTML. This means that Aperture uses a different parser; it would normally use HtmlParser for HTML, but for XML/XHTML the SAXParser is used. Jack crawled the site and did see lots of "Doc failed" warnings. They all seemed to be due to ill-formed HTML on the web pages. Browsers are tolerant of such errors, but unfortunately the SAXParser that our Aperture crawler uses is weak in that area and requires properly-formed HTML and throws an exception if even one small error is encountered.
    - Hwasu (credit to Jack!)

     

    Posted 2 years ago #
  6. charlie
    Member

    >- AFAIK, there are no warnings of the form "Doc failed:"+<document-URI>+<nothing-else>

    >It is possible that due to line folding the text after the document URI got wrapped onto the next line.

    Nope, I'm afraid that the errors had no subsequent detail. The errors were shown exactly as I showed them at the top of this thread. Thanks for the further detail, but I think there's still an issue with the amount of detail reported by the crawler.

    Posted 2 years ago #
  7. Mark Miller
    Moderator

    Hi Charlie -

    I think there is some confusion here - and this is what you will see in the next release - not what already existed in 1.5.

    - Mark

    Posted 2 years ago #
  8. charlie
    Member

    Is there a timetable for the next release of LWE?

    Posted 2 years ago #
  9. Mark Miller
    Moderator

    It's avaiable now: http://www.lucidimagination.com/lwe/download

    Keep in mind that there is no upgrade path from the last developer release currently - start from a new install.

    Posted 2 years ago #
  10. charlie
    Member

    I've been working with the new version and I'm glad to see more detailed crawler errors being reported.

    Posted 2 years ago #
  11. charlie
    Member

    I also wondered whether http://home.ccil.org/~cowan/XML/tagsoup/ might be useful to clean up XHTML before it was indexed by Aperture. In our experience (and we've built a lot of web crawlers over the years!) web content is never, ever clean....

    Posted 2 years ago #

RSS feed for this topic