Friday, 1 February 2013

Using Sam Spade to Parse a Website


Using Sam Spade to Parse a Website
One of the many functions that Sam Spade performs is scraping or
crawling a website. It will parse all the HTML code looking for
characters, words and string patterns specified by the user. The search
capabilities are extremely versatile, and provides such capabilities as:


• Mirroring the website to the local hard disk. This function could
facilitate a manual inspection of the website. It must be noted,
however, that some website administrators can detect and
disapprove of such an action.
• Apply a general filter to the pages that are parsed, e.g. only .html,
.asp and .txt pages. The website may contain types of pages or files
that do not offer any searchable information, so only those that can
be successfully parsed are un-filtered.
• Parse website for e-mail addresses. In the comment fields or as part
of the contact information, the author may have written an e-mail
address. This can potentially provide a username for future login
attempts and brute force attacks.
• Search for images on this or other servers. If the website is mirrored
elsewhere, whether on this or another servers, then there may exist
additional information from these alternate sources or a different
server may have a badly configured security policy.
• Search for links to this or other servers. The links can provide
information relating to the location of significant files and
directories on a server.
• Parse for hidden form values. Some pages use predefined values
and constants for such tasks as default authentication to other
servers. Obviously, these are not displayed on a web page but can be
found by searching through the HTML source.
• Parse for a user-defined text string. This option includes the ability
to specify a string exactly or use a combination of wildcards and
user-definable character sets.


No comments:

Post a Comment