Tuesday, August 11, 2009

Getting started - The Crawler

Ok, first off, lets define the name of this baby, which will be the first Web Based App Scanner which will try to resolve all different issues and missing features found in other similar tools.

And it is called to be the Next Generation Web Scanner because of new features so keep reading coming posts.

NOTE: The main purpose of the creation of this tool is to learn the guts of this kind of testing.

So, the name of this baby is NeZa. Why? I will explain you in coming posts.

Ok, basically the big picture constitutes 4 steps:

  1. A crawler: To give the "carnita" (meat) to our attack engine.
  2. An attack engine: To send different kinda attacks to Web app and analyze the response to identify potential flaws.
  3. Defect Management: We got a bug, then? insert the bug in the QA process until it is closed and help:
  • The developer by preparing technical description of the bug.
  • The tester to reproduce the defect easily (in a right-click way) to eliminate false positives.
  • Business to prepare non technical, self explained Proof of concepts.
  • Developers to point out the line of code that needs to be fixed. This means Dynamic and Static combination. No one tool do this actually, we will try.
4. Next Generation Features!!!!!!!!!!!!!!

Before starting with our Crawler design, make a note of the Development environment I will be using so that you can copy the code explained into your framework without problems.

Environment Details:

IDE : MyEclipse
Lang: Java
Web Server/App: Tomcat 6.0
Technology used:
Apache Struts
Dojo for Ajax
Castor for XML marshalling

Getting started with Crawler

Goal: Identify in an automated way all URL's related to a specific web site.

I used the public code of a multi-threaded Webcrawler that I found in Internet from Andreas Hess which you can found here:


The crawler use a ThreadController which tracks the execution of each thread running which I liked it because I am planning to use this feature to communicate via AJAX the status of Urls being identified while showing them in the browser. But this is part of future features.

Main steps of the Crawler:

1. Put Starting URL in the queue

URLQueue q = new URLQueue();
q.push(startURL, 0);
//Adding starting URL to the queue

// Setting maxLevels (URL sublinks) to analyze
//Also number of threads to use
new Crawler(q, _maxLevel, _maxThread);

2. Get first URL from queue

//We get first URL from queue to start analyzing it.
for (Object newTask = queue.pop(level);
newTask != null;
newTask = queue.pop(level)) {
// Tell the message receiver what we're doing now
mr.receiveMessage(newTask, id);
// Process the newTask
process(newTask, queue.getHostname());

3. Save Document from URL. We call get URL to Open a connection to the Web Site and get the content

* Writes the contents of the url to a string by calling saveURL with a
* string writer as argument
public static String getURL(URL url)
throws IOException {
StringWriter sw = new StringWriter();
saveURL(url, sw);
return sw.toString();

* Opens a buffered stream on the url and copies the contents to OutputStream
public static void saveURL(URL url, OutputStream os)
throws IOException {
InputStream is = url.openStream();

byte[] buf = new byte[1048576];
int n = is.read(buf);
while (n != -1) {
os.write(buf, 0, n);
n = is.read(buf);

4. Extract links from document. Here I added support to extract links from area, frames and iframes.

We call this function: saveURL.extractlinks.

public static Vector extractLinks(String rawPage, String page) {

int i = 0;
final int ROWS = 4;
final int COLS = 2;

Vector links = new Vector();
String[][] tags;
tags = new String [ROWS][COLS];

//Getting links via href, area, frames and iframes
tags[0][0] = "<a "; tags[0][1] = "href";
tags[1][0] = "<area "; tags[1][1] = "href";
tags[2][0] = "<frame "; tags[2][1] = "src";
tags[3][0] = "<iframe "; tags[3][1] = "src";
int index = 0;
int index2 = 0;

String strLink = "";
String remaining = "";
StringTokenizer st;

for (i = 0; i< ROWS; i++){ index = 0; index2 = 0; while ((index = page.indexOf(tags[i][0], index)) != -1 ) {

if ((index = page.indexOf(tags[i][1], index)) == -1) break;
if ((index = page.indexOf("=", index)) == -1) break;
if ((index2 = page.indexOf("mailto", index)) != -1) break;

remaining = rawPage.substring(++index).replaceAll("^\\s+", "");
st = new StringTokenizer(remaining, "\t\n\r\"'>#");
strLink = st.nextToken();
if (! links.contains(strLink)) links.add(strLink);
return links;

Also created a validation to make sure only links related to the domain analyzed are added. Which means, if you scan an app and there is a link to google.com, you should not add that link right?

5. Add Linked URLS to the Queue

We iterate through link vector.

NOTE: The new links identified are added to a second Queue (level + 1). The crawler only use 2 queues for current and next links identified.

for (int n = 0; n < links.size(); n++) {
try {
//Urls might be relative to current page
Pattern p = Pattern.compile("^(htt[p|s]|ftp)");

Matcher m = p.matcher(links.elementAt(n).toString());
if (m.find()){
URL hostLink = new URL((String) links.elementAt(n));

//The domain url needs to be the same as the one in starting url
if (_hostname.equals(hostLink.getHost())){
link = new URL(pageURL, (String) links.elementAt(n));
queue.push(link, level + 1);
link = new URL(pageURL, (String) links.elementAt(n));
queue.push(link, level + 1);

} catch (MalformedURLException e) {
// Ignore malformed URLs, the link extractor might have failed.

So, our Crawler is running.

Now the next step is to also support Authentication so that whenever the Crawl find a login page, it can be able to authenticate automatically to keep identifying new urls (links).

There are 3 steps to do next:

1. Prepare a configuration settings page where we can give the Scanner the Login URL and credentials to use while crawling.
2. Identify Cookies sent by Web Application so that we can keep the session alive.
3. Add this feature to the Crawler to authenticate automatically while crawling.


No comments:

Post a Comment

Note: Only a member of this blog may post a comment.