Tuesday, August 11, 2009

Getting started - The Crawler

Ok, first off, lets define the name of this baby, which will be the first Web Based App Scanner which will try to resolve all different issues and missing features found in other similar tools.

And it is called to be the Next Generation Web Scanner because of new features so keep reading coming posts.

NOTE: The main purpose of the creation of this tool is to learn the guts of this kind of testing.

So, the name of this baby is NeZa. Why? I will explain you in coming posts.

Ok, basically the big picture constitutes 4 steps:

  1. A crawler: To give the "carnita" (meat) to our attack engine.
  2. An attack engine: To send different kinda attacks to Web app and analyze the response to identify potential flaws.
  3. Defect Management: We got a bug, then? insert the bug in the QA process until it is closed and help:
  • The developer by preparing technical description of the bug.
  • The tester to reproduce the defect easily (in a right-click way) to eliminate false positives.
  • Business to prepare non technical, self explained Proof of concepts.
  • Developers to point out the line of code that needs to be fixed. This means Dynamic and Static combination. No one tool do this actually, we will try.
4. Next Generation Features!!!!!!!!!!!!!!

Before starting with our Crawler design, make a note of the Development environment I will be using so that you can copy the code explained into your framework without problems.

Environment Details:

IDE : MyEclipse
Lang: Java
Web Server/App: Tomcat 6.0
Technology used:
Apache Struts
Dojo for Ajax
Castor for XML marshalling

Getting started with Crawler

Goal: Identify in an automated way all URL's related to a specific web site.

I used the public code of a multi-threaded Webcrawler that I found in Internet from Andreas Hess which you can found here:


The crawler use a ThreadController which tracks the execution of each thread running which I liked it because I am planning to use this feature to communicate via AJAX the status of Urls being identified while showing them in the browser. But this is part of future features.

Main steps of the Crawler:

1. Put Starting URL in the queue

URLQueue q = new URLQueue();
q.push(startURL, 0);
//Adding starting URL to the queue

// Setting maxLevels (URL sublinks) to analyze
//Also number of threads to use
new Crawler(q, _maxLevel, _maxThread);

2. Get first URL from queue

//We get first URL from queue to start analyzing it.
for (Object newTask = queue.pop(level);
newTask != null;
newTask = queue.pop(level)) {
// Tell the message receiver what we're doing now
mr.receiveMessage(newTask, id);
// Process the newTask
process(newTask, queue.getHostname());

3. Save Document from URL. We call get URL to Open a connection to the Web Site and get the content

* Writes the contents of the url to a string by calling saveURL with a
* string writer as argument
public static String getURL(URL url)
throws IOException {
StringWriter sw = new StringWriter();
saveURL(url, sw);
return sw.toString();

* Opens a buffered stream on the url and copies the contents to OutputStream
public static void saveURL(URL url, OutputStream os)
throws IOException {
InputStream is = url.openStream();

byte[] buf = new byte[1048576];
int n = is.read(buf);
while (n != -1) {
os.write(buf, 0, n);
n = is.read(buf);

4. Extract links from document. Here I added support to extract links from area, frames and iframes.

We call this function: saveURL.extractlinks.

public static Vector extractLinks(String rawPage, String page) {

int i = 0;
final int ROWS = 4;
final int COLS = 2;

Vector links = new Vector();
String[][] tags;
tags = new String [ROWS][COLS];

//Getting links via href, area, frames and iframes
tags[0][0] = "<a "; tags[0][1] = "href";
tags[1][0] = "<area "; tags[1][1] = "href";
tags[2][0] = "<frame "; tags[2][1] = "src";
tags[3][0] = "<iframe "; tags[3][1] = "src";
int index = 0;
int index2 = 0;

String strLink = "";
String remaining = "";
StringTokenizer st;

for (i = 0; i< ROWS; i++){ index = 0; index2 = 0; while ((index = page.indexOf(tags[i][0], index)) != -1 ) {

if ((index = page.indexOf(tags[i][1], index)) == -1) break;
if ((index = page.indexOf("=", index)) == -1) break;
if ((index2 = page.indexOf("mailto", index)) != -1) break;

remaining = rawPage.substring(++index).replaceAll("^\\s+", "");
st = new StringTokenizer(remaining, "\t\n\r\"'>#");
strLink = st.nextToken();
if (! links.contains(strLink)) links.add(strLink);
return links;

Also created a validation to make sure only links related to the domain analyzed are added. Which means, if you scan an app and there is a link to google.com, you should not add that link right?

5. Add Linked URLS to the Queue

We iterate through link vector.

NOTE: The new links identified are added to a second Queue (level + 1). The crawler only use 2 queues for current and next links identified.

for (int n = 0; n < links.size(); n++) {
try {
//Urls might be relative to current page
Pattern p = Pattern.compile("^(htt[p|s]|ftp)");

Matcher m = p.matcher(links.elementAt(n).toString());
if (m.find()){
URL hostLink = new URL((String) links.elementAt(n));

//The domain url needs to be the same as the one in starting url
if (_hostname.equals(hostLink.getHost())){
link = new URL(pageURL, (String) links.elementAt(n));
queue.push(link, level + 1);
link = new URL(pageURL, (String) links.elementAt(n));
queue.push(link, level + 1);

} catch (MalformedURLException e) {
// Ignore malformed URLs, the link extractor might have failed.

So, our Crawler is running.

Now the next step is to also support Authentication so that whenever the Crawl find a login page, it can be able to authenticate automatically to keep identifying new urls (links).

There are 3 steps to do next:

1. Prepare a configuration settings page where we can give the Scanner the Login URL and credentials to use while crawling.
2. Identify Cookies sent by Web Application so that we can keep the session alive.
3. Add this feature to the Crawler to authenticate automatically while crawling.


Sunday, March 8, 2009

Building our own Web App Scanner - First Time

Hi all, i decided to create this blog mainly to share and get knowledge from the community. I just started 3 months ago in the development of my own Web Application Security Scanner tool. Basically, this new tool will try to be the Next Generation Web App Scanner.

I have been working with the main and famous Web App scanners and i think there is no tool nowadays that can be able to cover the 3 most important roles in this kinda efforts:

1. Business
2. Developers
3. Testers

Basically, i will be talking about WebInspect, Acunetix and Watchguard which are the tools i know and the ones from where i have been inspired to create my own one.

Some tools focus on Business by delivering good security compliant reports, others focus on Testers by creating a good interface to reproduce vulnerabilities so that testers can avoid false positives and by the way, no one take cares of developers. I think this later team needs to understand how to reproduce a vuln so that it can try to fix it right? The problem is that this tools gives you the URL and parameter injected to find the bug but what about the Flow to follow to get to that POST request to inject the parameter? I mean, may be you need to authenticate and then click on the 5th check boxes which will displayed a new windows where you need to select "Save" button to get to the vulnerable request.

What about Scan coverage? These tools show the URL's which they assessed and the bugs identified but who can guarantee the whole application got tested?
I ask business, how do you know that app is not missing some important sections or hidden transactions from your application?

But in order to know if Scan Coverage was successful, business need to:

Compare the 80 URLs of the app and the 1000 different POST/GET Parameters plus 4000 Lines of JavaScript (AJAX) against technical documentation of the Web App to try to identify any gap right??
But this human effort is not doable.

Another question to business, these tools says "I am PCI Compliant" or "I am OWASP TOP Ten Compliant" but ... how can business validate all TOP Ten kinda attacks are being sent to your app? or how old are those kinda attacks?

Technology supported by the tool.

Lets suppose business have its own implementation of AJAX, how do you need the web app scanner tool is supporting it, and if not? the tool is informing you that it was not able to test such kinda "weird" transactions?

Vulnerabilities management

Ok, good, the tool found 50 confirmed vulnerabilities, so... what is next?? is there an integrated interface to deal with this new bugs until get them fixed by Dev Team?

These kinda improvements is what i think will generate the Next Gen Web App Scanner.

In coming posts i will start talking about the new features i am integrating to my app and i will share the problems i am facing, how i worked them out and technical stuff!!!!!!