DigitalScores

Featured article

On DigitalScores

Products & Services

Support

Resources for

Explore

Loading URLs

Last week I talked about how to develop a fast search engine and I received some questions about how you go about actually retrieving the HTML pages reliably.

Sun has implemented a simple class called 'URL' (from the java.io package that comes with any version of Java) that does this beautifully. The following code will retrieve and display the HTML text for any absolute URL (starting with http://) or an absolute URL followed by a relative URL (such as an HTML filename) that you type into the command prompt:

import java.net.*;
import java.io.*;

public class URLReader {
  public static void main(String[] args) throws Exception {
    if (args.length != 1 && args.length != 2) {
      System.out.println("Syntax: URLReader absoluteURLToDisplay");
      System.out.println("     or URLReader absoluteURL relativeURLToDisplay");
      return;
    }
    URL url = null;
    if (args.length == 1) {
      url = new URL(args[0]);
    } else {
      URL contextURL = new URL(args[0]);
	url = new URL(contextURL, args[1]);
    }
    BufferedReader in = new BufferedReader(
				  new InputStreamReader(
				    url.openStream()));
    String inputLine;
    while ((inputLine = in.readLine()) != null)
	  System.out.println(inputLine);
    in.close();
    System.out.println("\nThe content type for the URL '" + url.toString() +
                       "' was: " +
                       URLConnection.getFileNameMap().getContentTypeFor(
                                                        url.toString()));
   }
}

For example, to compile and test the code you could simply use:

javac URLReader.java
java URLReader https://www.digitalscores.com

Or to display the contents of the web page '../all-features' as pointed to from inside the web page 'https://www.digitalscores.com/featured-articles/03/loading-urls.html' then could use:

java URLReader
     https://www.digitalscores.com/featured-articles/03/loading-urls.html
     ../all-features.html

A very nice thing about how Sun have implemented their URL class is that it is so easy to work with relative URLs. This is particular helpful if you are writing a search engine that involves a web-crawler because most of the URLs that people specify in <a href..> tags inside their HTML document are relative to the path of the document in which they are written.

In the second example above we supplied what could be the URL for the actual HTML file loaded and then provided the relative URL. In the actual code, this is implemented by simply using the URL(URL context, String relativeURL) constructor instead of the URL(String absoluteURL) constructor.

If the example above does not find such a relative URL then from my testing it means that it probably also can't be found in any of the popular browsers if that relative URL was placed inside the HTML page. In practice this test of aiming to resolve URLs in the same way that browsers do is the test that you want to use when writing a web crawler; you are really trying to find documents that humans browsing public websites via hyperlinks can also find.

Identifying the type of document that the URL points to

When you type a URL into a browser, it will generally bring up an HTML document. Even if it returns a page not found error, this will often be an HTML document outputted by the server. On the other hand, the content type might be an image, a pdf file or some content type other than text. How do you work out the content type?

It is important to understand that while https://www.digitalscores.com/3dmatrix and https://www.digitalscores.com/3dmatrix/index.html both return exactly the same document using the code above, the final line, which tries to identify the content type, will return 'text/html' for the second URL but null for the first URL. This is because the first URL does not directly point to a file.

I am not sure about the perfect solution for this, however in practice it would seem reasonable to assume that if the document is retrieved, that is, the line "BufferedReader in = new BufferedReader(new InputStreamReader(url.openStream()));" does not raise an exception, and the content type was either 'text/html' or null then in both cases you could assume that the document was a text/html document.

Also, if are writing a web-crawler then I encourage you to read my section Proceed With Caution.

Julian Cochran
DigitalScores
www.digitalscores.com

return to all features ť

Š DigitalScores 2025

Contact | Company Info | Employment | Privacy | Terms of Use