Introduction
The dependencies page lists all the jars that you will need to have in your classpath.
The class org.htmlunit.WebClient is the main starting point. This simulates a web browser and will be used to execute all of the tests. (see WebClient - the browser)
Android
Using HtmlUnit on Android has some challenges because of the subtle technical distinction
of java on android. Because of this, we offer a customized distribution to work around these problem.
Please check out htmlunit-android on github.
Most unit testing will be done within a framework like JUnit so all the examples here will assume that we are using that.
In the first sample, we create the web client and have it load the homepage from the HtmlUnit website. We then verify that this page has the correct title. Note that getPage() can return different types of pages based on the content type of the returned data. In this case, we are expecting a content type of text/html, so we cast the result to an org.htmlunit.html.HtmlPage.
@Test public void homePage() throws Exception { try (WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("https://www.htmlunit.org/"); Assertions.assertEquals("HtmlUnit – Welcome to HtmlUnit", page.getTitleText()); final String pageAsXml = page.asXml(); Assertions.assertTrue(pageAsXml.contains("<body class=\"topBarDisabled\">")); final String pageAsText = page.asNormalizedText(); Assertions.assertTrue(pageAsText.contains("Support for the HTTP and HTTPS protocols")); } }
Submitting a form
Frequently we want to change values in a form and submit the form back to the server. The following example shows how you might do this.
@Test public void submittingForm() throws Exception { try (WebClient webClient = new WebClient()) { // Get the first page final HtmlPage page = webClient.getPage("http://some_url"); // Get the form that we are dealing with and within that form, // find the submit button and the field that we want to change. final HtmlForm form = page.getFormByName("myform"); final HtmlSubmitInput button = form.getInputByName("submitbutton"); final HtmlTextInput textField = form.getInputByName("userid"); // Change the value of the text field textField.type("root"); // Now submit the form by clicking the button and get back the second page. final HtmlPage secondPage = button.click(); } }
Finding form elements
For filling out a form, you first have to find the form elements you like to interact with.
final HtmlTextInput textField = form.getInputByName("userid");
In addition to all the general ways of finding dom elements (see below) the HtmlForm object offers some convenient methods to find form elements:
- HtmlForm.getButtonByName(String)
- HtmlForm.getButtonsByName(String)
- HtmlForm.getCheckedRadioButton(String)
- HtmlForm.getInputByName(String)
- HtmlForm.getInputByValue(String)
- HtmlForm.getInputsByName(String)
- HtmlForm.getInputsByValue(String)
- HtmlForm.getRadioButtonsByName(String)
- HtmlForm.getSelectByName(String)
- HtmlForm.getSelectsByName(String)
- HtmlForm.getTextAreaByName(String)
- HtmlForm.getTextAreasByName(String)
- HtmlForm.getElements()
All these methods are working based on a list of all dom elements associated with this form - this list includes all descendants of the form element AND all other elements associated to this form using the 'form' attribute. In general the method HtmlForm.getElements() builds this list and all other methods using this list as base for more filtering.
Text input <input type='text'>
These form elements represented as instances of class HtmlTextInput.
final HtmlTextInput textField = form.getInputByName("userid");
To replace the value with some new text you should use the method HtmlElement#type(String). This call takes care of setting the focus (if required; including triggering all the focus related events) and then simulating the typing of the provided string (char by char, including the keyboard events).
textField.type("RBRi");
If all the events not really needed, you can also use the method HtmlSelectableTextInput#setValue(String).
Text area <textarea>
These form elements represented as instances of class HtmlTextArea.
final HtmlTextArea textArea = form.getTextAreaByName("comment");
The usage of HtmlTextArea is similar to HtmlTextInput (because both derived from HtmlSelectableTextInput). This means you can also use type(String) or even setValue(String) for updating these elements.
textArea.type("HtmlUnit is a great library...");
Radio buttons <input type='radio'> and Checkboxes <input type='checkbox'>
These form elements represented as instances of class HtmlRadioButtonInput/HtmlCheckBoxInput.
final HtmlRadioButtonInput countryGermany = form.getInputByName("radio_country_germany"); final HtmlCheckBoxInput programmingLanguage = form.getInputByName("check_language_java");
Usually your form contains many of these elements organized in groups. To check a radio button of a checkbox you have to use HtmlRadioButtonInput#setChecked(boolean) or HtmlCheckBoxInput#setChecked(boolean).
countryGermany.setChecked(true); programmingLanguage.setChecked(true);
Checking a single radio button will automatically uncheck all other radio buttons in the same group.
Select <select>
These form elements represented as instances of class HtmlSelect. The individual options are represented by instances of class HtmlOption.
final HtmlSelect currency = form.getInputByName("currency"); <p> The simplest way to select one of the options is the method <a href='apidocs/org/htmlunit/html/HtmlSelect.html#setSelectedIndex-int-'>HtmlSelect.html#setSelectedIndex(int)</a>. </p> <source><![CDATA[ currency.setSelectedIndex(true);
To make your code more readable and robust, you have to search for the HtmlOption to select and then use HtmlSelect.html#setSelectedAttribute(HtmlOption, boolean).
HtmlOption euro = currency.getOptionByValue("Euro"); currency.setSelectedAttribute(euro, true);
For single selection select elements, this call also deselects all other options.
Finding a specific element
Once you have a reference to an HtmlPage, you can search for a specific HtmlElement by one of 'get' methods, or by using XPath or CSS selectors.
Traversing the DOM tree
Below is an example of finding a 'div' by an ID, and getting an anchor by name:
@Test public void getElements() throws Exception { try (WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("http://some_url"); final HtmlDivision div = page.getHtmlElementById("some_div_id"); final HtmlAnchor anchor = page.getAnchorByName("anchor_name"); } }
A simple way for finding elements might be to find all elements of a specific type.
@Test public void getElements() throws Exception { try (WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("http://some_url"); final DomNodeList<DomElement> inputs = page.getElementsByTagName("input"); final Iterator<DomElement> nodesIterator = inputs.iterator(); // now iterate } }
There is rich set of methods usable to locate page elements e.g.
- HtmlPage.getAnchors(); HtmlPage.getAnchorByHref(String); HtmlPage.getAnchorByName(String); HtmlPage.getAnchorByText(String)
- HtmlPage.getElementById(String); HtmlPage.getElementsById(String); HtmlPage.getElementsByIdAndOrName(String);
- HtmlPage.getElementByName(String); HtmlPage.getElementsByName(String)
- HtmlPage.getFormByName(String); HtmlPage.getForms()
- HtmlPage.getFrameByName(String); HtmlPage.getFrames()
You can also start searching from the document element (HtmlPage.getDocumentElement()) and then traverse the dom tree
- HtmlElement.getElementsByAttribute(String, String, String)
- DomElement.getElementsByTagName(String); DomElement.getElementsByTagNameNS(String, String)
- DomElement.getChildElements(); DomElement.getChildElementCount()
- DomElement.getFirstElementChild(); DomElement.getLastElementChild()
- HtmlElement.getEnclosingElement(String); HtmlElement.getEnclosingForm()
- DomNode.getChildNodes(); DomNode.getChildren(); DomNode.getDescendants(); DomNode.getDomElementDescendants(); DomNode.getFirstChild(); DomNode.getHtmlElementDescendants() DomNode.getLastChild(); DomNode.getNextElementSibling(); DomNode.getNextSibling(); DomNode.getPreviousElementSibling(); getPreviousSibling()
XPath queries
XPath is the suggested way for more complex searches, a brief tutorial can be found in W3Schools
@Test public void xpath() throws Exception { try (WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("https://www.htmlunit.org/"); //get list of all divs final List<?> divs = page.getByXPath("//div"); //get div which has a 'id' attribute of 'banner' final HtmlDivision div = (HtmlDivision) page.getByXPath("//div[@id='banner']").get(0); } }
CSS Selectors
You can also use CSS selectors
@Test public void cssSelector() throws Exception { try (WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("https://www.htmlunit.org/"); //get list of all divs final DomNodeList<DomNode> divs = page.querySelectorAll("div"); for (final DomNode div : divs) { // .... } //get div which has the id 'breadcrumbs' final DomNode div = page.querySelector("div#breadcrumbs"); } }
Extracting text
When you need to extract text from a web page, think of it as a two-step process: find it, then extract it.
Use HtmlUnit's search methods (like getElementById(), XPath, or CSS selectors) to find the specific element
containing the text you want. Then simply call the asNormalizedText() method on that element to get the text
exactly as a user would see it in their browser.
See following example:
@Test public void extractTextToc() throws Exception { try (WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("https://www.htmlunit.org/"); final DomNode sponsoringDiv = page.querySelector("#bodyColumn > section:nth-child(1) > div:nth-child(2)"); // A normalized textual representation of this element that represents // what would be visible to the user if this page was shown in a web browser. // Whitespace is normalized like in the browser and block tags are separated by '\n'. final String content = sponsoringDiv.asNormalizedText(); } }
Extracting the whole page content
If you want to extract all text from an entire page without targeting specific elements, you can call asNormalizedText()
directly on the body element. This is useful for getting a quick overview of all visible content or for full-text indexing.
Here's a simple example that loads a page and extracts all its text:
@Test public void extractTextFromBody() throws Exception { try (WebClient webClient = new WebClient()) { final HtmlPage page = webClient.getPage("https://www.htmlunit.org/"); final HtmlBody body = page.getBody(); // A normalized textual representation of this element that represents // what would be visible to the user if this page was shown in a web browser. // Whitespace is normalized like in the browser and block tags are separated by '\n'. final String bodyContent = body.asNormalizedText(); } }