How to Select and format Portion of a Webpage Using Jsoup and Htmlcleaner in Android


Their are times when need may arise to display a portion of a webpage only and not the entire page using WebView in an android application, for example an application might just need to display the blog post portion of this page only and not the comments section. In this blog post I will explain how this could be achieved using the open source libraries jsoup and htmlcleaner.

Needed Libraries

To do this, open the eclipse IDE and create a new android project, name it PortionofPageViewer and then point your browser to to download the jsoup library and to download the htmlcleaner library which will both be used to select a portion of the DOM and clean up the retrieved page contents or tags respectively.

Building the application

First the Internet permission need to be added to the androidmanifest.xml file.

<uses-permission android:name="android.permission.INTERNET"/>

NNext things to be added are the WebView,ProgressBar and TextView widgets to the main.xml file in the res/layout folder, now in package explorer, right click the project in the package explorer and add a new folder with the name libs, right click the folder, select import and in the dialog box, select file system and browse to where the two jar files that you downloaded  are and import them to the folder. Then right click the project and select properties, in the dialog box select Java Build Path and click on Add Jar, in the dialog box select the project and select the libraries from the libs folder and click ok.

In the auto generated PortionofPageViewerActivity class, import the following packages, because they will be needed.

import org.htmlcleaner.CleanerProperties;
import org.htmlcleaner.HtmlCleaner;
import org.htmlcleaner.SimpleHtmlSerializer;
import org.htmlcleaner.TagNode;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

Thee jsoup library can be used to select elements as well as manipulate elements using the CSS or the jquery like element selection style. The lines of codes below are responsible for connecting to a page, getting the raw html tags and selecting the elements needed out of the returned tags.

Document doc = Jsoup.connect(url).get();
Elements newsRawTag ="div#cntent");

Jsoup does a good work of returning the elements or portion of the page needed, however, the htmlcleaner does the work of cleaning up and adding the required html tags to the retrieved elements, the line of codes below do that.

HtmlCleaner cleaner = new HtmlCleaner();
CleanerProperties props = cleaner.getProperties();
TagNode tagNode = new HtmlCleaner(props).clean(newPage);
SimpleHtmlSerializer htmlSerializer =  new SimpleHtmlSerializer(props);
//the cleaned html is then passed to the webview widget for rendering
browser.loadDataWithBaseURL(null, htmlSerializer.getAsString(tagNode),
"text/html", "charset=UTF-8",null);

The screen cast of the device during loading and when the page has been rendered by the WebView widget are shown below.

device when first loading                        device when fully loaded

The full source code of the PortionofPageViewerActivity class is shown below

public class PortionofPageViewerActivity extends Activity
  private WebView browser;
  private ProgressBar loadingProgressBar;
  private TextView txtLoading;

    /** Called when the activity is first created. */
  public void onCreate(Bundle savedInstanceState) {
     browser = (WebView) findViewById(;

    @SuppressLint({ "ParserError", "ParserError" })
  private void LoadContent(final String url)
    loadingProgressBar = (ProgressBar) findViewById(;
    txtLoading=(TextView) findViewById(;
   	new Thread(new Runnable()
   	  public void run()
   	  final String newPage;

           Document doc =Jsoup.connect(url).get();
           Elements newsRawTag ="div#postcontent");

           runOnUiThread(new Runnable()
   	    public void run()
   	        HtmlCleaner cleaner = new HtmlCleaner();
   		CleanerProperties props = cleaner.getProperties();
   		TagNode tagNode = new HtmlCleaner(props).clean(newPage;
   	        SimpleHtmlSerializer htmlSerializer =
                            new SimpleHtmlSerializer(props);
               getAsString(tagNode), "text/html", "charset=UTF-8",null;

   	         catch (IOException e)

       } catch (ClientProtocolException e) {

        }  catch (IOException e) {





Share this page on

25 Comment(s)   22 People Like(s) This Page   Permalink  

 Click  To Like This Page

comments powered by Disqus

Older Comment(s)

Posted by    Phani

Wednesday, September 12, 2012    4:35 PM

hi thanks for the useful information, I came across your blog while searching for pro/cons of jsoup over htmlcleaner. I was wondering why you used both of them as it can be achieved using either of them alone! Is convenient selection syntax of jsoup the reason to use it? If so we can do the cleanup using Jsoup.parseBodyFragment() and Jsoup.clean() I am just trying to make a choice between them. While htmlcleaner is smaller in size (matters for mobile dev right?) jsoup has got nice api to work with.

Posted by    Ayobami Adewole

Saturday, September 15, 2012    8:20 AM

@Phani Having two many libraries in your application can increase the .apk file size that will be built at the end of development, both libraries are good, but if the size of your .apk file matters a lot, I will suggest you stick to one of the libraries.

Posted by    ethanfel

Monday, October 29, 2012    9:32 PM

Hello, you're work is great it work very well. I'mt trying to use you're code to load a table in my activity. It work but the table apear as plain text How would you change you're code to keep the table ?

Posted by    Ayobami Adewole

Tuesday, October 30, 2012    9:15 PM

@Ethanfel make sure you use HtmlCleaner to clean up the html tags before you pass it to the webview for rendering.

Posted by    Shahid

Wednesday, November 07, 2012    4:43 PM

Hi. I am a newbie to this sort of stuff. I tried running your code however I get an error that says "vm does not provide monitor information" and then points to the line "Document doc =Jsoup.connect(url).get();". What am I doing wrong? Any help would be helpful. Thank You.

Posted by    Ayobami Adewole

Wednesday, November 07, 2012    9:58 PM

@Shahid the line of code connects to the website or IP address where you are fetching your html contents to retrieve it, check if you have active connection before you proceed.

Posted by    Shahid

Thursday, November 08, 2012    12:44 AM

Hi there. Thank you for your reply. I managed to solve my problem. Basically I had to place the Jsoup and Htmlcleaner Jar library file into the "libs" folder, so now it works. I was using Eclipse. I noticed that in the above example the 2 images are not displayed when running the app just text only. Is there a way to fetch and display images as well?

Posted by    Ayobami Adewole

Thursday, November 08, 2012    8:57 AM

In the above example, I only selected the area of this page that has text only, if you want an image to be displayed, you can select an area with image in it, and the webview will display the image as well

Posted by    Lawrence Macharia

Thursday, January 03, 2013    8:23 AM

Hey Ayobami, Nice work on the tutorial. I however have some problem when I implement your code. When I use the your link, the progress bar is loading indefinitely and when I use mine the app force closes. What could be the problem and is possible 4 u to avail the XML file as well? Thanks

Posted by    Lawrence Macharia

Thursday, January 03, 2013    8:45 AM

Hey Ayobami, I figured out what my problem was; it was the same problem Shahid above had. Thanks

Posted by    Shahid

Thursday, January 17, 2013    6:28 PM

Hi Ayobami. I am trying to fetch and display a table from a website. The problem is that when the table is displayed it does not show the lines of the table just the text which looks very unorganised. Is there a way to fetch and properly display tables. Thank You!

Posted by    Ayobami Adewole

Friday, January 18, 2013    12:27 AM

@Shahid you will probably do something like Document doc = Jsoup.connect("some url").get(); Elements newsRawTag ="the table name goes here"); you can now do something like newPage=newsRawTag.html(); To get the html form of the table before loading it in a webview