Java如何抓取网页内容数据

0个评论

3个牛币 Java 抓取网页

所有回答列表(5)

martin0321 2015年3月12日

最代码官方 LV167 2015年3月12日

简单的实现就是定向抓取，通过java原生类HttpURLConnection即可实现，如：java抓取html网页数据内容demo

也可以通过java开源框架jsoup实现：使用jsoup抓取指定网站地址的class的html内容

还可以通过java爬虫框架webmagic实现：WebMagic框架搭建的爬虫，根据自定义规则，直接抓取，使用灵活，Demo部署即可查看。

请参考其他的java抓取代码：java爬虫抓取网页数据

wwwphp LV2 2015年3月18日

通过httpclient抓取网页内容，再使用jsoup解析html代码

wxzain LV4 2015年3月25日

    /**
   * 获取网页内容
   * @param url 链接
   * @param code 编码格式 GB2312/UTF-8
   * @return
   */
   public static String getContent(String url, String code) {
       HttpURLConnection conn = null;
       String tmp = "";
       StringBuffer sb = null;
       BufferedReader sread = null;
       try {
           conn = (HttpURLConnection) new URL(url).openConnection();
           conn.setRequestProperty("User-Agent", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; H010818)");
           sb = new StringBuffer("");
           sread = new BufferedReader(new InputStreamReader(conn.getInputStream(), code));
           while ((tmp = sread.readLine()) != null) {
               sb.append(tmp);
           }
       } catch (Exception e) {
           log.error(e.getMessage(), e);
       } finally {
           try {
               if (sread != null) {
                   sread.close();
               }
           } catch (IOException e) {
               log.error(e.getMessage(), e);
           }
           if (conn != null)
               conn.disconnect();
       }
       return sb.toString();
   }

xhmlwaf LV2 2015年4月9日

用JSoup解析就可以了，用JSoup速度很快，而且选择器很强大，可以用一个表达式抓取你网页上的任何数据。

到如Jsoup.jar包，通过url获取Document对象。

public static Document getDocument(String url) {
       Document doc = null;
       try {
           doc = Jsoup.connect(url).data("jquery", "java")
                   .userAgent("Mozilla").cookie("auth", "token")
                   .timeout(100000).get();
       } catch (IOException e) {
           e.printStackTrace();
       }
       return doc;
   }

最后通过Document的select函数来选择。

最热评论
最新问答