博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
网络爬虫2:使用crawler4j爬取网络内容
阅读量:4310 次
发布时间:2019-06-06

本文共 6166 字,大约阅读时间需要 20 分钟。

https://github.com/yasserg/crawler4j

需要两个包:

  crawler4j-4.1-jar-with-dependencies.jar

  slf4j-simple-1.7.22.jar(如果不加,会有警告:SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".)

 相关包下载:

http://download.csdn.net/detail/talkwah/9747407

 

(crawler4j-4.1-jar-with-dependencies.jar相关资料少,github下载半天还失败,故整理了一下)

参考资料:

http://blog.csdn.net/zjm131421/article/details/13093869

 

http://favccxx.blog.51cto.com/2890523/1691079/

 

import java.util.Set;import java.util.regex.Pattern;import edu.uci.ics.crawler4j.crawler.CrawlConfig;import edu.uci.ics.crawler4j.crawler.CrawlController;import edu.uci.ics.crawler4j.crawler.Page;import edu.uci.ics.crawler4j.crawler.WebCrawler;import edu.uci.ics.crawler4j.fetcher.PageFetcher;import edu.uci.ics.crawler4j.parser.HtmlParseData;import edu.uci.ics.crawler4j.robotstxt.RobotstxtConfig;import edu.uci.ics.crawler4j.robotstxt.RobotstxtServer;import edu.uci.ics.crawler4j.url.WebURL;public class AhCrawler extends WebCrawler {    // 三要素:    // _访问谁?    // _怎么访?    // _访上了怎么处置?    private static final String C_URL = "http://www.ximalaya.com";    @Override    public boolean shouldVisit(Page referringPage, WebURL url) {        String href = url.getURL().toLowerCase();        // 不匹配:MP3|jpg|png结尾的资源        Pattern p = Pattern.compile(".*(\\.(MP3|jpg|png))$");        return !p.matcher(href).matches() && href.startsWith(C_URL);    }    @Override    public void visit(Page page) {        String url = page.getWebURL().getURL();        String parentUrl = page.getWebURL().getParentUrl();        String anchor = page.getWebURL().getAnchor();        System.out.println("↓↓↓↓↓↓↓↓↓");        System.out.println("URL        :" + url);        System.out.println("Parent page:" + parentUrl);        System.out.println("Anchor text:" + anchor);        logger.info("URL: {}", url);        logger.debug("Parent page: {}", parentUrl);        logger.debug("Anchor text: {}", anchor);        if (page.getParseData() instanceof HtmlParseData) {            HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();            String text = htmlParseData.getText();            String html = htmlParseData.getHtml();            Set
links = htmlParseData.getOutgoingUrls(); System.out.println("--------------------------"); // System.out.println("Text length: " + text.length()); // System.out.println("Html length: " + html.length()); System.out.println("Number of outgoing links: " + links.size()); } System.out.println("↑↑↑↑↑↑↑"); } public static void main(String[] args) throws Exception { // 源代码例子中,这两位是两只参数 // 配置个路径,这个路径相当于Temp文件夹,不用先建好, String crawlStorageFolder = "/data/crawl/root"; int numberOfCrawlers = 7; CrawlConfig crawlConf = new CrawlConfig(); crawlConf.setCrawlStorageFolder(crawlStorageFolder); PageFetcher pageFetcher = new PageFetcher(crawlConf); RobotstxtConfig robotConf = new RobotstxtConfig(); RobotstxtServer robotServ = new RobotstxtServer(robotConf, pageFetcher); // 控制器 CrawlController c = new CrawlController(crawlConf, pageFetcher, robotServ); // 添加网址 c.addSeed(C_URL); // 启动爬虫 c.start(AhCrawler.class, numberOfCrawlers); }}

 

CrawlController c 的来历:

结果示例:

↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/5333001/sound/25320285Parent page:http://www.ximalaya.com/dq/music-ACG/Anchor text:俊豪演奏 - 琵琶版《刀劍如夢》[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5333001/sound/25320285--------------------------Number of outgoing links: 131↑↑↑↑↑↑↑[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30119950/sound/12181402↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/30119950/sound/12181402Parent page:http://www.ximalaya.com/dq/book-果麦文化/Anchor text:第二十六集 人生的意思不在于留下什么,而在于经历--------------------------Number of outgoing links: 134↑↑↑↑↑↑↑↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/zhubo/56833971/Parent page:http://www.ximalaya.com/4932085/sound/21902925Anchor text:null[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/zhubo/56833971/--------------------------Number of outgoing links: 68↑↑↑↑↑↑↑[Crawler 4] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5413571/sound/2349697↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/5413571/sound/2349697Parent page:http://www.ximalaya.com/dq/renwen-新知/Anchor text:41-方明-西江月·夜行黄沙道中 南宋 辛弃疾--------------------------Number of outgoing links: 134↑↑↑↑↑↑↑[Crawler 6] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/5011186/sound/30650945↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/5011186/sound/30650945Parent page:http://www.ximalaya.com/dq/finance-大咖/Anchor text:03--------------------------Number of outgoing links: 111↑↑↑↑↑↑↑↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/1000144/album/3559805Parent page:http://www.ximalaya.com/dq/music-文艺/Anchor text:null[Crawler 2] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/1000144/album/3559805--------------------------Number of outgoing links: 85↑↑↑↑↑↑↑↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/4932085/sound/21902925/likerParent page:http://www.ximalaya.com/4932085/sound/21902925Anchor text:更多[Crawler 1] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/4932085/sound/21902925/liker--------------------------Number of outgoing links: 96↑↑↑↑↑↑↑↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/30895669/sound/19945445Parent page:http://www.ximalaya.com/dq/music-ACG/Anchor text:宫崎骏-久石让[Crawler 3] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/30895669/sound/19945445--------------------------Number of outgoing links: 131↑↑↑↑↑↑↑↓↓↓↓↓↓↓↓↓URL        :http://www.ximalaya.com/9112346/album/2903291Parent page:http://www.ximalaya.com/dq/book-果麦文化/Anchor text:null[Crawler 7] INFO edu.uci.ics.crawler4j.crawler.WebCrawler - URL: http://www.ximalaya.com/9112346/album/2903291--------------------------Number of outgoing links: 90↑↑↑↑↑↑↑

 

 

 

转载于:https://www.cnblogs.com/AndyHoo/p/6367622.html

你可能感兴趣的文章
Recommendation Systems
查看>>
shell脚本 inotify + rsync 同步脚本
查看>>
maven pom 引入本地jar包
查看>>
QVT之The Relations Language(Part 二)
查看>>
python--dict和set类型--4
查看>>
快速实现Magento多语言的设置和产品数据的多语言方法
查看>>
python操作数据库
查看>>
Django的ORM基本操作补充一对多
查看>>
A - Oil Deposits(搜索)
查看>>
E - Phone List(字典序,string类型使用)
查看>>
自定义SeekBar三步
查看>>
"Coding Interview Guide" -- 设计一个有getMin功能的栈
查看>>
Java基础知识强化之多线程笔记06:Lock接口 (区别于Synchronized块)
查看>>
PHP笔记09:PHP之 MVC理解
查看>>
Android(java)学习笔记20:UDP协议发送数据
查看>>
stata学习笔记(五):描述性统计分析
查看>>
n!的位数 斯特林公式
查看>>
转:socket
查看>>
hLG2034Fire Maze ---BFS
查看>>
定时Job在IIS中潜在危险-IIS 定期回收
查看>>