文章详情

短信预约-IT技能 免费直播动态提醒

请输入下面的图形验证码

提交验证

短信预约提醒成功

springboot+WebMagic+MyBatis爬虫框架怎么用

2023-06-20 20:35

关注

这篇文章主要为大家展示了“springboot+WebMagic+MyBatis爬虫框架怎么用”,内容简而易懂,条理清晰,希望能够帮助大家解决疑惑,下面让小编带领大家一起研究并学习一下“springboot+WebMagic+MyBatis爬虫框架怎么用”这篇文章吧。

WebMagic是一个开源的java爬虫框架。WebMagic框架的使用并不是本文的重点,具体如何使用请参考官方文档:http://webmagic.io/docs/。

本文是对spring boot+WebMagic+MyBatis做了整合,使用WebMagic爬取数据,然后通过MyBatis持久化爬取的数据到mysql数据库。本文提供的源代码可以作为java爬虫项目的脚手架。

springboot+WebMagic+MyBatis爬虫框架怎么用

1.添加maven依赖

<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0"         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">    <modelVersion>4.0.0</modelVersion>    <groupId>hyzx</groupId>    <artifactId>qbasic-crawler</artifactId>    <version>1.0.0</version>    <parent>        <groupId>org.springframework.boot</groupId>        <artifactId>spring-boot-starter-parent</artifactId>        <version>1.5.21.RELEASE</version>        <relativePath/> <!-- lookup parent from repository -->    </parent>    <properties>        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>        <maven.test.skip>true</maven.test.skip>        <java.version>1.8</java.version>        <maven.compiler.plugin.version>3.8.1</maven.compiler.plugin.version>        <maven.resources.plugin.version>3.1.0</maven.resources.plugin.version>        <mysql.connector.version>5.1.47</mysql.connector.version>        <druid.spring.boot.starter.version>1.1.17</druid.spring.boot.starter.version>        <mybatis.spring.boot.starter.version>1.3.4</mybatis.spring.boot.starter.version>        <fastjson.version>1.2.58</fastjson.version>        <commons.lang3.version>3.9</commons.lang3.version>        <joda.time.version>2.10.2</joda.time.version>        <webmagic.core.version>0.7.3</webmagic.core.version>    </properties>    <dependencies>        <dependency>            <groupId>org.springframework.boot</groupId>            <artifactId>spring-boot-devtools</artifactId>            <scope>runtime</scope>            <optional>true</optional>        </dependency>        <dependency>            <groupId>org.springframework.boot</groupId>            <artifactId>spring-boot-starter-test</artifactId>            <scope>test</scope>        </dependency>        <dependency>            <groupId>org.springframework.boot</groupId>            <artifactId>spring-boot-configuration-processor</artifactId>            <optional>true</optional>        </dependency>        <dependency>            <groupId>mysql</groupId>            <artifactId>mysql-connector-java</artifactId>            <version>${mysql.connector.version}</version>        </dependency>        <dependency>            <groupId>com.alibaba</groupId>            <artifactId>druid-spring-boot-starter</artifactId>            <version>${druid.spring.boot.starter.version}</version>        </dependency>        <dependency>            <groupId>org.mybatis.spring.boot</groupId>            <artifactId>mybatis-spring-boot-starter</artifactId>            <version>${mybatis.spring.boot.starter.version}</version>        </dependency>        <dependency>            <groupId>com.alibaba</groupId>            <artifactId>fastjson</artifactId>            <version>${fastjson.version}</version>        </dependency>        <dependency>            <groupId>org.apache.commons</groupId>            <artifactId>commons-lang3</artifactId>            <version>${commons.lang3.version}</version>        </dependency>        <dependency>            <groupId>joda-time</groupId>            <artifactId>joda-time</artifactId>            <version>${joda.time.version}</version>        </dependency>        <dependency>            <groupId>us.codecraft</groupId>            <artifactId>webmagic-core</artifactId>            <version>${webmagic.core.version}</version>            <exclusions>                <exclusion>                    <groupId>org.slf4j</groupId>                    <artifactId>slf4j-log4j12</artifactId>                </exclusion>            </exclusions>        </dependency>    </dependencies>    <build>        <plugins>            <plugin>                <groupId>org.apache.maven.plugins</groupId>                <artifactId>maven-compiler-plugin</artifactId>                <version>${maven.compiler.plugin.version}</version>                <configuration>                    <source>${java.version}</source>                    <target>${java.version}</target>                    <encoding>${project.build.sourceEncoding}</encoding>                </configuration>            </plugin>            <plugin>                <groupId>org.apache.maven.plugins</groupId>                <artifactId>maven-resources-plugin</artifactId>                <version>${maven.resources.plugin.version}</version>                <configuration>                    <encoding>${project.build.sourceEncoding}</encoding>                </configuration>            </plugin>            <plugin>                <groupId>org.springframework.boot</groupId>                <artifactId>spring-boot-maven-plugin</artifactId>                <configuration>                    <fork>true</fork>                    <addResources>true</addResources>                </configuration>                <executions>                    <execution>                        <goals>                            <goal>repackage</goal>                        </goals>                    </execution>                </executions>            </plugin>        </plugins>    </build>    <repositories>        <repository>            <id>public</id>            <name>aliyun nexus</name>            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>            <releases>                <enabled>true</enabled>            </releases>        </repository>    </repositories>    <pluginRepositories>        <pluginRepository>            <id>public</id>            <name>aliyun nexus</name>            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>            <releases>                <enabled>true</enabled>            </releases>            <snapshots>                <enabled>false</enabled>            </snapshots>        </pluginRepository>    </pluginRepositories></project>

2.项目配置文件 application.properties

配置mysql数据源,druid数据库连接池以及MyBatis的mapper文件的位置。

# mysql数据源配置spring.datasource.name=mysqlspring.datasource.type=com.alibaba.druid.pool.DruidDataSourcespring.datasource.driver-class-name=com.mysql.jdbc.Driverspring.datasource.url=jdbc:mysql://192.168.0.63:3306/gjhzjl?useUnicode=true&characterEncoding=utf8&useSSL=false&allowMultiQueries=truespring.datasource.username=rootspring.datasource.password=root# druid数据库连接池配置spring.datasource.druid.initial-size=5spring.datasource.druid.min-idle=5spring.datasource.druid.max-active=10spring.datasource.druid.max-wait=60000spring.datasource.druid.validation-query=SELECT 1 FROM DUALspring.datasource.druid.test-on-borrow=falsespring.datasource.druid.test-on-return=falsespring.datasource.druid.test-while-idle=truespring.datasource.druid.time-between-eviction-runs-millis=60000spring.datasource.druid.min-evictable-idle-time-millis=300000spring.datasource.druid.max-evictable-idle-time-millis=600000# mybatis配置mybatis.mapperLocations=classpath:mapper*.xml

3.数据库表结构

CREATE TABLE `cms_content` (  `contentId` varchar(40) NOT NULL COMMENT '内容ID',  `title` varchar(150) NOT NULL COMMENT '',  `content` longtext COMMENT '文章内容',  `releaseDate` datetime NOT NULL COMMENT '发布日期',  PRIMARY KEY (`contentId`)) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='CMS内容表';

4.实体类

import java.util.Date;public class CmsContentPO {    private String contentId;    private String title;    private String content;    private Date releaseDate;    public String getContentId() {        return contentId;    }    public void setContentId(String contentId) {        this.contentId = contentId;    }    public String getTitle() {        return title;    }    public void setTitle(String title) {        this.title = title;    }    public String getContent() {        return content;    }    public void setContent(String content) {        this.content = content;    }    public Date getReleaseDate() {        return releaseDate;    }    public void setReleaseDate(Date releaseDate) {        this.releaseDate = releaseDate;    }}

5.mapper接口

public interface CrawlerMapper {    int addCmsContent(CmsContentPO record);}

6.CrawlerMapper.xml文件

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE mapper PUBLIC "-//mybatis.org//DTD Mapper 3.0//EN" "http://mybatis.org/dtd/mybatis-3-mapper.dtd"><mapper namespace="com.hyzx.qbasic.dao.CrawlerMapper">    <insert id="addCmsContent" parameterType="com.hyzx.qbasic.model.CmsContentPO">        insert into cms_content (contentId,                                 title,                                 releaseDate,                                 content)        values (#{contentId,jdbcType=VARCHAR},                #{title,jdbcType=VARCHAR},                #{releaseDate,jdbcType=TIMESTAMP},                #{content,jdbcType=LONGVARCHAR})    </insert></mapper>

7.知乎页面内容处理类ZhihuPageProcessor

主要用于解析爬取到的知乎html页面。

@Componentpublic class ZhihuPageProcessor implements PageProcessor {    private Site site = Site.me().setRetryTimes(3).setSleepTime(1000);    @Override    public void process(Page page) {        page.addTargetRequests(page.getHtml().links().regex("https://www\\.zhihu\\.com/question/\\d+/answer/\\d+.*").all());        page.putField("title", page.getHtml().xpath("//h2[@class='QuestionHeader-title']/text()").toString());        page.putField("answer", page.getHtml().xpath("//div[@class='QuestionAnswer-content']/tidyText()").toString());        if (page.getResultItems().get("title") == null) {            // 如果是列表页,跳过此页,pipeline不进行后续处理            page.setSkip(true);        }    }    @Override    public Site getSite() {        return site;    }}

8.知乎数据处理类ZhihuPipeline

主要用于将知乎html页面解析出的数据存储到mysql数据库。

@Componentpublic class ZhihuPipeline implements Pipeline {    private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class);    @Autowired    private CrawlerMapper crawlerMapper;    public void process(ResultItems resultItems, Task task) {        String title = resultItems.get("title");        String answer = resultItems.get("answer");        CmsContentPO contentPO = new CmsContentPO();        contentPO.setContentId(UUID.randomUUID().toString());        contentPO.setTitle(title);        contentPO.setReleaseDate(new Date());        contentPO.setContent(answer);        try {            boolean success = crawlerMapper.addCmsContent(contentPO) > 0;            LOGGER.info("保存知乎文章成功:{}", title);        } catch (Exception ex) {            LOGGER.error("保存知乎文章失败", ex);        }    }}

9.知乎爬虫任务类ZhihuTask

每十分钟启动一次爬虫。

@Componentpublic class ZhihuTask {    private static final Logger LOGGER = LoggerFactory.getLogger(ZhihuPipeline.class);    @Autowired    private ZhihuPipeline zhihuPipeline;    @Autowired    private ZhihuPageProcessor zhihuPageProcessor;    private ScheduledExecutorService timer = Executors.newSingleThreadScheduledExecutor();    public void crawl() {        // 定时任务,每10分钟爬取一次        timer.scheduleWithFixedDelay(() -> {            Thread.currentThread().setName("zhihuCrawlerThread");            try {                Spider.create(zhihuPageProcessor)                        // 从https://www.zhihu.com/explore开始抓                        .addUrl("https://www.zhihu.com/explore")                        // 抓取到的数据存数据库                        .addPipeline(zhihuPipeline)                        // 开启2个线程抓取                        .thread(2)                        // 异步启动爬虫                        .start();            } catch (Exception ex) {                LOGGER.error("定时抓取知乎数据线程执行异常", ex);            }        }, 0, 10, TimeUnit.MINUTES);    }}

10.Spring boot程序启动类

@SpringBootApplication@MapperScan(basePackages = "com.hyzx.qbasic.dao")public class Application implements CommandLineRunner {    @Autowired    private ZhihuTask zhihuTask;    public static void main(String[] args) throws IOException {        SpringApplication.run(Application.class, args);    }    @Override    public void run(String... strings) throws Exception {        // 爬取知乎数据        zhihuTask.crawl();    }}

以上是“springboot+WebMagic+MyBatis爬虫框架怎么用”这篇文章的所有内容,感谢各位的阅读!相信大家都有了一定的了解,希望分享的内容对大家有所帮助,如果还想学习更多知识,欢迎关注编程网行业资讯频道!

阅读原文内容投诉

免责声明:

① 本站未注明“稿件来源”的信息均来自网络整理。其文字、图片和音视频稿件的所属权归原作者所有。本站收集整理出于非商业性的教育和科研之目的,并不意味着本站赞同其观点或证实其内容的真实性。仅作为临时的测试数据,供内部测试之用。本站并未授权任何人以任何方式主动获取本站任何信息。

② 本站未注明“稿件来源”的临时测试数据将在测试完成后最终做删除处理。有问题或投稿请发送至: 邮箱/279061341@qq.com QQ/279061341

软考中级精品资料免费领

  • 历年真题答案解析
  • 备考技巧名师总结
  • 高频考点精准押题
  • 2024年上半年信息系统项目管理师第二批次真题及答案解析(完整版)

    难度     813人已做
    查看
  • 【考后总结】2024年5月26日信息系统项目管理师第2批次考情分析

    难度     354人已做
    查看
  • 【考后总结】2024年5月25日信息系统项目管理师第1批次考情分析

    难度     318人已做
    查看
  • 2024年上半年软考高项第一、二批次真题考点汇总(完整版)

    难度     435人已做
    查看
  • 2024年上半年系统架构设计师考试综合知识真题

    难度     224人已做
    查看

相关文章

发现更多好内容

猜你喜欢

AI推送时光机
位置:首页-资讯-后端开发
咦!没有更多了?去看看其它编程学习网 内容吧
首页课程
资料下载
问答资讯