Tio Boot DocsTio Boot Docs
Home
  • java-db
  • api-table
  • Enjoy
  • Tio Boot Admin
  • ai_agent
  • translator
  • knowlege_base
  • ai-search
  • 案例
Abount
  • Github
  • Gitee
Home
  • java-db
  • api-table
  • Enjoy
  • Tio Boot Admin
  • ai_agent
  • translator
  • knowlege_base
  • ai-search
  • 案例
Abount
  • Github
  • Gitee
  • 01_tio-boot 简介

    • tio-boot:新一代高性能 Java Web 开发框架
    • tio-boot 入门示例
    • Tio-Boot 配置 : 现代化的配置方案
    • tio-boot 整合 Logback
    • tio-boot 整合 hotswap-classloader 实现热加载
    • 自行编译 tio-boot
    • 最新版本
    • 开发规范
  • 02_部署

    • 使用 Maven Profile 实现分环境打包 tio-boot 项目
    • Maven 项目配置详解:依赖与 Profiles 配置
    • tio-boot 打包成 FastJar
    • 使用 GraalVM 构建 tio-boot Native 程序
    • 使用 Docker 部署 tio-boot
    • 部署到 Fly.io
    • 部署到 AWS Lambda
    • 到阿里云云函数
    • 使用 Deploy 工具部署
    • 胖包与瘦包的打包与部署
    • 使用 Jenkins 部署 Tio-Boot 项目
    • 使用 Nginx 反向代理 Tio-Boot
    • 使用 Supervisor 管理 Java 应用
  • 03_配置

    • 配置参数
    • 服务器监听器
    • 内置缓存系统 AbsCache
    • 使用 Redis 作为内部 Cache
    • 静态文件处理器
    • 基于域名的静态资源隔离
    • DecodeExceptionHandler
  • 04_原理

    • 生命周期
    • 请求处理流程
    • 重要的类
  • 05_json

    • Json
    • 接受 JSON 和响应 JSON
    • 响应实体类
  • 06_web

    • 概述
    • 文件上传
    • 接收请求参数
    • 接收日期参数
    • 接收数组参数
    • 返回字符串
    • 返回文本数据
    • 返回网页
    • 请求和响应字节
    • 文件下载
    • 返回视频文件并支持断点续传
    • http Session
    • Cookie
    • HttpRequest
    • HttpResponse
    • Resps
    • RespBodyVo
    • /zh/06_web/19.html
    • 全局异常处理器
    • 异步
    • 动态 返回 CSS 实现
    • 返回图片
    • Transfer-Encoding: chunked 实时音频播放
    • Server-Sent Events (SSE)
    • 接口访问统计
    • 接口请求和响应数据记录
    • 自定义 Handler 转发请求
    • 使用 HttpForwardHandler 转发所有请求
    • 跨域
    • 添加 Controller
    • 常用工具类
    • HTTP Basic 认证
    • WebJars
    • JProtobuf
  • 07_validate

    • 数据紧校验规范
    • 参数校验
  • 08_websocket

    • 使用 tio-boot 搭建 WebSocket 服务
    • WebSocket 聊天室项目示例
  • 09_java-db

    • java‑db
    • 操作数据库入门示例
    • SQL 模板
    • 数据源配置与使用
    • ActiveRecord
    • Model
    • 生成器与 Model
    • Db 工具类
    • 批量操作
    • 数据库事务处理
    • Cache 缓存
    • Dialect 多数据库支持
    • 表关联操作
    • 复合主键
    • Oracle 支持
    • Enjoy SQL 模板
    • Java-DB 整合 Enjoy 模板最佳实践
    • 多数据源支持
    • 独立使用 ActiveRecord
    • 调用存储过程
    • java-db 整合 Guava 的 Striped 锁优化
    • 生成 SQL
    • 通过实体类操作数据库
    • java-db 读写分离
    • Spring Boot 整合 Java-DB
    • like 查询
    • 常用操作示例
    • Druid 监控集成指南
    • SQL 统计
  • 10_api-table

    • ApiTable 概述
    • 使用 ApiTable 连接 SQLite
    • 使用 ApiTable 连接 Mysql
    • 使用 ApiTable 连接 Postgres
    • 使用 ApiTable 连接 TDEngine
    • 使用 api-table 连接 oracle
    • 使用 api-table 连接 mysql and tdengine 多数据源
    • EasyExcel 导出
    • EasyExcel 导入
    • TQL(Table SQL)前端输入规范
    • ApiTable 实现增删改查
    • 数组类型
    • 单独使用 ApiTable
  • 11_aop

    • JFinal-aop
    • Aop 工具类
    • 配置
    • 配置
    • 独立使用 JFinal Aop
    • @AImport
    • 原理解析
  • 12_cache

    • Caffine
    • Jedis-redis
    • hutool RedisDS
    • Redisson
    • Caffeine and redis
    • CacheUtils 工具类
    • 使用 CacheUtils 整合 caffeine 和 redis 实现的两级缓存
    • 使用 java-db 整合 ehcache
    • 使用 java-db 整合 redis
    • Java DB Redis 相关 Api
    • redis 使用示例
  • 13_认证和权限

    • hutool-JWT
    • FixedTokenInterceptor
    • 使用内置 TokenManager 实现登录
    • 用户系统
    • 重置密码
    • 匿名登录
    • Google 登录
    • 权限校验注解
    • Sa-Token
    • sa-token 登录注册
    • StpUtil.isLogin() 源码解析
    • 短信登录
    • 移动端微信登录实现指南
    • 移动端重置密码
  • 14_i18n

    • i18n
  • 15_enjoy

    • tio-boot 整合 Enjoy 模版引擎文档
    • 引擎配置
    • 表达式
    • 指令
    • 注释
    • 原样输出
    • Shared Method 扩展
    • Shared Object 扩展
    • Extension Method 扩展
    • Spring boot 整合
    • 独立使用 Enjoy
    • tio-boot enjoy 自定义指令 localeDate
    • PromptEngine
    • Enjoy 入门示例-擎渲染大模型请求体
    • Enjoy 使用示例
  • 16_定时任务

    • Quartz 定时任务集成指南
    • 分布式定时任务 xxl-jb
    • cron4j 使用指南
  • 17_tests

    • TioBootTest 类
  • 18_tio

    • TioBootServer
    • tio-core
    • 内置 TCP 处理器
    • 独立启动 UDPServer
    • 使用内置 UDPServer
    • t-io 消息处理流程
    • tio-运行原理详解
    • TioConfig
    • ChannelContext
    • Tio 工具类
    • 业务数据绑定
    • 业务数据解绑
    • 发送数据
    • 关闭连接
    • Packet
    • 监控: 心跳
    • 监控: 客户端的流量数据
    • 监控: 单条 TCP 连接的流量数据
    • 监控: 端口的流量数据
    • 单条通道统计: ChannelStat
    • 所有通道统计: GroupStat
    • 资源共享
    • 成员排序
    • ssl
    • DecodeRunnable
    • 使用 AsynchronousSocketChannel 响应数据
    • 拉黑 IP
    • 深入解析 Tio 源码:构建高性能 Java 网络应用
  • 19_aio

    • ByteBuffer
    • AIO HTTP 服务器
    • 自定义和线程池和池化 ByteBuffer
    • AioHttpServer 应用示例 IP 属地查询
    • 手写 AIO Http 服务器
  • 20_netty

    • Netty TCP Server
    • Netty Web Socket Server
    • 使用 protoc 生成 Java 包文件
    • Netty WebSocket Server 二进制数据传输
    • Netty 组件详解
  • 21_netty-boot

    • Netty-Boot
    • 原理解析
    • 整合 Hot Reload
    • 整合 数据库
    • 整合 Redis
    • 整合 Elasticsearch
    • 整合 Dubbo
    • Listener
    • 文件上传
    • 拦截器
    • Spring Boot 整合 Netty-Boot
    • SSL 配置指南
    • ChannelInitializer
    • Reserve
  • 22_MQ

    • Mica-mqtt
    • EMQX
    • Disruptor
  • 23_tio-utils

    • tio-utils
    • HttpUtils
    • Notification
    • 邮箱
    • JSON
    • 读取文件
    • Base64
    • 上传和下载
    • Http
    • Telegram
    • RsaUtils
    • EnvUtils 使用文档
    • 系统监控
    • 毫秒并发 ID (MCID) 生成方案
  • 24_tio-http-server

    • 使用 Tio-Http-Server 搭建简单的 HTTP 服务
    • tio-boot 添加 HttpRequestHandler
    • 在 Android 上使用 tio-boot 运行 HTTP 服务
    • tio-http-server-native
    • handler 常用操作
  • 25_tio-websocket

    • WebSocket 服务器
    • WebSocket Client
  • 26_tio-im

    • 通讯协议文档
    • ChatPacket.proto 文档
    • java protobuf
    • 数据表设计
    • 创建工程
    • 登录
    • 历史消息
    • 发消息
  • 27_mybatis

    • Tio-Boot 整合 MyBatis
    • 使用配置类方式整合 MyBatis
    • 整合数据源
    • 使用 mybatis-plus 整合 tdengine
    • 整合 mybatis-plus
  • 28_mongodb

    • tio-boot 使用 mongo-java-driver 操作 mongodb
  • 29_elastic-search

    • Elasticsearch
    • JavaDB 整合 ElasticSearch
    • Elastic 工具类使用指南
    • Elastic-search 注意事项
    • ES 课程示例文档
  • 30_magic-script

    • tio-boot 整合 magic-script
  • 31_groovy

    • tio-boot 整合 Groovy
  • 32_firebase

    • 整合 google firebase
    • Firebase Storage
    • Firebase Authentication
    • 使用 Firebase Admin SDK 进行匿名用户管理与自定义状态标记
    • 导出用户
    • 注册回调
    • 登录注册
  • 33_文件存储

    • 文件上传数据表
    • 本地存储
    • 使用 AWS S3 存储文件并整合到 Tio-Boot 项目中
    • 存储文件到 腾讯 COS
  • 34_spider

    • jsoup
    • 爬取 z-lib.io 数据
    • 整合 WebMagic
    • WebMagic 示例:爬取学校课程数据
    • Playwright
    • Flexmark (Markdown 处理器)
    • tio-boot 整合 Playwright
    • 缓存网页数据
  • 36_integration_thirty_party

    • tio-boot 整合 okhttp
    • 整合 GrpahQL
    • 集成 Mailjet
    • 整合 ip2region
    • 整合 GeoLite 离线库
    • 整合 Lark 机器人指南
    • 集成 Lark Mail 实现邮件发送
    • Thymeleaf
    • Swagger
    • Clerk 验证
  • 37_dubbo

    • 概述
    • dubbo 2.6.0
    • dubbo 2.6.0 调用过程
    • dubbo 3.2.0
  • 38_spring

    • Spring Boot Web 整合 Tio Boot
    • spring-boot-starter-webflux 整合 tio-boot
    • Tio Boot 整合 Spring Boot Starter
    • Tio Boot 整合 Spring Boot Starter Data Redis 指南
  • 39_spring-cloud

    • tio-boot spring-cloud
  • 40_mysql

    • 使用 Docker 运行 MySQL
    • /zh/42_mysql/02.html
  • 41_postgresql

    • PostgreSQL 安装
    • PostgreSQL 主键自增
    • PostgreSQL 日期类型
    • Postgresql 金融类型
    • PostgreSQL 数组类型
    • PostgreSQL 全文检索
    • PostgreSQL 查询优化
    • 获取字段类型
    • PostgreSQL 向量
    • PostgreSQL 优化向量查询
    • PostgreSQL 其他
  • 43_oceanbase

    • 快速体验 OceanBase 社区版
    • 快速上手 OceanBase 数据库单机部署与管理
    • 诊断集群性能
    • 优化 SQL 性能指南
    • /zh/43_oceanbase/05.html
  • 50_media

    • JAVE 提取视频中的声音
    • Jave 提取视频中的图片
    • /zh/50_media/03.html
  • 51_asr

    • Whisper-JNI
  • 54_native-media

    • java-native-media
    • JNI 入门示例
    • mp3 拆分
    • mp4 转 mp3
    • 使用 libmp3lame 实现高质量 MP3 编码
    • Linux 编译
    • macOS 编译
    • 从 JAR 包中加载本地库文件
    • 支持的音频和视频格式
    • 任意格式转为 mp3
    • 通用格式转换
    • 通用格式拆分
    • 视频合并
    • VideoToHLS
    • split_video_to_hls 支持其他语言
    • 持久化 HLS 会话
  • 55_telegram4j

    • 数据库设计
    • /zh/55_telegram4j/02.html
    • 基于 MTProto 协议开发 Telegram 翻译机器人
    • 过滤旧消息
    • 保存机器人消息
    • 定时推送
    • 增加命令菜单
    • 使用 telegram-Client
    • 使用自定义 StoreLayout
    • 延迟测试
    • Reactor 错误处理
    • Telegram4J 常见错误处理指南
  • 56_telegram-bots

    • TelegramBots 入门指南
    • 使用工具库 telegram-bot-base 开发翻译机器人
  • 60_LLM

    • 简介
    • AI 问答
    • /zh/60_LLM/03.html
    • /zh/60_LLM/04.html
    • 增强检索(RAG)
    • 结构化数据检索
    • 搜索+AI
    • 集成第三方 API
    • 后置处理
    • 推荐问题生成
    • 连接代码执行器
    • 避免 GPT 混乱
    • /zh/60_LLM/13.html
  • 61_ai_agent

    • 数据库设计
    • 示例问题管理
    • 会话管理
    • 历史记录
    • 对接 Perplexity API
    • 意图识别与生成提示词
    • 智能问答模块设计与实现
    • 文件上传与解析文档
    • 翻译
    • 名人搜索功能实现
    • Ai studio gemini youbue 问答使用说明
    • 自建 YouTube 字幕问答系统
    • 自建 获取 youtube 字幕服务
    • 通用搜索
    • /zh/61_ai_agent/15.html
    • 16
    • 17
    • 18
    • 在 tio-boot 应用中整合 ai-agent
    • 16
  • 62_translator

    • 简介
  • 63_knowlege_base

    • 数据库设计
    • 用户登录实现
    • 模型管理
    • 知识库管理
    • 文档拆分
    • 片段向量
    • 命中测试
    • 文档管理
    • 片段管理
    • 问题管理
    • 应用管理
    • 向量检索
    • 推理问答
    • 问答模块
    • 统计分析
    • 用户管理
    • api 管理
    • 存储文件到 S3
    • 文档解析优化
    • 片段汇总
    • 段落分块与检索
    • 多文档解析
    • 对话日志
    • 检索性能优化
    • Milvus
    • 文档解析方案和费用对比
    • 离线运行向量模型
  • 64_ai-search

    • ai-search 项目简介
    • ai-search 数据库文档
    • ai-search SearxNG 搜索引擎
    • ai-search Jina Reader API
    • ai-search Jina Search API
    • ai-search 搜索、重排与读取内容
    • ai-search PDF 文件处理
    • ai-search 推理问答
    • Google Custom Search JSON API
    • ai-search 意图识别
    • ai-search 问题重写
    • ai-search 系统 API 接口 WebSocket 版本
    • ai-search 搜索代码实现 WebSocket 版本
    • ai-search 生成建议问
    • ai-search 生成问题标题
    • ai-search 历史记录
    • Discover API
    • 翻译
    • Tavily Search API 文档
    • 对接 Tavily Search
    • 火山引擎 DeepSeek
    • 对接 火山引擎 DeepSeek
    • ai-search 搜索代码实现 SSE 版本
    • jar 包部署
    • Docker 部署
    • 爬取一个静态网站的所有数据
    • 网页数据预处理
    • 网页数据检索与问答流程整合
  • 65_java-linux

    • Java 执行 python 代码
    • 通过大模型执行 Python 代码
    • MCP 协议
    • Cline 提示词
    • Cline 提示词-中文版本
  • 66_manim

    • 简介
    • Manim 开发环境搭建
    • 生成场景提示词
    • 生成代码
    • 完整脚本示例
    • 语音合成系统
    • Fish.audio TTS 接口说明文档与 Java 客户端封装
    • 整合 fishaudio 到 java-uni-ai-server 项目
    • 执行 Python (Manim) 代码
    • 使用 SSE 流式传输生成进度的实现文档
    • 整合全流程完整文档
    • HLS 动态推流技术文档
    • manim 分场景生成代码
    • 分场景运行代码及流式播放支持
    • 分场景业务端完整实现流程
    • Maiim布局管理器
    • 仅仅生成场景代码
    • 使用 modal 运行 manim 代码
    • Python 使用 Modal GPU 加速渲染
    • Modal 平台 GPU 环境下运行 Manim
    • Modal Manim OpenGL 安装与使用
    • 优化 GPU 加速
    • 生成视频封面流程
    • Java 调用 manim 命令 执行代码 生成封面
    • Manim 图像生成服务客户端文档
    • manim render help
    • 显示 中文公式
    • manimgl
    • EGL
    • /zh/66_manim/30.html
    • /zh/66_manim/31.html
    • 成本核算
    • /zh/66_manim/33.html
  • 70_tio-boot-admin

    • 入门指南
    • 初始化数据
    • token 存储
    • 与前端集成
    • 文件上传
    • 网络请求
    • 图片管理
    • /zh/70_tio-boot-admin/08.html
    • Word 管理
    • PDF 管理
    • 文章管理
    • 富文本编辑器
  • 71_tio-boot

    • /zh/71_tio-boot/01.html
    • Swagger 整合到 Tio-Boot 中的指南
    • HTTP/1.1 Pipelining 性能测试报告
  • 80_性能测试

    • 压力测试 - tio-http-serer
    • 压力测试 - tio-boot
    • 压力测试 - tio-boot-native
    • 压力测试 - netty-boot
    • 性能测试对比
    • TechEmpower FrameworkBenchmarks
    • 压力测试 - tio-boot 12 C 32G
  • 99_案例

    • 封装 IP 查询服务
    • tio-boot 案例 - 全局异常捕获与企业微信群通知
    • tio-boot 案例 - 文件上传和下载
    • tio-boot 案例 - 整合 ant design pro 增删改查
    • tio-boot 案例 - 流失响应
    • tio-boot 案例 - 增强检索
    • tio-boot 案例 - 整合 function call
    • tio-boot 案例 - 定时任务 监控 PostgreSQL、Redis 和 Elasticsearch
    • Tio-Boot 案例:使用 SQLite 整合到登录注册系统
    • tio-boot 案例 - 执行 shell 命令

WebMagic 示例:爬取学校课程数据

目标是使用 WebMagic 框架构建一个完整的爬虫示例,从夏威夷大学的课程数据网页中自动提取并存储课程相关信息到数据库中。整个过程包括以下几个关键步骤:

  1. 数据分析:分析目标网页的 HTML 结构,确定需要爬取的内容及其位置,例如机构列表、学期信息、学科列表、课程详细信息等。

  2. 数据库表设计:为每个爬取的数据类型(机构、学期、学科、课程)创建数据库表,用于存储爬取到的信息。

  3. 爬虫程序编写:利用 WebMagic 框架,编写不同的 PageProcessor 和 Pipeline 类来实现对目标网页的爬取和数据存储。每个爬虫程序针对特定的数据类型,如机构、学期、学科和课程,分别进行编写。

  4. 数据存储:将爬取的数据通过 Pipeline 存入预先设计的数据库表中,实现对机构、学期、学科、课程等多层次信息的存储。

  5. 测试:通过测试程序验证爬虫的功能,确保爬取的数据能够正确地提取和存储。

1. 爬取目标:institution

1. 网页数据分析

目标网页:https://www.sis.hawaii.edu/uhdad/avail.classes Alt text网页的 HTML 结构如下所示:

<body>
  <div class="header">
    <p class="system">
      <a href="./avail.classes" class="system_link">University of Hawaii</a>
    </p>
    <p class="institution">
      <a href="./avail.classes" class="title_link">Class Availability</a>
    </p>
  </div>
  <div class="bodydiv">
    <p><b>By institution<sup>*</sup>:</b></p>
    <ul class="institutions">
      <li class="HAW">
        <a href="./avail.classes?i=HAW">Hawaii Community College</a>
      </li>
      <li class="HON">
        <a href="./avail.classes?i=HON">Honolulu Community College</a>
      </li>
      <li class="KAP">
        <a href="./avail.classes?i=KAP">Kapi'olani Community College</a>
      </li>
      <li class="KAU">
        <a href="./avail.classes?i=KAU">Kauai Community College</a>
      </li>
      <li class="LEE">
        <a href="./avail.classes?i=LEE">Leeward Community College</a>
      </li>
      <li class="MAU">
        <a href="./avail.classes?i=MAU">University of Hawaii Maui College</a>
      </li>
      <li class="WOA">
        <a href="./avail.classes?i=WOA">University of Hawaii West Oahu</a>
      </li>
      <li class="HIL">
        <a href="./avail.classes?i=HIL">University of Hawaii at Hilo</a>
      </li>
      <li class="MAN">
        <a href="./avail.classes?i=MAN">University of Hawaii at Manoa</a>
      </li>
      <li class="WIN">
        <a href="./avail.classes?i=WIN">Windward Community College</a>
      </li>
      <li class="SYS">
        <a href="https://www.uhonline.hawaii.edu/courses/">System-wide Distance Learning</a>
      </li>
    </ul>
    <p style="font-size: small"><sup>*</sup> The University of Hawaii System comprises ten institutions.</p>
  </div>
  <div class="footer">
    <hr size="1" width="100%" />
    <p class="copyright">&copy;2024 <a href="http://www.hawaii.edu/">University of Hawaii</a></p>
    <p class="updated">Updated: 09/13/2024 02:52:44 PM HST</p>
  </div>
</body>

数据分析:

  • 目标数据位于 <ul class="institutions"> 标签内的 <li> 标签中,每个 <li> 标签包含一个 <a> 标签,链接表示机构代码 (abbr_name),文本内容为机构名称 (name)。
  • 链接的 href 属性的格式为 ./avail.classes?i=机构代码,我们需要解析这个链接以提取机构代码。

2. 创建数据库表

首先,创建用于存储爬取数据的数据库表 spider_us_hi_uh_institution。SQL 语句如下:

DROP TABLE IF EXISTS "spider_us_hi_uh_institution";
CREATE TABLE "spider_us_hi_uh_institution" (
  "id" "pg_catalog"."int8" NOT NULL PRIMARY KEY,
  "abbr_name" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "name" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "creator" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "create_time" "pg_catalog"."timestamp",
  "updater" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "update_time" "pg_catalog"."timestamp",
  "deleted" "pg_catalog"."int2" DEFAULT 0,
  "tenant_id" "pg_catalog"."int8" DEFAULT 1
);

3. 编写爬虫程序

1 创建 PageProcessor 进行网页解析

PageProcessor 是 WebMagic 框架中用于定义网页抓取和解析逻辑的接口。这里我们创建一个 InstitutionProcessor 类:

import java.util.List;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

public class InstitutionProcessor implements PageProcessor {

  private Site site = Site.me().setRetryTimes(3).setSleepTime(10000);

  @Override
  public void process(Page page) {
    // 定义用于解析机构列表的 CSS 选择器
    String selector = "ul.institutions li a";

    // 从页面中提取链接和文本
    List<String> links = page.getHtml().css(selector, "href").all();
    List<String> names = page.getHtml().css(selector, "text").all();

    // 将提取到的数据保存到 ResultItems 中
    page.putField("links", links);
    page.putField("names", names);
  }

  @Override
  public Site getSite() {
    return site;
  }
}

代码分析:

  1. process 方法:核心方法,负责从页面中提取数据。

    • 使用 CSS 选择器 "ul.institutions li a" 来定位机构列表的链接。
    • links 提取所有链接的 href 属性,names 提取链接文本(即机构名称)。
    • 使用 page.putField 方法将提取到的数据存储在 ResultItems 对象中,以便后续处理。
  2. getSite 方法:配置爬虫的站点设置,例如重试次数和请求间隔时间。

2 创建 Pipeline 将数据存入数据库

Pipeline 用于处理从页面提取的数据,这里我们将数据存入数据库:

import java.util.List;
import com.litongjava.data.utils.SnowflakeIdGenerator;
import com.litongjava.jfinal.plugin.activerecord.Db;
import com.litongjava.jfinal.plugin.activerecord.Row;
import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

@Slf4j
public class InstitutionPipeline implements Pipeline {

  @Override
  public void process(ResultItems resultItems, Task task) {
    List<String> links = resultItems.get("links");
    List<String> names = resultItems.get("names");

    // 插入数据库
    for (int i = 0; i < links.size(); i++) {
      String link = links.get(i);
      String[] split = link.split("=");
      if (split.length < 2) {
        continue;
      }
      String abbrName = split[1];
      Integer count = Db.queryInt("select count(1) from spider_us_hi_uh_institution where abbr_name=?", abbrName);

      if (count > 0) {
        continue; // 如果数据库中已存在相同的机构代码,则跳过
      } else {
        Row row = new Row();
        row.put("id", new SnowflakeIdGenerator(0, 0).generateId());
        row.put("abbr_name", abbrName);
        row.put("name", names.get(i));
        boolean saved = Db.save("spider_us_hi_uh_institution", row);
        log.info("Saved {},{}", abbrName, saved);
      }
    }
  }
}

代码分析:

  1. process 方法:负责将 ResultItems 中的数据存入数据库。
    • 从 ResultItems 对象中获取链接和名称。
    • 通过链接解析机构的简称 abbrName。
    • 检查数据库中是否已存在该机构,如不存在,则插入新记录。
    • 使用 SnowflakeIdGenerator 生成唯一 ID 并保存到 spider_us_hi_uh_institution 表中。

3 创建爬虫服务

创建 SpiderInstitutionService 类来启动爬虫:

import us.codecraft.webmagic.Spider;

public class SpiderInstitutionService {

  public void index() {
    String url = "https://www.sis.hawaii.edu/uhdad/avail.classes";
    Spider.create(new InstitutionProcessor())
        // 添加目标 URL
        .addUrl(url)
        // 添加 Pipeline
        .addPipeline(new InstitutionPipeline())
        // 启用 5 个线程并运行爬虫
        .thread(5).run();
  }
}

4 测试程序

编写测试程序来验证爬虫的功能:

import java.util.List;
import org.junit.Test;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.open.chat.config.DbConfig;
import com.litongjava.table

.utils.MarkdownTableUtils;
import com.litongjava.tio.utils.environment.EnvUtils;

public class SpiderInstitutionServiceTest {

  @Test
  public void test() {
    EnvUtils.load();
    new DbConfig().config();
    Aop.get(SpiderInstitutionService.class).index();
  }

  @Test
  public void findAll() {
    EnvUtils.load();
    new DbConfig().config();
    List<Row> all = Db.findAll("spider_us_hi_uh_institution");
    System.out.println(MarkdownTableUtils.to(all));
  }
}

4. 数据预览

爬取结果示例:

idabbr_namenamecreatorcreate_timeupdaterupdate_timedeletedtenant_id
424467593466740736HAWHawaii Community CollegeNULLNULLNULLNULL01
424467593504489472HONHonolulu Community CollegeNULLNULLNULLNULL01
424467593521266688KAPKapi'olani Community CollegeNULLNULLNULLNULL01
424467593542238208KAUKauai Community CollegeNULLNULLNULLNULL01
424467593563209728LEELeeward Community CollegeNULLNULLNULLNULL01
424467593579986944MAUUniversity of Hawaii Maui CollegeNULLNULLNULLNULL01
424467593596764160WOAUniversity of Hawaii West OahuNULLNULLNULLNULL01
424467593617735680HILUniversity of Hawaii at HiloNULLNULLNULLNULL01
424467593634512896MANUniversity of Hawaii at ManoaNULLNULLNULLNULL01
424467593647095808WINWindward Community CollegeNULLNULLNULLNULL01

通过上述步骤,我们成功创建了一个完整的 WebMagic 爬虫示例,可以从网页中提取学校数据并将其存入数据库。

2. 爬取目标:semester

2.1 网页数据分析

目标网页:https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP Alt text

网页结构:

<body>
  <div class="header">
    <p class="system">
      <a href="./avail.classes" class="system_link">University of Hawaii</a>
      <a
        href="https://hawaii-kapiolani.verbacompare.com/"
        target="_blank"
        class="book_link"
        style="color: white; text-decoration: none; float: right"
        >Textbooks/Course Materials</a
      >
    </p>
    <p class="institution">
      <a href="./avail.classes?i=KAP" class="title_link">Kapi'olani Community College</a>
      &#149; Class Availability
    </p>
    <p class="transfer-info">
      <a href="http://www.hawaii.edu/admissions/transfers.html">(UH Transfer Information)</a>
    </p>
  </div>
  <p class="backlink">
    <a href="https://www.hawaii.edu/myuhinfo/class-availability/">Back to list of UH System Institutions</a>
  </p>
  <div class="bodydiv">
    <p>
      <b>Active and upcoming terms at <a href="./avail.classes?i=KAP">Kapi'olani Community College</a>:</b>
    </p>
    <ul class="terms">
      <li><a href="./avail.classes?i=KAP&t=202510">Fall 2024</a></li>
      <li><a href="./avail.classes?i=KAP&t=202440">Summer 2024</a></li>
      <li><a href="./avail.classes?i=KAP&t=202430">Spring 2024</a></li>
      <li><a href="./avail.classes?i=KAP&t=202410">Fall 2023</a></li>
      <li><a href="./avail.classes?i=KAP&t=202340">Summer 2023</a></li>
      <li><a href="./avail.classes?i=KAP&t=202330">Spring 2023</a></li>
    </ul>
  </div>
  <div class="footer">
    <hr size="1" width="100%" />
    <p class="copyright">&copy;2024 <a href="http://www.hawaii.edu/">University of Hawaii</a></p>
    <p class="updated">Updated: 09/13/2024 02:54:37 PM HST</p>
  </div>
</body>

数据分析:

  • 目标数据位于 <ul class="terms"> 标签内的 <li> 标签中,每个 <li> 标签包含一个 <a> 标签,链接表示学期(t 值),文本内容为学期名称(例如 "Fall 2024")。
  • 链接的 href 属性格式为 ./avail.classes?i=机构代码&t=学期编号,我们需要解析这个链接以提取学期编号 t 和机构代码 i。

2.2 创建数据库表

首先,创建用于存储爬取数据的数据库表 spider_us_hi_uh_semester。SQL 语句如下:

DROP TABLE IF EXISTS "spider_us_hi_uh_semester";
CREATE TABLE "public"."spider_us_hi_uh_semester" (
  "id" "pg_catalog"."int8" NOT NULL PRIMARY KEY,
  "institution_id" "pg_catalog"."int8",
  "name" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "t" "pg_catalog"."int4",
  "creator" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "create_time" "pg_catalog"."timestamp",
  "updater" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "update_time" "pg_catalog"."timestamp",
  "deleted" "pg_catalog"."int2" DEFAULT 0,
  "tenant_id" "pg_catalog"."int8" DEFAULT 1
);

该表用于存储学期信息,包括学期 ID、机构 ID、学期名称和学期编号等。

2.3 编写爬虫程序

1 创建 PageProcessor 进行网页解析

PageProcessor 是 WebMagic 中的接口,用于定义爬取逻辑。我们将创建 SemesterProcessor 类:

package com.litongjava.open.chat.spider.uh.semester;

import java.util.List;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

public class SemesterProcessor implements PageProcessor {
  private Site site = Site.me().setRetryTimes(3).setSleepTime(10000);

  @Override
  public void process(Page page) {
    String selector = "ul.terms li a";
    List<String> links = page.getHtml().css(selector, "href").all();
    List<String> names = page.getHtml().css(selector, "text").all();

    page.putField("links", links);
    page.putField("names", names);
  }

  @Override
  public Site getSite() {
    return site;
  }
}

代码分析:

  1. process 方法:负责从网页中提取数据。

    • 使用 CSS 选择器 "ul.terms li a" 来定位学期列表的链接。
    • links 提取所有链接的 href 属性,names 提取链接文本(即学期名称)。
    • 使用 page.putField 方法将提取到的学期链接和名称存储在 ResultItems 对象中,供后续 Pipeline 处理。
  2. getSite 方法:配置爬虫的站点设置,例如重试次数和请求间隔时间。

2 创建 Pipeline 将数据存入数据库

Pipeline 用于处理 PageProcessor 中提取的数据,这里我们将数据存入数据库:

package com.litongjava.open.chat.spider.uh.semester;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.tio.utils.mcid.McIdUtils;
import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

@Slf4j
public class SemesterPipeline implements Pipeline {
  String pattern = "i=(\\w+)&t=(\\d+)";
  String tableName = "spider_us_hi_uh_semester";
  private List<Row> records;

  public SemesterPipeline(List<Row> records) {
    this.records = records;
  }

  @SneakyThrows
  @Override
  public void process(ResultItems resultItems, Task task) {
    List<String> links = resultItems.get("links");
    List<String> names = resultItems.get("names");

    int size = links.size();
    List<Row> saveRecords = new ArrayList<>(size);
    Pattern compiledPattern = Pattern.compile(pattern);

    for (int i = 0; i < size; i++) {
      String uri = links.get(i);
      String abbrName = null;
      Integer t = null;

      Matcher matcher = compiledPattern.matcher(uri);

      if (matcher.find()) {
        abbrName = matcher.group(1);
        t = Integer.parseInt(matcher.group(2));
      } else {
        log.info("没有匹配的值 skip:{}", i);
        continue;
      }

      Long institutionId = getInstitutionId(abbrName);
      Row row = new Row();
      row.put("id", McIdUtils.id());
      row.put("institution_id", institutionId);
      row.put("name", names.get(i));
      row.put("t", t);
      saveRecords.add(row);
    }

    Db.tx(() -> {
      Db.delete("truncate table " + tableName);
      Db.batchSave(tableName, saveRecords, saveRecords.size());
      return true;
    });
  }

  private Long getInstitutionId(String abbrName) {
    for (Row row : records) {
      if (row.get("abbr_name").equals(abbrName)) {
        return row.getLong("id");
      }
    }
    return null;
  }
}

代码分析:

  1. 正则表达式:pattern 用于从链接中提取机构代码 abbrName 和学期编号 t。
  2. process 方法:将学期数据解析后存入数据库。
    • 从 ResultItems 获取链接和名称列表。
    • 使用正则表达式从链接中提取机构代码和学期编号。
    • 调用 getInstitutionId 方法获取机构 ID。
    • 构建 Row 对象,将学期数据存入 spider_us_hi_uh_semester 表中。
    • 使用 Db.tx 执行数据库事务,确保数据一致性。

3 创建爬虫服务

创建 `Spider

SemesterService` 类来运行爬虫:

package com.litongjava.open.chat.spider.uh.semester;

import java.util.List;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import us.codecraft.webmagic.Spider;

public class SpiderSemesterService {
  public void index() {
    List<Row> all = Db.findAll("spider_us_hi_uh_institution");
    String url = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP";
    Spider.create(new SemesterProcessor())
        .addUrl(url)
        .addPipeline(new SemesterPipeline(all))
        .thread(5).run();
  }
}

4. 测试程序

编写测试程序来验证爬虫的功能:

package com.litongjava.open.chat.spider.uh.semester;

import java.util.List;
import org.junit.Test;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.open.chat.config.DbConfig;
import com.litongjava.table.utils.MarkdownTableUtils;
import com.litongjava.tio.utils.environment.EnvUtils;

public class SpiderSemesterServiceTest {

  @Test
  public void test() {
    EnvUtils.load();
    new DbConfig().config();
    Aop.get(SpiderSemesterService.class).index();
  }

  @Test
  public void findAll() {
    EnvUtils.load();
    new DbConfig().config();
    List<Row> all = Db.findAll("spider_us_hi_uh_semester");
    System.out.println(MarkdownTableUtils.to(all));
  }
}

4. 数据示例

idinstitution_idnametcreatorcreate_timeupdaterupdate_timedeletedtenant_id
7070836632642528424467593521266688Fall 2024202510NULLNULLNULLNULL01
7070836632642529424467593521266688Summer 2024202440NULLNULLNULLNULL01
7070836632642530424467593521266688Spring 2024202430NULLNULLNULLNULL01
7070836632642531424467593521266688Fall 2023202410NULLNULLNULLNULL01
7070836632642532424467593521266688Summer 2023202340NULLNULLNULLNULL01
7070836632642533424467593521266688Spring 2023202330NULLNULLNULLNULL01

通过上述步骤,我们成功创建了一个完整的 WebMagic 爬虫示例,能够从网页中提取学期数据并存入数据库。

3. Subjects

1 网页数据分析

https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202510 Alt text

<body>
  <div class="header">
    <p class="system">
      <a href="./avail.classes" class="system_link">University of Hawaii</a>
      &nbsp;&nbsp;&nbsp;&nbsp;
      <a href="./avail.classes?i=KAP&t=202510&frames=i" class="no_frames">FRAMES</a>
      <a
        href="https://hawaii-kapiolani.verbacompare.com/"
        target="_blank"
        class="book_link,"
        style="color: white; text-decoration: none; float: right"
        >Textbooks/Course Materials</a
      >
    </p>
    <p class="institution">
      <a href="./avail.classes?i=KAP" class="title_link">Kapi'olani Community College</a>
      &#149; Fall 2024 Class Availability
    </p>
    <p class="transfer-info">
      <a href="http://www.hawaii.edu/admissions/transfers.html">(UH Transfer Information)</a>
    </p>
  </div>
  <p class="backlink">
    <a href="./avail.classes?i=KAP">Back to list of terms available for Kapi'olani Community College</a>
  </p>
  <div class="bodydiv">
    <p style="margin-bottom: 0">
      <b
        >Subjects offered by
        <a href="./avail.classes?i=KAP">Kapi'olani Community College</a>
        for Fall 2024:</b
      >
    </p>
    <div
      style="border: 1px solid #333; background-color: #f7f7f7; font-size: smaller; padding: 0.2em 0.8em 0; width: 80%; margin-top: 1em; margin-left: auto; margin-right: auto;"
    >
      <span class="infotext">
        <p>
          <b
            >Classes for Fall Semester 2024 at Kapi'olani Community College have been modified and will be delivered in
            a variety of formats, including entirely online (scheduled/synchronous or unscheduled/asynchronous),
            face-to-face, and hybrid, as indicated in this class availability list. As changes to the schedule may
            occur, students are encouraged to regularly check their class schedule in
            <a href="https://www.star.hawaii.edu/studentinterface/" target="_blank">STAR</a>/MyUH, their hawaii.edu
            email, and Laulima messages from their instructors. The Fall 2024 semester starts August 26, 2024 and ends
            December 20, 2024.</b
          >
        </p></span
      >
    </div>
  </div>
  <!-- bodydiv -->
  <div class="columns">
    <div class="leftcolumn">
      <ul class="subjects">
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ACC">Accounting (ACC)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ASL">American Sign Language (ASL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ANTH">Anthropology (ANTH)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ART">Art (ART)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=BIOC">Biochemistry (BIOC)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=BIOL">Biology (BIOL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=BOT">Botany (BOT)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=BUS">Business (BUS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=BLAW">Business Law (BLAW)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=CHEM">Chemistry (CHEM)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=CHN">Chinese Language & Literature (CHN)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=COM">Communication (COM)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=CHW">Community Health Worker (CHW)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=CULN">Culinary Arts (CULN)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=DNCE">Dance (DNCE)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=DENT">Dental Assisting (DENT)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ERTH">Earth Sciences</a>
          <a href="./avail.classes?i=KAP&t=202510&s=ERTH" style="color: red;">
            (ERTH formerly GG Geology &amp; Geophysics)</a
          >
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=EALL">East Asian Languages & Lit (EALL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ECON">Economics (ECON)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ED">Education (ED)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=EE">Electrical Engineering (EE)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=EMT">Emergency Medical Technician (EMT)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ESOL">Eng for Speakers of Other Lang (ESOL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ENG">English (ENG)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ESL">English as a Second Language (ESL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ENT">Entrepreneurship (ENT)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ES">Ethnic Studies (ES)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ESS">Exercise & Sport Science (ESS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=FIL">Filipino (FIL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=FSHE">Food Service & Hosp Educ (FSHE)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=FR">French (FR)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=GEO">Geography & Environment (GEO)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=HAW">Hawaiian (HAW)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=HWST">Hawaiian Studies (HWST)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=HLTH">Health (HLTH)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=HIST">History (HIST)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=HOST">Hospitality & Tourism (HOST)</a>
        </li>
      </ul>
    </div>
    <div class="rightcolumn">
      <ul class="subjects">
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=HDFS">Human Dev & Family Studies</a>
          <a href="./avail.classes?i=KAP&t=202510&s=HDFS" style="color: red;"> (HDFS formerly FAMR Family Resources)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=HUM">Humanities (HUM)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ITS">Information Technology (ITS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ICS">Information& Computer Sciences (ICS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=IS">Interdisciplinary Studies (IS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=JPN">Japanese Language & Literature (JPN)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=KOR">Korean (KOR)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=LAW">Law (LAW)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=LING">Linguistics (LING)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=MATH">Mathematics (MATH)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=ME">Mechanical Engineering (ME)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=MEDA">Medical Assisting (MEDA)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=MLT">Medical Laboratory Technician (MLT)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=MICR">Microbiology (MICR)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=MICT">Mobile Intensv Care Technician (MICT)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=MUS">Music (MUS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=NURS">Nursing (NURS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=OTA">Occupational Therapy Assistant (OTA)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=PACS">Pacific Islands Studies (PACS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=PHIL">Philosophy (PHIL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=PTA">Physical Therapist Assistant (PTA)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=PHYS">Physics (PHYS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=PHYL">Physiology (PHYL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=POLS">Political Science (POLS)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=PSY">Psychology (PSY)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=RAD">Radiologic Technology (RAD)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=REL">Religion (REL)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=RESP">Respiratory Care (RESP)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=SCI">Science (SCI)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=SLT">Second Language Teaching (SLT)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=SSCI">Social Sciences (SSCI)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=SW">Social Work (SW)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=SOC">Sociology (SOC)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=SPAN">Spanish (SPAN)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=SP">Speech (SP)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=THEA">Theatre (THEA)</a>
        </li>
        <li>
          <a href="./avail.classes?i=KAP&t=202510&s=WS">Women's Studies (WS)</a>
        </li>
      </ul>
    </div>
    <!-- rightcolumn -->
  </div>
  <!-- to hold columns -->
  <div class="footer">
    <!-- footer div 1 -->
    <p align="center">
      <a href="./avail.classes?i=KAP&t=202510&frames=i">VIEW USING FRAMES</a>
    </p>
  </div>
  <!-- footer div 1 -->
  <div class="footer">
    <hr size="1" width="100%" />
    <p class="copyright">&copy;2024 <a href="http://www.hawaii.edu/">University of Hawaii</a></p>
    <p class="updated">Updated: 09/13/2024 02:56:29 PM HST</p>
  </div>
</body>

网页结构:

<body>
  <div class="header">
    <!-- 头部包含学校名称和导航链接 -->
  </div>
  <div class="bodydiv">
    <p style="margin-bottom: 0">
      <b>Subjects offered by Kapi'olani Community College for Fall 2024:</b>
    </p>
    <!-- 网页主体内容 -->
  </div>
  <div class="columns">
    <!-- 左侧学科列表 -->
    <div class="leftcolumn">
      <ul class="subjects">
        <li><a href="./avail.classes?i=KAP&t=202510&s=ACC">Accounting (ACC)</a></li>
        <!-- 其他学科 -->
      </ul>
    </div>
    <!-- 右侧学科列表 -->
    <div class="rightcolumn">
      <ul class="subjects">
        <li><a href="./avail.classes?i=KAP&t=202510&s=HDFS">Human Dev & Family Studies</a></li>
        <!-- 其他学科 -->
      </ul>
    </div>
  </div>
  <!-- 底部版权和更新时间信息 -->
</body>

数据分析:

  • 目标数据位于 <ul class="subjects"> 标签内,每个 <li> 标签包含一个 <a> 标签,链接表示学科代码 (s),文本内容为学科名称 (name)。
  • 链接的 href 属性格式为 ./avail.classes?i=KAP&t=202510&s=学科代码,我们需要解析这个链接以提取学科代码和对应的学期信息。

2 创建数据表

首先,创建用于存储爬取数据的数据库表 spider_us_hi_uh_subject。SQL 语句如下: 该表用于存储学科信息,包括学科 ID、学期 ID、学科名称和学科代码等。

DROP TABLE IF EXISTS "public"."spider_us_hi_uh_subject";
CREATE TABLE "public"."spider_us_hi_uh_subject" (
  "id" "pg_catalog"."int8" NOT NULL primary key,
  "semester_id" "pg_catalog"."int8",
  "name" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "s" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "creator" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "create_time" "pg_catalog"."timestamp",
  "updater" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "update_time" "pg_catalog"."timestamp",
  "deleted" "pg_catalog"."int2" DEFAULT 0,
  "tenant_id" "pg_catalog"."int8" DEFAULT 1
);

3 编写程序

1 创建 PageProcessor 进行网页解析

PageProcessor 是 WebMagic 框架中用于定义网页抓取和解析逻辑的接口。这里我们创建一个 SubjectProcessor 类:

package com.litongjava.open.chat.spider.uh.subject;

import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;

import java.util.List;

/**
 * Created by litonglinux@qq.com on 2023/6/20_7:05
 */
public class SubjectProcessor implements PageProcessor {
  private Site site = Site.me().setRetryTimes(3).setSleepTime(10000);

  @Override
  public void process(Page page) {
    String selector = "ul.subjects li a";
    List<String> links = page.getHtml().css(selector, "href").all();
    List<String> names = page.getHtml().css(selector, "text").all();

    page.putField("links", links);
    page.putField("names", names);

  }

  @Override
  public Site getSite() {
    return site;
  }
}

代码分析:

  1. process 方法:核心方法,负责从页面中提取数据。

    • 使用 CSS 选择器 "ul.subjects li a" 来定位学科列表的链接。
    • links 提取所有链接的 href 属性,names 提取链接文本(即学科名称)。
    • 使用 page.putField 方法将提取到的数据存储在 ResultItems 对象中,以便后续处理。
  2. getSite 方法:配置爬虫的站点设置,例如重试次数和请求间隔时间。

2 创建 Pipeline 将数据存入数据库

Pipeline 用于处理从页面提取的数据,这里我们将数据存入数据库:

package com.litongjava.open.chat.spider.uh.subject;

import java.util.ArrayList;
import java.util.List;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.tio.utils.mcid.McIdUtils;

import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

/**
 * Created by litonglinux@qq.com on 2023/6/20_7:05
 */
@Slf4j
public class SubjectPipeline implements Pipeline {

  String pattern = "i=(.*?)&t=(.*?)&s=(.*)";
  String tableName = "spider_us_hi_uh_subject";

  private List<Row> records;

  public SubjectPipeline(List<Row> records) {
    this.records = records;
  }

  @SneakyThrows
  @Override
  public void process(ResultItems resultItems, Task task) {
    // https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202410&s=ACC
    List<String> links = resultItems.get("links");
    List<String> names = resultItems.get("names");

    List<Row> saveRecords = new ArrayList<>();
    int size = links.size();
    Pattern compiledPattern = Pattern.compile(pattern);
    for (int i = 0; i < size; i++) {
      String uri = links.get(i);
      String name = names.get(i);
      //String abbrName = null;
      String t = null;
      String s = null;

      Matcher matcher = compiledPattern.matcher(uri);

      if (matcher.find()) {
        //abbrName = matcher.group(1);
        t = matcher.group(2);
        s = matcher.group(3);

      } else {
        log.info("没有匹配的值 skip:{}", i);
        continue;
      }

      Long semesterId = getSemesterId(t);

      Row row = new Row();
      row.put("id", McIdUtils.id());
      row.put("semester_id", semesterId);
      row.put("name", name);
      row.put("s", s);
      saveRecords.add(row);
    }
    Db.tx(() -> {
      Db.delete("truncate table " + tableName);
      Db.batchSave(tableName, saveRecords, saveRecords.size());
      return true;
    });
  }

  private Long getSemesterId(String t) {
    for (Row row : records) {
      if (row.getStr("t").equals(t)) {
        return row.getLong("id");
      }
    }
    return null;
  }

}

代码分析:

  1. 正则表达式:pattern 用于从链接中提取机构代码 i、学期编号 t 和学科代码 s。
  2. process 方法:将学科数据解析后存入数据库。
    • 从 ResultItems 获取链接和名称列表。
    • 使用正则表达式从链接中提取学期编号和学科代码。
    • 调用 getSemesterId 方法获取学期 ID。
    • 构建 Row 对象,将学科数据存入 spider_us_hi_uh_subject 表中。
    • 使用 Db.tx 执行数据库事务,确保数据一致性。

3 创建爬虫服务

创建 SpiderSubjectService 类来运行爬虫:

package com.litongjava.open.chat.spider.uh.subject;

import java.util.List;

import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;

import us.codecraft.webmagic.Spider;

public class SpiderSubjectService {

  public void index() {
    List<Row> all = Db.findAll("spider_us_hi_uh_semester");
    String url = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202510";
    Spider.create(new SubjectProcessor())
        // url
        .addUrl(url).addPipeline(new SubjectPipeline(all))
        //
        .thread(5).run();
  }

}

4. 测试程序

编写测试程序来验证爬虫的功能:

package com.litongjava.open.chat.spider.uh.subject;

import java.util.List;

import org.junit.Test;

import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.open.chat.config.DbConfig;
import com.litongjava.table.utils.MarkdownTableUtils;
import com.litongjava.tio.utils.environment.EnvUtils;

public class SpiderSubjectServiceTest {

  @Test
  public void test() {
    EnvUtils.load();
    new DbConfig().config();
    Aop.get(SpiderSubjectService.class).index();
  }

  @Test
  public void findAll() {
    EnvUtils.load();
    new DbConfig().config();
    List<Row> all = Db.findAll("spider_us_hi_uh_subject");
    System.out.println(MarkdownTableUtils.to(all));
  }

}

4 数据示例

idsemester_idnamescreatorcreate_timeupdaterupdate_timedeletedtenant_id
70708421637353767070836632642528Accounting (ACC)ACCNULLNULLNULLNULL01
70708421637353777070836632642528American Sign Language (ASL)ASLNULLNULLNULLNULL01
70708421637353787070836632642528Anthropology (ANTH)ANTHNULLNULLNULLNULL01
70708421637353797070836632642528Art (ART)ARTNULLNULLNULLNULL01
70708421637353807070836632642528Biochemistry (BIOC)BIOCNULLNULLNULLNULL01
70708421637353817070836632642528Biology (BIOL)BIOLNULLNULLNULLNULL01
70708421637353827070836632642528Botany (BOT)BOTNULLNULLNULLNULL01
70708421637353837070836632642528Business (BUS)BUSNULLNULLNULLNULL01
70708421637353847070836632642528Business Law (BLAW)BLAWNULLNULLNULLNULL01
70708421637353857070836632642528Chemistry (CHEM)CHEMNULLNULLNULLNULL01
70708421637353867070836632642528Chinese Language & Literature (CHN)CHNNULLNULLNULLNULL01
70708421637353877070836632642528Communication (COM)COMNULLNULLNULLNULL01
70708421637353887070836632642528Community Health Worker (CHW)CHWNULLNULLNULLNULL01
70708421637353897070836632642528Culinary Arts (CULN)CULNNULLNULLNULLNULL01
70708421637353907070836632642528Dance (DNCE)DNCENULLNULLNULLNULL01
70708421637353917070836632642528Dental Assisting (DENT)DENTNULLNULLNULLNULL01
70708421637394727070836632642528Earth SciencesERTHNULLNULLNULLNULL01
70708421637394737070836632642528(ERTH formerly GG Geology & Geophysics)ERTHNULLNULLNULLNULL01
70708421637394747070836632642528East Asian Languages & Lit (EALL)EALLNULLNULLNULLNULL01
70708421637394757070836632642528Economics (ECON)ECONNULLNULLNULLNULL01
70708421637394767070836632642528Education (ED)EDNULLNULLNULLNULL01
70708421637394777070836632642528Electrical Engineering (EE)EENULLNULLNULLNULL01
70708421637394787070836632642528Emergency Medical Technician (EMT)EMTNULLNULLNULLNULL01
70708421637394797070836632642528Eng for Speakers of Other Lang (ESOL)ESOLNULLNULLNULLNULL01
70708421637394807070836632642528English (ENG)ENGNULLNULLNULLNULL01
70708421637394817070836632642528English as a Second Language (ESL)ESLNULLNULLNULLNULL01
70708421637394827070836632642528Entrepreneurship (ENT)ENTNULLNULLNULLNULL01
70708421637394837070836632642528Ethnic Studies (ES)ESNULLNULLNULLNULL01
70708421637394847070836632642528Exercise & Sport Science (ESS)ESSNULLNULLNULLNULL01
70708421637394857070836632642528Filipino (FIL)FILNULLNULLNULLNULL01
70708421637394867070836632642528Food Service & Hosp Educ (FSHE)FSHENULLNULLNULLNULL01
70708421637394877070836632642528French (FR)FRNULLNULLNULLNULL01
70708421637517607070836632642528Geography & Environment (GEO)GEONULLNULLNULLNULL01
70708421637517617070836632642528Hawaiian (HAW)HAWNULLNULLNULLNULL01
70708421637517627070836632642528Hawaiian Studies (HWST)HWSTNULLNULLNULLNULL01
70708421637517637070836632642528Health (HLTH)HLTHNULLNULLNULLNULL01
70708421637517647070836632642528History (HIST)HISTNULLNULLNULLNULL01
70708421637517657070836632642528Hospitality & Tourism (HOST)HOSTNULLNULLNULLNULL01
70708421637517667070836632642528Human Dev & Family StudiesHDFSNULLNULLNULLNULL01
70708421637517677070836632642528(HDFS formerly FAMR Family Resources)HDFSNULLNULLNULLNULL01
70708421637517687070836632642528Humanities (HUM)HUMNULLNULLNULLNULL01
70708421637517697070836632642528Information Technology (ITS)ITSNULLNULLNULLNULL01
70708421637517707070836632642528Information& Computer Sciences (ICS)ICSNULLNULLNULLNULL01
70708421637517717070836632642528Interdisciplinary Studies (IS)ISNULLNULLNULLNULL01
70708421637517727070836632642528Japanese Language & Literature (JPN)JPNNULLNULLNULLNULL01
70708421637517737070836632642528Korean (KOR)KORNULLNULLNULLNULL01
70708421637517747070836632642528Law (LAW)LAWNULLNULLNULLNULL01
70708421637517757070836632642528Linguistics (LING)LINGNULLNULLNULLNULL01
70708421637558567070836632642528Mathematics (MATH)MATHNULLNULLNULLNULL01
70708421637558577070836632642528Mechanical Engineering (ME)MENULLNULLNULLNULL01
70708421637558587070836632642528Medical Assisting (MEDA)MEDANULLNULLNULLNULL01
70708421637558597070836632642528Medical Laboratory Technician (MLT)MLTNULLNULLNULLNULL01
70708421637558607070836632642528Microbiology (MICR)MICRNULLNULLNULLNULL01
70708421637558617070836632642528Mobile Intensv Care Technician (MICT)MICTNULLNULLNULLNULL01
70708421637558627070836632642528Music (MUS)MUSNULLNULLNULLNULL01
70708421637558637070836632642528Nursing (NURS)NURSNULLNULLNULLNULL01
70708421637558647070836632642528Occupational Therapy Assistant (OTA)OTANULLNULLNULLNULL01
70708421637558657070836632642528Pacific Islands Studies (PACS)PACSNULLNULLNULLNULL01
70708421637558667070836632642528Philosophy (PHIL)PHILNULLNULLNULLNULL01
70708421637558677070836632642528Physical Therapist Assistant (PTA)PTANULLNULLNULLNULL01
70708421637558687070836632642528Physics (PHYS)PHYSNULLNULLNULLNULL01
70708421637558697070836632642528Physiology (PHYL)PHYLNULLNULLNULLNULL01
70708421637558707070836632642528Political Science (POLS)POLSNULLNULLNULLNULL01
70708421637558717070836632642528Psychology (PSY)PSYNULLNULLNULLNULL01
70708421637599527070836632642528Radiologic Technology (RAD)RADNULLNULLNULLNULL01
70708421637599537070836632642528Religion (REL)RELNULLNULLNULLNULL01
70708421637599547070836632642528Respiratory Care (RESP)RESPNULLNULLNULLNULL01
70708421637599557070836632642528Science (SCI)SCINULLNULLNULLNULL01
70708421637599567070836632642528Second Language Teaching (SLT)SLTNULLNULLNULLNULL01
70708421637599577070836632642528Social Sciences (SSCI)SSCINULLNULLNULLNULL01
70708421637599587070836632642528Social Work (SW)SWNULLNULLNULLNULL01
70708421637599597070836632642528Sociology (SOC)SOCNULLNULLNULLNULL01
70708421637599607070836632642528Spanish (SPAN)SPANNULLNULLNULLNULL01
70708421637599617070836632642528Speech (SP)SPNULLNULLNULLNULL01
70708421637599627070836632642528Theatre (THEA)THEANULLNULLNULLNULL01
70708421637599637070836632642528Women's Studies (WS)WSNULLNULLNULLNULL01

通过上述步骤,我们成功创建了一个完整的 WebMagic 爬虫示例,可以从网页中提取学科数据并将其存入数据库。

4 course

course 是我们要爬取的主要数据

1 网页数据分析

目标网页: https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202510&s=ACC

该网页展示了 Kapi'olani Community College 在 2024 年秋季学期(t=202510)的会计课程(s=ACC)。网页包含的课程数据详细信息包括课程编号(CRN)、课程名称、学分、讲师、上课时间、地点、可用座位数等。

Alt text

<body>
  <div class="header">
    <p class="system">
      <a href="./avail.classes" class="system_link">University of Hawaii</a>
      <a
        href="https://hawaii-kapiolani.verbacompare.com/"
        target="_blank"
        class="book_link,"
        style="color: white; text-decoration: none; float: right"
        >Textbooks/Course Materials</a
      >
    </p>
    <p class="institution">
      <a href="./avail.classes?i=KAP" class="title_link">Kapi'olani Community College</a>
      &#149; Fall 2024 Class Availability
    </p>
    <p class="transfer-info">
      <a href="http://www.hawaii.edu/admissions/transfers.html">(UH Transfer Information)</a>
    </p>
  </div>
  <div class="header2">
    <p class="backlink">
      <a href="./avail.classes?i=KAP&t=202510">Back to list of subjects</a>
      <b><big>&nbsp;&nbsp;Click on the CRN for additional class information.</big></b>
    </p>
  </div>
  <!-- header2 -->
  <h1>Accounting (ACC)</h1>
  <!-- f_listcrse called for i: 'KAP t: '202510' -->
  <!-- uh_avail_inst_rec.inst_desc: 'Kapi'olani Community College' -->
  <!-- SCRIPT_NAME: /uhdad -->
  <!-- PATH_INFO: /avail.classes -->
  <!-- VPDI DUMP -->
  <!-- session_user: WWW_USER -->
  <!-- sessionid: 304863132 -->
  <!-- type_context: CC -->
  <!-- code_context: ALLQ -->
  <!-- lmod_context:  -->
  <!-- inst_code: MAU -->
  <!-- multi_context: OVERRIDE -->
  <!-- proc_context: MAN -->
  <!-- hierarchy_context: RESTRICT -->
  <!-- primary_context:  -->
  <!-- secondary_context:  -->
  <!-- current_inst_code: MAN -->
  <!-- sect_rec: i 'KAP', term: '202510', ssbsect_crn: '31175', ssrmeet: '31175' -->
  <!-- pt. a -->
  <!-- pt. b -->
  <!-- pt. c -->
  <!-- pt. d -->
  <!-- pt. e -->
  <table class="listOfClasses" border="0" cellpadding="2" cellspacing="0" width="100%">
    <!-- pt. f -->
    <!-- uh_avail_inst_rec.show_enrl = Y -->
    <thead>
      <tr style="background:#D4D4D4;">
        <th>
          <abbr title="Requirements & Designation Codes (see legend below)">GenEd/Focus/<br />Special Des.</abbr>
        </th>
        <th>CRN</th>
        <th>Course</th>
        <th>Section</th>
        <th>Title</th>
        <th>Credits</th>
        <th>Instructor</th>
        <th>
          <abbr title="Currently Enrolled">Curr.<br />Enrolled</abbr>
        </th>
        <th>
          <abbr title="Number of seats still available">Seats<br />avail.</abbr>
        </th>
        <th>Days</th>
        <th>Time</th>
        <th>Room</th>
        <th>Dates</th>
      </tr>
    </thead>
    <!-- pt. g -->
    <tr class="odd">
      <td class="default">IDAP</td>
      <!-- pt. h -->
      <td class="default"><a href="./avail.class?i=KAP&t=202510&c=31175">31175</a></td>
      <td nowrap="nowrap" class="default">ACC 132</td>
      <td class="default">0</td>
      <!-- ssrmeet_array_length = 0 -->
      <!-- ssrmeet_array_length = 1 -->
      <!-- comment_text=Comment:<br><br>Prerequisite(s): Concurrent enrollment in ACC 124 or ACC 201, or consent of the instructor or the Accounting Program Coordinator or the Business, Legal & Technology Education Department Chairperson. <br><br>Recommended Preparation: Credit or concurrent enrollment in ICS 101; and credit or concurrent enrollment in ENG 22 or ESOL 94F or ESOL 94S, or qualification for ENG 100 or ESL 100.<br><br>This ACC 132 class section (CRN 31191) is conducted online. This class is fully ASYNCHRONOUS: students are never required to meet with each other and/or the instructor at a specific time. However, the class includes one or more OPTIONAL synchronous online activities. Instructor will work with students to schedule meeting day/time for the optional synchronous activities. This class does not include proctored exams.  This class does not require fieldwork.<br><br>This course uses Cengage Unlimited and CNOW2 homework system.  This class will be participating in the Kapi’olani Bookstore’s IDAP program. An IDAP Rental Charge will be added to your MyUH account, price to be determined.  The instructor will provide instructions on how to access your e-book and other digital course material, including the CNOW2 homework system.  If you have any questions, please contact the Kapi’olani bookstore or visit our FAQ page at https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp. <br><br>This section of ACC 132 (31175) is cross-listed with ACC 132 (31536). One class, one instructor, dual formats.  Register for CRN 31175 for online asynchronous class format OR register for CRN 31536 for hybrid class format.  To take an online or hybrid class, you need regular access to a desktop or laptop computer and reliable Internet connection.<br><br>Email the program coordinator (seabolt@hawaii.edu) as needed for additional information and check your hawaii.edu email account regularly for class notifications. -->
      <td class="default">Payroll/Hawai`i Gen Excise Tax</td>
      <td class="default">3</td>
      <td class="default">
        <abbr title="Leanne N Matsumoto"><span class="abbr" title="Leanne N Matsumoto">L N Matsumoto</span></abbr>
      </td>
      <td class="default" align="center">17</td>
      <td class="default" align="center">4</td>
      <td class="default">TBA</td>
      <td class="default">TBA</td>
      <td class="default">
        <abbr title="Online Online Asynchronous"
          ><span class="abbr" title="Online Online Asynchronous">ONLINE ASYNC</span></abbr
        >
      </td>
      <td class="default">08/26-12/20</td>
    </tr>
    <tr class="odd">
      <td colspan="13">
        Comment:<br /><br />Prerequisite(s): Concurrent enrollment in ACC 124 or ACC 201, or consent of the instructor
        or the Accounting Program Coordinator or the Business, Legal & Technology Education Department Chairperson.
        <br /><br />Recommended Preparation: Credit or concurrent enrollment in ICS 101; and credit or concurrent
        enrollment in ENG 22 or ESOL 94F or ESOL 94S, or qualification for ENG 100 or ESL 100.<br /><br />This ACC 132
        class section (CRN 31191) is conducted online. This class is fully ASYNCHRONOUS: students are never required to
        meet with each other and/or the instructor at a specific time. However, the class includes one or more OPTIONAL
        synchronous online activities. Instructor will work with students to schedule meeting day/time for the optional
        synchronous activities. This class does not include proctored exams. This class does not require fieldwork.<br /><br />This
        course uses Cengage Unlimited and CNOW2 homework system. This class will be participating in the Kapi’olani
        Bookstore’s IDAP program. An IDAP Rental Charge will be added to your MyUH account, price to be determined. The
        instructor will provide instructions on how to access your e-book and other digital course material, including
        the CNOW2 homework system. If you have any questions, please contact the Kapi’olani bookstore or visit our FAQ
        page at https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp. <br /><br />This section of ACC 132 (31175) is
        cross-listed with ACC 132 (31536). One class, one instructor, dual formats. Register for CRN 31175 for online
        asynchronous class format OR register for CRN 31536 for hybrid class format. To take an online or hybrid class,
        you need regular access to a desktop or laptop computer and reliable Internet connection.<br /><br />Email the
        program coordinator (seabolt@hawaii.edu) as needed for additional information and check your hawaii.edu email
        account regularly for class notifications.
      </td>
    </tr>
    <!-- sect_rec: i 'KAP', term: '202510', ssbsect_crn: '31536', ssrmeet: '31536' -->
    <!-- pt. a -->
    <!-- pt. b -->
    <!-- pt. c -->
    <!-- pt. d -->
    <!-- pt. e -->
    <!-- pt. f -->
    <!-- pt. g -->
    <tr>
      <td class="default">IDAP</td>
      <!-- pt. h -->
      <td class="default"><a href="./avail.class?i=KAP&t=202510&c=31536">31536</a></td>
      <td nowrap="nowrap" class="default">ACC 132</td>
      <td class="default">0</td>
      <!-- ssrmeet_array_length = 0 -->
      <!-- ssrmeet_array_length = 1 -->
      <!-- ssrmeet_array_length = 2 -->
      <!-- comment_text=Comment:<br><br>Prerequisite(s): Concurrent enrollment in ACC 124 or ACC 201, or consent of the instructor or the Accounting Program Coordinator or the Business, Legal & Technology Education Department Chairperson. <br><br>Recommended Preparation: Credit or concurrent enrollment in ICS 101; and credit or concurrent enrollment in ENG 22 or ESOL 94F or ESOL 94S, or qualification for ENG 100 or ESL 100.<br><br>This ACC 132 class section (CRN 31586) is conducted partially online and partially face-to-face. Class meetings are scheduled for Thursdays, 5:00 - 5:50 pm on campus, with the remaining class materials and content delivered online. This class is not self-paced. This class does not include proctored exams. This class does not require fieldwork. The class also includes one or more OPTIONAL synchronous online activities. Optional one-on-one conferences can be arranged with the instructor. <br><br>This course uses Cengage Unlimited and CNOW2 homework system. This class will be participating in the Kapi’olani Bookstore’s IDAP program. An IDAP Rental Charge will be added to your MyUH account, price to be determined.  The instructor will provide instructions on how to access your e-book and other digital course material, including the CNOW2 homework system.  If you have any questions, please contact the Kapi’olani bookstore or visit our FAQ page at https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp. <br><br>This section of ACC 132 (31536) is cross listed with ACC 132 (31175). One class, one instructor, dual formats.  Register for CRN 31175 for online asynchronous class format OR register for CRN 31536 for hybrid class format.  To take an online or hybrid class, you need regular access to a desktop or laptop computer and reliable Internet connection.<br><br>Email the program coordinator (seabolt@hawaii.edu) as needed for additional information and check your hawaii.edu email account regularly for class notifications. -->
      <td class="default">Payroll/Hawai`i Gen Excise Tax</td>
      <td class="default">3</td>
      <td class="default">
        <abbr title="Leanne N Matsumoto"><span class="abbr" title="Leanne N Matsumoto">L N Matsumoto</span></abbr>
      </td>
      <td class="default" align="center">4</td>
      <td class="default" align="center">4</td>
      <td class="default">TBA</td>
      <td class="default">TBA</td>
      <td class="default">
        <abbr title="Online Online Asynchronous"
          ><span class="abbr" title="Online Online Asynchronous">ONLINE ASYNC</span></abbr
        >
      </td>
      <td class="default">08/26-12/20</td>
    </tr>
    <tr>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">&nbsp;</td>
      <td class="default">R</td>
      <td nowrap="nowrap" class="default">0500-<spacer />0550p</td>
      <td class="default">
        <abbr title="Kopiko Room 101B"><span class="abbr" title="Kopiko Room 101B">KOPIKO 101B</span></abbr>
      </td>
      <td class="default">08/26-12/20</td>
    </tr>
    <tr>
      <td colspan="13">
        Comment:<br /><br />Prerequisite(s): Concurrent enrollment in ACC 124 or ACC 201, or consent of the instructor
        or the Accounting Program Coordinator or the Business, Legal & Technology Education Department Chairperson.
        <br /><br />Recommended Preparation: Credit or concurrent enrollment in ICS 101; and credit or concurrent
        enrollment in ENG 22 or ESOL 94F or ESOL 94S, or qualification for ENG 100 or ESL 100.<br /><br />This ACC 132
        class section (CRN 31586) is conducted partially online and partially face-to-face. Class meetings are scheduled
        for Thursdays, 5:00 - 5:50 pm on campus, with the remaining class materials and content delivered online. This
        class is not self-paced. This class does not include proctored exams. This class does not require fieldwork. The
        class also includes one or more OPTIONAL synchronous online activities. Optional one-on-one conferences can be
        arranged with the instructor. <br /><br />This course uses Cengage Unlimited and CNOW2 homework system. This
        class will be participating in the Kapi’olani Bookstore’s IDAP program. An IDAP Rental Charge will be added to
        your MyUH account, price to be determined. The instructor will provide instructions on how to access your e-book
        and other digital course material, including the CNOW2 homework system. If you have any questions, please
        contact the Kapi’olani bookstore or visit our FAQ page at https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp.
        <br /><br />This section of ACC 132 (31536) is cross listed with ACC 132 (31175). One class, one instructor,
        dual formats. Register for CRN 31175 for online asynchronous class format OR register for CRN 31536 for hybrid
        class format. To take an online or hybrid class, you need regular access to a desktop or laptop computer and
        reliable Internet connection.<br /><br />Email the program coordinator (seabolt@hawaii.edu) as needed for
        additional information and check your hawaii.edu email account regularly for class notifications.
      </td>
    </tr>
    <!-- sect_rec: i 'KAP', term: '202510', ssbsect_crn: '31536', ssrmeet: '31536' -->
    <!-- sect_rec: i 'KAP', term: '202510', ssbsect_crn: '31465', ssrmeet: '31465' -->
    <!-- pt. a -->
    <!-- pt. b -->
    <!-- pt. c -->
    <!-- pt. d -->
    <!-- pt. e -->
    <!-- pt. f -->
    <!-- pt. g -->
    <tr class="odd">
      <td class="default">IDAP</td>
      <!-- pt. h -->
      <td class="default"><a href="./avail.class?i=KAP&t=202510&c=31465">31465</a></td>
      <td nowrap="nowrap" class="default">ACC 134</td>
      <td class="default">0</td>
      <!-- ssrmeet_array_length = 0 -->
      <!-- ssrmeet_array_length = 1 -->
      <!-- comment_text=Comment:<br><br>Prerequisite(s): Qualification for ENG 22 or ESOL 94F or ESOL 94S or qualification for an equivalent course, or consent of the instructor or the Accounting Program Coordinator or the Business, Legal & Technology Education Department Chairperson.<br><br>Recommended Preparation: Credit or concurrent enrollment in ICS 101. <br><br>This ACC 134 class section (CRN 31507) is conducted online. This class is fully ASYNCHRONOUS: students are never required to meet with each other and/or the instructor at a specific time. However, the class includes one or more OPTIONAL synchronous online activities. Instructor will work with students to schedule meeting day/time for the optional synchronous activities. This class does not include proctored exams.  This class does not require fieldwork.<br><br>This course uses Cengage Unlimited and CNOW2 homework system.  This class will be participating in the Kapi’olani Bookstore’s IDAP program. An IDAP Rental Charge will be added to your MyUH account, price to be determined.  The instructor will provide instructions on how to access your e-book and other digital course material, including the CNOW2 homework system.  If you have any questions, please contact the Kapi’olani bookstore or visit our FAQ page at https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp. <br><br>This section of ACC 134 (31465) is cross-listed with ACC 134 (31537). One class, one instructor, dual formats.  Register for CRN 31465 for online asynchronous class format OR register for CRN 31537 for hybrid class format.  To take an online or hybrid class, you need regular access to a desktop or laptop computer and a reliable Internet connection.<br><br>Email the program coordinator (seabolt@hawaii.edu) as needed for additional information and check your hawaii.edu email account regularly for class notifications. -->
      <td class="default">Individual Income Tax Prep</td>
      <td class="default">3</td>
      <td class="default">
        <abbr title="Roy Y Kamida"><span class="abbr" title="Roy Y Kamida">R Y Kamida</span></abbr>
      </td>
      <td class="default" align="center">17</td>
      <td class="default" align="center">3</td>
      <td class="default">TBA</td>
      <td class="default">TBA</td>
      <td class="default">
        <abbr title="Online Online Asynchronous"
          ><span class="abbr" title="Online Online Asynchronous">ONLINE ASYNC</span></abbr
        >
      </td>
      <td class="default">08/26-12/20</td>
    </tr>
    <tr class="odd">
      <td colspan="13">
        Comment:<br /><br />Prerequisite(s): Qualification for ENG 22 or ESOL 94F or ESOL 94S or qualification for an
        equivalent course, or consent of the instructor or the Accounting Program Coordinator or the Business, Legal &
        Technology Education Department Chairperson.<br /><br />Recommended Preparation: Credit or concurrent enrollment
        in ICS 101. <br /><br />This ACC 134 class section (CRN 31507) is conducted online. This class is fully
        ASYNCHRONOUS: students are never required to meet with each other and/or the instructor at a specific time.
        However, the class includes one or more OPTIONAL synchronous online activities. Instructor will work with
        students to schedule meeting day/time for the optional synchronous activities. This class does not include
        proctored exams. This class does not require fieldwork.<br /><br />This course uses Cengage Unlimited and CNOW2
        homework system. This class will be participating in the Kapi’olani Bookstore’s IDAP program. An IDAP Rental
        Charge will be added to your MyUH account, price to be determined. The instructor will provide instructions on
        how to access your e-book and other digital course material, including the CNOW2 homework system. If you have
        any questions, please contact the Kapi’olani bookstore or visit our FAQ page at
        https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp. <br /><br />This section of ACC 134 (31465) is
        cross-listed with ACC 134 (31537). One class, one instructor, dual formats. Register for CRN 31465 for online
        asynchronous class format OR register for CRN 31537 for hybrid class format. To take an online or hybrid class,
        you need regular access to a desktop or laptop computer and a reliable Internet connection.<br /><br />Email the
        program coordinator (seabolt@hawaii.edu) as needed for additional information and check your hawaii.edu email
        account regularly for class notifications.
      </td>
    </tr>
    <!-- sect_rec: i 'KAP', term: '202510', ssbsect_crn: '31537', ssrmeet: '31537' -->
    <!-- pt. a -->
    <!-- pt. b -->
    <!-- pt. c -->
    <!-- pt. d -->
    <!-- pt. e -->
    <!-- pt. f -->
    <!-- pt. g -->
    <tr></tr>
  </table>
</body>

网页的 HTML 结构:

<body>
  <div class="header">
    <!-- 包含学校名称和导航链接 -->
  </div>
  <div class="header2">
    <!-- 提供返回课程列表的链接 -->
  </div>
  <h1>Accounting (ACC)</h1>
  <table class="listOfClasses" border="0" cellpadding="2" cellspacing="0" width="100%">
    <thead>
      <!-- 表头,定义了课程信息的各个字段 -->
    </thead>
    <tr class="odd">
      <!-- 表格行,每一行代表一门课程,包括课程编号、名称、讲师、时间等信息 -->
    </tr>
    <tr class="odd">
      <!-- 有些行包含课程的详细信息,如备注或先决条件 -->
    </tr>
  </table>
</body>

数据分析:

  • 表格:课程信息位于 <table class="listOfClasses"> 表格中,每个 <tr> 表格行表示一门课程的信息。
  • 课程详情:包括课程编号(CRN)、课程名称、学分、讲师、上课时间、地点等。
  • 特殊情况:有的课程包含额外的备注信息,通常位于独立的 <tr> 元素中,与课程信息不同。

2. 创建数据库表

首先,创建数据库表 spider_us_hi_uh_course 用于存储爬取的课程数据,SQL 语句如下:

DROP TABLE IF EXISTS "spider_us_hi_uh_course";
CREATE TABLE "spider_us_hi_uh_course" (
  "id" "pg_catalog"."int8" NOT NULL PRIMARY KEY,
  "institution" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "term" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "subject_abbr" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "subject_name" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "focus_on" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "crn" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "course" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "section" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "title" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "credits" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "instructor" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "curr_enrolled" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "seats_avail" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "curr_waitlisted" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "wait_avail" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "days" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "time" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "room" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "dates" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "details_url" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "sources_url" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "remark" "pg_catalog"."text" COLLATE "pg_catalog"."default",
  "creator" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "create_time" "pg_catalog"."timestamp",
  "updater" "pg_catalog"."varchar" COLLATE "pg_catalog"."default",
  "update_time" "pg_catalog"."timestamp",
  "deleted" "pg_catalog"."int2" DEFAULT 0,
  "tenant_id" "pg_catalog"."int8" DEFAULT 1
);

该表用于存储课程的详细信息,包括机构、学期、课程编号、讲师、时间、地点等。

3. 编写爬虫程序

1 创建 PageProcessor 进行网页解析

PageProcessor 是 WebMagic 框架中用于定义网页抓取和解析逻辑的接口。这里我们创建一个 KapCourseProcessor 类:

package com.litongjava.open.chat.spider.uh.course;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.litongjava.tio.utils.hutool.StrUtil;

import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.Page;
import us.codecraft.webmagic.Site;
import us.codecraft.webmagic.processor.PageProcessor;


@Slf4j
public class KapCourseProcessor implements PageProcessor {
  private Site site = Site.me().setRetryTimes(3).setSleepTime(10000);


  @Override
  public void process(Page page) {
    Elements rows = page.getHtml().getDocument().select("table.listOfClasses > tbody > tr");
    String sourceUrl = page.getUrl().toString();

    List<Map<String, Object>> dataList = new ArrayList<>();
    int rowSize = rows.size();
    log.info("row size:{}", rowSize);
    for (int i = 0; i < rowSize; i++) {
      Element row = rows.get(i);
      Elements columns = row.select("td");

      int size = columns.size(); // 正常请求下值是15
      if (size == 15) {
        i = canEnrolled(i, row, columns, sourceUrl, dataList);
      }
      if (size == 13) {
        i = canNotEnrolled(i, row, columns, sourceUrl, dataList);
      }
    }
    page.putField("dataList", dataList);
    log.info("dataList size:{}", dataList.size());

  }

  /**
   * can not be Enrolled columns is 13
   */
  private int canNotEnrolled(int i, Element row, Elements columns, String sourceUrl, List<Map<String, Object>> dataList) {
    String focusOn = columns.get(0).text().trim();
    String crn = columns.get(1).text().trim();
    String course = columns.get(2).text().trim();
    String section = columns.get(3).text().trim();
    String title = columns.get(4).text().trim();
    String credits = columns.get(5).text().trim();
    String instructor = columns.get(6).text().trim();
    String currEnrolled = columns.get(7).text().trim();
    String seatsAvail = columns.get(8).text().trim();
    String days = columns.get(9).text().trim();
    String time = columns.get(10).text().trim();
    String room = columns.get(11).text().trim();
    String dates = columns.get(12).text().trim();
    String detailsUrl = columns.get(1).select("a").attr("href").trim();

    Map<String, Object> dataMap = new HashMap<>();
    dataMap.put("focus_on", focusOn);
    dataMap.put("crn", crn);
    dataMap.put("course", course);
    dataMap.put("section", section);
    dataMap.put("title", title);
    dataMap.put("credits", credits);
    dataMap.put("instructor", instructor);
    dataMap.put("curr_enrolled", currEnrolled);
    dataMap.put("seats_avail", seatsAvail);
    dataMap.put("days", days);
    dataMap.put("time", time);
    dataMap.put("room", room);
    dataMap.put("dates", dates);
    dataMap.put("details_url", detailsUrl);
    dataMap.put("sources_url", sourceUrl);
    dataList.add(dataMap);

    Element nextRow = row.nextElementSibling();
    if (nextRow != null) {
      Elements nextRowTds = nextRow.select("td");
      if (nextRowTds.size() == 15) {
        //下一行仍然是课程数据,不是Prerequisite(s)数据
        String nextFocusOn = nextRowTds.get(0).text().trim();
        nextFocusOn = ifNull(nextFocusOn, focusOn);
        String nextCrn = nextRowTds.get(1).text().trim();
        nextCrn = ifNull(nextCrn, crn);
        String nextCourse = nextRowTds.get(2).text().trim();
        nextCourse = ifNull(nextCourse, course);
        String nextSection = nextRowTds.get(3).text().trim();
        nextSection = ifNull(nextSection, section);
        String nextTitle = nextRowTds.get(4).text().trim();
        nextTitle = ifNull(nextTitle, title);
        String nextCredits = nextRowTds.get(5).text().trim();
        nextCredits = ifNull(nextCredits, credits);
        String nextInstructor = nextRowTds.get(6).text().trim();
        nextInstructor = ifNull(nextInstructor, instructor);
        String nextCurrEnrolled = nextRowTds.get(7).text().trim();
        nextCurrEnrolled = ifNull(nextCurrEnrolled, currEnrolled);
        String nextSeatsAvail = nextRowTds.get(8).text().trim();
        nextSeatsAvail = ifNull(nextSeatsAvail, seatsAvail);
        String nextDays = nextRowTds.get(9).text().trim();
        nextDays = ifNull(nextDays, days);
        String nextTime = nextRowTds.get(10).text().trim();
        nextTime = ifNull(nextTime, time);
        String nextRoom = nextRowTds.get(11).text().trim();
        nextRoom = ifNull(nextRoom, room);
        String nextDates = nextRowTds.get(12).text().trim();
        nextDates = ifNull(nextDates, dates);

        Map<String, Object> nextRowMap = new HashMap<>();
        nextRowMap.put("focus_on", nextFocusOn);
        nextRowMap.put("crn", nextCrn);
        nextRowMap.put("course", nextCourse);
        nextRowMap.put("section", nextSection);
        nextRowMap.put("title", nextTitle);
        nextRowMap.put("credits", nextCredits);
        nextRowMap.put("instructor", nextInstructor);
        nextRowMap.put("curr_enrolled", nextCurrEnrolled);
        nextRowMap.put("seats_avail", nextSeatsAvail);
        nextRowMap.put("days", nextDays);
        nextRowMap.put("time", nextTime);
        nextRowMap.put("room", nextRoom);
        nextRowMap.put("dates", nextDates);
        nextRowMap.put("details_url", detailsUrl);
        nextRowMap.put("sources_url", sourceUrl);
        dataList.add(nextRowMap);
        i = i + 2;
      } else {
        //跳过下一行
        String html = nextRowTds.first().html();
        dataMap.put("remark", html);
        i++;
      }

    }
    return i;
  }

  /**
   * can be Enrolled columns is 15
   */
  private int canEnrolled(int i, Element row, Elements columns, String sourceUrl, List<Map<String, Object>> dataList) {
    String focusOn = columns.get(0).text().trim();
    String crn = columns.get(1).text().trim();
    String course = columns.get(2).text().trim();
    String section = columns.get(3).text().trim();
    String title = columns.get(4).text().trim();
    String credits = columns.get(5).text().trim();
    String instructor = columns.get(6).text().trim();
    String currEnrolled = columns.get(7).text().trim();
    String seatsAvail = columns.get(8).text().trim();
    String currWaitlisted = columns.get(9).text().trim();
    String waitAvail = columns.get(10).text().trim();
    String days = columns.get(11).text().trim();
    String time = columns.get(12).text().trim();
    String room = columns.get(13).text().trim();
    String detailsUrl = columns.get(1).select("a").attr("href").trim();
    String dates = columns.get(14).text().trim();

    Map<String, Object> dataMap = new HashMap<>();
    dataMap.put("focus_on", focusOn);
    dataMap.put("crn", crn);
    dataMap.put("course", course);
    dataMap.put("section", section);
    dataMap.put("title", title);
    dataMap.put("credits", credits);
    dataMap.put("instructor", instructor);
    dataMap.put("curr_enrolled", currEnrolled);
    dataMap.put("seats_avail", seatsAvail);
    dataMap.put("curr_waitlisted", currWaitlisted);
    dataMap.put("wait_avail", waitAvail);
    dataMap.put("days", days);
    dataMap.put("time", time);
    dataMap.put("room", room);
    dataMap.put("dates", dates);
    dataMap.put("details_url", detailsUrl);
    dataMap.put("sources_url", sourceUrl);
    dataList.add(dataMap);

    Element nextRow = row.nextElementSibling();
    if (nextRow != null) {
      Elements nextRowTds = nextRow.select("td");
      if (nextRowTds.size() == 15) {
        //下一行仍然是课程数据,不是Prerequisite(s)数据
        String nextFocusOn = nextRowTds.get(0).text().trim();
        nextFocusOn = ifNull(nextFocusOn, focusOn);
        String nextCrn = nextRowTds.get(1).text().trim();
        nextCrn = ifNull(nextCrn, crn);
        String nextCourse = nextRowTds.get(2).text().trim();
        nextCourse = ifNull(nextCourse, course);
        String nextSection = nextRowTds.get(3).text().trim();
        nextSection = ifNull(nextSection, section);
        String nextTitle = nextRowTds.get(4).text().trim();
        nextTitle = ifNull(nextTitle, title);
        String nextCredits = nextRowTds.get(5).text().trim();
        nextCredits = ifNull(nextCredits, credits);
        String nextInstructor = nextRowTds.get(6).text().trim();
        nextInstructor = ifNull(nextInstructor, instructor);
        String nextCurrEnrolled = nextRowTds.get(7).text().trim();
        nextCurrEnrolled = ifNull(nextCurrEnrolled, currEnrolled);
        String nextSeatsAvail = nextRowTds.get(8).text().trim();
        nextSeatsAvail = ifNull(nextSeatsAvail, seatsAvail);
        String nextCurrWaitlisted = nextRowTds.get(9).text().trim();
        nextCurrWaitlisted = ifNull(nextCurrWaitlisted, currWaitlisted);
        String nextWaitAvail = nextRowTds.get(10).text().trim();
        nextWaitAvail = ifNull(nextWaitAvail, waitAvail);
        String nextDays = nextRowTds.get(11).text().trim();
        nextDays = ifNull(nextDays, days);
        String nextTime = nextRowTds.get(12).text().trim();
        nextTime = ifNull(nextTime, time);
        String nextRoom = nextRowTds.get(13).text().trim();
        nextRoom = ifNull(nextRoom, room);
        String nextDates = nextRowTds.get(14).text().trim();
        nextDates = ifNull(nextDates, dates);

        Map<String, Object> nextRowMap = new HashMap<>();
        nextRowMap.put("focus_on", nextFocusOn);
        nextRowMap.put("crn", nextCrn);
        nextRowMap.put("course", nextCourse);
        nextRowMap.put("section", nextSection);
        nextRowMap.put("title", nextTitle);
        nextRowMap.put("credits", nextCredits);
        nextRowMap.put("instructor", nextInstructor);
        nextRowMap.put("curr_enrolled", nextCurrEnrolled);
        nextRowMap.put("seats_avail", nextSeatsAvail);
        nextRowMap.put("curr_waitlisted", nextCurrWaitlisted);
        nextRowMap.put("wait_avail", nextWaitAvail);
        nextRowMap.put("days", nextDays);
        nextRowMap.put("time", nextTime);
        nextRowMap.put("room", nextRoom);
        nextRowMap.put("dates", nextDates);
        nextRowMap.put("details_url", detailsUrl);
        nextRowMap.put("sources_url", sourceUrl);
        dataList.add(nextRowMap);
        i = i + 2;
      } else {
        //跳过下一行
        String html = nextRowTds.first().html();
        dataMap.put("remark", html);
        i++;
      }

    }
    return i;
  }

  private String ifNull(String source, String defaultValue) {
    if (StrUtil.notBlank(source)) {
      source = defaultValue;
    }
    return source;
  }

  @Override
  public Site getSite() {
    return site;
  }
}

代码分析:

1 process and getSite
  1. process 方法:负责从页面中提取课程数据。

    • 使用 Jsoup 提取表格行 <tr> 元素列表。
    • 遍历每一行,检查 <td> 元素的数量来区分课程数据和备注数据。
    • 当 td 元素数量为 15 时,调用 canEnrolled 方法处理可以注册的课程。
    • 当 td 元素数量为 13 时,调用 canNotEnrolled 方法处理不可注册的课程。
    • 使用 page.putField 将解析结果存储在 ResultItems 对象中,供后续处理。
  2. getSite 方法:配置爬虫的站点设置,例如重试次数和请求间隔时间。

2 canEnrolled 和 canNotEnrolled 方法

这两个方法用于解析不同类型的课程数据:

  • canEnrolled:解析可注册课程的数据,包括课程编号、名称、学分、讲师、上课时间等。
  • canNotEnrolled:解析不可注册课程的数据,类似于 canEnrolled,但不包含等待列表信息。
private int canEnrolled(int i, Element row, Elements columns, String sourceUrl, List<Map<String, Object>> dataList) {
  String focusOn = columns.get(0).text().trim();
  String crn = columns.get(1).text().trim();
  // 省略其他字段解析

  Map<String, Object> dataMap = new HashMap<>();
  dataMap.put("focus_on", focusOn);
  dataMap.put("crn", crn);
  // 省略其他字段添加

  dataList.add(dataMap);

  // 处理下一行的备注信息
  Element nextRow = row.nextElementSibling();
  if (nextRow != null) {
    Elements nextRowTds = nextRow.select("td");
    if (nextRowTds.size() != 15) {
      String html = nextRowTds.first().html();
      dataMap.put("remark", html);
      i++;
    }
  }
  return i;
}
  • 数据处理:从 <td> 元素中提取课程数据,放入 dataMap。
  • 备注信息:检查下一行是否包含备注信息,如果是,将备注信息添加到 dataMap 中。

2 创建 Pipeline 将数据存入数据库

Pipeline 用于处理提取的数据并将其存储

到数据库中:

import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.tio.utils.mcid.McIdUtils;

import lombok.SneakyThrows;
import lombok.extern.slf4j.Slf4j;
import us.codecraft.webmagic.ResultItems;
import us.codecraft.webmagic.Task;
import us.codecraft.webmagic.pipeline.Pipeline;

/**
 * Created by litonglinux@qq.com on 2023/6/20_7:05
 */
@Slf4j
public class KapCoursePipeline implements Pipeline {

  String regex = "i=(\\w+)&t=(\\w+)&s=(\\w+)";
  String tableName = "spider_us_hi_uh_course";
  String deleteCoursesql = String.format("delete from %s where institution=? and term=? and subject_abbr=? and deleted=0", tableName);

  List<Row> records;

  public KapCoursePipeline(List<Row> records) {
    this.records = records;
  }

  @SneakyThrows
  @Override
  public void process(ResultItems resultItems, Task task) {
    List<Map<String, Object>> dataList = resultItems.get("dataList");

    Pattern pattern = Pattern.compile(regex);

    // 删除旧数据
    if (dataList.size() > 0) {
      Map<String, Object> courseMap = dataList.get(0);
      Object crn = courseMap.get("crn");
      Object sourcesUrl = courseMap.get("sources_url");
      String institutionName = null;
      String termName = null;
      String subjectName = null;

      Matcher matcher = pattern.matcher((String) sourcesUrl);

      if (matcher.find()) {
        institutionName = matcher.group(1);
        termName = matcher.group(2);
        subjectName = matcher.group(3);
      }

      String semesterName = getSemesterName(termName);
      log.info("i:{},t:{},s:{},semesterName:{},crn:{}", institutionName, termName, subjectName, semesterName, crn);
      // 删除旧数据
      Db.delete(deleteCoursesql, institutionName, semesterName, subjectName);
    }

    List<Row> saveRecords = new ArrayList<>();

    // 提取数据并插入
    for (Map<String, Object> courseMap : dataList) {
      Object crn = courseMap.get("crn");
      Object sourcesUrl = courseMap.get("sources_url");
      String i = null;
      String t = null;
      String s = null;

      Matcher matcher = pattern.matcher((String) sourcesUrl);

      if (matcher.find()) {
        i = matcher.group(1);
        t = matcher.group(2);
        s = matcher.group(3);
      }

      String semesterName = getSemesterName(t);
      log.info("i:{},t:{},s:{},semesterName:{},crn:{}", i, t, s, semesterName, crn);

      Row row = new Row();
      row.put("id", McIdUtils.id());
      row.put("institution", i);
      row.put("term", semesterName);
      row.put("subject_abbr", s);
      row.setColumns(courseMap);
      saveRecords.add(row);
    }

    Db.tx(() -> {
      Db.delete("truncate table " + tableName);
      Db.batchSave(tableName, saveRecords, saveRecords.size());
      return true;
    });
  }

  private String getSemesterName(String t) {
    for (Row row : records) {
      if (row.getStr("t").equals(t)) {
        return row.getStr("name");
      }
    }
    return null;
  }
}

代码分析:

  1. process 方法:负责将 ResultItems 中的课程数据保存到数据库中。
    • 遍历 dataList,为每条数据创建一个 Row 对象并设置字段。
    • 使用 Db.batchSave 批量保存数据到 spider_us_hi_uh_course 表中。

4. 运行爬虫服务

创建 KapCourseService 类来启动爬虫:

package com.litongjava.open.chat.spider.uh.course;

import java.util.List;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import us.codecraft.webmagic.Spider;

public class KapCourseService {

  public void index() {
    List<Row> all = Db.findAll("spider_us_hi_uh_semester");
    String url = "https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202510&s=ACC";
    int threadNum = 1;
    Spider.create(new KapCourseProcessor())
      .addUrl(url)
      .addPipeline(new KapCoursePipeline(all))
      .thread(threadNum).run();
  }
}

5. 测试程序

编写测试程序来验证爬虫的功能:

package com.litongjava.open.chat.spider.uh.course;

import java.util.List;
import org.junit.Test;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.open.chat.config.DbConfig;
import com.litongjava.table.utils.MarkdownTableUtils;
import com.litongjava.tio.utils.environment.EnvUtils;

public class KapCourseServiceTest {

  @Test
  public void test() {
    EnvUtils.load();
    new DbConfig().config();
    Aop.get(KapCourseService.class).index();
  }

  @Test
  public void findAll() {
    EnvUtils.load();
    new DbConfig().config();
    List<Row> all = Db.findAll("spider_us_hi_uh_course");
    System.out.println(MarkdownTableUtils.to(all));
  }
}

4 数据示例

通过上述步骤,我们成功创建了一个完整的 WebMagic 爬虫示例,可以从网页中提取课程数据并将其存入数据库。,数据如下

idinstitutiontermsubject_abbrsubject_namefocus_oncrncoursesectiontitlecreditsinstructorcurr_enrolledseats_availcurr_waitlistedwait_availdaystimeroomdatesdetails_urlsources_urlremarkcreatorcreate_timeupdaterupdate_timedeletedtenant_id
7070851012184720KAPFall 2024ACCNULLIDAP31175ACC 1320Payroll/Hawai`i Gen Excise Tax3L N Matsumoto174NULLNULLTBATBAONLINE ASYNC08/26-12/20./avail.class?i=KAP&t=202510&c=31175https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202510&s=ACCComment:

Prerequisite(s): Concurrent enrollment in ACC 124 or ACC 201, or consent of the instructor or the Accounting Program Coordinator or the Business, Legal & Technology Education Department Chairperson.

Recommended Preparation: Credit or concurrent enrollment in ICS 101; and credit or concurrent enrollment in ENG 22 or ESOL 94F or ESOL 94S, or qualification for ENG 100 or ESL 100.

This ACC 132 class section (CRN 31191) is conducted online. This class is fully ASYNCHRONOUS: students are never required to meet with each other and/or the instructor at a specific time. However, the class includes one or more OPTIONAL synchronous online activities. Instructor will work with students to schedule meeting day/time for the optional synchronous activities. This class does not include proctored exams. This class does not require fieldwork.

This course uses Cengage Unlimited and CNOW2 homework system. This class will be participating in the Kapi’olani Bookstore’s IDAP program. An IDAP Rental Charge will be added to your MyUH account, price to be determined. The instructor will provide instructions on how to access your e-book and other digital course material, including the CNOW2 homework system. If you have any questions, please contact the Kapi’olani bookstore or visit our FAQ page at https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp.

This section of ACC 132 (31175) is cross-listed with ACC 132 (31536). One class, one instructor, dual formats. Register for CRN 31175 for online asynchronous class format OR register for CRN 31536 for hybrid class format. To take an online or hybrid class, you need regular access to a desktop or laptop computer and reliable Internet connection.

Email the program coordinator (seabolt@hawaii.edu) as needed for additional information and check your hawaii.edu email account regularly for class notifications. | NULL | NULL | NULL | NULL | 0 | 1 | | 7070851012184721 | KAP | Fall 2024 | ACC | NULL | IDAP | 31536 | ACC 132 | 0 | Payroll/Hawai`i Gen Excise Tax | 3 | L N Matsumoto | 4 | 4 | NULL | NULL | TBA | TBA | ONLINE ASYNC | 08/26-12/20 | ./avail.class?i=KAP&t=202510&c=31536 | https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202510&s=ACC |   | NULL | NULL | NULL | NULL | 0 | 1 | | 7070851012192912 | KAP | Fall 2024 | ACC | NULL | IDAP | 31465 | ACC 134 | 0 | Individual Income Tax Prep | 3 | R Y Kamida | 17 | 3 | NULL | NULL | TBA | TBA | ONLINE ASYNC | 08/26-12/20 | ./avail.class?i=KAP&t=202510&c=31465 | https://www.sis.hawaii.edu/uhdad/avail.classes?i=KAP&t=202510&s=ACC | Comment:

Prerequisite(s): Qualification for ENG 22 or ESOL 94F or ESOL 94S or qualification for an equivalent course, or consent of the instructor or the Accounting Program Coordinator or the Business, Legal & Technology Education Department Chairperson.

Recommended Preparation: Credit or concurrent enrollment in ICS 101.

This ACC 134 class section (CRN 31507) is conducted online. This class is fully ASYNCHRONOUS: students are never required to meet with each other and/or the instructor at a specific time. However, the class includes one or more OPTIONAL synchronous online activities. Instructor will work with students to schedule meeting day/time for the optional synchronous activities. This class does not include proctored exams. This class does not require fieldwork.

This course uses Cengage Unlimited and CNOW2 homework system. This class will be participating in the Kapi’olani Bookstore’s IDAP program. An IDAP Rental Charge will be added to your MyUH account, price to be determined. The instructor will provide instructions on how to access your e-book and other digital course material, including the CNOW2 homework system. If you have any questions, please contact the Kapi’olani bookstore or visit our FAQ page at https://www.bookstore.hawaii.edu/uhkcc/site_IDAP.asp.

This section of ACC 134 (31465) is cross-listed with ACC 134 (31537). One class, one instructor, dual formats. Register for CRN 31465 for online asynchronous class format OR register for CRN 31537 for hybrid class format. To take an online or hybrid class, you need regular access to a desktop or laptop computer and a reliable Internet connection.

Email the program coordinator (seabolt@hawaii.edu) as needed for additional information and check your hawaii.edu email account regularly for class notifications. | NULL | NULL | NULL | NULL | 0 | 1 |

Edit this page
Last Updated:
Contributors: Tong Li
Prev
整合 WebMagic
Next
Playwright