提取网内容
1. 简介
StructuredDataExtractor
是基于 Playwright、Jsoup 和 Flexmark 的一个工具类,用于从已加载的网页中提取结构化数据。它提供两种主要能力:
- 纯文本提取:直接获取页面主体的所有可见文本。
- HTML 转 Markdown:将页面主体 HTML 转换为 Markdown,并可选地在末尾附上页面中所有链接。
此外,配合大模型(如 Google Gemini 或 Volc Engine)使用时,可针对特定抽取目标(extraction goal)对 Markdown 内容做深度分析,并返回符合需求的 JSON 结果。
2. 核心代码
2.1 获取纯文本内容
以下单元测试示例演示了如何使用 Playwright 打开网页,并调用 page.innerText("body")
获取 <body>
节点内的纯文本:
package com.litongjava.ai.browser.dom.service;
import org.junit.Test;
import com.microsoft.playwright.Browser;
import com.microsoft.playwright.BrowserContext;
import com.microsoft.playwright.BrowserType;
import com.microsoft.playwright.BrowserType.LaunchOptions;
import com.microsoft.playwright.Page;
import com.microsoft.playwright.Playwright;
public class StructuredDataExtractorTest {
@Test
public void test() {
try (Playwright playwright = Playwright.create()) {
LaunchOptions options = new BrowserType.LaunchOptions().setHeadless(false);
Browser browser = playwright.chromium().launch(options);
BrowserContext context = browser.newContext();
Page page = context.newPage();
page.navigate("https://www.taobao.com/");
page.waitForLoadState();
String text = page.innerText("body");
System.out.println(text);
}
}
}
2.2 转为 Markdown 并可选附加链接列表
在 StructuredDataExtractor
类中,通过 Jsoup 解析 HTML,并使用 Flexmark 将其转换为 Markdown。当 extractLinks
参数为 true
时,仅转换 HTML 部分;否则直接返回纯文本。
package com.litongjava.ai.browser.dom.service;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import com.microsoft.playwright.Page;
import com.vladsch.flexmark.html2md.converter.FlexmarkHtmlConverter;
public class StructuredDataExtractor {
/**
* 把整个页面转成 Markdown 文本返回,并可选地在尾部附上所有链接列表。
*
* @param page Playwright 已经打开并加载完成的 Page
* @param extractLinks 是否在 Markdown 末尾附上链接列表
* @return Markdown 字符串
*/
public static String extractStructuredData(Page page, boolean extractLinks) {
if (extractLinks) {
String html = page.content();
Document doc = Jsoup.parse(html);
String bodyHtml = doc.body().html();
// 3. HTML -> Markdown
FlexmarkHtmlConverter converter = FlexmarkHtmlConverter.builder().build();
String markdown = converter.convert(bodyHtml);
return markdown;
} else {
return page.innerText("body");
}
}
}
对应的单元测试示例:
package com.litongjava.ai.browser.dom.service;
import org.junit.Test;
import com.microsoft.playwright.Browser;
import com.microsoft.playwright.BrowserContext;
import com.microsoft.playwright.BrowserType;
import com.microsoft.playwright.BrowserType.LaunchOptions;
import com.microsoft.playwright.Page;
import com.microsoft.playwright.Playwright;
public class StructuredDataExtractorTest {
@Test
public void test() {
try (Playwright playwright = Playwright.create()) {
LaunchOptions options = new BrowserType.LaunchOptions().setHeadless(false);
Browser browser = playwright.chromium().launch(options);
BrowserContext context = browser.newContext();
Page page = context.newPage();
page.navigate("https://www.taobao.com/");
page.waitForLoadState();
String extractStructuredData = StructuredDataExtractor.extractStructuredData(page, true);
System.out.println(extractStructuredData);
}
}
}
3. 配合大模型进行结构化抽取
在实际业务场景中,我们常常需将转换后的 Markdown 交给大模型,按照指定的“抽取目标”(Extraction goal)提取特定信息并返回 JSON。以下示例展示了如何:
- 读取已有的 Markdown 文件
- 调用封装好的服务(
PageExtractStructuredDataService
) - 使用 Google Gemini 模型完成抽取
package com.litongjava.ai.brower.service;
import java.io.File;
import org.junit.Test;
import com.litongjava.consts.ModelPlatformName;
import com.litongjava.exception.GenerateException;
import com.litongjava.gemini.GoogleModels;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.tio.utils.environment.EnvUtils;
import com.litongjava.tio.utils.hutool.FileUtil;
import com.litongjava.tio.utils.json.JsonUtils;
public class PageExtractStructuredDataServiceTest {
@Test
public void test() {
EnvUtils.load();
String md = FileUtil.readString(new File("data/tabo_index.md"));
String query = "Extraction goal: Extract the details and link of the first product results";
try {
//Input length 188799 exceeds the maximum length 131072
//String data = Aop.get(PageExtractStructuredDataService.class).extract(ModelPlatformName.VOLC_ENGINE, VolcEngineModels.DEEPSEEK_V3_250324, md, query);
String data = Aop.get(PageExtractStructuredDataService.class).extract(
ModelPlatformName.GOOGLE,
GoogleModels.GEMINI_2_5_FLASH,
md,
query
);
System.out.println(data);
} catch (GenerateException e) {
e.printStackTrace();
String json = JsonUtils.toJson(e);
FileUtil.writeString(json, "error.json");
}
}
}
示例返回结果:
{
"product_details": {
"title": "大头围渔夫帽2024新款水洗做旧毛边水桶帽素颜遮脸防晒大号大码",
"price": "¥55",
"link": "https://item.taobao.com/item.htm?id=823244092073&scm=1007.40986.420852.520373&pvid=b565b012-b532-4060-9245-01bf1d541c16&xxc=home_recommend&skuId=5705765651860&mi_id=4WEBTxDpit0C-qAAjDb5oRN4EPXGvuNe_oHNxPZL8TZ9YnbAFVAGyBGi21te8zvg1sp7K07Q2wzkaQLbwhT7NQ&utparam=%7B%22abid%22%3A%22520373%22%2C%22item_ctr%22%3A0%2C%22x_object_type%22%3A%22item%22%2C%22pc_pvid%22%3A%22b565b012-b532-4060-9245-01bf1d541c16%22%2C%22item_cvr%22%3A0%2C%22mix_group%22%3A%22%22%2C%22pc_scene%22%3A%2220001%22%2C%22item_ecpm%22%3A0%2C%22aplus_abtest%22%3A%22011604543f5138e6eb63b7409cef36bd%22%2C%22tpp_buckets%22%3A%2230986%23420852%23module%22%2C%22x_object_id%22%3A823244092073%2C%22ab_info%22%3A%2230986%23420852%23-1%23%22%7D"
}
}
4. 使用说明
依赖项
- Playwright for Java
- Jsoup
- Flexmark HTML-to-Markdown 转换器
- 大模型调用客户端(Google Gemini / Volc Engine SDK)
流程概览
- 使用 Playwright 打开目标网页,等待加载完成。
- 调用
StructuredDataExtractor.extractStructuredData(page, extractLinks)
,获取 Markdown 或纯文本。 - (可选)将 Markdown 内容保存至本地文件。
- 根据抽取目标,调用大模型服务,传入 Markdown 和目标描述,获取结构化 JSON 结果。
注意事项
- 页面内容过大时,需自行拆分或分段调用大模型。
- 对于需要附带链接列表的场景,请将
extractLinks
置为true
。 - 大模型的输入长度及费率视具体平台而定。
通过上述组件和示例,您可以快速将任意网页内容提取为纯文本或 Markdown,并借助大模型完成自定义信息抽取,满足多种业务场景的需求。