名人搜索功能实现
本模块主要用于根据用户输入的关键字(如姓名、机构等),调用搜索引擎 API 获取相关信息(例如 LinkedIn、社交媒体、视频等),并通过大语言模型对数据进行整理、润色与摘要,最终返回格式化的 Markdown 输出。整个流程大致包括以下步骤:
- 提示词示例:定义了对用户输入的提示,指导后续信息整理和输出格式(例如生成个人简介、学术信息、兴趣描述等)。
- LinkedIn 数据抓取:利用
LinkedInService
类,通过缓存优先、后端 API 调用的方式,抓取 LinkedIn 上的用户资料和动态。 - 社交媒体信息扩展:使用
SocialMediaService
类,根据用户的姓名和机构信息从搜索结果中提取社交媒体账号信息,同时缓存结果,减少重复调用。 - 综合搜索与信息汇总:在
CelebrityService
类中,根据用户姓名和机构构建查询内容,依次调用搜索引擎接口进行网页、视频和 LinkedIn 信息搜索。最后,调用 PromptEngine 将所有数据汇总并生成最终的系统提示词,以供大模型生成输出。
注意:在搜索过程中,提示词中使用“linked”时可能会非常容易返回 Joshua S. 的信息,使用时请谨慎调整提示以避免误检。
下面依次展示代码及相应的业务流程介绍。
1. 提示词示例
该示例中定义了如何根据用户输入生成结构化输出的提示词模板,要求输出为 Markdown 格式,包含个人简介、学术信息、兴趣等。
<instruction>
1. Write a brief description of the user's "About" in one paragraph.
2. Gather the user's academic information, including background, studies, or achievements.
3. Write a paragraph summarizing the user's academic details.
4. Write a paragraph showcasing the user's interests.
5. Ensure the output is **formatted in Markdown** for better readability.
6. Do not include any XML tags in the final output.
</instruction>
<output>
省略
</output>
<example>
..省略
</example>
<data>
name:#(nane)
institution:#(institution)
search info:#(info)
linkedin profile:#(profile)
</data>
此模板为后续使用 PromptEngine 渲染模板、生成大模型提示词提供了规范。
2. LinkedInService 类
LinkedInService
类主要负责从 LinkedIn 上抓取用户的个人资料及动态。其主要业务逻辑如下:
- 缓存优先:首先从数据库缓存中查询是否已有相关数据,若存在则直接返回。
- API 调用:如果缓存中没有数据,则调用
ApiFyClient.linkedinProfileScraper
或ApiFyClient.linkedinProfilePostsScraper
接口获取最新信息。 - 数据解析与存储:获取的数据会进行 JSON 解析,并存入 PostgreSQL 数据库中,便于下次直接调用。
以下是完整代码,不做任何删减:
package com.litongjava.llm.service;
import org.postgresql.util.PGobject;
import com.alibaba.fastjson2.JSONArray;
import com.litongjava.apify.ApiFyClient;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.kit.PgObjectUtils;
import com.litongjava.llm.consts.AgentTableNames;
import com.litongjava.model.http.response.ResponseVo;
import com.litongjava.tio.utils.json.FastJson2Utils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class LinkedInService {
public String profileScraper(String url) {
PGobject pgObject = Db.queryColumnByField(AgentTableNames.linkedin_profile_cache, "profile_data", "source", url);
if (pgObject != null && pgObject.getValue() != null) {
return pgObject.getValue();
}
ResponseVo responseVo = ApiFyClient.linkedinProfileScraper(url);
if (responseVo.isOk()) {
String profile = responseVo.getBodyString();
if (profile.startsWith("[")) {
try {
profile = FastJson2Utils.parseArray(profile).toJSONString();
Row row = Row.by("id", SnowflakeIdUtils.id()).set("source", url).set("profile_data", PgObjectUtils.json(profile));
Db.save(AgentTableNames.linkedin_profile_cache, row);
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
} else {
try {
profile = FastJson2Utils.parseObject(profile).toJSONString();
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
}
return profile;
}
return null;
}
public String profilePostsScraper(String url) {
PGobject pgObject = Db.queryColumnByField(AgentTableNames.linkedin_profile_posts_cache, "data", "source", url);
if (pgObject != null && pgObject.getValue() != null) {
return pgObject.getValue();
}
ResponseVo responseVo = ApiFyClient.linkedinProfilePostsScraper(url);
if (responseVo.isOk()) {
String profile = responseVo.getBodyString();
if (profile.startsWith("[")) {
try {
JSONArray parseArray = FastJson2Utils.parseArray(profile);
profile = parseArray.toJSONString();
Row row = Row.by("id", SnowflakeIdUtils.id()).set("source", url).set("data", PgObjectUtils.json(profile));
Db.save(AgentTableNames.linkedin_profile_posts_cache, row);
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
} else {
try {
profile = FastJson2Utils.parseObject(profile).toJSONString();
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
}
return profile;
}
return null;
}
}
3. SocialMediaService 类
SocialMediaService
类主要负责从搜索结果中提取与用户相关的社交媒体账号信息。其主要流程包括:
- 名称和机构预处理:将姓名转为小写,机构转为大写,并构造搜索关键字。
- 数据库缓存:同样优先查询数据库中是否已存在相关社交媒体数据。
- 调用 PromptEngine:若无缓存,则通过 PromptEngine 根据模板生成查询提示,并调用 OpenAiClient 获取结果。
- 结果解析与存储:对 OpenAI 返回的 JSON 结果进行解析后存入数据库。
以下是完整代码:
package com.litongjava.llm.service;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.locks.Lock;
import org.postgresql.util.PGobject;
import com.google.common.util.concurrent.Striped;
import com.jfinal.kit.Kv;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.kit.PgObjectUtils;
import com.litongjava.llm.consts.AgentTableNames;
import com.litongjava.openai.chat.ChatMessage;
import com.litongjava.openai.chat.ChatResponseFormatType;
import com.litongjava.openai.chat.OpenAiChatRequestVo;
import com.litongjava.openai.chat.OpenAiChatResponseVo;
import com.litongjava.openai.client.OpenAiClient;
import com.litongjava.openai.constants.OpenAiModels;
import com.litongjava.template.PromptEngine;
import com.litongjava.tio.utils.environment.EnvUtils;
import com.litongjava.tio.utils.json.FastJson2Utils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import com.litongjava.volcengine.VolcEngineConst;
import com.litongjava.volcengine.VolcEngineModels;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class SocialMediaService {
private static final Striped<Lock> stripedLocks = Striped.lock(256);
public String extraSoicalMedia(String name, String institution, String searchInfo) {
String lowerCaseName = name.toLowerCase();
institution = institution.toUpperCase();
String key = name + " at " + institution;
Lock lock = stripedLocks.get(key);
lock.lock();
try {
String sql = "select data from %s where name=? and institution=?";
sql = String.format(sql, AgentTableNames.social_media_accounts);
PGobject pgobject = Db.queryPGobject(sql, lowerCaseName, institution);
String content = null;
if (pgobject != null && pgobject.getValue() != null) {
content = pgobject.getValue();
return content;
}
Kv set = Kv.by("data", searchInfo).set("name", name).set("institution", institution);
String renderToString = PromptEngine.renderToStringFromDb("extra_soical_media_prompt.txt", set);
log.info("prompt:{}", renderToString);
ChatMessage chatMessage = new ChatMessage("user", renderToString);
List<ChatMessage> messages = new ArrayList<>();
messages.add(chatMessage);
OpenAiChatRequestVo chatRequestVo = new OpenAiChatRequestVo();
chatRequestVo.setStream(false);
chatRequestVo.setResponse_format(ChatResponseFormatType.json_object);
chatRequestVo.setChatMessages(messages);
OpenAiChatResponseVo chat = useOpenAi(chatRequestVo);
content = chat.getChoices().get(0).getMessage().getContent();
if (content.startsWith("```json")) {
content = content.substring(7, content.length() - 3);
}
content = FastJson2Utils.parseObject(content).toJSONString();
PGobject json = PgObjectUtils.json(content);
Row row = Row.by("id", SnowflakeIdUtils.id()).set("name", lowerCaseName).set("institution", institution).set("data", json);
Db.save(AgentTableNames.social_media_accounts, row);
return content;
} finally {
lock.unlock();
}
}
private OpenAiChatResponseVo useDeepseek(OpenAiChatRequestVo chatRequestVo) {
chatRequestVo.setModel(VolcEngineModels.DEEPSEEK_V3_241226);
String apiKey = EnvUtils.get("VOLCENGINE_API_KEY");
return OpenAiClient.chatCompletions(VolcEngineConst.BASE_URL, apiKey, chatRequestVo);
}
@SuppressWarnings("unused")
private OpenAiChatResponseVo useOpenAi(OpenAiChatRequestVo chatRequestVo) {
chatRequestVo.setModel(OpenAiModels.GPT_4O_MINI);
OpenAiChatResponseVo chat = OpenAiClient.chatCompletions(chatRequestVo);
return chat;
}
}
4. CelebrityService 类
CelebrityService
类是整个流程的调度中心,主要负责:
- 构造搜索关键字:根据用户输入的姓名和机构生成搜索查询字符串,并多次拼接机构以提高搜索准确率。
- 依次调用搜索引擎接口:
- 网页搜索:利用 SearxngSearchClient 搜索相关网页内容,整理搜索结果,并将各个结果整合为 Markdown 格式的摘要和来源信息。
- 社交媒体账号扩展:调用
SocialMediaService.extraSoicalMedia
方法提取社交媒体账号信息,并将结果通过 SSE 推送给前端。 - 视频搜索:利用 SearxngSearchClient 搜索视频类别的内容,并推送结果。
- LinkedIn 搜索:根据社交媒体数据中提取的 LinkedIn 链接,调用
LinkedInService.profileScraper
和LinkedInService.profilePostsScraper
方法抓取 LinkedIn 个人资料和动态,并推送给前端。
- 汇总与生成提示词:最后将姓名、机构、搜索到的 Markdown 内容以及 LinkedIn 数据传入 PromptEngine 模板,生成最终的系统提示词供大模型使用。
以下为完整代码:
package com.litongjava.llm.service;
import java.util.ArrayList;
import java.util.List;
import com.alibaba.fastjson2.JSONArray;
import com.alibaba.fastjson2.JSONObject;
import com.jfinal.kit.Kv;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.llm.consts.AiChatEventName;
import com.litongjava.model.web.WebPageContent;
import com.litongjava.openai.chat.ChatMessageArgs;
import com.litongjava.searxng.SearxngResult;
import com.litongjava.searxng.SearxngSearchClient;
import com.litongjava.searxng.SearxngSearchParam;
import com.litongjava.searxng.SearxngSearchResponse;
import com.litongjava.template.PromptEngine;
import com.litongjava.tio.core.ChannelContext;
import com.litongjava.tio.core.Tio;
import com.litongjava.tio.http.common.sse.SsePacket;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.tio.utils.json.FastJson2Utils;
import com.litongjava.tio.utils.json.JsonUtils;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class CelebrityService {
private LinkedInService linkedInService = Aop.get(LinkedInService.class);
private SocialMediaService socialMediaService = Aop.get(SocialMediaService.class);
public String celebrity(ChannelContext channelContext, ChatMessageArgs chatSendArgs) {
String name = chatSendArgs.getName();
String institution = chatSendArgs.getInstitution();
//必须要添加两个institution,添加后搜索更准,但是不知道原理是什么?猜测是搜索引擎提高了权重
String textQuestion = name + " (" + institution + ")" + " at " + institution;
if (channelContext != null) {
Kv by = Kv.by("content", "First let me search google with " + textQuestion + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
SearxngSearchResponse searchResponse = Aop.get(SearchService.class).searchapi(textQuestion);
List<SearxngResult> results = searchResponse.getResults();
//SearxngSearchResponse searchResponse = SearxngSearchClient.search(textQuestion);
//SearxngSearchResponse searchResponse2 = Aop.get(SearchService.class).google("site:linkedin.com/in/ " + name + " at " + institution);
// List<SearxngResult> results2 = searchResponse2.getResults();
// for (SearxngResult searxngResult : results2) {
// results.add(searxngResult);
// }
if (results != null && results.size() < 1) {
Kv by = Kv.by("content", "unfortunate Failed to search.I will try again a 3 seconds. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
searchResponse = SearxngSearchClient.search(textQuestion);
results = searchResponse.getResults();
if (results != null && results.size() < 1) {
by = Kv.by("content", "unfortunate Failed to search.I will try again a 3 seconds. ");
ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
searchResponse = SearxngSearchClient.search(textQuestion);
results = searchResponse.getResults();
}
}
if (results != null && results.size() < 1) {
Kv by = Kv.by("content", "unfortunate Failed to search.Please try again later. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.delta, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
return null;
}
List<WebPageContent> pages = new ArrayList<>();
StringBuffer markdown = new StringBuffer();
StringBuffer sources = new StringBuffer();
for (int i = 0; i < results.size(); i++) {
SearxngResult searxngResult = results.get(i);
String title = searxngResult.getTitle();
String url = searxngResult.getUrl();
pages.add(new WebPageContent(title, url));
markdown.append("source " + (i + 1) + " " + searxngResult.getContent());
String content = searxngResult.getContent();
sources.append("source " + (i + 1) + ":").append(title).append(" ").append("url:").append(url).append(" ")
//
.append("content:").append(content).append("\r\n");
}
if (channelContext != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.citation, JsonUtils.toSkipNullJson(pages));
Tio.send(channelContext, ssePacket);
}
if (channelContext != null) {
Kv by = Kv.by("content", "Second let me extra social media account with " + textQuestion + ".");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
String soicalMediaAccounts = socialMediaService.extraSoicalMedia(name, institution, sources.toString());
if (channelContext != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.social_media, soicalMediaAccounts);
Tio.send(channelContext, ssePacket);
}
if (channelContext != null) {
Kv by = Kv.by("content", "Third let me search video with " + textQuestion + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
SearxngSearchParam param = new SearxngSearchParam();
param.setFormat("json").setQ(textQuestion).setCategories("videos");
searchResponse = SearxngSearchClient.search(param);
results = searchResponse.getResults();
pages = new ArrayList<>();
for (int i = 0; i < results.size(); i++) {
SearxngResult searxngResult = results.get(i);
String title = searxngResult.getTitle();
String url = searxngResult.getUrl();
pages.add(new WebPageContent(title, url));
}
if (channelContext != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.video, JsonUtils.toSkipNullJson(pages));
Tio.send(channelContext, ssePacket);
}
if (channelContext != null) {
Kv by = Kv.by("content", "Forth let me search linkedin with " + name + " " + institution + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
//SearxngSearchResponse person = LinkedinSearch.person(name, institution);
//List<SearxngResult> personResults = person.getResults();
//if (personResults != null && personResults.size() > 0) {
//String url = personResults.get(0).getUrl();
String profile = null;
String url = null;
try {
JSONArray social_media = FastJson2Utils.parseObject(soicalMediaAccounts).getJSONArray("social_media");
for (int i = 0; i < social_media.size(); i++) {
JSONObject jsonObject = social_media.getJSONObject(i);
if ("LinkedIn".equals(jsonObject.getString("platform"))) {
url = jsonObject.getString("url");
break;
}
}
} catch (Exception e) {
Kv by = Kv.by("content", "unfortunate Failed to find linkedin url. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
log.error(e.getMessage(), e);
}
if (StrUtil.isNotBlank(url)) {
if (channelContext != null) {
Kv by = Kv.by("content", "Fith let me read linkedin profile " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
if(url.startsWith("https://www.linkedin.com/in/")) {
try {
profile = linkedInService.profileScraper(url);
if (profile != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.linkedin, profile);
Tio.send(channelContext, ssePacket);
}
} catch (Exception e) {
log.error(e.getMessage(), e);
Kv by = Kv.by("content", "unfortunate Failed to read linkedin profile " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
}
if (channelContext != null) {
Kv by = Kv.by("content", "Sixth let me read linkedin posts " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
try {
String profilePosts = linkedInService.profilePostsScraper(url);
if (profilePosts != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.linkedin_posts, profilePosts);
Tio.send(channelContext, ssePacket);
}
} catch (Exception e) {
log.error(e.getMessage(), e);
Kv by = Kv.by("content", "unfortunate Failed to read linkedin profile posts " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
}
if (channelContext != null) {
Kv by = Kv.by("content", "Then let me summary all information and generate user information. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
// 3. 使用 PromptEngine 模版引擎填充提示词
Kv kv = Kv.by("name", name).set("institution", institution).set("info", markdown).set("profile", profile);
String systemPrompt = PromptEngine.renderToStringFromDb("celebrity_prompt.txt", kv);
return systemPrompt;
}
}
5. 业务流程总结
输入构造
根据用户提供的姓名和机构构造查询语句(例如name (institution) at institution
),提高搜索引擎对关键词的权重。初步网页搜索
使用 SearxngSearchClient 搜索网页内容,整合各个搜索结果,并生成 Markdown 格式的源数据摘要。社交媒体信息提取
通过调用SocialMediaService.extraSoicalMedia
方法,将搜索到的网页内容传入模板,获取用户的社交媒体账号信息,并通过 SSE 推送给前端。视频搜索
利用 SearxngSearchClient 搜索视频类别内容,并推送视频结果。LinkedIn 信息抓取
从社交媒体数据中提取 LinkedIn 链接后,调用LinkedInService
的相关方法分别抓取个人资料和动态。抓取过程中优先检查缓存,再调用 API。信息汇总与提示生成
将所有抓取到的信息(网页内容、社交媒体账号、LinkedIn 数据)汇总,通过 PromptEngine 填充模板,生成最终的系统提示词,供大语言模型生成详细的用户描述信息。
6. 注意事项
- 数据缓存:各模块均采用数据库缓存机制,避免重复请求同一数据,提高系统响应速度。
- 搜索结果不稳定:若搜索结果较少,系统会自动重试,确保尽可能获得有效信息。
- LinkedIn 特殊问题:已知在提示词中使用“linked”时,系统很容易返回 Joshua S. 的信息,因此在构造提示词时需要注意避免这种情况或进行额外过滤。