名人搜索功能实现
本模块主要用于根据用户输入的关键字(如姓名、机构等),调用搜索引擎 API 获取相关信息(例如 LinkedIn、社交媒体、视频等),并通过大语言模型对数据进行整理、润色与摘要,最终返回格式化的 Markdown 输出。整个流程大致包括以下步骤:
- 提示词示例:定义了对用户输入的提示,指导后续信息整理和输出格式(例如生成个人简介、学术信息、兴趣描述等)。
- LinkedIn 数据抓取:利用
LinkedInService类,通过缓存优先、后端 API 调用的方式,抓取 LinkedIn 上的用户资料和动态。 - 社交媒体信息扩展:使用
SocialMediaService类,根据用户的姓名和机构信息从搜索结果中提取社交媒体账号信息,同时缓存结果,减少重复调用。 - 综合搜索与信息汇总:在
CelebrityService类中,根据用户姓名和机构构建查询内容,依次调用搜索引擎接口进行网页、视频和 LinkedIn 信息搜索。最后,调用 PromptEngine 将所有数据汇总并生成最终的系统提示词,以供大模型生成输出。
注意:在搜索过程中,提示词中使用“linked”时可能会非常容易返回 Joshua S. 的信息,使用时请谨慎调整提示以避免误检。
下面依次展示代码及相应的业务流程介绍。
1. 提示词示例
该示例中定义了如何根据用户输入生成结构化输出的提示词模板,要求输出为 Markdown 格式,包含个人简介、学术信息、兴趣等。
<instruction>
1. Write a brief description of the user's "About" in one paragraph.
2. Gather the user's academic information, including background, studies, or achievements.
3. Write a paragraph summarizing the user's academic details.
4. Write a paragraph showcasing the user's interests.
5. Ensure the output is **formatted in Markdown** for better readability.
6. Do not include any XML tags in the final output.
</instruction>
<output>
省略
</output>
<example>
..省略
</example>
<data>
name:#(nane)
institution:#(institution)
search info:#(info)
linkedin profile:#(profile)
</data>
此模板为后续使用 PromptEngine 渲染模板、生成大模型提示词提供了规范。
2. LinkedInService 类
LinkedInService 类主要负责从 LinkedIn 上抓取用户的个人资料及动态。其主要业务逻辑如下:
- 缓存优先:首先从数据库缓存中查询是否已有相关数据,若存在则直接返回。
- API 调用:如果缓存中没有数据,则调用
ApiFyClient.linkedinProfileScraper或ApiFyClient.linkedinProfilePostsScraper接口获取最新信息。 - 数据解析与存储:获取的数据会进行 JSON 解析,并存入 PostgreSQL 数据库中,便于下次直接调用。
以下是完整代码,不做任何删减:
package com.litongjava.llm.service;
import org.postgresql.util.PGobject;
import com.alibaba.fastjson2.JSONArray;
import com.litongjava.apify.ApiFyClient;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.kit.PgObjectUtils;
import com.litongjava.llm.consts.AgentTableNames;
import com.litongjava.model.http.response.ResponseVo;
import com.litongjava.tio.utils.json.FastJson2Utils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class LinkedInService {
public String profileScraper(String url) {
PGobject pgObject = Db.queryColumnByField(AgentTableNames.linkedin_profile_cache, "profile_data", "source", url);
if (pgObject != null && pgObject.getValue() != null) {
return pgObject.getValue();
}
ResponseVo responseVo = ApiFyClient.linkedinProfileScraper(url);
if (responseVo.isOk()) {
String profile = responseVo.getBodyString();
if (profile.startsWith("[")) {
try {
profile = FastJson2Utils.parseArray(profile).toJSONString();
Row row = Row.by("id", SnowflakeIdUtils.id()).set("source", url).set("profile_data", PgObjectUtils.json(profile));
Db.save(AgentTableNames.linkedin_profile_cache, row);
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
} else {
try {
profile = FastJson2Utils.parseObject(profile).toJSONString();
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
}
return profile;
}
return null;
}
public String profilePostsScraper(String url) {
PGobject pgObject = Db.queryColumnByField(AgentTableNames.linkedin_profile_posts_cache, "data", "source", url);
if (pgObject != null && pgObject.getValue() != null) {
return pgObject.getValue();
}
ResponseVo responseVo = ApiFyClient.linkedinProfilePostsScraper(url);
if (responseVo.isOk()) {
String profile = responseVo.getBodyString();
if (profile.startsWith("[")) {
try {
JSONArray parseArray = FastJson2Utils.parseArray(profile);
profile = parseArray.toJSONString();
Row row = Row.by("id", SnowflakeIdUtils.id()).set("source", url).set("data", PgObjectUtils.json(profile));
Db.save(AgentTableNames.linkedin_profile_posts_cache, row);
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
} else {
try {
profile = FastJson2Utils.parseObject(profile).toJSONString();
} catch (Exception e) {
log.error("Failed to parse:{},{}", profile, e.getMessage(), e);
}
}
return profile;
}
return null;
}
}
3. SocialMediaService 类
SocialMediaService 类主要负责从搜索结果中提取与用户相关的社交媒体账号信息。其主要流程包括:
- 名称和机构预处理:将姓名转为小写,机构转为大写,并构造搜索关键字。
- 数据库缓存:同样优先查询数据库中是否已存在相关社交媒体数据。
- 调用 PromptEngine:若无缓存,则通过 PromptEngine 根据模板生成查询提示,并调用 OpenAiClient 获取结果。
- 结果解析与存储:对 OpenAI 返回的 JSON 结果进行解析后存入数据库。
以下是完整代码:
package com.litongjava.llm.service;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.locks.Lock;
import org.postgresql.util.PGobject;
import com.google.common.util.concurrent.Striped;
import com.jfinal.kit.Kv;
import com.litongjava.db.activerecord.Db;
import com.litongjava.db.activerecord.Row;
import com.litongjava.kit.PgObjectUtils;
import com.litongjava.llm.consts.AgentTableNames;
import com.litongjava.openai.chat.ChatMessage;
import com.litongjava.openai.chat.ChatResponseFormatType;
import com.litongjava.openai.chat.OpenAiChatRequestVo;
import com.litongjava.openai.chat.OpenAiChatResponseVo;
import com.litongjava.openai.client.OpenAiClient;
import com.litongjava.openai.consts.OpenAiModels;
import com.litongjava.template.PromptEngine;
import com.litongjava.tio.utils.environment.EnvUtils;
import com.litongjava.tio.utils.json.FastJson2Utils;
import com.litongjava.tio.utils.snowflake.SnowflakeIdUtils;
import com.litongjava.volcengine.VolcEngineConst;
import com.litongjava.volcengine.VolcEngineModels;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class SocialMediaService {
private static final Striped<Lock> stripedLocks = Striped.lock(256);
public String extraSoicalMedia(String name, String institution, String searchInfo) {
String lowerCaseName = name.toLowerCase();
institution = institution.toUpperCase();
String key = name + " at " + institution;
Lock lock = stripedLocks.get(key);
lock.lock();
try {
String sql = "select data from %s where name=? and institution=?";
sql = String.format(sql, AgentTableNames.social_media_accounts);
PGobject pgobject = Db.queryPGobject(sql, lowerCaseName, institution);
String content = null;
if (pgobject != null && pgobject.getValue() != null) {
content = pgobject.getValue();
return content;
}
Kv set = Kv.by("data", searchInfo).set("name", name).set("institution", institution);
String renderToString = PromptEngine.renderToStringFromDb("extra_soical_media_prompt.txt", set);
log.info("prompt:{}", renderToString);
ChatMessage chatMessage = new ChatMessage("user", renderToString);
List<ChatMessage> messages = new ArrayList<>();
messages.add(chatMessage);
OpenAiChatRequestVo chatRequestVo = new OpenAiChatRequestVo();
chatRequestVo.setStream(false);
chatRequestVo.setResponse_format(ChatResponseFormatType.json_object);
chatRequestVo.setChatMessages(messages);
OpenAiChatResponseVo chat = useOpenAi(chatRequestVo);
content = chat.getChoices().get(0).getMessage().getContent();
if (content.startsWith("```json")) {
content = content.substring(7, content.length() - 3);
}
content = FastJson2Utils.parseObject(content).toJSONString();
PGobject json = PgObjectUtils.json(content);
Row row = Row.by("id", SnowflakeIdUtils.id()).set("name", lowerCaseName).set("institution", institution).set("data", json);
Db.save(AgentTableNames.social_media_accounts, row);
return content;
} finally {
lock.unlock();
}
}
private OpenAiChatResponseVo useDeepseek(OpenAiChatRequestVo chatRequestVo) {
chatRequestVo.setModel(VolcEngineModels.DEEPSEEK_V3_241226);
String apiKey = EnvUtils.get("VOLCENGINE_API_KEY");
return OpenAiClient.chatCompletions(VolcEngineConst.BASE_URL, apiKey, chatRequestVo);
}
@SuppressWarnings("unused")
private OpenAiChatResponseVo useOpenAi(OpenAiChatRequestVo chatRequestVo) {
chatRequestVo.setModel(OpenAiModels.GPT_4O_MINI);
OpenAiChatResponseVo chat = OpenAiClient.chatCompletions(chatRequestVo);
return chat;
}
}
4. CelebrityService 类
CelebrityService 类是整个流程的调度中心,主要负责:
- 构造搜索关键字:根据用户输入的姓名和机构生成搜索查询字符串,并多次拼接机构以提高搜索准确率。
- 依次调用搜索引擎接口:
- 网页搜索:利用 SearxngSearchClient 搜索相关网页内容,整理搜索结果,并将各个结果整合为 Markdown 格式的摘要和来源信息。
- 社交媒体账号扩展:调用
SocialMediaService.extraSoicalMedia方法提取社交媒体账号信息,并将结果通过 SSE 推送给前端。 - 视频搜索:利用 SearxngSearchClient 搜索视频类别的内容,并推送结果。
- LinkedIn 搜索:根据社交媒体数据中提取的 LinkedIn 链接,调用
LinkedInService.profileScraper和LinkedInService.profilePostsScraper方法抓取 LinkedIn 个人资料和动态,并推送给前端。
- 汇总与生成提示词:最后将姓名、机构、搜索到的 Markdown 内容以及 LinkedIn 数据传入 PromptEngine 模板,生成最终的系统提示词供大模型使用。
以下为完整代码:
package com.litongjava.llm.service;
import java.util.ArrayList;
import java.util.List;
import com.alibaba.fastjson2.JSONArray;
import com.alibaba.fastjson2.JSONObject;
import com.jfinal.kit.Kv;
import com.litongjava.jfinal.aop.Aop;
import com.litongjava.llm.consts.AiChatEventName;
import com.litongjava.model.web.WebPageContent;
import com.litongjava.openai.chat.ChatMessageArgs;
import com.litongjava.searxng.SearxngResult;
import com.litongjava.searxng.SearxngSearchClient;
import com.litongjava.searxng.SearxngSearchParam;
import com.litongjava.searxng.SearxngSearchResponse;
import com.litongjava.template.PromptEngine;
import com.litongjava.tio.core.ChannelContext;
import com.litongjava.tio.core.Tio;
import com.litongjava.tio.http.common.sse.SsePacket;
import com.litongjava.tio.utils.hutool.StrUtil;
import com.litongjava.tio.utils.json.FastJson2Utils;
import com.litongjava.tio.utils.json.JsonUtils;
import lombok.extern.slf4j.Slf4j;
@Slf4j
public class CelebrityService {
private LinkedInService linkedInService = Aop.get(LinkedInService.class);
private SocialMediaService socialMediaService = Aop.get(SocialMediaService.class);
public String celebrity(ChannelContext channelContext, ChatMessageArgs chatSendArgs) {
String name = chatSendArgs.getName();
String institution = chatSendArgs.getInstitution();
//必须要添加两个institution,添加后搜索更准,但是不知道原理是什么?猜测是搜索引擎提高了权重
String textQuestion = name + " (" + institution + ")" + " at " + institution;
if (channelContext != null) {
Kv by = Kv.by("content", "First let me search google with " + textQuestion + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
SearxngSearchResponse searchResponse = Aop.get(SearchService.class).searchapi(textQuestion);
List<SearxngResult> results = searchResponse.getResults();
//SearxngSearchResponse searchResponse = SearxngSearchClient.search(textQuestion);
//SearxngSearchResponse searchResponse2 = Aop.get(SearchService.class).google("site:linkedin.com/in/ " + name + " at " + institution);
// List<SearxngResult> results2 = searchResponse2.getResults();
// for (SearxngResult searxngResult : results2) {
// results.add(searxngResult);
// }
if (results != null && results.size() < 1) {
Kv by = Kv.by("content", "unfortunate Failed to search.I will try again a 3 seconds. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
searchResponse = SearxngSearchClient.search(textQuestion);
results = searchResponse.getResults();
if (results != null && results.size() < 1) {
by = Kv.by("content", "unfortunate Failed to search.I will try again a 3 seconds. ");
ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
try {
Thread.sleep(3000);
} catch (InterruptedException e) {
e.printStackTrace();
}
searchResponse = SearxngSearchClient.search(textQuestion);
results = searchResponse.getResults();
}
}
if (results != null && results.size() < 1) {
Kv by = Kv.by("content", "unfortunate Failed to search.Please try again later. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.delta, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
return null;
}
List<WebPageContent> pages = new ArrayList<>();
StringBuffer markdown = new StringBuffer();
StringBuffer sources = new StringBuffer();
for (int i = 0; i < results.size(); i++) {
SearxngResult searxngResult = results.get(i);
String title = searxngResult.getTitle();
String url = searxngResult.getUrl();
pages.add(new WebPageContent(title, url));
markdown.append("source " + (i + 1) + " " + searxngResult.getContent());
String content = searxngResult.getContent();
sources.append("source " + (i + 1) + ":").append(title).append(" ").append("url:").append(url).append(" ")
//
.append("content:").append(content).append("\r\n");
}
if (channelContext != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.citation, JsonUtils.toSkipNullJson(pages));
Tio.send(channelContext, ssePacket);
}
if (channelContext != null) {
Kv by = Kv.by("content", "Second let me extra social media account with " + textQuestion + ".");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
String soicalMediaAccounts = socialMediaService.extraSoicalMedia(name, institution, sources.toString());
if (channelContext != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.social_media, soicalMediaAccounts);
Tio.send(channelContext, ssePacket);
}
if (channelContext != null) {
Kv by = Kv.by("content", "Third let me search video with " + textQuestion + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
SearxngSearchParam param = new SearxngSearchParam();
param.setFormat("json").setQ(textQuestion).setCategories("videos");
searchResponse = SearxngSearchClient.search(param);
results = searchResponse.getResults();
pages = new ArrayList<>();
for (int i = 0; i < results.size(); i++) {
SearxngResult searxngResult = results.get(i);
String title = searxngResult.getTitle();
String url = searxngResult.getUrl();
pages.add(new WebPageContent(title, url));
}
if (channelContext != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.video, JsonUtils.toSkipNullJson(pages));
Tio.send(channelContext, ssePacket);
}
if (channelContext != null) {
Kv by = Kv.by("content", "Forth let me search linkedin with " + name + " " + institution + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
//SearxngSearchResponse person = LinkedinSearch.person(name, institution);
//List<SearxngResult> personResults = person.getResults();
//if (personResults != null && personResults.size() > 0) {
//String url = personResults.get(0).getUrl();
String profile = null;
String url = null;
try {
JSONArray social_media = FastJson2Utils.parseObject(soicalMediaAccounts).getJSONArray("social_media");
for (int i = 0; i < social_media.size(); i++) {
JSONObject jsonObject = social_media.getJSONObject(i);
if ("LinkedIn".equals(jsonObject.getString("platform"))) {
url = jsonObject.getString("url");
break;
}
}
} catch (Exception e) {
Kv by = Kv.by("content", "unfortunate Failed to find linkedin url. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
log.error(e.getMessage(), e);
}
if (StrUtil.isNotBlank(url)) {
if (channelContext != null) {
Kv by = Kv.by("content", "Fith let me read linkedin profile " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
if(url.startsWith("https://www.linkedin.com/in/")) {
try {
profile = linkedInService.profileScraper(url);
if (profile != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.linkedin, profile);
Tio.send(channelContext, ssePacket);
}
} catch (Exception e) {
log.error(e.getMessage(), e);
Kv by = Kv.by("content", "unfortunate Failed to read linkedin profile " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
}
if (channelContext != null) {
Kv by = Kv.by("content", "Sixth let me read linkedin posts " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
try {
String profilePosts = linkedInService.profilePostsScraper(url);
if (profilePosts != null) {
SsePacket ssePacket = new SsePacket(AiChatEventName.linkedin_posts, profilePosts);
Tio.send(channelContext, ssePacket);
}
} catch (Exception e) {
log.error(e.getMessage(), e);
Kv by = Kv.by("content", "unfortunate Failed to read linkedin profile posts " + url + ". ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
}
if (channelContext != null) {
Kv by = Kv.by("content", "Then let me summary all information and generate user information. ");
SsePacket ssePacket = new SsePacket(AiChatEventName.reasoning, JsonUtils.toJson(by));
Tio.send(channelContext, ssePacket);
}
// 3. 使用 PromptEngine 模版引擎填充提示词
Kv kv = Kv.by("name", name).set("institution", institution).set("info", markdown).set("profile", profile);
String systemPrompt = PromptEngine.renderToStringFromDb("celebrity_prompt.txt", kv);
return systemPrompt;
}
}
5. 业务流程总结
输入构造
根据用户提供的姓名和机构构造查询语句(例如name (institution) at institution),提高搜索引擎对关键词的权重。初步网页搜索
使用 SearxngSearchClient 搜索网页内容,整合各个搜索结果,并生成 Markdown 格式的源数据摘要。社交媒体信息提取
通过调用SocialMediaService.extraSoicalMedia方法,将搜索到的网页内容传入模板,获取用户的社交媒体账号信息,并通过 SSE 推送给前端。视频搜索
利用 SearxngSearchClient 搜索视频类别内容,并推送视频结果。LinkedIn 信息抓取
从社交媒体数据中提取 LinkedIn 链接后,调用LinkedInService的相关方法分别抓取个人资料和动态。抓取过程中优先检查缓存,再调用 API。信息汇总与提示生成
将所有抓取到的信息(网页内容、社交媒体账号、LinkedIn 数据)汇总,通过 PromptEngine 填充模板,生成最终的系统提示词,供大语言模型生成详细的用户描述信息。
6. 注意事项
- 数据缓存:各模块均采用数据库缓存机制,避免重复请求同一数据,提高系统响应速度。
- 搜索结果不稳定:若搜索结果较少,系统会自动重试,确保尽可能获得有效信息。
- LinkedIn 特殊问题:已知在提示词中使用“linked”时,系统很容易返回 Joshua S. 的信息,因此在构造提示词时需要注意避免这种情况或进行额外过滤。
