Files
solidstate-tools/mcp/SearchPaperByEmbedding/README.md
koko c0b2ec5983 sofvBV_mcp重构v2
Embedding copy
2025-10-22 23:59:23 +08:00

2.3 KiB

Paper Semantic Search

Find similar papers using semantic search. Supports both local models (free) and OpenAI API (better quality).

Features

  • Request for papers from OpenReview (e.g., ICLR2026 submissions)
  • Semantic search with example papers or text queries
  • Support embedding caching
  • Embed model support: Open-source (e.g., all-MiniLM-L6-v2) or OpenAI

Quick Start

pip install -r requirements.txt

1. Prepare Papers

from crawl import crawl_papers

crawl_papers(
    venue_id="ICLR.cc/2026/Conference/Submission",
    output_file="iclr2026_papers.json"
)

2. Search Papers

from search import PaperSearcher

# Local model (free)
searcher = PaperSearcher('iclr2026_papers.json', model_type='local')

# OpenAI model (better, requires API key)
# export OPENAI_API_KEY='your-key'
# searcher = PaperSearcher('iclr2026_papers.json', model_type='openai')

searcher.compute_embeddings()

# Search with example papers that you are interested in
examples = [
    {
        "title": "Your paper title",
        "abstract": "Your paper abstract..."
    }
]

results = searcher.search(examples=examples, top_k=100)

# Or search with text query
results = searcher.search(query="interesting topics", top_k=100)

searcher.display(results, n=10)
searcher.save(results, 'results.json')

How It Works

  1. Paper titles and abstracts are converted to embeddings
  2. Embeddings are cached automatically
  3. Your query is embedded using the same model
  4. Cosine similarity finds the most similar papers
  5. Results are ranked by similarity score

Cache

Embeddings are cached as cache_<filename>_<hash>_<model>.npy. Delete to recompute.

Example Output

================================================================================
Top 100 Results (showing 10)
================================================================================

1. [0.8456] Paper a
   #12345 | foundation or frontier models, including LLMs
   https://openreview.net/forum?id=xxx

2. [0.8234] Paper b
   #12346 | applications to robotics, autonomy, planning
   https://openreview.net/forum?id=yyy

Tips

  • Use 1-5 example papers for best results, or a paragraph of description of your interested topic
  • Local model is good enough for most cases
  • OpenAI model for critical search (~$1 for 18k queries)

If it's useful, please consider giving a star~