-
Notifications
You must be signed in to change notification settings - Fork 282
[N-3]06-DocumentLoader/14-BiorxivLoader #708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
brian604
wants to merge
2
commits into
LangChain-OpenTutorial:main
Choose a base branch
from
brian604:ADD
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,388 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "635d8ebb", | ||
"metadata": {}, | ||
"source": [ | ||
"# Biorxiv Loader\n", | ||
"\n", | ||
"- Author: [frimer](https://github.com/brian604)\n", | ||
"- Design:\n", | ||
"- Peer Review: \n", | ||
"- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", | ||
"\n", | ||
"[](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/14-medrxivLoader.ipynb) [](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/06-DocumentLoader/14-medrxivLoader.ipynb)\n", | ||
"\n", | ||
"## Overview\n", | ||
"\n", | ||
"This tutorial will introduce you to another archives of health-related and biological-related contents: **medRxiv** and **bioRxiv**, both of which\n", | ||
"are operated by the Cold Spring Harbor Laboratory\n", | ||
"\n", | ||
"### Table of Contents\n", | ||
"\n", | ||
"- [Overview](#overview)\n", | ||
"- [Environment Setup](#environment-setup)\n", | ||
"- [Example Queries](#example-queries)\n", | ||
"\n", | ||
"### References\n", | ||
"\n", | ||
"- [medrxivr](https://github.com/ropensci/medrxivr)\n", | ||
" - Access and search medRxiv and bioRxiv\n", | ||
"- [Arxiv Langchain](https://python.langchain.com/docs/integrations/providers/arxiv/)\n", | ||
"- [medrxiv-langchain](https://github.com/brian604/medrxiv-langchain)\n", | ||
"----" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "c6c7aba4", | ||
"metadata": {}, | ||
"source": [ | ||
"## Environment Setup\n", | ||
"\n", | ||
"Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", | ||
"\n", | ||
"**[Note]**\n", | ||
"- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", | ||
"- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"id": "21943adb", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture --no-stderr\n", | ||
"%pip install --upgrade langchain-opentutorial" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"id": "f25ec196", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"\n", | ||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m A new release of pip is available: \u001b[0m\u001b[31;49m24.3.1\u001b[0m\u001b[39;49m -> \u001b[0m\u001b[32;49m25.0.1\u001b[0m\n", | ||
"\u001b[1m[\u001b[0m\u001b[34;49mnotice\u001b[0m\u001b[1;39;49m]\u001b[0m\u001b[39;49m To update, run: \u001b[0m\u001b[32;49mpip install --upgrade pip\u001b[0m\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Install required packages\n", | ||
"from langchain_opentutorial import package\n", | ||
"\n", | ||
"package.install(\n", | ||
" [\n", | ||
" \"langsmith\",\n", | ||
" \"langchain\",\n", | ||
" \"langchain_core\",\n", | ||
" \"langchain-anthropic\",\n", | ||
" \"langchain_community\",\n", | ||
" \"langchain_text_splitters\",\n", | ||
" \"langchain_openai\",\n", | ||
" ],\n", | ||
" verbose=False,\n", | ||
" upgrade=False,\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"id": "7f9065ea", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Environment variables have been set successfully.\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Set environment variables\n", | ||
"from langchain_opentutorial import set_env\n", | ||
"\n", | ||
"set_env(\n", | ||
" {\n", | ||
" \"OPENAI_API_KEY\": \"\",\n", | ||
" \"LANGCHAIN_API_KEY\": \"\",\n", | ||
" \"LANGCHAIN_TRACING_V2\": \"true\",\n", | ||
" \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", | ||
" \"LANGCHAIN_PROJECT\": \"BiorxivLoader\", # Please set it the same as title\n", | ||
" }\n", | ||
")" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "690a9ae0", | ||
"metadata": {}, | ||
"source": [ | ||
"You can alternatively set API keys such as `OPENAI_API_KEY` in a `.env` file and load them.\n", | ||
"\n", | ||
"**[Note]** This is not necessary if you've already set the required API keys in previous steps." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"id": "4f99b5b6", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"data": { | ||
"text/plain": [ | ||
"True" | ||
] | ||
}, | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"output_type": "execute_result" | ||
} | ||
], | ||
"source": [ | ||
"# Load API keys from .env file\n", | ||
"from dotenv import load_dotenv\n", | ||
"\n", | ||
"load_dotenv(override=True)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2b2fc536", | ||
"metadata": {}, | ||
"source": [ | ||
"## Example Queries\n", | ||
"\n", | ||
"In this step, we will test out few examples to see if the biorxiv loader works as expected so it has a potential to contribute to `langchain_community`\n", | ||
"- We will test the date range from server \"biorxiv\" for the period from 2024-01-01 to 2024-02-17" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 실제 코드는 총 3번의 쿼리를 통해 simple keyword search, w/date range, w/recent date range를 조회하고 있어서 이를 여기에 반영하거나, 각 검색 쿼리에 대한 코드블럭을 나눠서 각각 설명해주시는 것도 좋을 것 같습니다. 또한, 다른 document loader tutorial을 참고하여 search 함수의 파라미터를 설명해주는 게 좋을 것 같아 제안드립니다. |
||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"id": "d527750a", | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"\n", | ||
"Testing simple keyword search...\n", | ||
"\n", | ||
"Searching biorxiv with URL: https://api.biorxiv.org/details/biorxiv/2000-01-01/2025-02-19/0\n", | ||
"Total results from API: 370698\n", | ||
"Filtered results for query 'machine learning': 0\n", | ||
"\n", | ||
"Testing keyword search with date range...\n", | ||
"\n", | ||
"Searching biorxiv with URL: https://api.biorxiv.org/details/biorxiv/2024-01-01/2024-02-17/0\n", | ||
"Total results from API: 7235\n", | ||
"Filtered results for query 'CRISPR': 2\n", | ||
"\n", | ||
"Searching medrxiv with URL: https://api.biorxiv.org/details/medrxiv/2024-01-01/2024-02-17/0\n", | ||
"Total results from API: 1792\n", | ||
"Filtered results for query 'CRISPR': 1\n", | ||
"\n", | ||
"Testing last 30 days search...\n", | ||
"\n", | ||
"Searching medrxiv with URL: https://api.biorxiv.org/details/medrxiv/2025-01-20/2025-02-19/0\n", | ||
"Total results from API: 1306\n", | ||
"Filtered results for query 'sequencing': 8\n", | ||
"\n", | ||
"Summary Statistics\n", | ||
"-----------------\n", | ||
"Total unique papers: 8\n", | ||
"\n", | ||
"Papers by Category:\n", | ||
"genetic and genomic medicine: 4\n", | ||
"developmental biology: 1\n", | ||
"cell biology: 1\n", | ||
"infectious diseases: 1\n", | ||
"rheumatology: 1\n", | ||
"\n", | ||
"Papers by Server:\n", | ||
"biorxiv: 2\n", | ||
"medrxiv: 6\n", | ||
"\n", | ||
"Sample of Retrieved Papers:\n", | ||
"\n", | ||
"1. Establishment of CRISPR/Cas9-based knock-in in a hemimetabolous insect: targeted gene tagging in the cricket Gryllus bimaculatus\n", | ||
"Server: biorxiv\n", | ||
"Date: 2024-01-23\n", | ||
"DOI: 10.1101/2021.05.10.441399\n", | ||
"\n", | ||
"2. When less is more - A fast TurboID KI approach for high sensitivity endogenous interactome mapping\n", | ||
"Server: biorxiv\n", | ||
"Date: 2024-01-11\n", | ||
"DOI: 10.1101/2021.11.19.469212\n", | ||
"\n", | ||
"3. Tcte1 knockout influence on energy chain transportation, apoptosis and spermatogenesis - implications for male infertility\n", | ||
"Server: medrxiv\n", | ||
"Date: 2024-02-16\n", | ||
"DOI: 10.1101/2022.11.17.22282339\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import requests\n", | ||
"from datetime import datetime, timedelta\n", | ||
"from typing import List, Dict, Optional\n", | ||
"\n", | ||
"class SimpleBioRxivSearch:\n", | ||
" def __init__(self):\n", | ||
" self.base_url = \"https://api.biorxiv.org/details\"\n", | ||
" \n", | ||
" def search(self, \n", | ||
" query: str, \n", | ||
" server: List[str] = [\"biorxiv\"],\n", | ||
" start_date: Optional[str] = None,\n", | ||
" end_date: Optional[str] = None,\n", | ||
" max_results: int = 5) -> List[Dict]:\n", | ||
" \"\"\"\n", | ||
" Search bioRxiv and/or medRxiv papers.\n", | ||
" \"\"\"\n", | ||
" results = []\n", | ||
" \n", | ||
" for srv in server:\n", | ||
" # Construct the API URL\n", | ||
" if start_date and end_date:\n", | ||
" url = f\"{self.base_url}/{srv}/{start_date}/{end_date}/0\"\n", | ||
" else:\n", | ||
" url = f\"{self.base_url}/{srv}/2000-01-01/{datetime.now().strftime('%Y-%m-%d')}/0\"\n", | ||
" \n", | ||
" try:\n", | ||
" response = requests.get(url)\n", | ||
" response.raise_for_status() # Raise an exception for bad status codes\n", | ||
" data = response.json()\n", | ||
" \n", | ||
" # Print API response for debugging\n", | ||
" print(f\"\\nSearching {srv} with URL: {url}\")\n", | ||
" print(f\"Total results from API: {data.get('messages', [{}])[0].get('total', 0)}\")\n", | ||
" \n", | ||
" # Filter results based on query terms\n", | ||
" query_terms = [term.lower() for term in query.replace('\"', '').split(' AND ')]\n", | ||
" \n", | ||
" filtered_results = []\n", | ||
" for paper in data.get('collection', []):\n", | ||
" text_to_search = (paper.get('title', '') + ' ' + paper.get('abstract', '')).lower()\n", | ||
" if all(term in text_to_search for term in query_terms):\n", | ||
" filtered_results.append({\n", | ||
" 'title': paper.get('title', ''),\n", | ||
" 'abstract': paper.get('abstract', ''),\n", | ||
" 'doi': paper.get('doi', ''),\n", | ||
" 'category': paper.get('category', 'Unknown'),\n", | ||
" 'server': srv,\n", | ||
" 'date': paper.get('date', '')\n", | ||
" })\n", | ||
" \n", | ||
" print(f\"Filtered results for query '{query}': {len(filtered_results)}\")\n", | ||
" results.extend(filtered_results[:max_results])\n", | ||
" \n", | ||
" except requests.exceptions.RequestException as e:\n", | ||
" print(f\"Error accessing {srv} API: {str(e)}\")\n", | ||
" continue\n", | ||
" \n", | ||
" return results[:max_results]\n", | ||
"\n", | ||
"# Usage example:\n", | ||
"searcher = SimpleBioRxivSearch()\n", | ||
"\n", | ||
"# 1. Simple keyword search\n", | ||
"print(\"\\nTesting simple keyword search...\")\n", | ||
"docs1 = searcher.search(\n", | ||
" query=\"machine learning\", # Simplified query\n", | ||
" server=[\"biorxiv\"],\n", | ||
" max_results=5\n", | ||
")\n", | ||
"\n", | ||
"# 2. Keyword search with date range\n", | ||
"print(\"\\nTesting keyword search with date range...\")\n", | ||
"docs2 = searcher.search(\n", | ||
" query=\"CRISPR\", # Simplified query\n", | ||
" server=[\"biorxiv\", \"medrxiv\"],\n", | ||
" start_date=\"2024-01-01\",\n", | ||
" end_date=\"2024-02-17\",\n", | ||
" max_results=5\n", | ||
")\n", | ||
"\n", | ||
"# 3. Last 30 days search\n", | ||
"print(\"\\nTesting last 30 days search...\")\n", | ||
"end_date = datetime.now()\n", | ||
"start_date = end_date - timedelta(days=30)\n", | ||
"docs3 = searcher.search(\n", | ||
" query=\"sequencing\", # Simplified query\n", | ||
" server=[\"medrxiv\"],\n", | ||
" start_date=start_date.strftime(\"%Y-%m-%d\"),\n", | ||
" end_date=end_date.strftime(\"%Y-%m-%d\"),\n", | ||
" max_results=5\n", | ||
")\n", | ||
"\n", | ||
"# Print summary statistics\n", | ||
"all_docs = docs1 + docs2 + docs3\n", | ||
"categories = {}\n", | ||
"servers = {\"biorxiv\": 0, \"medrxiv\": 0}\n", | ||
"\n", | ||
"for doc in all_docs:\n", | ||
" # Count by category\n", | ||
" cat = doc['category']\n", | ||
" categories[cat] = categories.get(cat, 0) + 1\n", | ||
" \n", | ||
" # Count by server\n", | ||
" server = doc['server']\n", | ||
" servers[server] += 1\n", | ||
"\n", | ||
"print(\"\\nSummary Statistics\")\n", | ||
"print(\"-----------------\")\n", | ||
"print(f\"Total unique papers: {len(all_docs)}\")\n", | ||
"print(\"\\nPapers by Category:\")\n", | ||
"for cat, count in sorted(categories.items(), key=lambda x: x[1], reverse=True):\n", | ||
" print(f\"{cat}: {count}\")\n", | ||
"\n", | ||
"print(\"\\nPapers by Server:\")\n", | ||
"for server, count in servers.items():\n", | ||
" print(f\"{server}: {count}\")\n", | ||
"\n", | ||
"# Print sample of results\n", | ||
"print(\"\\nSample of Retrieved Papers:\")\n", | ||
"for i, doc in enumerate(all_docs[:3], 1):\n", | ||
" print(f\"\\n{i}. {doc['title']}\")\n", | ||
" print(f\"Server: {doc['server']}\")\n", | ||
" print(f\"Date: {doc['date']}\")\n", | ||
" print(f\"DOI: {doc['doi']}\")" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "langchain-opentutorial-O2Q3XwPg-py3.11", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.11" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"if the biorxic loader works as expected so it has a potential to contribute to
langchain_community
"라는 문장 의미가 다소 불명확한 것 같습니다. 어떤 뜻으로 적으신 걸까요?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arxiv는 langchain에서는 (langchain_community) 형태로 서포트가 되지만, biorxiv, medrxiv와 같은 생명과학자들을 위한 패키지는 없어서 많이 아쉬워서 표현을 이러한 식으로 한것입니다.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
네 취지는 이해하였는데요,
langchain_community
패키지와 무관해보이는데 해당 코드에 contribution한다는 의미로 읽혀서 여쭤보았습니다. 아래와 같이 수정해보면 어떨까 싶습니다: