diff --git a/13-LangChain-Expression-Language/09-Generator.ipynb b/13-LangChain-Expression-Language/09-Generator.ipynb new file mode 100644 index 000000000..b27a90f18 --- /dev/null +++ b/13-LangChain-Expression-Language/09-Generator.ipynb @@ -0,0 +1,509 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Generator\n", + "\n", + "- Author: [Junseong Kim](https://www.linkedin.com/in/%EC%A4%80%EC%84%B1-%EA%B9%80-591b351b2/)\n", + "- Design: [Junseong Kim](https://www.linkedin.com/in/%EC%A4%80%EC%84%B1-%EA%B9%80-591b351b2/)\n", + "- Peer Review: \n", + "- This is a part of [LangChain Open Tutorial](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial)\n", + "\n", + "[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/02-CommaSeparatedListOutputParser.ipynb) [![Open in GitHub](https://img.shields.io/badge/Open%20in%20GitHub-181717?style=flat-square&logo=github&logoColor=white)](https://github.com/LangChain-OpenTutorial/LangChain-OpenTutorial/blob/main/03-OutputParser/02-CommaSeparatedListOutputParser.ipynb)\n", + "\n", + "## Overview\n", + "\n", + "This tutorial demonstrates how to use a **user-defined generator** (or async generator) in a LangChain pipeline to process text outputs in a streaming fashion. Specifically, we’ll show how to parse a comma-separated string output into a Python list, all while maintaining the benefits of streaming from a Language Model.\n", + "\n", + "We will also cover asynchronous usage, showing how to adopt the same approach with async generators. By the end of this tutorial, you’ll be able to:\n", + "\n", + "Implement a custom generator function that can handle streaming outputs\n", + "Parse comma-separated text chunks into a list in real time\n", + "Use both synchronous and asynchronous approaches for streaming\n", + "Integrate these parsers in a LangChain chain\n", + "Optionally, explore how `RunnableGenerator` can help implement custom generator transformations in a streaming context\n", + "\n", + "### Table of Contents\n", + "\n", + "- [Overview](#overview) \n", + "- [Environment Setup](#environment-setup) \n", + "- [Implementing a Comma-Separated List Parser with a Custom Generator](#implementing-a-comma-separated-list-parser-with-a-custom-generator) \n", + " - [Synchronous Parsing](#synchronous-parsing) \n", + " - [Asynchronous Parsing](#asynchronous-parsing) \n", + "- [Using RunnableGenerator with Our Comma-Separated List Parser](#using-runnablegenerator-with-our-comma-separated-list-parser) \n", + "\n", + "### References\n", + "\n", + "- [LangChain ChatOpenAI API reference](https://python.langchain.com/api_reference/openai/chat_models/langchain_openai.chat_models.base.ChatOpenAI.html)\n", + "- [LangChain custom functions](https://python.langchain.com/docs/how_to/functions/)\n", + "- [LangChain RunnableGenerator](https://python.langchain.com/api_reference/core/runnables/langchain_core.runnables.base.RunnableGenerator.html)\n", + "- [Python Generators Documentation](https://docs.python.org/3/tutorial/classes.html#generators)\n", + "- [Python Async IO Documentation](https://docs.python.org/3/library/asyncio.html)\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Environment Setup\n", + "\n", + "Set up the environment. You may refer to [Environment Setup](https://wikidocs.net/257836) for more details.\n", + "\n", + "**[Note]**\n", + "- `langchain-opentutorial` is a package that provides a set of easy-to-use environment setup, useful functions and utilities for tutorials. \n", + "- You can checkout the [`langchain-opentutorial`](https://github.com/LangChain-OpenTutorial/langchain-opentutorial-pypi) for more details." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "%%capture --no-stderr\n", + "%pip install langchain-opentutorial" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "from langchain_opentutorial import package\n", + "\n", + "package.install(\n", + " [\n", + " \"langsmith\",\n", + " \"langchain\",\n", + " \"langchain_openai\",\n", + " \"langchain_core\",\n", + " \"langchain_community\",\n", + " ],\n", + " verbose=False,\n", + " upgrade=False,\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Environment variables have been set successfully.\n" + ] + } + ], + "source": [ + "# Set environment variables\n", + "from langchain_opentutorial import set_env\n", + "\n", + "set_env(\n", + " {\n", + " \"OPENAI_API_KEY\": \"\",\n", + " \"LANGCHAIN_API_KEY\": \"\",\n", + " \"LANGCHAIN_TRACING_V2\": \"true\",\n", + " \"LANGCHAIN_ENDPOINT\": \"https://api.smith.langchain.com\",\n", + " \"LANGCHAIN_PROJECT\": \"09-Generator\",\n", + " }\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can alternatively set `OPENAI_API_KEY` in `.env` file and load it. \n", + "\n", + "[Note] This is not necessary if you've already set `OPENAI_API_KEY` in previous steps." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "True" + ] + }, + "execution_count": 4, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from dotenv import load_dotenv\n", + "\n", + "load_dotenv(override=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Implementing a Comma-Separated List Parser with a Custom Generator\n", + "\n", + "When working with Language Models, you may often receive outputs in plain text form, such as comma-separated strings. If you want to parse those outputs into a structured format (e.g., a list) as they are generated, you can implement a custom generator function. This retains the streaming benefits—observing partial outputs in real time—while converting the data into a more usable format." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Synchronous Parsing\n", + "\n", + "In this section, we define a custom generator function `split_into_list()`. It accepts an iterator of tokens (strings) and continuously accumulates them until it encounters a comma. At each comma, it yields the current accumulated text (stripped and split) as a list item.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import Iterator, List\n", + "\n", + "\n", + "# A user-defined parser that splits a stream of tokens into a comma-separated list\n", + "def split_into_list(input: Iterator[str]) -> Iterator[List[str]]:\n", + " buffer = \"\"\n", + " for chunk in input:\n", + " # Accumulate tokens in the buffer\n", + " buffer += chunk\n", + " # Whenever we find a comma, split and yield the segment\n", + " while \",\" in buffer:\n", + " comma_index = buffer.index(\",\")\n", + " yield [buffer[:comma_index].strip()]\n", + " buffer = buffer[comma_index + 1 :]\n", + " # Finally, yield whatever remains in the buffer\n", + " yield [buffer.strip()]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, we create a LangChain pipeline that does the following:\n", + "\n", + "- Defines a prompt template to generate comma-separated outputs.\n", + "- Uses `ChatOpenAI` to get deterministic responses by setting `temperature=0.0`.\n", + "- Converts the raw output into a string using `StrOutputParser`.\n", + "- Pipes (|) that string output into our `split_into_list` function for parsing." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_core.prompts import ChatPromptTemplate\n", + "from langchain_core.output_parsers import StrOutputParser\n", + "from langchain_openai import ChatOpenAI\n", + "\n", + "prompt = ChatPromptTemplate.from_template(\n", + " \"Write a comma-separated list of 5 companies similar to: {company}\"\n", + ")\n", + "\n", + "# Initialize the model with temperature=0.0 for deterministic output\n", + "model = ChatOpenAI(temperature=0.0, model=\"gpt-4o\")\n", + "\n", + "# Chain 1: Convert to a string\n", + "str_chain = prompt | model | StrOutputParser()\n", + "\n", + "# Chain 2: Parse the comma-separated string into a list using our generator\n", + "list_chain = str_chain | split_into_list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "By streaming the output through `list_chain`, you can see the partial results in real time. Each chunk appears as soon as the parser encounters a comma:" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Microsoft']\n", + "['Apple']\n", + "['Amazon']\n", + "['Facebook']\n", + "['IBM']\n" + ] + } + ], + "source": [ + "# Stream the parsed data\n", + "for chunk in list_chain.stream({\"company\": \"Google\"}):\n", + " print(chunk, flush=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you prefer to get the entire parsed result at once (after the entire generation is completed), use the .`invoke()` method:" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Microsoft', 'Apple', 'Amazon', 'Facebook', 'IBM']\n" + ] + } + ], + "source": [ + "output = list_chain.invoke({\"company\": \"Google\"})\n", + "print(output)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Asynchronous Parsing\n", + "\n", + "The above approach works for synchronous iteration. However, some applications may require **async** iteration to avoid blocking. The following shows how to handle the same comma-separated parsing with an **async generator**.\n", + "\n", + "\n", + "Here, `asplit_into_list()` accumulates tokens in the same way but uses async for to handle asynchronous data streams." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "from typing import AsyncIterator\n", + "\n", + "\n", + "async def asplit_into_list(input: AsyncIterator[str]) -> AsyncIterator[List[str]]:\n", + " buffer = \"\"\n", + " async for chunk in input:\n", + " buffer += chunk\n", + " while \",\" in buffer:\n", + " comma_index = buffer.index(\",\")\n", + " yield [buffer[:comma_index].strip()]\n", + " buffer = buffer[comma_index + 1 :]\n", + " yield [buffer.strip()]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, you can **pipe** the asynchronous parser into a chain just like the synchronous version:" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [], + "source": [ + "alist_chain = str_chain | asplit_into_list" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you call `astream()`, you can handle each chunk as it arrives, in an async context:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Microsoft']\n", + "['Apple']\n", + "['Amazon']\n", + "['Facebook']\n", + "['IBM']\n" + ] + } + ], + "source": [ + "async for chunk in alist_chain.astream({\"company\": \"Google\"}):\n", + " print(chunk, flush=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Similarly, you can get the entire parsed list using the asynchronous `ainvoke()` method:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Microsoft', 'Apple', 'Amazon', 'Facebook', 'IBM']\n" + ] + } + ], + "source": [ + "result = await alist_chain.ainvoke({\"company\": \"Google\"})\n", + "print(result)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using `RunnableGenerator` with Our Comma-Separated List Parser\n", + "In addition to writing your own generator functions, you can leverage `RunnableGenerator` for more advanced or modular streaming behavior. This approach wraps your generator logic in a Runnable, making it easy to plug into a chain and still preserve partial output streaming. Below, we modify our **comma-separated list parser** to demonstrate how `RunnableGenerator` can be used.\n", + "\n", + "### Why Use `RunnableGenerator`?\n", + "- Modularity: Easily encapsulate your parsing logic as a “runnable” component.\n", + "- Consistency: The `RunnableGenerator` interface (`invoke`, `stream`, `ainvoke`, `astream`) is consistent with other LangChain runnables.\n", + "- Extendability: Combine multiple runnables (e.g., `RunnableLambda`, `RunnableGenerator`) in sequence for more complex transformations. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Transforming the Same Parser Logic\n", + "\n", + "Previously, we defined `split_into_list()` as a standalone Python generator function. Let’s do something similar, but as a **transform** function for `RunnableGenerator`. We want to parse a streaming sequence of tokens into a **list** of individual items whenever we see a comma." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [], + "source": [ + "from langchain_core.runnables import RunnableGenerator\n", + "from typing import Iterator, List\n", + "\n", + "\n", + "def comma_parser_runnable(input_iter: Iterator[str]) -> Iterator[List[str]]:\n", + " \"\"\"\n", + " This function accumulates tokens from input_iter and yields\n", + " each chunk split by commas as a list.\n", + " \"\"\"\n", + " buffer = \"\"\n", + " for chunk in input_iter:\n", + " buffer += chunk\n", + " # Whenever we find a comma, split and yield\n", + " while \",\" in buffer:\n", + " comma_index = buffer.index(\",\")\n", + " yield [buffer[:comma_index].strip()]\n", + " buffer = buffer[comma_index + 1 :]\n", + " # Finally, yield whatever remains\n", + " yield [buffer.strip()]\n", + "\n", + "\n", + "# Wrap it in a RunnableGenerator\n", + "parser_runnable = RunnableGenerator(comma_parser_runnable)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can now integrate 'parser_runnable' into the **same** prompt-and-model pipeline we used before. " + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "list_chain_via_runnable = str_chain | parser_runnable" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When run, partial outputs will appear as single-element lists, just like our original custom generator approach. \n", + "\n", + "The difference is that we’re now using `RunnableGenerator` to encapsulate the logic in a more modular, LangChain-native way." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['Microsoft']\n", + "['Apple']\n", + "['Amazon']\n", + "['Facebook']\n", + "['IBM']\n" + ] + } + ], + "source": [ + "# Stream partial results\n", + "for parsed_chunk in list_chain_via_runnable.stream({\"company\": \"Google\"}):\n", + " print(parsed_chunk)" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "langchain-opentutorial-QDzDRI-1-py3.11", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.0" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}