-
Notifications
You must be signed in to change notification settings - Fork 282
[N-2] 09-Vector Store / 02-ChromaWithLangchain #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add Dataset for Image Search : `animal-180.zip`
- Remove `animal-180.zip` Dataset - Add HuggingFace Dataset Code
데이터 load 기능 수정하다가 제가 실수로 이전 PR Closed 되서 새로 PR 했습니다 ㅠㅠ Reviewer 분들 죄송해요... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🖥️ OS: Mac
✅ Checklist
- Template: Tutorials follows the required template.
- Table of Contents(TOC) Links: All Table of Contents links work. ((Yes/No)
- Image: Image filenames follow guidelines.
- *Imports: All import statements use the latest versions. Ensure "langchain-teddynote" is not used.
- Code Execution: Code runs without errors.
- Comments: 고생하셨습니다!! 🙂
- 꼼꼼하게 작성해주셨네요. 모두 정상 실행되는 것 확인했습니다.
- data 불러오는 시간도 많이 줄어들었네요 1분 조금 넘게 걸린 것 같습니다~~! 감사합니다!!
백틱 뒤 공백 제거 및 마지막 셀 import os 추가
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🖥️ OS: Mac
✅ Checklist
- Template: Tutorials follows the required template.
- 백틱 뒤 공백없는 부분 전부 수정해서 commit 했습니다.
- Table of Contents(TOC) Links: All Table of Contents links work. ((Yes/No)
- Image: Image filenames follow guidelines.
- *Imports: All import statements use the latest versions. Ensure "langchain-teddynote" is not used.
- Code Execution: Code runs without errors.
- 마지막 셀에 import os 가 빠져있어서 추가했습니다. (커밋완료)
- datasets 패키지 업데이트가 필요합니다. 자세한 내용 아래 작성하였습니다.
- Comments: {Write freely, 한국어 기술 가능}
[ISSUE]
- ValueError in load_dataset
[RESOLVE]
- 기존 datasets 버전이 2.14.4 였는데 버전 업그레이드하고 해결되었습니다.
[Action items]
- 관련 패키지 버전 업데이트가 필요합니다.
datasets
: 3.2.0
[에러난 부분]
from datasets import load_dataset
dataset = load_dataset("Pupba/animal-180", split="train")
[에러]
ValueError Traceback (most recent call last)
Cell In[20], line 3
1 from datasets import load_dataset
----> 3 dataset = load_dataset("Pupba/animal-180", split="train")
5 # slice 50 set
6 images = dataset[:50]["png"]
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:2112, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
2107 verification_mode = VerificationMode(
2108 (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
2109 )
2111 # Create a dataset builder
-> 2112 builder_instance = load_dataset_builder(
2113 path=path,
2114 name=name,
2115 data_dir=data_dir,
2116 data_files=data_files,
2117 cache_dir=cache_dir,
2118 features=features,
2119 download_config=download_config,
2120 download_mode=download_mode,
2121 revision=revision,
2122 token=token,
2123 storage_options=storage_options,
2124 **config_kwargs,
2125 )
2127 # Return iterable dataset in case of streaming
2128 if streaming:
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1798, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, **config_kwargs)
1796 download_config = download_config.copy() if download_config else DownloadConfig()
1797 download_config.storage_options.update(storage_options)
-> 1798 dataset_module = dataset_module_factory(
1799 path,
1800 revision=revision,
1801 download_config=download_config,
1802 download_mode=download_mode,
1803 data_dir=data_dir,
1804 data_files=data_files,
1805 )
1806 # Get dataset builder class from the processing script
1807 builder_kwargs = dataset_module.builder_kwargs
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1495, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
1490 if isinstance(e1, FileNotFoundError):
1491 raise FileNotFoundError(
1492 f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
1493 f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
1494 ) from None
-> 1495 raise e1 from None
1496 else:
1497 raise FileNotFoundError(
1498 f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory."
1499 )
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1479, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
1464 return HubDatasetModuleFactoryWithScript(
1465 path,
1466 revision=revision,
(...)
1469 dynamic_modules_path=dynamic_modules_path,
1470 ).get_module()
1471 else:
1472 return HubDatasetModuleFactoryWithoutScript(
1473 path,
1474 revision=revision,
1475 data_dir=data_dir,
1476 data_files=data_files,
1477 download_config=download_config,
1478 download_mode=download_mode,
-> 1479 ).get_module()
1480 except (
1481 Exception
1482 ) as e1: # noqa all the attempts failed, before raising the error we should check if the module is already cached.
1483 try:
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1034, in HubDatasetModuleFactoryWithoutScript.get_module(self)
1029 metadata_configs = MetadataConfigs.from_dataset_card_data(dataset_card_data)
1030 dataset_infos = DatasetInfosDict.from_dataset_card_data(dataset_card_data)
1031 patterns = (
1032 sanitize_patterns(self.data_files)
1033 if self.data_files is not None
-> 1034 else get_data_patterns(base_path, download_config=self.download_config)
1035 )
1036 data_files = DataFilesDict.from_patterns(
1037 patterns,
1038 base_path=base_path,
1039 allowed_extensions=ALL_ALLOWED_EXTENSIONS,
1040 download_config=self.download_config,
1041 )
1042 module_name, default_builder_kwargs = infer_module_for_data_files(
1043 data_files=data_files,
1044 path=self.name,
1045 download_config=self.download_config,
1046 )
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/data_files.py:457, in get_data_patterns(base_path, download_config)
455 resolver = partial(resolve_pattern, base_path=base_path, download_config=download_config)
456 try:
--> 457 return _get_data_files_patterns(resolver)
458 except FileNotFoundError:
459 raise EmptyDatasetError(f"The directory at {base_path} doesn't contain any data files") from None
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/data_files.py:248, in _get_data_files_patterns(pattern_resolver)
246 for pattern in patterns:
247 try:
--> 248 data_files = pattern_resolver(pattern)
249 except FileNotFoundError:
250 continue
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/data_files.py:332, in resolve_pattern(pattern, base_path, allowed_extensions, download_config)
330 base_path = ""
331 pattern, storage_options = _prepare_path_and_storage_options(pattern, download_config=download_config)
--> 332 fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
333 fs_base_path = base_path.split("::")[0].split("://")[-1] or fs.root_marker
334 fs_pattern = pattern.split("::")[0].split("://")[-1]
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/fsspec/core.py:686, in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
684 paths = _expand_paths(paths, name_function, num)
685 elif "*" in paths:
--> 686 paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
687 else:
688 paths = [paths]
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py:521, in HfFileSystem.glob(self, path, **kwargs)
519 kwargs = {"expand_info": kwargs.get("detail", False), **kwargs}
520 path = self.resolve_path(path, revision=kwargs.get("revision")).unresolve()
--> 521 return super().glob(path, **kwargs)
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/fsspec/spec.py:611, in AbstractFileSystem.glob(self, path, maxdepth, **kwargs)
607 depth = None
609 allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
--> 611 pattern = glob_translate(path + ("/" if ends_with_sep else ""))
612 pattern = re.compile(pattern)
614 out = {
615 p: info
616 for p, info in sorted(allpaths.items())
(...)
621 )
622 }
File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/fsspec/utils.py:731, in glob_translate(pat)
729 continue
730 elif "**" in part:
--> 731 raise ValueError(
732 "Invalid pattern: '**' can only be an entire path component"
733 )
734 if part:
735 results.extend(_translate(part, f"{not_sep}*", not_sep))
ValueError: Invalid pattern: '**' can only be an entire path component
- add `import os` - Fix `package.install` to `datasets >= 3.2.0`
- Fix : Add a blank space after the backtick - Fix : `package.install` to `datasets >= 3.2.0`
피드백 감사합니다. |
앗! 제가 import os 도 추가해서 커밋했는데, 지금 중복되어있는 것 같아요! |
- 'import os' duplicate modification.
제 작업 환경에서 노트북 파일 싱크 오류가 좀 있었네요... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
수고많으셨습니다! 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
제가 fetch가 안된 상황에서 리뷰를 했었나보네요.
영인님께서 꼼꼼하게 잘 리뷰해주셨네요!
두 분 모두 감사합니다.
저도 다시 실행해봤는데, 에러없이 잘 실행되는 것을 확인했습니다!
고생하셨습니다!!
87fddfd
into
LangChain-OpenTutorial:main
animal-180.zip
DatasetAuthor Checklist
PR Title Format: I have confirmed that the PR title follows the correct format. (e.g., [N-2] 07-Text Splitter / 07-RecursiveCharacterTextSplitter)
Committed Files: I have ensured that no unnecessary files (e.g., .bin, .gitignore, poetry.lock, pyproject.toml) are included. These files are not allowed.
(Optional) Related Issue: If this PR is linked to an issue, I have referenced the issue number in the PR message. (e.g., Fixes Update 01-PromptTemplate.ipynb #123)
❌ Do not include unnecessary files (e.g., .bin, .gitignore, poetry.lock, pyproject.toml) or other people's code. If included, close the PR and create a new PR.
Review Template (Intial PR)
If no one reviews your PR within a few days, please @-mention one of teddylee777, musangk, BAEM1N