Thanks to visit codestin.com
Credit goes to github.com

Skip to content

Conversation

pupba
Copy link
Contributor

@pupba pupba commented Jan 9, 2025

  • Remove animal-180.zip Dataset
  • Add HuggingFace Dataset Code

Author Checklist

  • PR Title Format: I have confirmed that the PR title follows the correct format. (e.g., [N-2] 07-Text Splitter / 07-RecursiveCharacterTextSplitter)

  • Committed Files: I have ensured that no unnecessary files (e.g., .bin, .gitignore, poetry.lock, pyproject.toml) are included. These files are not allowed.

  • (Optional) Related Issue: If this PR is linked to an issue, I have referenced the issue number in the PR message. (e.g., Fixes Update 01-PromptTemplate.ipynb #123)

  • ❌ Do not include unnecessary files (e.g., .bin, .gitignore, poetry.lock, pyproject.toml) or other people's code. If included, close the PR and create a new PR.

Review Template (Intial PR)

🖥️ OS: Win/Mac/Linux   
✅ Checklist      
 - [ ] **Template**: Tutorials follows the required template. 
 - [ ] **Table of Contents(TOC) Links**: All Table of Contents links work. ((Yes/No)
 - [ ] **Image**: Image filenames follow guidelines.
 - [ ] **Imports*: All import statements use the latest versions. Ensure "langchain-teddynote" is not used. 
 - [ ] **Code Execution**: Code runs without errors.
 - Comments: {Write freely, 한국어 기술 가능}     

If no one reviews your PR within a few days, please @-mention one of teddylee777, musangk, BAEM1N

@pupba pupba added the hot fix Quick fix on something label Jan 9, 2025
@pupba
Copy link
Contributor Author

pupba commented Jan 9, 2025

데이터 load 기능 수정하다가 제가 실수로 이전 PR Closed 되서 새로 PR 했습니다 ㅠㅠ Reviewer 분들 죄송해요...
@namyoungkim @Normalist-K

Copy link
Contributor

@namyoungkim namyoungkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🖥️ OS: Mac
✅ Checklist

  • Template: Tutorials follows the required template.
  • Table of Contents(TOC) Links: All Table of Contents links work. ((Yes/No)
  • Image: Image filenames follow guidelines.
  • *Imports: All import statements use the latest versions. Ensure "langchain-teddynote" is not used.
  • Code Execution: Code runs without errors.
  • Comments: 고생하셨습니다!! 🙂
    • 꼼꼼하게 작성해주셨네요. 모두 정상 실행되는 것 확인했습니다.
    • data 불러오는 시간도 많이 줄어들었네요 1분 조금 넘게 걸린 것 같습니다~~! 감사합니다!!

백틱 뒤 공백 제거 및 마지막 셀 import os 추가
Copy link
Contributor

@Normalist-K Normalist-K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🖥️ OS: Mac
✅ Checklist

  • Template: Tutorials follows the required template.
    • 백틱 뒤 공백없는 부분 전부 수정해서 commit 했습니다.
  • Table of Contents(TOC) Links: All Table of Contents links work. ((Yes/No)
  • Image: Image filenames follow guidelines.
  • *Imports: All import statements use the latest versions. Ensure "langchain-teddynote" is not used.
  • Code Execution: Code runs without errors.
    1. 마지막 셀에 import os 가 빠져있어서 추가했습니다. (커밋완료)
    2. datasets 패키지 업데이트가 필요합니다. 자세한 내용 아래 작성하였습니다.
  • Comments: {Write freely, 한국어 기술 가능}

[ISSUE]

  • ValueError in load_dataset

[RESOLVE]

  • 기존 datasets 버전이 2.14.4 였는데 버전 업그레이드하고 해결되었습니다.

[Action items]

  • 관련 패키지 버전 업데이트가 필요합니다.
  • datasets: 3.2.0

[에러난 부분]

from datasets import load_dataset

dataset = load_dataset("Pupba/animal-180", split="train")

[에러]

ValueError                                Traceback (most recent call last)
Cell In[20], line 3
      1 from datasets import load_dataset
----> 3 dataset = load_dataset("Pupba/animal-180", split="train")
      5 # slice 50 set
      6 images = dataset[:50]["png"]

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:2112, in load_dataset(path, name, data_dir, data_files, split, cache_dir, features, download_config, download_mode, verification_mode, ignore_verifications, keep_in_memory, save_infos, revision, token, use_auth_token, task, streaming, num_proc, storage_options, **config_kwargs)
   2107 verification_mode = VerificationMode(
   2108     (verification_mode or VerificationMode.BASIC_CHECKS) if not save_infos else VerificationMode.ALL_CHECKS
   2109 )
   2111 # Create a dataset builder
-> 2112 builder_instance = load_dataset_builder(
   2113     path=path,
   2114     name=name,
   2115     data_dir=data_dir,
   2116     data_files=data_files,
   2117     cache_dir=cache_dir,
   2118     features=features,
   2119     download_config=download_config,
   2120     download_mode=download_mode,
   2121     revision=revision,
   2122     token=token,
   2123     storage_options=storage_options,
   2124     **config_kwargs,
   2125 )
   2127 # Return iterable dataset in case of streaming
   2128 if streaming:

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1798, in load_dataset_builder(path, name, data_dir, data_files, cache_dir, features, download_config, download_mode, revision, token, use_auth_token, storage_options, **config_kwargs)
   1796     download_config = download_config.copy() if download_config else DownloadConfig()
   1797     download_config.storage_options.update(storage_options)
-> 1798 dataset_module = dataset_module_factory(
   1799     path,
   1800     revision=revision,
   1801     download_config=download_config,
   1802     download_mode=download_mode,
   1803     data_dir=data_dir,
   1804     data_files=data_files,
   1805 )
   1806 # Get dataset builder class from the processing script
   1807 builder_kwargs = dataset_module.builder_kwargs

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1495, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1490             if isinstance(e1, FileNotFoundError):
   1491                 raise FileNotFoundError(
   1492                     f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory. "
   1493                     f"Couldn't find '{path}' on the Hugging Face Hub either: {type(e1).__name__}: {e1}"
   1494                 ) from None
-> 1495             raise e1 from None
   1496 else:
   1497     raise FileNotFoundError(
   1498         f"Couldn't find a dataset script at {relative_to_absolute_path(combined_path)} or any data file in the same directory."
   1499     )

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1479, in dataset_module_factory(path, revision, download_config, download_mode, dynamic_modules_path, data_dir, data_files, **download_kwargs)
   1464         return HubDatasetModuleFactoryWithScript(
   1465             path,
   1466             revision=revision,
   (...)
   1469             dynamic_modules_path=dynamic_modules_path,
   1470         ).get_module()
   1471     else:
   1472         return HubDatasetModuleFactoryWithoutScript(
   1473             path,
   1474             revision=revision,
   1475             data_dir=data_dir,
   1476             data_files=data_files,
   1477             download_config=download_config,
   1478             download_mode=download_mode,
-> 1479         ).get_module()
   1480 except (
   1481     Exception
   1482 ) as e1:  # noqa all the attempts failed, before raising the error we should check if the module is already cached.
   1483     try:

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/load.py:1034, in HubDatasetModuleFactoryWithoutScript.get_module(self)
   1029 metadata_configs = MetadataConfigs.from_dataset_card_data(dataset_card_data)
   1030 dataset_infos = DatasetInfosDict.from_dataset_card_data(dataset_card_data)
   1031 patterns = (
   1032     sanitize_patterns(self.data_files)
   1033     if self.data_files is not None
-> 1034     else get_data_patterns(base_path, download_config=self.download_config)
   1035 )
   1036 data_files = DataFilesDict.from_patterns(
   1037     patterns,
   1038     base_path=base_path,
   1039     allowed_extensions=ALL_ALLOWED_EXTENSIONS,
   1040     download_config=self.download_config,
   1041 )
   1042 module_name, default_builder_kwargs = infer_module_for_data_files(
   1043     data_files=data_files,
   1044     path=self.name,
   1045     download_config=self.download_config,
   1046 )

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/data_files.py:457, in get_data_patterns(base_path, download_config)
    455 resolver = partial(resolve_pattern, base_path=base_path, download_config=download_config)
    456 try:
--> 457     return _get_data_files_patterns(resolver)
    458 except FileNotFoundError:
    459     raise EmptyDatasetError(f"The directory at {base_path} doesn't contain any data files") from None

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/data_files.py:248, in _get_data_files_patterns(pattern_resolver)
    246 for pattern in patterns:
    247     try:
--> 248         data_files = pattern_resolver(pattern)
    249     except FileNotFoundError:
    250         continue

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/datasets/data_files.py:332, in resolve_pattern(pattern, base_path, allowed_extensions, download_config)
    330     base_path = ""
    331 pattern, storage_options = _prepare_path_and_storage_options(pattern, download_config=download_config)
--> 332 fs, _, _ = get_fs_token_paths(pattern, storage_options=storage_options)
    333 fs_base_path = base_path.split("::")[0].split("://")[-1] or fs.root_marker
    334 fs_pattern = pattern.split("::")[0].split("://")[-1]

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/fsspec/core.py:686, in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol, expand)
    684     paths = _expand_paths(paths, name_function, num)
    685 elif "*" in paths:
--> 686     paths = [f for f in sorted(fs.glob(paths)) if not fs.isdir(f)]
    687 else:
    688     paths = [paths]

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/huggingface_hub/hf_file_system.py:521, in HfFileSystem.glob(self, path, **kwargs)
    519 kwargs = {"expand_info": kwargs.get("detail", False), **kwargs}
    520 path = self.resolve_path(path, revision=kwargs.get("revision")).unresolve()
--> 521 return super().glob(path, **kwargs)

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/fsspec/spec.py:611, in AbstractFileSystem.glob(self, path, maxdepth, **kwargs)
    607         depth = None
    609 allpaths = self.find(root, maxdepth=depth, withdirs=True, detail=True, **kwargs)
--> 611 pattern = glob_translate(path + ("/" if ends_with_sep else ""))
    612 pattern = re.compile(pattern)
    614 out = {
    615     p: info
    616     for p, info in sorted(allpaths.items())
   (...)
    621     )
    622 }

File ~/anaconda3/envs/langchain-opentutorial/lib/python3.11/site-packages/fsspec/utils.py:731, in glob_translate(pat)
    729     continue
    730 elif "**" in part:
--> 731     raise ValueError(
    732         "Invalid pattern: '**' can only be an entire path component"
    733     )
    734 if part:
    735     results.extend(_translate(part, f"{not_sep}*", not_sep))

ValueError: Invalid pattern: '**' can only be an entire path component

pupba added 4 commits January 10, 2025 18:12
- add `import os`
- Fix `package.install`  to `datasets >= 3.2.0`
- Fix : Add a blank space after the backtick
- Fix : `package.install`  to `datasets >= 3.2.0`
@pupba
Copy link
Contributor Author

pupba commented Jan 10, 2025

@Normalist-K

  1. 백틱 관련 부분 수정했습니다!
  2. 패키지 인스톨 부분에 dataset >= 3.2.0로 3.2.0 이상이 설치되도록 변경했습니다.
  3. import os 추가 했습니다!

피드백 감사합니다.

@Normalist-K
Copy link
Contributor

@Normalist-K

  1. 백틱 관련 부분 수정했습니다!
  2. 패키지 인스톨 부분에 dataset >= 3.2.0로 3.2.0 이상이 설치되도록 변경했습니다.
  3. import os 추가 했습니다!

피드백 감사합니다.

앗! 제가 import os 도 추가해서 커밋했는데, 지금 중복되어있는 것 같아요!

- 'import os' duplicate modification.
@pupba
Copy link
Contributor Author

pupba commented Jan 10, 2025

@Normalist-K

제 작업 환경에서 노트북 파일 싱크 오류가 좀 있었네요...
중복 문제 수정 했습니다! 감사합니다!

Copy link
Contributor

@Normalist-K Normalist-K left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

수고많으셨습니다! 👍🏻

Copy link
Contributor

@namyoungkim namyoungkim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

제가 fetch가 안된 상황에서 리뷰를 했었나보네요.
영인님께서 꼼꼼하게 잘 리뷰해주셨네요!

두 분 모두 감사합니다.
저도 다시 실행해봤는데, 에러없이 잘 실행되는 것을 확인했습니다!

고생하셨습니다!!

@teddylee777 teddylee777 merged commit 87fddfd into LangChain-OpenTutorial:main Jan 11, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hot fix Quick fix on something
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants