instructlab · bbrowning · Apr 9, 2025 · Apr 8, 2025 · Apr 9, 2025
diff --git a/src/instructlab/sdg/configs/knowledge/atomic_facts.yaml b/src/instructlab/sdg/configs/knowledge/atomic_facts.yaml
@@ -0,0 +1,45 @@
+system: You are an AI assistant knowledgeable about {{domain}} domain. Be accurate but concise in response.
+
+introduction: | 
+  Please break down the following snippet from an article about {{domain}} into atomic facts.
+
+principles: |
+  1. Make sure each fact is grounded in the given text.
+  2. Include any necessary information needed to explain the fact or concept
+  3. The atomic facts should be as simple as possible, if it’s compound sentence, break down one more time
+  4. For clarity, avoid using pronouns like ’it’, ’he’, ’she’, ’this’, ’that’ etc., and instead use the full names or titles.
+  5. Focus only on key concepts and facts. Skip any question or problems mentioned in the passage.
+
+examples: |
+  To help you understand the task, here is an example:
+  [Passage]
+  The tournament was contested by ten national teams, maintaining the same format used in 2019. After six weeks of round-robin matches, India, South Africa, Australia, and New Zealand finished as the top four and qualified for the knockout stage. In the knockout stage, India and Australia beat New Zealand and South Africa, respectively, to advance to the final, played on 19 November at the Narendra Modi Stadium in Ahmedabad. Australia won the final by six wickets, winning their sixth Cricket World Cup title.
+  [Facts]
+  1. The tournament was contested by ten national teams.
+  2. The tournament maintained the same format used in 2019.
+  3. The round-robin matches lasted for six weeks.
+  4. India finished as one of the top four teams.
+  5. South Africa finished as one of the top four teams.
+  6. Australia finished as one of the top four teams.
+  7. New Zealand finished as one of the top four teams.
+  8. India, South Africa, Australia, and New Zealand qualified for the knockout stage.
+  9. In the knockout stage, India beat New Zealand.
+  10. In the knockout stage, Australia beat South Africa.
+  11. India advanced to the final.
+  12. Australia advanced to the final.
+  13. The final was played on 19 November.
+  14. The final was held at the Narendra Modi Stadium in Ahmedabad.
+  15. Australia won the final by six wickets.
+  16. Australia won their sixth Cricket World Cup title.
+  [End]
+
+
+generation: |
+  Now it's your turn breakdown following snippet from article about {{domain}} into atomic facts following similar style as above examples
+  [Passage]
+  {{document}}
+  [Facts]
+
+
+start_tags: [""]
+end_tags: [""]
diff --git a/src/instructlab/sdg/configs/knowledge/detailed_summary.yaml b/src/instructlab/sdg/configs/knowledge/detailed_summary.yaml
@@ -0,0 +1,17 @@
+system: You are an AI assistant that is expert at summarizing text.
+
+introduction: |
+  Give me detailed summary for below document, making sure all key points are covered.
+
+principles: |
+  Do not add any new information.
+  Do not miss any key points from the provided document
+
+examples: ""
+
+generation: |
+  Document:
+  {{document}}
+
+start_tags: [""]
+end_tags: [""]
diff --git a/src/instructlab/sdg/configs/knowledge/extractive_summary.yaml b/src/instructlab/sdg/configs/knowledge/extractive_summary.yaml
@@ -0,0 +1,17 @@
+system: You are an AI assistant that is expert at summarizing text.
+
+introduction: |
+  Give me detailed extractive summary for below document, making sure all key points are covered.
+
+principles: |
+  Do not add any new information.
+  Do not miss any key points from the provided document
+
+examples: ""
+
+generation: |
+  Document:
+  {{document}}
+
+start_tags: [""]
+end_tags: [""]
diff --git a/src/instructlab/sdg/pipelines/llama/__init__.py b/src/instructlab/sdg/pipelines/llama/__init__.py
diff --git a/src/instructlab/sdg/pipelines/llama/freeform_skills.yaml b/src/instructlab/sdg/pipelines/llama/freeform_skills.yaml
@@ -0,0 +1,53 @@
+version: "1.0"
+blocks:
+  - name: gen_questions
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/freeform_questions.yaml
+      output_cols:
+        - question
+      batch_kwargs:
+        num_samples: 50
+    drop_duplicates:
+      - question
+  - name: eval_questions
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/evaluate_freeform_questions.yaml
+      output_cols:
+        - evaluation
+        - score
+  - name: filter_questions
+    type: FilterByValueBlock
+    config:
+      filter_column: score
+      filter_value: 1.0
+      operation: eq
+      convert_dtype: float
+    drop_columns:
+      - evaluation
+      - score
+      - num_samples
+  - name: gen_responses
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/freeform_responses.yaml
+      output_cols:
+        - response
+  - name: evaluate_qa_pair
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/evaluate_freeform_pair.yaml
+      output_cols:
+        - evaluation
+        - score
+  - name: filter_qa_pair
+    type: FilterByValueBlock
+    config:
+      filter_column: score
+      filter_value: 2.0
+      operation: ge
+      convert_dtype: float
+    drop_columns:
+      - evaluation
+      - score
diff --git a/src/instructlab/sdg/pipelines/llama/grounded_skills.yaml b/src/instructlab/sdg/pipelines/llama/grounded_skills.yaml
@@ -0,0 +1,70 @@
+version: "1.0"
+blocks:
+  - name: gen_contexts
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/contexts.yaml
+      output_cols:
+        - context
+      gen_kwargs:
+        temperature: 0.7
+        max_tokens: 4096
+        n: 10
+        seed: 42
+    drop_duplicates:
+      - context
+  - name: gen_grounded_questions
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/grounded_questions.yaml
+      output_cols:
+        - question
+      batch_kwargs:
+        num_samples: 3
+    drop_duplicates:
+      - question
+  - name: eval_grounded_questions
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/evaluate_grounded_questions.yaml
+      output_cols:
+        - evaluation
+        - score
+  - name: filter_grounded_questions
+    type: FilterByValueBlock
+    config:
+      filter_column: score
+      filter_value: 1.0
+      operation: eq
+      convert_dtype: float
+    drop_columns:
+      - evaluation
+      - score
+      - num_samples
+  - name: gen_grounded_responses
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/grounded_responses.yaml
+      output_cols:
+        - response
+  - name: evaluate_grounded_qa_pair
+    type: LLMBlock
+    config:
+      config_path: ../../configs/skills/evaluate_grounded_pair.yaml
+      output_cols:
+        - evaluation
+        - score
+  - name: filter_grounded_qa_pair
+    type: FilterByValueBlock
+    config:
+      filter_column: score
+      filter_value: 2.0
+      operation: ge
+      convert_dtype: float
+  - name: combine_question_and_context
+    type: CombineColumnsBlock
+    config:
+      columns:
+        - context
+        - question
+      output_col: question
diff --git a/src/instructlab/sdg/pipelines/llama/knowledge.yaml b/src/instructlab/sdg/pipelines/llama/knowledge.yaml
@@ -0,0 +1,169 @@
+version: "1.0"
+blocks:
+  - name: duplicate_document_col
+    type: DuplicateColumnsBlock
+    config:
+      columns_map:
+        document: base_document
+
+  - name: gen_detailed_summary
+    type: LLMBlock
+    config:
+      config_path: ../../configs/knowledge/detailed_summary.yaml
+      output_cols:
+        - summary_detailed
+      gen_kwargs:
+        max_tokens: 2048
+
+  - name: gen_atomic_facts
+    type: LLMBlock
+    config:
+      config_path: ../../configs/knowledge/atomic_facts.yaml
+      output_cols:
+        - summary_atomic_facts
+      gen_kwargs:
+        max_tokens: 2048
+
+  - name: gen_extractive_summary
+    type: LLMBlock
+    config:
+      config_path: ../../configs/knowledge/extractive_summary.yaml
+      output_cols:
+        - summary_extractive
+      gen_kwargs:
+        max_tokens: 2048
+
+  - name: flatten_summary_columns
+    type: FlattenColumnsBlock
+    config:
+      var_cols:
+        - summary_detailed
+        - summary_extractive
+        - summary_atomic_facts
+        - base_document
+      value_name: summary
+      var_name: dataset_type
+
+  - name: rename_to_document_column
+    type: RenameColumnsBlock
+    config:
+      columns_map:
+        document: raw_document
+        summary: document
+
+  - name: knowledge generation
+    type: LLMBlock
+    config:
+      config_path: ../../configs/knowledge/generate_questions_responses.yaml
+      output_cols:
+        - question
+        - response
+      batch_kwargs:
+        batched: true
+      parser_kwargs:
+        parser_name: custom
+        parsing_pattern: '\[(?:Question|QUESTION)\]\s*(.*?)\s*\[(?:Answer|ANSWER)\]\s*(.*?)\s*(?=\[(?:Question|QUESTION)\]|$)'
+        parser_cleanup_tags:
+          - "[END]"
+          - "[End]"
+      gen_kwargs: 
+        max_tokens: 4096
+
+  - name: eval_faithfulness_qa_pair
+    type: LLMBlock
+    config:
+      config_path: ../../configs/knowledge/evaluate_faithfulness.yaml
+      output_cols:
+        - explanation
+        - judgment
+      gen_kwargs:
+        max_tokens: 512
+
+  - name: filter_faithfulness
+    type: FilterByValueBlock
+    config:
+      filter_column: judgment
+      filter_value: "YES"
+      operation: eq
+    drop_columns:
+      - judgment
+      - explanation
+
+  - name: eval_relevancy_qa_pair
+    type: LLMBlock
+    config:
+      config_path: ../../configs/knowledge/evaluate_relevancy.yaml
+      output_cols:
+        - feedback
+        - score
+      gen_kwargs:
+        max_tokens: 512
+
+  - name: filter_relevancy
+    type: FilterByValueBlock
+    config:
+      filter_column: score
+      filter_value: 2.0
+      operation: eq
+      convert_dtype: float
+    drop_columns:
+      - feedback
+      - score
+
+  - name: eval_verify_question
+    type: LLMBlock
+    config:
+      config_path: ../../configs/knowledge/evaluate_question.yaml
+      output_cols:
+        - explanation
+        - rating
+      gen_kwargs:
+        max_tokens: 512
+
+  - name: filter_verify_question
+    type: FilterByValueBlock
+    config:
+      filter_column: rating
+      filter_value: 1.0
+      operation: eq
+      convert_dtype: float
+    drop_columns:
+      - explanation
+      - rating
+      - __index_level_0__
+
+datamixing:
+  auxiliary_instructions:
+    summary_detailed:
+      - Provide me with a comprehensive summary of the given document.
+      - Prepare a detailed breakdown of the contents of the document for me.
+      - Summarize the document thoroughly, covering all important points.
+      - Create a detailed executive summary of the provided document.
+      - Compose a comprehensive overview of the document's content.
+      - Deliver a detailed synopsis of the material presented in the document.
+      - Furnish me with a detailed analysis of the document's key points.
+      - Generate a thorough summary of the main ideas in the document.
+      - Offer a detailed digest of the information contained in the document.
+      - Supply me with a comprehensive rundown of the document's contents.
+    summary_extractive:
+      - Provide me with a summary of the document using extractive methods.
+      - Create an extractive summary for the given document.
+      - Generate an extractive summary from the document that was given to you.
+      - Summarize the document using extractive techniques.
+      - Create a summary of the provided document using extractive methods.
+      - Generate an extractive summary for the document provided.
+      - Using extractive techniques, summarize the given document.
+      - Create a summary of the document using extractive summarization.
+      - Generate an extractive summary of the document that was provided.
+      - Summarize the provided document using extractive summarization techniques.
+    summary_atomic_facts:
+      - Identify and list all atomic facts from the document.
+      - Extract all key facts from the given document.
+      - List all the important facts from the provided document.
+      - Highlight all the atomic facts present in the document.
+      - Identify and enumerate all key facts from the given text.
+      - List out all the critical information from the document.
+      - Highlight all the essential facts from the provided text.
+      - Identify and summarize all the important details from the document.
+      - Extract all the atomic facts from the given document.
+      - List all the key takeaways from the provided text.