RAG News Extractor

A Retrieval-Augmented Generation (RAG) system that extracts, ingests, and serves relevant news content from Google News. Leveraging web scraping, text chunking, embedding, and a vector database, this project allows users to ask about any news topic, receive a summarized overview, and follow up with streaming answers.

Features

Automated News Extraction • Query Google News for a given topic • Scrape individual news articles using Beautiful Soup
Document Processing & Chunking • Clean and normalize raw HTML content • Chunk documents using LangChain’s RecursiveCharacterTextSplitter
Dense Embedding & Vector Storage • Embed each text chunk with Sentence-Transformers • Store embeddings in a FAISS vector database for fast retrieval
Retrieval & Summarization • On user query, retrieve the top-k relevant chunks from FAISS • Summarize retrieved chunks via Groq LLaMA model (streaming output)
Follow-Up Questions • Maintain context to answer follow-up queries about the same news topic • Stream responses as the model generates them
Web Interface (Streamlit) • Simple, interactive UI to enter topics and view summaries • Real-time, streaming answer display

Architecture & Workflow

User Input
- User enters a news topic in the Streamlit UI.
News Retrieval
1. Query Google News for top headlines/links matching the topic.
2. For each news link, scrape the article text using Beautiful Soup.
3. Store raw text locally.
Text Chunking
- Use LangChain’s RecursiveCharacterTextSplitter to split each article into manageable chunks.
Embedding & Indexing
1. For each text chunk, compute a dense embedding via a Sentence-Transformer model (e.g., all-MiniLM-L6-v2).
2. Insert embeddings into a FAISS index.
Querying & Summarization
1. When the user asks “Summarize the latest news on [topic]”:
  - Embed the user question.
  - Retrieve the top-k closest chunks from FAISS.
  - Concatenate/re-rank retrieved chunks as needed.
  - Stream the summarization from Groq LLaMA (prompted via LangChain).
2. For follow-up questions:
  - Context window includes system prompt + retrieved chunks.
  - Generate a streaming reply via the same Groq LLaMA pipeline.
Streamlit Frontend
- Displays: • A text input for “Enter a news topic” • A “Submit” button to trigger the RAG pipeline • A live, streaming text area to show the LLaMA-generated summary/answer

Installation

Clone the repository

git clone https://github.com/arjunravi26/rag_news_extractor.git
cd rag_news_extractor

Create a Python virtual environment

python3 -m venv .venv
source .venv/bin/activate       # On Windows: .venv\Scripts\activate

Install required packages

pip install --upgrade pip
pip install -r requirements.txt

Set up environment variables (if applicable)
- If your project requires API keys (e.g., custom Google News API, LLaMA credentials), create a .env file in the root directory:
```
GROQ_API_KEY=your_openai_api_key_here
```
- Ensure that .env is listed in .gitignore to avoid committing secrets.
Initialize FAISS Index
- If starting fresh, the first run of the Streamlit app will build the FAISS database from scratch.

Usage

Run the Streamlit App
```
streamlit run app.py
```
- A browser window/tab will open at http://localhost:8501/.
- Enter a news topic (e.g., “Artificial Intelligence”) and click Submit.
Interact with the System
- The app will display a streaming summary of the latest news related to your topic.
- After the initial summary, you can enter follow-up questions in a text box (e.g., “What are the main challenges?”).
- The answer will stream in real time, leveraging the existing context window + retrieved chunks.

Dependencies

Core Libraries
- Beautiful Soup 4 — Web scraping
- LangChain — Text splitting & prompt templates
- Sentence Transformers — Dense text embeddings
- FAISS — Vector database
- Groq LLaMA — LLM for summarization & Q&A
- Streamlit — Web interface
Python Version
- Tested with Python 3.9+. Higher versions may work but ensure compatibility with FAISS and Sentence-Transformers.

Install everything via:

pip install -r requirements.txt

Contributing

Contributions are welcome! If you’d like to:

Report a Bug • Open an issue and provide a clear description of the problem and steps to reproduce.
Request a Feature • Open an issue labeled “enhancement” explaining the feature goal and use case.
Submit a Pull Request
1. Fork the repository.
2. Create a new branch:
```
git checkout -b feature/your-feature-name
```
3. Make sure to update tests (if any) and add documentation if your change affects the user interface or CLI.
4. Run a quick formatting check (e.g., flake8 or black).
5. Submit a pull request against the main branch.

Thank you for helping improve this project!

License

This project is licensed under the MIT License. See the LICENSE file for details.

Built by arjunravi26

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
NEWS_data		NEWS_data
__pycache__		__pycache__
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
dev_app.py		dev_app.py
inference_pipeline.py		inference_pipeline.py
main.py		main.py
news_summary_rag.ipynb		news_summary_rag.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RAG News Extractor

Table of Contents

Features

Architecture & Workflow

Installation

Usage

Dependencies

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

arjunravi26/rag_news_extractor

Folders and files

Latest commit

History

Repository files navigation

RAG News Extractor

Table of Contents

Features

Architecture & Workflow

Installation

Usage

Dependencies

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages