A Retrieval-Augmented Generation (RAG) system that extracts, ingests, and serves relevant news content from Google News. Leveraging web scraping, text chunking, embedding, and a vector database, this project allows users to ask about any news topic, receive a summarized overview, and follow up with streaming answers.
- Features
- Architecture & Workflow
- Installation
- Usage
- Configuration
- Dependencies
- Project Structure
- Contributing
- License
-
Automated News Extraction • Query Google News for a given topic • Scrape individual news articles using Beautiful Soup
-
Document Processing & Chunking • Clean and normalize raw HTML content • Chunk documents using LangChain’s
RecursiveCharacterTextSplitter
-
Dense Embedding & Vector Storage • Embed each text chunk with Sentence-Transformers • Store embeddings in a FAISS vector database for fast retrieval
-
Retrieval & Summarization • On user query, retrieve the top-k relevant chunks from FAISS • Summarize retrieved chunks via Groq LLaMA model (streaming output)
-
Follow-Up Questions • Maintain context to answer follow-up queries about the same news topic • Stream responses as the model generates them
-
Web Interface (Streamlit) • Simple, interactive UI to enter topics and view summaries • Real-time, streaming answer display
-
User Input
- User enters a news topic in the Streamlit UI.
-
News Retrieval
- Query Google News for top headlines/links matching the topic.
- For each news link, scrape the article text using Beautiful Soup.
- Store raw text locally.
-
Text Chunking
- Use LangChain’s
RecursiveCharacterTextSplitter
to split each article into manageable chunks.
- Use LangChain’s
-
Embedding & Indexing
- For each text chunk, compute a dense embedding via a Sentence-Transformer model (e.g.,
all-MiniLM-L6-v2
). - Insert embeddings into a FAISS index.
- For each text chunk, compute a dense embedding via a Sentence-Transformer model (e.g.,
-
Querying & Summarization
- When the user asks “Summarize the latest news on [topic]”:
- Embed the user question.
- Retrieve the top-k closest chunks from FAISS.
- Concatenate/re-rank retrieved chunks as needed.
- Stream the summarization from Groq LLaMA (prompted via LangChain).
- For follow-up questions:
- Context window includes system prompt + retrieved chunks.
- Generate a streaming reply via the same Groq LLaMA pipeline.
- When the user asks “Summarize the latest news on [topic]”:
-
Streamlit Frontend
- Displays: • A text input for “Enter a news topic” • A “Submit” button to trigger the RAG pipeline • A live, streaming text area to show the LLaMA-generated summary/answer
-
Clone the repository
git clone https://github.com/arjunravi26/rag_news_extractor.git cd rag_news_extractor
-
Create a Python virtual environment
python3 -m venv .venv source .venv/bin/activate # On Windows: .venv\Scripts\activate
-
Install required packages
pip install --upgrade pip pip install -r requirements.txt
-
Set up environment variables (if applicable)
-
If your project requires API keys (e.g., custom Google News API, LLaMA credentials), create a
.env
file in the root directory:GROQ_API_KEY=your_openai_api_key_here
-
Ensure that
.env
is listed in.gitignore
to avoid committing secrets.
-
-
Initialize FAISS Index
- If starting fresh, the first run of the Streamlit app will build the FAISS database from scratch.
-
Run the Streamlit App
streamlit run app.py
- A browser window/tab will open at
http://localhost:8501/
. - Enter a news topic (e.g., “Artificial Intelligence”) and click Submit.
- A browser window/tab will open at
-
Interact with the System
- The app will display a streaming summary of the latest news related to your topic.
- After the initial summary, you can enter follow-up questions in a text box (e.g., “What are the main challenges?”).
- The answer will stream in real time, leveraging the existing context window + retrieved chunks.
-
Core Libraries
- Beautiful Soup 4 — Web scraping
- LangChain — Text splitting & prompt templates
- Sentence Transformers — Dense text embeddings
- FAISS — Vector database
- Groq LLaMA — LLM for summarization & Q&A
- Streamlit — Web interface
-
Python Version
- Tested with Python 3.9+. Higher versions may work but ensure compatibility with FAISS and Sentence-Transformers.
Install everything via:
pip install -r requirements.txt
Contributions are welcome! If you’d like to:
-
Report a Bug • Open an issue and provide a clear description of the problem and steps to reproduce.
-
Request a Feature • Open an issue labeled “enhancement” explaining the feature goal and use case.
-
Submit a Pull Request
-
Fork the repository.
-
Create a new branch:
git checkout -b feature/your-feature-name
-
Make sure to update tests (if any) and add documentation if your change affects the user interface or CLI.
-
Run a quick formatting check (e.g.,
flake8
orblack
). -
Submit a pull request against the
main
branch.
-
Thank you for helping improve this project!
This project is licensed under the MIT License. See the LICENSE file for details.
Built by arjunravi26