Thanks to visit codestin.com
Credit goes to github.com

Skip to content

A Django-based project featuring a web crawler that automates daily scraping of the Zoomit.ir website using Scrapy and Selenium. Implemented task scheduling with Celery and Celery Beat, and containerized the application with Docker, including Dockerfile and Docker Compose for easy deployment.

License

Notifications You must be signed in to change notification settings

farzanmosayyebi/TechNews

Repository files navigation

Tech News

About The Project

This is the Back-end of a news website, which includes a web crawler that can crawl news from zoomit.ir website.

(back to top)

Built With

  • Python

  • Django

  • DjangoREST

  • Postgres

  • Swagger

  • Selenium

  • Scrapy Badge

  • Docker

  • Celery

  • Redis

(back to top)

Getting Started

Prerequisites

  • Python

  • PostgreSQL

Installation

  1. Clone the repo

    git clone https://github.com/farzanmosayyebi/TechNews
  2. Navigate to src directory

    cd src
  3. Install the requirements

    pip install -r requirements.txt
  4. Apply migrations

    Note: First, You will need to create the PostgreSQL database and set the environment variables in a file named .env with the following format in src directory:

    TechNews/src/.env:

    SECRET_KEY=your-secret-key
    DB_NAME=your-db-name
    DB_USER=your-db-user
    DB_PASSWORD=your-db-password
    DB_HOST=your-db-host
    DB_PORT=your-db-port
    

    Then run

    python manage.py migrate

Usage

  • To start the project, in src directory run

    python manage.py runserver
    • Url to see the Swagger UI
    127.0.0.1:8000/swagger/
    

Running the tests

  • In src directory run

    • Windows
    python manage.py test ..\tests
    • Linux/MacOS
    python manage.py test ../tests

Running the crawler

  • In src directory, run

    python manage.py crawl --limit <number-of-items-to-scrape>
    • This is a custom django Command which crawls the specified number of items from zoomit.ir website. The default number is 500.

    Example

    • To crawl 50 items
    python manage.py crawl --limit 50

(back to top)

Running with docker compose

  1. In root directory of the project, run:

    docker compose up

    Note: You need to provide a file named app.env (using --env-file) that contains the environment variables for the project.

    About Dockerfiles:

    • Two dockerfiles are implemented :
      • Dockerfile.base: Which is the base file that only installs dependencies. Backend, celery-beat and celery-flower containers will be run upon the image built from this file.
      • Dockerfile.worker: This file also installs Google Chrome and needed packages in order to be able to run selenium in celery workers. Celery-worker container will be run upon the image built from this file.

(back to top)

Crawler schedule

At startup, 500 news will be crawled from zoomit.ir. After that, celery beat is scheduled to push crawl tasks to message queue daily at midnight. which means everyday at midnight, 60 news items will be crawled from zoomit.ir.

License

  • Distributed under the MIT License. See LICENSE for more information.

(back to top)

About

A Django-based project featuring a web crawler that automates daily scraping of the Zoomit.ir website using Scrapy and Selenium. Implemented task scheduling with Celery and Celery Beat, and containerized the application with Docker, including Dockerfile and Docker Compose for easy deployment.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages