Taraana30 is a web scraper that calculates weekly Top 30 Bollywood Songs from various platforms:
The Web Scraper scrapes the above List/Play List for the data from various providers and save the data in the data folder : ./data/<date of previous week's saturday>/ more on <date of previous week's saturday> later
The Web Scraper provides 2 main commands:
- To get just top_30.csv and candidates.csv:
$ python main.pyRun the above command from the root of the folder to produce top_30.csv and candidates.csv files in the data folder: ./data/<date of previous week's saturday>/top_30.csv and /candidates.csv
- To get all .csv files from the scraper:
$ python main.py --allRun the above command from the root of the folder to procude:
candidates.csvtop_30.csvgaana.csvhungama.csvmirchi.csvsaavn.csvwynk.csv
in the ./data/<date of previous week's saturday>/ folder if it exist or create it then create the files
<date of previous week's saturday>: The folder naming system of the scraper uses the date of the previous Saturday to distinguish between two weeks. The date is in the formate: DD-MM-YYYY. (The week changes on Sunday 00:00:00 i.e., even if the scrip is run on Saturday the folder name will be the date of previous Saturday)
Example:
./
├── data
├── 22-06-2019
│ ├── top_30.csv
│ └── candidates.csv
└── 29-06-2019
├── top_30.csv
├── candidates.csv
├── gaana.csv
├── hungama.csv
├── mirchi.csv
├── saavn.csv
└── wynk.csvCurrently the tool is only present as a GitHub repository and could be used from there only
-
Fork and Clone to your machine
-
Run the pipenv:
pipenv shell -
Run
pipenv install --ignore-pipfileto install all dependencies to your machine. The main dependencies are:- beautifulsoup4
- requests
- lxml
-
Language: Python v3.7.3
-
Scraping Module: BeautifulSoup v4.7.1 with lxml parser
-
I/O request Module: requests v2.22.0
-
Misc: This project is a collection of 5 scrapers one for each plateform:
radiomirchi.comgaana.comhungama.comjiosaavn.comwynk.in
To speed up the scraping process particularly the delay in various I/O requests for gathering the source code from various platforms Multi-Threading is used, 1 thread per scraper (i.e., thread pool of 5 threads)
Note: The data/22-06-2019 folder in the repository is just an example/sample folder with data just to see the output of the script. No .csv file should be saved with the scraper. If you want to disable this feature, then remove *.csv from .gitignore file
When using as a package import the main module and call the taraana30() function on it.
from Taraana import main
# This function will just write the .csv files and will not return anything
main.taraana30()the taraana30() takes 1 optional argument all_files which tells how many files to create
all_files=True: this is same as passing--allargument to the script from terminalall_files=False: (Default) this is same as running the script without any argument
Note: The taraana30() function makes the ./data/<date of previous week's saturday>/ folder relative to the script it is called from (it uses the os.getcwd() function to find the current working directory and makes the ./data/<date of previous week's saturday>/ directory in it)