This repository presents code for WhatsApp Explorer. An instance of WhatsApp Explorer can be found here: https://whatsapp.whats-viral.me/
Whatsapp Explorer is an end-to-end data collection tool for WhatsApp. It is designed to collect data from WhatsApp groups and individual chats for research purposes. The tool manages the data collection process, including consent of chats for donation, anonymization of media and messages, and storage of the data. The tool also provides a monitoring dashboard to monitor the data collection process. The tool is designed to be scalable and can be used to collect data from multiple WhatsApp accounts simultaneously.
The tool is built using the following technologies:
- Frontend: React
- Backend: Node.js and Python
- Database: MongoDB
- Monitoring Dashboard: Streamlit Python
- Downloader: Python
To setup an instance of WhatsApp Explorer, researchers will need the following system requirements:
- A server with following minimum specifications per parallelly connected WhatsApp account:
- 1 GB RAM
- 2 CPU cores
- Hence, if we plan to add 8 accounts simultaneously, we will need a server with at least 8 GB RAM and 16 CPU cores.
- The storage requirements depends on the number of accounts we add and the activity of the accounts.
- We tested the tool on an Amazon AWS EC2 instance with 16GB RAM, 16 core CPU and 2 terabytes storage. The instance does not use much RAM though to enable parallel data downloads, multiple CPU cores would help.
- Ports: You will need to open the following ports on your server:
- 3000: For the frontend
- 8000: For the backend
- 8501: For the monitoring dashboard
- Clone the repository -
git clone ... - You will have to set up the config files of the project manually. The process is described below.
- Edit the file
run.confin the cloned directory.
DOWNLOADER_TIME="0 17 * * *"
BACKEND_FOLDER="./api/whatsappWebApi" # The path to the backend folder
FRONTED_FOLDER="./frontend/WhatsappMonitorFrontend" # The path to the frontend folder
DASHBOARD_FOLDER="./WMDash" # The path to the monitoring dashboard folder
DOWNLOADER_FOLDER="./downloadTool" # The path to the downloader folder
- Open the file
run.confin the cloned directory.
DOWNLOADER_TIME="0 17 * * *"
DOWNLOADER_TIME: The cron expression for the time at which the downloader will run. The downloader is responsible for backing up all data and shifting the media files to the file system for easy access.- Open the file
main.yamlin thedownloader/configdirectory.
mailer:
enabled: true
email: [email protected]
services:
frontend:
name: wm-frontend
backend:
name: wm-backend
mongodb:
uri: mongodb://127.0.0.1:27017/whatsappLogs
data:
path: /path/to/store/data
backup:
duration: 5mailer: The configuration for the mailer service.- The
enabledfield determines whether the mailer service is enabled or not. - The
emailfield is the email to which the report will be sent.
- The
services: The configuration for the services.- The
frontendandbackendfields are the names of the frontend and backend services respectively. Do not change these. - The
mongodb.urifield is the URI of the MongoDB database.
- The
data: The configuration for the data. Thepathfield is the path to the directory where the data will be stored.backup: The configuration for the backup. Thedurationfield is the number of days after which the backup will be createde e.g.5means the backup will be created every 5 days. (Note that for the recent 1 week, the backup will be created every day)
- Open the file
.envin thefrontenddirectory.
PORT=3000
REACT_APP_API_URL=http://localhost:8000/
# Features
REACT_APP_INDIVIDUAL_USER=true
REACT_APP_DAILY_REPORT=true
REACT_APP_INDIVIDUAL_CHAT=falsePORT: The port on which the frontend will run.REACT_APP_API_URL: The URL of the backend service.REACT_APP_INDIVIDUAL_USER: Set totrueif you want to enable the individual user feature. This feature enables a user to add themselves as a participant in the survey.REACT_APP_DAILY_REPORT: Set totrueif you want to enable the daily report feature. The daily report is accessible to the admin only.REACT_APP_INDIVIDUAL_CHAT: Set totrueif you want to enable the individual chat feature. This feature enables a user to donate bilateral chats alongwith the default group chats.
- Open the file
prod.ymlin thebackend/configdirectory.
port: 8000
IS_HTTPS: true
mongodb:
uri: mongodb://127.0.0.1:27017/whatsappLogs
allowed_origins:
- https://www.whatsapp.whats-viral.me
- https://whatsapp.whats-viral.me
- http://whatsapp.whats-viral.me
autologger:
cron: "0 2 * * *" # 8:05 PM
parallel: 8 # Number of parallel instances to run
messages:
limit: "Infinity" # Can be any number or "Infinity"
daysOld: 60 # Number of days old messages to store
recent: 14 # Number of days old messages to categorize as recent
status: false # Whether to store message status or not, currently facing bugs
retries: 10 # Number of tries to get message status or reactions
timeouts:
chat: 600000 # 10 minutes
message: 300000 # 5 minutes
contact: 300000 # 5 minutes
media: 300000 # 5 minutes
numMessages: 300000 # 5 minutes
reactions: 300000 # 5 minutes
messageStatus: 300000 # 5 minutes
connection: 300000 # 5 minutes
audio:
enabled: true
video:
enabled: true
maxDuration: "Infinity" # can be 60 seconds
forwardingScore: 0port: The port on which the backend will run.IS_HTTPS: Set totrueif your server is accessible over HTTPS.mongodb.uri: The URI of the MongoDB database.allowed_origins: The list of allowed origins for CORS. This usually includes the frontend URL.autologger: The configuration for the autologger service.- The
cronfield is the cron expression for the time at which the autologger will run. - The
parallelfield is the number of accounts to connect parallelly while auto-logging.
- The
messages: The configuration for the messages service.- The
limitfield is the maximum number of messages to store. - The
daysOldfield is the number of days old messages to store. - The
recentfield is the number of days old messages to categorize as recent. - The
statusfield is whether to store message status or not. Message status includes the read and delivered status of the message. - The
retriesfield is the number of tries to get message status or reactions.
- The
timeouts: The configuration for the timeouts of different services. This describes how much time the application will wait for a service to respond before timing out.audio: The configuration for the audio service.- The
enabledfield determines whether we are downloading audio files or not.
- The
video: The configuration for the video service.- The
enabledfield determines whether we are downloading video files or not. - The
maxDurationfield is the maximum duration of the video we can download. - The
forwardingScorefield is minimum number of times a video has to be forwarded to be stored in the database. (Considered as viral)
- The
- Open the file
run.confin thebackenddirectory.
GCLOUD_KEY_PATH=/path/to/gcloud-key.json/
GCLOUD_KEY_PATH: The path to the Google Cloud key file. This is required for name anonymization using the google DLP library. The file should be stored inbackend/keysdirectory.
- Open the file
run.confin thedashboarddirectory.
PORT=8501
BASE_URL=monitoring
-
PORT: The port on which the monitoring dashboard will run. -
BASE_URL: The base URL of the monitoring dashboard. -
Open the file
config.ymlin thedashboarddirectory.
# Survey Data
survey: "path/to/backend/formResponse"
# WhatsApp Data
download_paths: [
'download/storage/path1',
'download/storage/path2',
'download/storage/path3'
]survey: The path to the survey responses collected during the data collection process. This is present in the 'formResponse' folder in the backend directory by default.download_paths: The paths to the downloaded WhatsApp data. These are the paths to the folders where the WhatsApp data is downloaded. This is the data path specified to the downloader tool (prod.yml->data->path) where all the media and chat data is stored.
- Run the script
run.shin the root directory of the project. This will start the frontend, backend and downloader services. - Check that no errors occurred during the setup process. Logs are available at
./<current_date>.log - The frontend will be accessible at
http://localhost:3000/by default. - The default admin credentials are:
- Username:
admin - Password:
12345
- Username:
- To add surveyors, go to the add surveyor page in the admin dashboard.
- Login with the surveyor credentials to access the surveyor dashboard. A surveyor can then add participants to the survey.
- You can monitor the application data in the monitoring dashboard. The monitoring dashboard is accessible at
http://<your_ip>:<port>/<base_url>as set in therun.conffile in the monitoring directory.
You can also find the code for the frontend, backend and pipelines required for building a dashboard to visualize the data. The dashboard provides an easy way to summarize the collected data and can be easily setup based on the pipelines provided by WhatsApp Explorer. You can find the code for the visualizer in the data_visualizer folder. Please refer to the README in that folder for details.