Sample Datasets for Data and Text Analysis

InfraNodus is a AI-powered text network analysis tool. You can use it to reveal patterns in text data.

Here we provide some of the sample datasets you can use to try out various workflows.

Keyword Stats Datasets

This folder contains the data on Google Search volumes for various keywords. Usually you would analyze the column with the keyword combinations to find recurring patterns and use metadata in other columns for filtering (e.g. search volume, difficulty, location, etc.)

Keyword Stats / google_us_ai-tools_matching-terms_2025-07-08_20-05-24.csv: Google Search volumes for keywords related to "ai tools". Use the matching keyword workflow to see how to analyze this file step by step.

Open-Ended Survey Datasets

This folder contains samples of open-ended surveys. Usually one or more columns contain the responses while the other columns contain metadata about the survey participants. This metadata can be used for filtering: e.g. what the people from a certain location or background said about a partciular topic or their sentiment.

Open Ended Surveys / OSMI-2019-Mental-Health-Tech-Modified.csv: Open Source Mental Health Initiative (OSMI) 2019 Mental Health Tech Survey. Use the open-ended survey workflow to see how to analyze this file step by step.

Listing Datasets

This folder contains samples of listings. Such listings would often contain a column with a title and description of a listing as well as severeal other columns with categories which can be used for filtering.

Listings / ec_europa_data.csv: European Commission Open Data Portal. Use the listings workflow to see how to analyze this file step by step.

Network Graphs Datasets

This folder contains network graph data in Gexf format. Gexf is a type of XML that encodes nodes, relations, and related metadata.

Diseasosome / diseasosome-diseases.gexf: A network graph of diseases and their connections based on the “Human Disease Network” study, which contains information about the links between the different diseases and associated genes. To simplify, we’ve removed information about the gene associations, keeping only the connections between the different diseases. The diseases are linked together if there’s at least one gene mutation that is correlated with the both diseases. Use the network graph workflow to see how to analyze this file step by step.
Related Artists / related-artists.gexf: A network of related classic rock artists extracted from Spotify, provided by Ifeanyi Idiaye. You can see which artists are central to the field (because they are listened to with the most diverse set of artists) and which artists form clusters of interconnected communities.
C Elegans / celegans.gexf: C. elegans connectome of neurons. C. elegans is a more or less simple organism. Its adult hermaphrodite form has 302 neurons and this network shows how those neurons are connnected, which are the most central ones, and which form clusters.
Yeast / yeast.gexf: a yeast molecular interaction network that shows which proteins are more central, which form clusters, etc.

Also check out our separate archive of network analysis datasets

Knowledge Graph Datasets

This folder contains knowledge graphs that show relations between different types of entities.

Knowledge Graphs / similar-sites.md: a text file that can be uploaded to InfraNodus to analyze similar sites in SEO sphere

Datasets Extracted from Databases

This folder contains extracts from various interesting databases. For example, an extract of the research papers titles and abstracts from Arxiv up to 2025.

It also contains a Python script you can freely re-use (MIT license) to filter the long JSON files into shorter versions that can be digested by InfraNodus (up to 10Mb limits).

Arxiv Research Papers contains a list of research papers on graphs extracted from https://www.kaggle.com/datasets/Cornell-University/arxiv. You can generate your own extract from the Arxiv file by using our python script filter_graph_papers.py — this script will prompt you for the categories and the keywords to look for in that file. Edit the python script if you'd like to filter a file with a different name, otherwise it will look for the file arxiv-metadata-oai-snapshot.json which is the default name of the file provided by Cornell university in their Kaggle dataset archive.
Visual Text Analysis Companies — this CSV file contains a list of the companies operating in the visual text analysis field, their USPs, strengths and weaknesses, as well as the keywords related to their expertise. Can be used for competitive analysis as described in this tutorial: https://support.noduslabs.com/hc/en-us/articles/22905603668636-Competitive-Analysis-Mapping-How-to-Visualize-Expertise-Networks-and-Find-Strategic-Gaps
Trump Administration Personnel — this CSV file contains information about the individuals that are a part of the president Trump's administration, listing their skills, background, affiliation, etc. Can be used for social network analysis as described in this tutorial: https://support.noduslabs.com/hc/en-us/articles/22947832720412-Beyond-Organizational-Skills-Matrix-Social-Expertise-Network-Analysis

License

All datasets are provided as-is and are subject to the license of the original source.

Try them out with https://infranodus.com.

Use these examples with our InfraNodus tutorials: https://support.noduslabs.com

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
databases		databases
keyword-stats		keyword-stats
knowledge-graphs		knowledge-graphs
listings		listings
network-graphs		network-graphs
open-ended-surveys		open-ended-surveys
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Sample Datasets for Data and Text Analysis

Keyword Stats Datasets

Open-Ended Survey Datasets

Listing Datasets

Network Graphs Datasets

Knowledge Graph Datasets

Datasets Extracted from Databases

License

About

Uh oh!

Releases

Packages

Languages

infranodus/datasets

Folders and files

Latest commit

History

Repository files navigation

Sample Datasets for Data and Text Analysis

Keyword Stats Datasets

Open-Ended Survey Datasets

Listing Datasets

Network Graphs Datasets

Knowledge Graph Datasets

Datasets Extracted from Databases

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages