Table of Contents generated with DocToc
This repository contains data sets used in the scuole project.
The instructions below guide you through the process of retrieving and cleaning the requisite data files before you start the update process in the scuole.
Data should be updated according to the following timeline:
- January after the latest TAPR data is released at the end of the previous year
- May (or late summer, in the case of 2025) after 8th Grade Cohort data is published
- At other regular intervals to ensure AskTed directory info is updated (though this should be automated)
You can update these data independently from each other, provided you follow the instructions in scuole in sequence.
As you proceed through these steps, please update the vintage of each dataset in the schools data catalog (see the columns with the Schools Explorer header). As of January 2025, most data in this repo was last updated mid-2023.
TEA provides district boundaries and campus coordinates on their open data site, which you can get to via the School District Locator page. The latest district boundaries are listed under "Current Districts" and campus coordinates under "Current Campuses". Depending on what year you are updating for, you'll go to "Archived Schools Data" and click the school year you are updating for.
Put the GeoJSON boundaries and coordinates in the respective folder: tapr/reference/district/shapes/ or tapr/reference/campus/shapes/.
These files are updated approximately once every year, and should be downloaded as GeoJSONs. We don't display the actual shapes on the page because they're not accurate enough and may be misleading. They are useful for determining nearby districts and geolocating.
These files are updated approximately once every year, and should be downloaded as GeoJSONs. Campus coordinates data can be a bit dated, and likely includes some "zombie schools" which have closed or otherwise have zero enrollment.
Released: as information is updated, but most likely to be up-to-date by September 1
AskTED provides superintendents, principals and directory information for all schools and districts. The scuole repo downloads data from AskTED directly and updates them in our database, so there's no need to manually download and add them to scuole-data.
Instructions on what commands to run to update AskTED are in the scuole README.
As of 2023, the AskTED data is found on their site if you click on Download School and District File with Site Address. This ensures that we have school and district address for the site of the school and district office rather than the mailing addresses (which can be P.O. Boxes).
If the links above don't lead you to the correct data, then AskTED might have changed the link to the data. If so, then the commands to update askTED data in the scuole database will fail, so make sure the links are updated with the correct headers names.
Released: Historically in late November/early December, but was mid-February for SY2023-24 download
All stats are collected by the Texas Education Agency.
To download TAPR data, go to the Texas Academic Performance Report homepage and find the most recent release. Click the Data Download link and go to the Advanced TAPR Data (Numerators, Denominators & Rates) option.
This app requires sheets for College, Career, and Military Readiness (CCMR), TSIA, College Prep, AP/IB, SAT/ACT, Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates, Longitudinal Rate (4-Year, 5-Year, & 6-Year), Staff & Student Information for Campus, District, Region and State. The Reference Information, Accountability Rating and Special Education Determination Status sheet is only required for Campus, District and Region.
For districts and campuses we need to download one extra file that contains the full A-F rankings. Go back one page and select 20xx Accountability instead. (The year will change depending on what year you're working on!). Download the Accountability Summary for Campus and Districts.
For school year 2022-23, accountability ratings were not included in the TAPR dataset due to a pending lawsuit. That data was subsequently released and is separately available. To acquire this data, I followed these steps:
- From TEA's accountability ratings page
- Go to the 2022-23 page.
- Select 2023 Data Download.
- From the 2023 Data Download page:
- District-level Data (on the next iteration, Campus-level data)
- Accountability Summary
- Continue
- Select All
- Comma Delimited
- Download
Repeat the above steps for Campus-level Data
After downloading each file, you will save it in their respective folders (Campus, District, Region, State) in their respective years in the tapr directory as the following spreadsheets. An example of the directory for Campus from 2021-22 found here.
| TAPR data file | Download File Name Pattern | Scuole File name |
|---|---|---|
| Reference Information, Accountability Rating and Special Education Determination Status not needed for states |
*REF.csv | reference.csv |
| Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates | *GRAD.csv | attendance.csv |
| Longitudinal Rate (4-Year, 5-Year, & 6-Year) | *COMP.csv | longitudinal-rate.csv |
| College, Career, and Military Readiness (CCMR), TSIA, College Prep | *PERF1.csv | postsecondary-readiness-and-non-staar-performance-indicators.csv |
| AP/IB, SAT/ACT added in SY2020-21 |
*PERF2.csv | ap-ib-sat-act.csv |
| Staff, Student, and Annual Graduates | *PROF.csv | staff-and-student-information.csv |
| Accountability Summary | Accountability Summary | accountability.csv |
The TAPR data usually needs a cleaning before we run it in the scuole database. During the last few years (early 2020s?), we've done the following cleaning:
- Remove the leading apostrophe in the "DISTRICT", "COUNTY", "REGION" and "CAMPUS" columns (this was used to force these IDs to be recognized as strings)
- The campus, district and region codes should have a fixed number of digits, usually padded with leading zeroes. They are:
- campus (9 digits)
- district (6 digits)
- region (2 digits)
- county (3 digits)
- Make headers all caps. SAT and ACT (added in SY2021-22, from what I can tell) had headers with random letters that were lowercased.
This Jupyter notebook (revised in 2025) should do all that for you.
If you're unlucky to run into any other formatting errors, first of all (sorry!), second of all, try to write a solution in a Python notebook and add it to this README so it can be reproduced the following year and properly documented.
Each year, there's a possibility that campuses and districts change names, are added, or are removed. We rely on the reference.csv in each year's TAPR folder to create a entities.csv file that will create models for districts and campuses.
Instructions on how to take the reference.csv and create a new entities.csv are in the format_new_entities Jupyter Notebook — we should be doing this every year.
We do some district and campus name re-formatting in the Jupyter Notebook (i.e. Cayuga H S --> Cayuga High School). Abbreviations, the Regex for those abbreviations, and the string to replace them with are in campus_name_abbrev_guidelines.xlsx.
When you update entitites in the scoule database, you will be erasing the existing district and campus models and then re-adding every district and campus based on that entities.csv. This should take care of districts and schools that get renamed/removed/added.
The headers for the data should match the schema found in schema_v2.py which is what we use to map the data. If after uploading all of the data into the scuole database, you notice there are fields missing. It could be because the header in the spreadsheet do not match the schema found in schema_v2.py. There are a lot of headers and columns so it might get tedious to check each and every one, especially since they don't tend to change year-to-year. But it might be worth checking if you see data missing. The scuole database just skips those headers if it doesn't see a matching header and shows N/A in your local database.
Every year, TEA publishes a Master reference of TAPR elements like this one from 2022. It's usually found in the TAPR Advanced Data Download for that year that you use to download the data and called Master Reference (HTML). You can also download it in an Excel format for campuses, districts, regions and state.
It's also good to remember that for some datasets, TAPR releases the latest data while others are a year behind. For example, if we were updating the 2021-22 TAPR data, the latest data would be for 2021-22 (or Class of 2022) and the previous year would be 2020-21 (or Class of 2021) Here's is a handy breakdown:
| TAPR data file | Vintage Year | What It Contains |
|---|---|---|
| Reference Information, Accountability Rating and Special Education Determination Status | latest | A-F scores |
| Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates | previous | chronic absenteeism dropout rates |
| Longitudinal Rate (4-Year, 5-Year, & 6-Year) | previous | four-year graduation rates |
| College, Career, and Military Readiness (CCMR), TSIA, College Prep | previous | College-ready graduates (TSIA scores) |
| AP/IB, SAT/ACT added in SY2020-21 |
previous | SAT/ACT scores AP/IB participation/performance |
| Staff, Student, and Annual Graduates | latest | Teacher data (demographics, degree, salaries, students per teacher) student data (demographics, at-risk, economically disadvantages and limited English proficiency students, bilingual/ESL, gifted & talented, special ed) |
| Accountability Summary | previous | ? |
Released: Annually in late April/early May
The Texas Higher Ed Coordinating Board (THECB) provides data for the Higher Ed Outcomes section of the app. To obtain and clean the data:
-
First, download the latest year of the data from THECB. The latest year should be 11 years from the current year.
-
Then, create a folder in the
cohorts/folder. Name it the year (YYYY) to which the data corresponds. -
Open the spreadsheet- you can do this in Google Sheets if you don't have Excel. Unhide the
Master Raw Dataworksheet in the.xlsxfile from THECB by going toFormat -> Sheet -> Unhide --> Master Raw Data(Excel) orHamburger Menu -> Unhide(Google Sheets). -
For the last two years, however, the
Master Raw Dataworksheet has not been for the corresponding year. Check to see if it matches the year. If not, contact the Texas Higher Ed Coordinating Board right away so they can get you a spreadsheet of thatMaster Raw Data. Warning: They tend to take their sweet time when it comes to communication. Note 1/31/25 Cohort data as published on THECB does not include Master Raw Data for the 2013 Cohort. I needed to send a PIR request to[email protected]. -
Copy and paste that data into a new spreadsheet. Change the headers to match the list of fields in the loader, or copy and paste the headers from a previous year's data. Be sure the data matches the header, and save it as
regionState.csv. -
Copy and paste the data found in
Region Cty Gender,Region Cty EconandRegion Cty Ethnicityinto individual csv files. Name themcountyGender.csv,countyEcon.csvandcountyEthnicity.csv, respectively. Change the headers to match the list of fields in the loader, or copy and paste the headers from a previous year's data. -
Be sure to remove any notes found at the top and bottom of all
.xlsxfiles. -
Make sure all counts are integers and all percents are floats (which requires changing them in Excel 😬).
-
When updating the latest cohorts data, I noticed that they're adding asterisk (*) and N/As into the datasets. We don't want that! We want them to be empty. I created a Jupyter notebook called
clean_cohorts_data.ipynbto help with any cleanup of that.
Note (from 2023 or prior?): According to a spokesperson at THECB, they are planning on building a cohorts dashboard in the near future, and they don't know how they will make the spreadsheets available to download (if at all). That's why we should probably be on the lookout for any changes in the THECB website and plan for any changes in our updating process in the future depending on how the data is released. Contact Mike Eddleman at [email protected] or any other spokesperson at THECB when the time comes for any details.
Followup from June 2025 Mike Eddleman is still on the scene. Data is very much structured differently, and I'm working through whether/how we can integrate it into the outcomes explorer even as we're sunsetting this version of the Schools Explorer by end of year.