Thanks to visit codestin.com
Credit goes to github.com

Skip to content

texastribune/scuole-data

Repository files navigation

Table of Contents generated with DocToc

scuole-data

This repository contains data sets used in the scuole project.

The instructions below guide you through the process of retrieving and cleaning the requisite data files before you start the update process in the scuole.

Data should be updated according to the following timeline:

  • January after the latest TAPR data is released at the end of the previous year
  • May (or late summer, in the case of 2025) after 8th Grade Cohort data is published
  • At other regular intervals to ensure AskTed directory info is updated (though this should be automated)

You can update these data independently from each other, provided you follow the instructions in scuole in sequence.

As you proceed through these steps, please update the vintage of each dataset in the schools data catalog (see the columns with the Schools Explorer header). As of January 2025, most data in this repo was last updated mid-2023.

District boundaries and campus coordinates

TEA provides district boundaries and campus coordinates on their open data site, which you can get to via the School District Locator page. The latest district boundaries are listed under "Current Districts" and campus coordinates under "Current Campuses". Depending on what year you are updating for, you'll go to "Archived Schools Data" and click the school year you are updating for.

Put the GeoJSON boundaries and coordinates in the respective folder: tapr/reference/district/shapes/ or tapr/reference/campus/shapes/.

District boundaries

These files are updated approximately once every year, and should be downloaded as GeoJSONs. We don't display the actual shapes on the page because they're not accurate enough and may be misleading. They are useful for determining nearby districts and geolocating.

Campus coordinates

These files are updated approximately once every year, and should be downloaded as GeoJSONs. Campus coordinates data can be a bit dated, and likely includes some "zombie schools" which have closed or otherwise have zero enrollment.

AskTED

Released: as information is updated, but most likely to be up-to-date by September 1

AskTED provides superintendents, principals and directory information for all schools and districts. The scuole repo downloads data from AskTED directly and updates them in our database, so there's no need to manually download and add them to scuole-data.

Instructions on what commands to run to update AskTED are in the scuole README.

As of 2023, the AskTED data is found on their site if you click on Download School and District File with Site Address. This ensures that we have school and district address for the site of the school and district office rather than the mailing addresses (which can be P.O. Boxes).

If the links above don't lead you to the correct data, then AskTED might have changed the link to the data. If so, then the commands to update askTED data in the scuole database will fail, so make sure the links are updated with the correct headers names.

TAPR

Released: Historically in late November/early December, but was mid-February for SY2023-24 download

All stats are collected by the Texas Education Agency.

To download TAPR data, go to the Texas Academic Performance Report homepage and find the most recent release. Click the Data Download link and go to the Advanced TAPR Data (Numerators, Denominators & Rates) option.

This app requires sheets for College, Career, and Military Readiness (CCMR), TSIA, College Prep, AP/IB, SAT/ACT, Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates, Longitudinal Rate (4-Year, 5-Year, & 6-Year), Staff & Student Information for Campus, District, Region and State. The Reference Information, Accountability Rating and Special Education Determination Status sheet is only required for Campus, District and Region.

For districts and campuses we need to download one extra file that contains the full A-F rankings. Go back one page and select 20xx Accountability instead. (The year will change depending on what year you're working on!). Download the Accountability Summary for Campus and Districts.

For school year 2022-23, accountability ratings were not included in the TAPR dataset due to a pending lawsuit. That data was subsequently released and is separately available. To acquire this data, I followed these steps:

  1. From TEA's accountability ratings page
  2. Go to the 2022-23 page.
  3. Select 2023 Data Download.
  4. From the 2023 Data Download page:
    1. District-level Data (on the next iteration, Campus-level data)
    2. Accountability Summary
    3. Continue
  5. Select All
  6. Comma Delimited
  7. Download

Repeat the above steps for Campus-level Data

After downloading each file, you will save it in their respective folders (Campus, District, Region, State) in their respective years in the tapr directory as the following spreadsheets. An example of the directory for Campus from 2021-22 found here.

TAPR data file Download File Name Pattern Scuole File name
Reference Information, Accountability Rating and Special Education Determination Status
  not needed for states
*REF.csv reference.csv
Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates *GRAD.csv attendance.csv
Longitudinal Rate (4-Year, 5-Year, & 6-Year) *COMP.csv longitudinal-rate.csv
College, Career, and Military Readiness (CCMR), TSIA, College Prep *PERF1.csv postsecondary-readiness-and-non-staar-performance-indicators.csv
AP/IB, SAT/ACT
  added in SY2020-21
*PERF2.csv ap-ib-sat-act.csv
Staff, Student, and Annual Graduates *PROF.csv staff-and-student-information.csv
Accountability Summary Accountability Summary accountability.csv

Cleaning the TAPR data

The TAPR data usually needs a cleaning before we run it in the scuole database. During the last few years (early 2020s?), we've done the following cleaning:

  • Remove the leading apostrophe in the "DISTRICT", "COUNTY", "REGION" and "CAMPUS" columns (this was used to force these IDs to be recognized as strings)
  • The campus, district and region codes should have a fixed number of digits, usually padded with leading zeroes. They are:
    • campus (9 digits)
    • district (6 digits)
    • region (2 digits)
    • county (3 digits)
  • Make headers all caps. SAT and ACT (added in SY2021-22, from what I can tell) had headers with random letters that were lowercased.

This Jupyter notebook (revised in 2025) should do all that for you.

If you're unlucky to run into any other formatting errors, first of all (sorry!), second of all, try to write a solution in a Python notebook and add it to this README so it can be reproduced the following year and properly documented.

District and campus models

Each year, there's a possibility that campuses and districts change names, are added, or are removed. We rely on the reference.csv in each year's TAPR folder to create a entities.csv file that will create models for districts and campuses.

Instructions on how to take the reference.csv and create a new entities.csv are in the format_new_entities Jupyter Notebook — we should be doing this every year.

We do some district and campus name re-formatting in the Jupyter Notebook (i.e. Cayuga H S --> Cayuga High School). Abbreviations, the Regex for those abbreviations, and the string to replace them with are in campus_name_abbrev_guidelines.xlsx.

When you update entitites in the scoule database, you will be erasing the existing district and campus models and then re-adding every district and campus based on that entities.csv. This should take care of districts and schools that get renamed/removed/added.

Making sure TAPR header data matches

The headers for the data should match the schema found in schema_v2.py which is what we use to map the data. If after uploading all of the data into the scuole database, you notice there are fields missing. It could be because the header in the spreadsheet do not match the schema found in schema_v2.py. There are a lot of headers and columns so it might get tedious to check each and every one, especially since they don't tend to change year-to-year. But it might be worth checking if you see data missing. The scuole database just skips those headers if it doesn't see a matching header and shows N/A in your local database.

Every year, TEA publishes a Master reference of TAPR elements like this one from 2022. It's usually found in the TAPR Advanced Data Download for that year that you use to download the data and called Master Reference (HTML). You can also download it in an Excel format for campuses, districts, regions and state.

It's also good to remember that for some datasets, TAPR releases the latest data while others are a year behind. For example, if we were updating the 2021-22 TAPR data, the latest data would be for 2021-22 (or Class of 2022) and the previous year would be 2020-21 (or Class of 2021) Here's is a handy breakdown:

TAPR data file Vintage Year What It Contains
Reference Information, Accountability Rating and Special Education Determination Status latest A-F scores
Attendance, Chronic Absenteeism, Graduation (RHSP/DAP & FHSP), and Dropout Rates previous chronic absenteeism
dropout rates
Longitudinal Rate (4-Year, 5-Year, & 6-Year) previous four-year graduation rates
College, Career, and Military Readiness (CCMR), TSIA, College Prep previous College-ready graduates (TSIA scores)
AP/IB, SAT/ACT
  added in SY2020-21
previous SAT/ACT scores
AP/IB participation/performance
Staff, Student, and Annual Graduates latest Teacher data (demographics, degree, salaries, students per teacher)
student data (demographics, at-risk, economically disadvantages and limited English proficiency students, bilingual/ESL, gifted & talented, special ed)
Accountability Summary previous ?

Cohorts

Released: Annually in late April/early May

The Texas Higher Ed Coordinating Board (THECB) provides data for the Higher Ed Outcomes section of the app. To obtain and clean the data:

  1. First, download the latest year of the data from THECB. The latest year should be 11 years from the current year.

  2. Then, create a folder in the cohorts/ folder. Name it the year (YYYY) to which the data corresponds.

  3. Open the spreadsheet- you can do this in Google Sheets if you don't have Excel. Unhide the Master Raw Data worksheet in the .xlsx file from THECB by going to Format -> Sheet -> Unhide --> Master Raw Data (Excel) or Hamburger Menu -> Unhide (Google Sheets).

  4. For the last two years, however, the Master Raw Data worksheet has not been for the corresponding year. Check to see if it matches the year. If not, contact the Texas Higher Ed Coordinating Board right away so they can get you a spreadsheet of that Master Raw Data. Warning: They tend to take their sweet time when it comes to communication. Note 1/31/25 Cohort data as published on THECB does not include Master Raw Data for the 2013 Cohort. I needed to send a PIR request to [email protected].

  5. Copy and paste that data into a new spreadsheet. Change the headers to match the list of fields in the loader, or copy and paste the headers from a previous year's data. Be sure the data matches the header, and save it as regionState.csv.

  6. Copy and paste the data found in Region Cty Gender, Region Cty Econ and Region Cty Ethnicity into individual csv files. Name them countyGender.csv, countyEcon.csv and countyEthnicity.csv, respectively. Change the headers to match the list of fields in the loader, or copy and paste the headers from a previous year's data.

  7. Be sure to remove any notes found at the top and bottom of all .xlsx files.

  8. Make sure all counts are integers and all percents are floats (which requires changing them in Excel 😬).

  9. When updating the latest cohorts data, I noticed that they're adding asterisk (*) and N/As into the datasets. We don't want that! We want them to be empty. I created a Jupyter notebook called clean_cohorts_data.ipynb to help with any cleanup of that.

Note (from 2023 or prior?): According to a spokesperson at THECB, they are planning on building a cohorts dashboard in the near future, and they don't know how they will make the spreadsheets available to download (if at all). That's why we should probably be on the lookout for any changes in the THECB website and plan for any changes in our updating process in the future depending on how the data is released. Contact Mike Eddleman at [email protected] or any other spokesperson at THECB when the time comes for any details.

Followup from June 2025 Mike Eddleman is still on the scene. Data is very much structured differently, and I'm working through whether/how we can integrate it into the outcomes explorer even as we're sunsetting this version of the Schools Explorer by end of year.

About

A repository of data sets used in the scuole project.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 6