Skip to main content

Section 5.6 Obtaining and Cleaning the Data

Here, we will find reputable sources for data on anti-trans bills and get the data into a form we can analyze.
In this section, I will...
  • Provide reasons for why data needs to be cleaned.
  • Learn how to access, merge, and clean my own data sets.
  • Explain and execute the data cleaning process for analysis.

Subsection 5.6.1 Importing and Merging the Data

Exercises Exercises

Finding Data Online.
1.
Go to the website tracktranslegislation.com
โ€‰113โ€‰
tracktranslegislation.com
, which in turn obtains its data from LegiScan
โ€‰114โ€‰
legiscan.com
, and browse around. What are two things that you notice about the site? Two things that surprise you? Two things that arenโ€™t on the site that you wonder about?
2.
This data is up-to-date from 2021 through April 21, 2023. In order to track bills prior to 2021, we use data from the ACLUโ€™s โ€œPast Legislation Affecting LGBT Rights Across the Countryโ€ pages for 2018, 2019, and 2020
โ€‰115โ€‰
www.aclu.org/past-legislation-affecting-lgbt-rights-across-country-2018
.
Go to the ACLU "Past Legislation Affecting LGBT Rights Across the Country 2020"
โ€‰116โ€‰
www.aclu.org/past-legislation-affecting-lgbt-rights-across-country-2020
webpage and skim that page. What similarities do you notice with the data found on the Track Trans Legislation website? What differences do you notice? Click on the "View 2019" and "View 2018 Session bills" links and do the same for those years.
3.
Weโ€™d like to get a sense of how anti-trans legislation in the U.S. changed over time, so weโ€™re going to try to merge the Track Trans Legislation (TTL) data with the ACLU data. Since the ACLU data has different bill "Status" categories than TTL, weโ€™ll need to figure out how to classify each ACLU bill into one of the TTL categories.
Use the Terminology
โ€‰117โ€‰
www.tracktranslegislation.com/terminology
page on the TTL website to answer the following question. Which of the TTL categories would you classify "Referred to committee" into? "Hearing scheduled"? "Withdrawn"? You may want to click on the bill numbers on the ACLU site to see how the website LegiScan, a constantly-updated bill tracker, classifies each bill.
Note that the 2020 ACLU page was last updated on March 20, 2020, since many state legislatures were suspended or closed during the first year of the COVID-19 pandemic; the ACLU page promised to โ€œupdate the tracker as major new developments occur[red] .โ€ This data has distinct variable names and organization, so we also modify the ACLU data to match the Track Trans Legislation data as closely as possible.
For example, bills that were withdrawn, not passed by the end of a given legislative session, explicitly listed as โ€œDeadโ€, or were recommended against by a committee and did not proceed in the legislature were relabeled โ€œDeadโ€ (at least for that year). The exception is when the bill description is specifically listed as โ€œhearing scheduledโ€, โ€œreferred to committeeโ€, carried over from another year, or otherwise makes clear that the bill is still under consideration, in which case โ€œIntroducedโ€ or โ€œCrossed Overโ€ (depending on whether the bill had been passed by at least one chamber) is used.
We only include bills in categories tracked by both data sources; this leaves out, for example, bills preventing localities from passing anti-discrimination ordinances within a state. We use a broad reading of the โ€œreligious freedomโ€ category to include bills that allow for people with โ€œsincerely-held religious beliefsโ€ in that state to challenge nondiscrimination laws, discriminate against LGBTQ+ people, refuse to provide healthcare to LGBTQ+ people, refuse to provide adoption services to LGBTQ+ people; discriminate against married LGBTQ+ people, and receive funding for discriminatory student groups at public universities, among others.

Subsection 5.6.2 Cleaning the Data

Cleaning is a process that involves fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset. For more information, check out Tableauโ€™s Guide to Data Cleaning
โ€‰118โ€‰
www.tableau.com/learn/articles/what-is-data-cleaning
. After importing and merging data sets, it is integral to clean the data before working or manipulating the data and producing any conclusions. While the cleaning process can be done manually, it is much more efficient to use a software program such as Microsoft Excel or a programming language such as R when working with multiple data sets or large quantities of data.
Here are the steps to clean your data:
1. Remove duplicate or irrelevant information 2. Fix naming conventions, typos, or any other structural errors that do not make the data uniform 3. Filter unwanted data such as outliers 4. Handle missing data by dropping observations, inputting missing values, or altering the data 5. Validate and ensure that the cleaning process was completed correctly before beginning the coding or visualizing process
First, the 2018-2020 ACLU datasets code state names by their two-letter abbreviations (e.g., โ€œAZโ€ instead of โ€œArizonaโ€), while Track Trans Legislation uses full names. So we use a program to convert full names to abbreviations in the whole dataset. We have done this cleaning and it can be found in ADD GITHUB LINK HERE! You can see the code we used here: ADD THE CODE FOR STUDENTS THAT ARE CURIOUS!!!! We also note that, for example, the 2021 dataset includes some bills passed in January 2022, so we eliminate duplicate bills.
This decreases the number of bills in our dataset from 927 to 893. Next, note that the bills whose status is labeled Introduced* by TTL are those that failed to meet their stateโ€™s โ€œcrossover deadlineโ€, the date by which a bill must pass out of the chamber in which it was introduced and to the other chamber (e.g. State House vs. Senate). According to the site Track Trans Legislation
โ€‰119โ€‰
www.tracktranslegislation.com/
, a bill that is not passed in its initial chamber by the crossover deadline โ€œfaces high procedural hurdles in order to move forward.โ€ Thus, we wish to classify these bills (at least for the current session) as โ€œDead/Failedโ€.
Moreover, one billโ€™s status is listed as โ€œPostedโ€, Kentuckyโ€™s HB132 in 2020. LegiScan research
โ€‰120โ€‰
legiscan.com/KY/bill/HB132/2020
reveals that this bill died in committee, so we update its status to Dead/Failed.
Letโ€™s walk through cleaning the most recent anti-trans legislation data from Track Trans Legislation
โ€‰121โ€‰
www.tracktranslegislation.com/
1. Letโ€™s look at the data from all states, all bill types, and all statuses (i.e. you should not be filtering anything in this case). Click on the โ€œExportโ€ button and download the data as a CSV file.
Data from Track Trans Legislation displaying dates, states, bills, bill types, and statuses with a red box highlighting the "Export" button.
2. We now need to open the CSV file in Excel. To do this, take the following steps: Open Excel. Select File โ†’ Openโ€ฆโ†’ Choose the CSV file. The Excel sheet should populate with the data.
3. To get a preview of what the data looks like, go to the Home tab within Excel and press Analyze Data. A side panel should appear on the right of the screen. Always be sure to check these insights present the information clearly.
Excel file downloaded from Track Trans Legislation displaying the "Analyzing Data" button and corresponding window.
4. Remove any duplicates. To remove duplicates: Data โ†’ Remove Duplicates, ensure that all columns with duplicates are checked off so they are removed.
Excel heading with the "Data" tab and "Remove Duplicates" button highlighted in a red box.
5. Now, letโ€™s make sure the formatting is consistent. Letโ€™s try lowercasing all of the state names. First, create a new column to the side of the states column. Next, in this new column, enter โ€œ=LOWER( )โ€ function into the first cell. In the parentheses, write X:X, replacing the Xs with the letter of the original column and hit enter. This will populate the new column with all of the state names in lowercase letters:
Columns E and F from Excel file showing the before and after results of using the "=LOWER()" function
Note: If you donโ€™t want the formulas in the resultant cells, you just want the new lowercased versions of the names as if they had been hand-typed, you can select the names, press Control-C to copy them, navigate to Paste โ†’ Paste Values, and paste them into a new column or on top of the column you had previously made. Doing this will allow you to delete the original column without affecting the new, lowercase column. We can also capitalize all letters in the cells we select by using โ€œ=UPPER( )โ€ or use โ€œ=PROPER( )โ€ to reset the capitalization so the first letter is the only one capitalized.
Drop down from the "Paste" with "Paste Values" highlighted in green.
6. Next, we will find entries for the state of โ€œOklahomaโ€ and replace them with the state abbreviation โ€œOKโ€ using the Find and Replace function. To do this, the following steps apply: First, navigate to Home โ†’ Editing โ†’ Find & Select. Next, Replace โ†’ Find what: โ€œOklahomaโ€ and Replace with: โ€œOKโ€ โ†’ Replace All. Note, the alert will notify you of how many cells were altered and you should make sure that all planned changes were made.
Excel heading with the "Home" tab and the "Find & Select" button highlighted in red boxes.
The "Find & Replace" pop-up window with "Find what:" and "Replace with:" dropdowns, the "Replace All" button, and alert message highlighted in red boxes.
7. Finally, we can use the TRIM( ) function to eliminate excessive spaces. For example, if in our data set we have โ€œ Arizonaโ€ (note the extra space before the A) and we want โ€œArizona,โ€ we first need to create a new column adjacent to the state column. Next, in this new column, enter โ€œ=TRIM( )โ€ into the first cell. In the parentheses, write X:X, replacing the Xs with the letter of the original state column and hit enter. This will populate the new column with all of the state names without excessive spaces.
Note: The TRIM( ) function will not remove any spaces between two words in a state name (โ€œNorth Carolinaโ€ to โ€œNorthCarolinaโ€), it will just remove excess spaces at the beginning and end of the state name.

Subsection 5.6.3 Exercise

Instructions: Open the Oklahoma Excel file
โ€‰122โ€‰
github.com/BatesDCS23-24/FemSTEM2/blob/b680dd6723a2a4097ad27e978de7d686d9deeb02/Excel%20files/5.3%20Exercise%201.xlsx
and click โ€œview rawโ€ to download the file.
You are part of a data-cleaning team focused on this data set. Your job is to clean the provided data set as follows: tracktranslegislation.com
โ€‰123โ€‰
tracktranslegislation.com
  1. Delete all conditional formatting (i.e. capitalization)
  2. Replace all mentions of โ€œAlabamaโ€ with โ€œOklahomaโ€
  3. Ensure that everything is spelled correctly (use spell check)
  4. Remove all duplicate bills
  5. Create a new column called โ€œBill Description 2.โ€ Use the TRIM() function to remove excess spaces in the โ€œBill Descriptionโ€ column.
  6. Your boss wants a new โ€œState2โ€ column where all of the state names are completely capitalized (i.e. โ€œOKLAHOMAโ€). Create a new โ€œState2โ€ column using the UPPER( ) function.
Note: If you want to know how to clean data using the R language, check out Chapter 5 on Data Structures
โ€‰124โ€‰
rc2e.com/datastructures#intro-DataStructures
in RCookbook.

Subsection 5.6.4 Cleaning for Text Analysis

Text analysis is the process of using computer systems to understand human-written text for analysis purposes including analyzing legislation. Text can be analyzed from unstructured data sources such as emails, surveys, documents, and other online material to extract insights. The main feature of text analysis is training computer software to associate words with meanings, similar to how humans learn a new language through objects, actions, and emotions. Deep learning and natural language processing (NLP) are the principles of text analysis. (Source).
โ€‰125โ€‰
aws.amazon.com/what-is/text-analysis/
The goal of this section is to give foundational steps to clean a txt document in R studio and then use R studio to analyze the text using frequency analysis, bigrams, and optionally networks.
In this chapter, you will explore anti-trans legislation through text analysis. Note, this approach requires additional cleaning. This cleaning process depends on the specific bill, its formatting, and your groupโ€™s preferences. Decisions must be made about how much you want to clean the data for your specific project.
Note: whenever you download a file to open in R or R Studio, make sure to set your working directory to wherever that file is located. For example, if you have the file just on your desktop (not in any folder), entering โ€œsetwd("~/Desktop")โ€ in the console window will set the working directory to your desktop.

Activity 7.

Before we do any text analysis, we have to do some data cleaning. We will use an Arizona bill regarding trans legislation as an example, which can be viewed here
โ€‰126โ€‰
github.com/DAAAAMNSocialJustice/anti-trans-legislation/blob/3f450de86643b58c20df51e910cf7ebd8c4a266e/4-bill-test.xlsx
.
After downloading the "4-bill-test" excel file, we can enter the following code in R:
This will install some necessary packages for cleaning and read the excel file into R. We want to look at first anti-trans bill in the excel file- an Arizona bill vetoed that addresses pronoun use in schools. Add the following code below what you already have:
This code will output the first row of text of the bills file. We will now clean this section with the following code. Run each block one at a time along with the code from above. You can enter "ltext" at any point to check your progress.
  1. Removing line breaks.
  2. Deleting punctuation.
  3. Deleting a phrase or line.
    This text tends to have a 1. and a. and (a) and (b). We already eliminated parentheses and periods, so now we are going to eliminate all single letter words and numbers. We are intentionally not removing all numbers. There are several references to K12 or under 18 so we want to maintain these. We also want to make sure that our words will be recognized as the same when we construct relationships, so we have to convert everything to lowercase.
  4. Removing redundant spaces.
  5. Identifying and eliminating stop words.
    Stop words are common words that carry little textual meaning, so we do not need to include them in our analysis. For example, "the", "are", "is", etc...
After cleaning the data, a block of text remains.
In order to analyze the block of text, we need to get these words into what is called a word vector. For those unfamiliar with the idea of vectors, word vectors are a list of words, with one word on each line. If you are familiar with vectors from other math courses, the idea of a word vector is similar: it is a list of words in an array format. We can create a word vector and save it as a new csv file with the following code:
You can enter "ltext_vec" to view our word vector. Play with the following code to analyze your text.
We can note that the relative frequency (rflel) is the number of times a particular value for a variable has been observed in relation to the total number of values for that variable. We can also see from our top 20 list that "school" and "pronoun" were of the most common words used in the text. This makes sense given the context of this bill.

Subsection 5.6.5 Visualizing Text Analysis Data

We can continue our investigation of the bill by creating and analyzing visualizations.

Activity 8.

  1. Download the atl_heritage.Rmd
    โ€‰127โ€‰
    github.com/DAAAAMNSocialJustice/anti-trans-legislation/blob/3f450de86643b58c20df51e910cf7ebd8c4a266e/atl_heritage.Rmd
    , 4-bill-test.xlsx
    โ€‰128โ€‰
    github.com/DAAAAMNSocialJustice/anti-trans-legislation/blob/3f450de86643b58c20df51e910cf7ebd8c4a266e/4-bill-test.xlsx
    , template legislation.xlsx
    โ€‰129โ€‰
    github.com/DAAAAMNSocialJustice/anti-trans-legislation/blob/3f450de86643b58c20df51e910cf7ebd8c4a266e/template%20legislation.xlsx
    , and Pro-LGBTQ-Bills.csv
    โ€‰130โ€‰
    github.com/DAAAAMNSocialJustice/anti-trans-legislation/blob/3f450de86643b58c20df51e910cf7ebd8c4a266e/Pro-LGBTQ-Bills.csv
    files from GitHub
    โ€‰131โ€‰
    github.com/DAAAAMNSocialJustice/anti-trans-legislation
    .
  2. Open atl_heritage.Rmd in R Studio and run the code by selecting Runโ†’Run All. A small yellow banner may pop up at the top of your screen that you do not have a library installed. Press โ€œInstall,โ€ and run the code again.
  3. Tabs named โ€œbills,โ€ โ€œtemplate_leg,โ€ โ€œprobills,โ€ โ€œltext_cuonts,โ€ and โ€œcounts_orderedโ€ appear in the bar next to the original .Rmd file. What information do each of these tabs contain?
  4. After running the code, some tables and graphs will appear directly in the code. Return to the .Rmd tab and scroll through the code until you find the following table. How many times does the word โ€œemployeeโ€ appear in the bill? Hint: look at the โ€œnโ€ column. What is the relative frequency of the word โ€œeducationโ€? Hint: look at the โ€œrelfโ€ column.
  5. Continue to scroll through the graphs. Talk with a partner about what looks nice in each graph, and what you would want to change. Which graph is your favorite? Why?

Subsection 5.6.6 Exercises

Now, itโ€™s your turn to try!

Checkpoint 5.6.1.

Using the following graph, answer the questions 1-6.
  1. What does the label โ€œnโ€ on the y-axis mean?
  2. What title would you give to the graph?
  3. How many words appear exactly 5 times in the bill?
  4. How many words appear at least 7 times in the bill?
  5. What word appears the most?
  6. What might this information tell us about the bill?

Checkpoint 5.6.2.

List 3 things you can do to clean data learned in Section 5.4 and 3 things you can do to clean data for text analysis (discussed in this section). What are some similarities and differences between the methods you came up with?

Checkpoint 5.6.3.

What do you think would happen if you didnโ€™t clean the data before making a word vector? How might the graph look different?

Checkpoint 5.6.4.

Which cleaning method for text analysis do you think would be most useful to learn? Why? This question is asking for an opinion, there is no right answer.