Skip to Main Content

Tools for Organizing and Cleaning Data Workshop

An introductory guide to methods and tools for cleaning, organizing, and exporting data.

Cleaning Problem Characters

Most of the time your data looks like gibberish, the problem is not the data itself, but a mis-match between the encoding the data is in and the encoding the software expected to read. If this is the case, you can use the processes below to check for encoding issues and to correct many of them.

 

Changing Encoding with Notepad++

1. Save your file as a .csv

2. Open your file in Notepad++ and note whether the characters are displaying correctly.

3. Select "Encoding" and "Encode in ANSI" if using Excel.

4. Save the file again.

4. Open the file in Excel and check whether the characters are displaying properly.

 

Change Encoding expected by Excel

1.  In Excel, save your file as a .txt (tab-delimited) file.

2. Close the file and then re-open the file in Excel.

3. Tell Excel to read the file as UTF-8 encoded data.

4. Notice if the characters are displaying correctly.

Find and Replace to Clean Data

Even simple tools, like Find and Replace, can be useful for reviewing and cleaning your data.

Things to keep in mind:

1. Selecting a portion of your data before "Find and Replace" will search only that subset of data.

This can be helpful when you're only attempting to replace a particular character in one type of field, but it can also deceptively report no occurrences of a character when the character is present in other parts of your data.

2. Selecting "Find All" will display every occurrence of a string in the search area.

Scrolling through this list can help you see patterns for when a string occurs and help you check whether you want to use "Replace All" or verify each replacement one at a time.

3. Using "Find and Replace" to replace diacritics and other characters requires a one-to-one replacement, but there are other tools (like ASAP utilities) or methods (like Regular Expressions) that will allow you to do more complex clean-up.

ASAP Utilities or Notepad++ can be especially useful when characters that don't seem to be a problem in Excel are causing your data to fail to import into other software properly. ASAP Utilities can find many characters and replace them with one character, including replacing various ways to encode spaces with one simple space, or replacing all diacritics with their nearest standard letter. Using "Show All Characters" in Notepad++ can also make visible these characters that may simply show up as a line break or new column or not be visible at all in Excel or Google Sheets.

If you have complicated replacements that require Regular Expressions, you may want to explore using the GREL or Jython functions available in OpenRefine.

4. When you use "Find and Replace" on characters that may occur in other places in your data that you don't want to risk changing, you can use a larger string to help make sure that you only replace the problems.

For example, putting the degree marks in the appropriate places could be done by noticing that the numbers before the problem characters in the string below are always either 29 or 90.

Lat:  29�55'24.19""""N, Long:  90� 5'14.49""""W; Lat:  29�59'0.25""""N, Long:  90� 6'37.19""""W

Thus, Find 29�  and Replace it with 29° and Find 90�  and Replace it with 90° could cut your work in half while also making sure that you don't put degree characters into random places within your data.

Find Outlying Data Mistakes

Filter and Sort

Filter and Sort are two tools that can help you find outliers in your data. These outliers will often be mistakes in your data, such as dates in different formats, missing decimal points, or data added to the wrong column.

Excel includes a button "filter and sort", but these are two different functions.

 
Filter

Filter organizes the contents of the field so that every duplicate is only listed once. To see all of the contents once organized in this way, you can click the arrow next to the header. By selecting or de-selecting particular pieces of data in this drop down, you can limit the data shown to particular problematic pieces of data.

 

Sort

Sort organizes the contents of the field so that each record is listed in a particular order. When sorting, always use "expand the selection" if you'd like to maintain the relationship between these cells and their rows (ie. records). With text contents, this will default to alphabetic order from A-Z. You can also reverse the order, or use "custom sort" to create your own order. Custom sort allows you to create a hierarchy that sorts by two or more columns, which can be used to see the relationship between two columns (for example, every record entered during a certain time stamp range may have the incorrect information in a measurement field because an instrument was incorrectly calibrated).

 
Using Filter and Sort

Use "Sort & Filter" to organize the data in the Restaurant field.

  • Are there any mis-spellings in the Restaurant names?
  • Can you use this data to assess how many Restaurants are represented in the data?
  • Can you use a custom sort to assess the relationship between the restaurants and the dates of the menus represented?
  • Can you use a custom sort to assess the relationship between the restaurants and their locations?
Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.