Eric 6 User's Guide/Eric as a Quick-And-Dirty Editor/Scenario 1

From Wikibooks, open books for an open world
Jump to navigation Jump to search

Scenario 1[edit]

You want to import some data into a database. To do so, you need to manipulate some of the text for consistency, turning text like this:

Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered
Anhui,Mainland China,1/22/2020 17:00,1,,
Beijing,Mainland China,1/22/2020 17:00,14,,

and this:

Province/State,Country/Region,Last Update,Confirmed,Deaths,Recovered,Latitude,Longitude
Hubei,Mainland China,2020-03-03T11:43:02,67217,2835,36208,30.9756,112.2707
,South Korea,2020-03-03T09:43:02,5186,28,30,36.0000,128.0000
,Italy,2020-03-03T20:03:06,2502,79,160,43.0000,12.0000
,Iran,2020-03-03T14:23:03,2336,77,291,32.0000,53.0000
Guangdong,Mainland China,2020-03-03T10:23:10,1350,7,1101,23.3417,113.4244

and this:

FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key
45001,Abbeville,South Carolina,US,2020-04-10 22:54:07,34.22333378,-82.46170658,7,0,0,0,"Abbeville, South Carolina, US"
22001,Acadia,Louisiana,US,2020-04-10 22:54:07,30.295064899999996,-92.41419698,94,4,0,0,"Acadia, Louisiana, US"
51001,Accomack,Virginia,US,2020-04-10 22:54:07,37.76707161,-75.63234615,12,0,0,0,"Accomack, Virginia, US"

into this:

Rpt_Date,Confirmed,Deaths,Recovered,Combined_Key
"2020-05-18",35,0,0,"Abbeville, South Carolina, US"
"2020-05-18",198,12,0,"Acadia, Louisiana, US"
"2020-05-18",688,9,0,"Accomack, Virginia, US"
"2020-05-18",773,22,0,"Ada, Idaho, US"
"2020-05-18",5,0,0,"Adair, Iowa, US"

A perfect quick-and-dirty project for Python and Eric. You decide that this will be a console-only application – no need for classes or a fancy GUI just to clean up some text files.

Background[edit]

It all started innocently enough. You just wanted to see how Covid-19 infections were going in some places near and dear to you, like the UK where you have relatives. The web charts you'd looked at weren't flexible enough to display what you wanted to see. The sites offering them credited the Center for Systems Science and Engineering at Johns Hopkins University (JHU CSSE) as the data source. It turned out that JHU was making the data publicly accessible on github.

Checking it out, you found what looked like .csv files. "Great - I'll just download some of those, open them up in LibreOffice Calc, and knock together some charts of my own."

It wasn't quite that simple. There are global reports and more detailed ones limited to the United States. Your interest was global. There are two types of data files: time series and daily reports.

For the time series, JHU updates three omnibus table each day, for confirmed cases, deaths and recovered. Each table contains location names as rows and dates as columns. The number of cases appears at their intersection. A new date column is added each day. New location rows are added intermittently.

The daily reports contain locations in rows, a Last_Updated column and columns for confirmed, deaths, recovered and active cases, plus, at various times, other data like latitude and longitude.

The report names are in the form "<report>.csv" but when you downloaded a few, they turned out to be in HTML. Looking at the data in your browser, you found you could get the data for a page out as bare text by using the page's RAW button. File, Save in the browser generated a basic csv file. A bit of a hassle, but manageable. You were only looking for a few dates.

You downloaded some data, wangled out your charts and were content.

Time marched on. So did Covid. You kept going back for more. It got tedious. You were less content.

Now, you're a bit of data person and have some basic SQL skills. You knew you could roll up a open source relational database management system (RDBMS) like Firebird and maybe work with it using some kind of GUI tool. “Great,” you thought, “all I have to do is import the data into the RDBMS. Then I can slice, dice and chart to my heart’s content.”

If only life were that simple.

You'd been working with the time series reports, but found they didn't fit a database construct comfortably. For example, the reports add a new column for each date. In database terms, that's ALTER DATABASE which is an uncommon way to add new data - more usually it's done by adding a row using INSERT. There were other issues. You decided to work with the daily reports instead, combinign them into one big text file to import into the database.

But the daily reports had some problems too. Some of the locations names were inconsistent: in early days “Mainland China” was used, later it was just “China”. “South Korea” became “Korea, South” then “South Korea” again. As the virus spread, coverage expanded to include more countries, provinces/states and US counties. The date format flipped back and forth a few times from m/d/yyyy to yyyy-mm-dd. A unique text key for each location was eventually added. A table of numeric universal IDs (UIDs) for each location was also created, but the UIDs are not used in the main data files, only the combined key text.

It also turns out that the reports are issued daily but the data isn’t necessarily new each day. Some jurisdictions report much less frequently, so including their data for each new date would create duplication.

It became a classic Extract, Transform, Load process.

ETL Plan[edit]

Extract fixed location data (Latitude, Longitude, etc) into separate tables for US_Counties, Provinces_States and Countries_Regions with Combined_Key string for each.

Mainland China -> China
Korea, South -> South Korea

Extract date from Last_Updated column and restructure to ISO date standard (yyyy-mm-dd).

Add Combined_Key column where missing and populate with concatenation of US_County, Province/State and Country/Region.

Drop unnecessary fields.

Write the lines to an output file using the date as the filename.

Once all the files have been processed, combine them all into one big Python set (to eliminate duplicates) and save to a file. Finally, import the data into the database.

Eric – New Project[edit]

Start Eric and select Project, New. The Project Properties window will be displayed. It’s mostly self-explanatory. This is a quick-and-dirty text-based project so the Project Type is Console.