Collect / Cleanse / Enrich / Store / Query / Visualize (Part I)
Data Mining, Data Science, Big Data, call it what you want, all of those sound a bit scary. We usually see the result of that type of work as great visualizations illustrating thoughtful studies (best case scenario) or simply a tasteless and sad pie chart (worst case scenario), but not what the process entails.
Leaving aside the business and the analysis components of data mining to focus on its more technical level, we are roughly left with Collect / Cleanse / Enrich / Store / Query / Visualize.
That definitely sounds a bit more friendly and if you are willing to start getting into it, no rocket science is required. It's simpler than you think. The following is going to be a simple and practical exercise of that process.
For those who want to be initiated, the purpose of these articles is to bring a humble light to that process in a comprehensive manner by making available the scripts for you to download, inspect thoroughly, and reuse as much as you want. It won't hurt, I promise!
All of that will be broken down in two parts:
- Collect / Cleanse / Enrich (in this first post)
- Store / Query / Visualize (second upcoming post)
No fancy and expensive software required, that would betray the spirit of open source -- open knowledge -- that I like. Before going ahead let me quote the wise words of Meta S. Brown from her book Data Mining For Dummies.
How data miners spend their time
Cooks who serve delicious dinners spend a lot of time chopping vegetables. Runners who win races spend a lot of time stretching and training. Data miners who develop valuable predictive models spend a lot of time preparing data.
People who havenít tried data mining sometimes think that it is a nonstop thrill ride of discovering great insights and developing powerful models. It isnít. Most of your time goes to doing all the things that must be done right before you can start building models. Data preparation isnít the most glamorous aspect of the job.
Itís painstaking work, and you have a lot to do, so much that data miners spend more time preparing data than doing anything else. Yet data preparation is worth the effort, because it makes meaningful discovery possible.
If you haven't, read it please. Now, bear with me.
What we need to make it work
Python is a high-level and powerful programming language, very popular among Data Analysts. PHP is not as serviceable in that regard but the fact it's a very common server-side scripting language designed for web development makes it even more widely popular. This is an advantage if you want to introduce data processing to the largest audience possible so It will serve our educational purpose here.
From a simple TXT file (plain of CSV flavor) to a database. For the latter we'll use MySQL because it is popular, open-source, and available at the snap of your fingers.
How do I get all this? XAMPP to the rescue
Install only these components, the rest are not needed.
That will create a folder named xampp in your hard drive, like c:\xampp in Windows.
If you already have a LAMP host account for your blog or website you can use it instead of making your laptop / desktop play the role of local web server. Up to you. In that case be careful and make sure the scripts have the right permissions to write files in the folders of the project.
This is all about data, we need some. Corruption in Spain has been ramping up since the dictatorial era of General Franco was over. We became a democracy 'supervised' by the old structures of power that transitioned to the new ones. One of the latest corruption scandals is the one known as Caja Madrid Opaque Credit Cards.
Eighty six former senior executives and board members of a savings bank that had to be bailed out with public money lavishly used credit cards for a total of Ä15.5 million while, ironically, they were prescribing austerity measures for the rest of the citizens. The data is in those Excel files and represents more than 82 thousand credit card transactions. Nice guys. Great topic eh?
The thing is most of the relevant newspapers made some analysis and put together some charts but none made the raw data available with one exception. This data will be the starting point of our work.
That data and the way we can retrieve it makes a nice case for our project besides the personal interest in keeping it available. The outcome is right here, ready to download in CSV (compressed) and Excel format:
At the end of the post there is more information on how I got extra data from a Tableau public chart to enrich the previous one with a couple of cheap tricks.
As someone I admire would put it (paraphrasing quite a lot): The outcome of anything paid with public money, whatever the yield is, goods, intellectual work, and its associated data, should remain public and 100% accessible to anyone specially if we want a healthy democracy. I'm lucky enough to count Javier de la Cueva (@jdelacueva) as a friend.
Before we make it work
Download the scripts creditcardsdataproject.zip. Uncompress and place folder 'creditcardsdataproject' under root folder 'htdocs' or whatever is the root folder for your local web server or remote (it could be 'public_html').
Note: all the code could be placed in a single php file but I think it will be easy to understand what the code does and reuse it if chopped in smaller scripts. It also lets me keep an eye on how much memory and time each task requires for some script performance improvement purposes.
Putting the scripts to work for us
First, a general overview of the PHP stuff used to achieve our purpose:
- cURL http://php.net/manual/en/book.curl.php
Used to grab the content of web pages
- DOMXPath http://php.net/manual/en/class.domxpath.php
Used to extract information from pages we got using cURL
- PHPExcel http://www.phpexcel.net
Powerful library that lets us open, manipulate, and save Excel and CSV files
- Some Filesystem functions http://php.net/manual/en/book.filesystem.php
In order to create new files (fopen), place content (fwrite) and save them locally (fclose)
- Basic string functions to cleanse data http://php.net/manual/en/ref.strings.php
Nothing really fancy or complicated.
Open your browser, make sure your local web server is running, and go to http://localhost/creditcardsdataproject/. You will see something like:
Easy to follow instructions there, don't click anything yet!
Let's see what those scripts do.
Collect step 1. Scraping URLs from a web page
This webpage lists a bunch of excel files hosted under that same domain that we want to download for later processing as they hold the data we want to deal with.
That 01-get-xlsx-urls.php script fetches the HTML document using cURL, and using DOMXPath collects all the links included in the table pointing to XSLX (Excel) files. Then it writes or stores to a local simple TXT file all those URLs using file functions (fopen, fwrite, fclose).
Now you can click on the first link that points to 01-get-xlsx-urls.php.
Collect step 2. Downloading Excel files
Now that we have the complete list of files we want (109!) it is time to download them to a local folder /xslx/ using script 02-download-xlsx.php.
Again, we use simple file functions to open xlsx-urls.txt to read line by line, using cURL to verify if those URLs point to something available and if so, fetch and save them to a local folder.
Click on 02-download-xlsx.php. You will see how they appear in your hard drive. Also script 02-list-xlsx.php will list them for you once finished.
Step 3. Parse / Cleanse / Complete / Export Data
This is where things get a bit more interesting. We have the data stored in 109 Excel files which is not very handy. That data needs some cleansing as many rows hold no real information or just data we can consider garbage.
To achieve higher levels of analysis we need to complete those records with data coming from other sources, like the party of each of those guys belong to so we can calculate what party has a higher level of corruption calculated in amount of money spent using the secret credit cards and such. Source is another Excel file, extra_data/descripcion-consejeros.xlsx and as commented I'll explain how I got that later on.
03-xlsxs-to-single-csv-xslx.php starts by opening that \extra_data\descripcion-consejeros.xlsx file with the PHPExcel library mentioned and placing its contents into an array for later use.
Before parsing all the Excel files downloaded to folder \xslx\ the script creates a CSV file where all the cleansed and enriched data will be written line by line. Then we add the first line with the headers or column names.
Time to cycle through the downloaded excel files. For each file:
- Open with PHPExcel library
- Read contents and place them in an array
- Loop through the array
- Cleanse data, lower the casing, capitalize, apply number formatting and such
- Search in the $descriptionsArray for matches using NIF value (works as unique ID) and grab the corresponding values we want in the final CSV to enrich it
- Put together all the values in a single line, properly formatted and separated by commas
- Write every line to the open CSV
- Close the CSV file once all the XLSX files have been parsed
Let's add a bit more:
Just because it only takes a few extra lines of code the script does a couple of tasks more:
- Compresses the CSV file into a ZIP, saving disk space (19.4MB cut down to 1.4MB)
- Creates a XLSX Excel file from the CSV in case you want to use that popular tool to create some pivot tables and charts
Main job is done now. We have a nice CSV with pristine and enriched data ready to move forward. Next step could be:
- Open that into your favorite data visualization software like Tableau, Qlick, or even Excel and start digging and charting right away
- Import it to a database like MySQL but we'll leave that for the next article dealing with the Store / Query / Visualize part of the journey
You might have discovered by now that I'm not the most dexterous programmer out there but hey, nothing breaks and it works!
Cheap tricks. Extract data from Tableau public embeded workbooks
As commented previously I was looking for some extra data related to those corrupt executives that could help performing a deeper analysis. For example, interesting to know how many years they were in the position, what position, party or union and such. That was not available in the initial Excel files but I found it in an online newspaper.
Finding the Tableau workbook's URL
It depends how a public workbook is embed n a web page, the Download' link typically located in the right lower corner is hidden. When that happens right click your mouse somewhere close and using browser's console won't take you much time to find an URL looking like public.tableausoftware.com...
In this case it was: http://public.tableausoftware.com/views/cajamadrid-tarjetasblack/Dashboard1?:embed=y&:showVizHome=no&:host_url=https://public.tableausoftware.com/&:tabs=no&:toolbar=yes&:animate_transition=yes&:display_static_image=no&:display_spinner=yes&:display_overlay=yes&:display_count=yes&:loadOrderID=0
There you have the download link. Click and get the workbook caja madrid-tarjetas black.twbx' in your computer. Next step is download Tableau Public and once installed open the workbook.
Getting the raw data from the Workbook
There is no option to export the raw data from the workbook but you can get it anyway in few seconds.
Click on the first Sheet of the workbook, find and click the View Data' icon and when the window pops up click on the top left corner to select all the cells at once. Copy and paste it into Excel and save or directly into a plain text file and save it as 'tab-separated values' TSV file.
Thanks Belen for that workbook, Cuartopoder.es for the excel files, Javier for the inspiration and Heather for proofreading.
Illustration at the top by freepik.com