Version Control with Git and Representing Data

http://www.phdcomics.com/comics/archive/phd101212s.gif

Version Control with Git

Why Git?

http://www.phdcomics.com/comics/archive/phd101212s.gif

A mental model of Git

Activity

Today we will try to make a mental model of git so you can understand what it is you have been doing when you type commands into the Terminal

Click this link and add your mental model.

../../../_images/git_model_googledoc.png

What is Git?

https://www.nobledesktop.com/image/blog/git-branches-merge.png

Image Source: Noble Desktop.

The GitHub Flow

IFrame('https://guides.github.com/introduction/flow/',width=1200,height=800)

Markdown(‘Source: GitHub Guides’)

Activity (continued)

Let’s now add the “After” version of your mental models

Click this link and add your mental model.

../../../_images/git_model_googledoc.png

Data Representation

Objectives

  • Define: computer, software, memory, data, memory size/data size, cloud

  • Explain “Big Data” and describe data growth in the coming years.

  • Explain the role of metadata for interpreting data.

  • Define: file, file encoding, text file, binary file

Computer Terminology

There is a tremendous amount of terminology related to technology. Using terminology precisely and correctly demonstrates understanding of a domain and simplifies communication. We will introduce terminology as needed.

Basic Computer Terminology

  • A computer is a device that can be programmed to solve problems.

  • Hardware includes the physical components of computer

    • (eg. central processing unit, monitor, keyboard, computer data storage, graphic card, speakers).

  • Software programs that a computer follows to perform functions

    • (eg. operating system, internet browser).

  • Memory is a piece of technology which allows the computer to store data either temporarily (lost when computer reboots, eg. RAM) or permanently (data is preserved even if power is lost, eg. hard drive).

  • There are many different technologies for storing data with varying performance.

  • Some live inside your computer while others are portable and can be used on difference devices (e.g. USB drives).

“The Cloud”

“The Cloud” is not part of your computer but rather a network of distributed computers on the Internet that provides storage, applications, and services for your computer.

Examples:

  • Dropbox is a cloud service that allows you to store your files on machines distributed on the Internet. Automatically synchronizes any files in folder with all your machines.

  • iCloud is an Apple service that stores and synchronizes your data, music, apps, and other content across Apple devices.

  • Google Docs you can write, edit, and collaborate wherever you are. For free.

What is data?

Data: information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.

Cambridge Dictionary

However, is can be argued (see this article for example) that data != information. In addition, one might refer to raw data as a collection of number/facts that don’t have meaning until it has been analyzed or has been given meaning.

How is data measured?

  • Computers represent data digitally meaning that data is represented using discrete units called as bits (Binary Digits).

  • The real-world is analog where the information is encoded on a continuous signal (spectrum of values, ie. infinite sounds/colours).

“Like with the artist’s abstract composition, the trick is to take all of the real-world sound, picture, number, etc. data that we want in the computer and convert it into the kind of data that can be represented in (on/off) switches.” – University of Rhode Island

These “on/off” switches are called “bits” and a “bit” can have a state of 1 (on) or 0 (off). All data is represented at its essence, as collections of bits. A byte contains 8 bits . Your internet speed is measured in either kilo/mega/giga bits per second (bps) or kilo/mega/giga Bytes per second (Bps).

Units of Data

Unit

Size

byte (B)

8 bits

kilobyte (KB)

1000 bytes

megabyte (MB)

10^6 bytes (or 1000 KB)

gigabyte (GB)

10^9 bytes (or 1000 MB)

terabyte (TB)

10^12 bytes (or 1000 GB)

petabyte (PB)

10^15 bytes (or 1000 TB)

exabyte (EB)

10^18 bytes (or 1000 PB)

zettabyte (ZB)

10^21 bytes (or 1000 EB)

Encoding other data

Documents, music, and videos that we commonly use are encoded in a much more complex way, but the principles is the same: everything boils down to bits and bytes.

As we learn more about representing information, always remember that everything is stored as bits, it is by interpreting the context that we have information.

Let’s go through a few more definitions of types of data.

Metadata

Metadata is data that describes other data.

Examples of metadata include:
- names of files
- column names in a spreadsheet
- table and column names and types in a database

Metadata helps you understand how to interpret and manipulate the data.

Files

A file is a sequence of bytes on a storage device.

  • A file has a name.

  • A computer reads the file from a storage device into memory to use it.

The operating system manages how to store and retrieve the file bytes from the device.

The program using the file must know how to interpret those bytes based on its information (e.g. metadata) on what is stored in the file.

File Encoding

A file encoding is how the bytes represent data in a file.

A file encoding is determined from the file extension (e.g. .txt or .xlsx) which allows the operating system (OS) to know how to process the file.

The extension allows the OS to select the program to use.
The program understands how to process the file in its format.

Binary vs Text files

  • At a generic level of description, there are two kinds of computer files: text files and binary files.

  • The difference between binary and text files is in how these bytes are interpreted.

  • A text file is a file encoded in a character format such as ASCII or Unicode. These files are readable by humans.

  • Data analytics will often involve processing text files.

  • We can usually tell if a file is binary or text based on its file extension.

File Encodings: Text Files

There are many different text file encodings:

  • Web standards: html, xml, css, svg, json, … - JSON file data encoded in JSON (JavaScript Object Notation) format - XML file data encoded in XML (Extensible Markup Language) format

  • Tabular data: csv, tsv, … - CSV comma-separate file each line is a record, fields separated by commas - tab-separated file each line is a record, fields separated by tabs

  • Documents: txt, tex, markdown, asciidoc, rtf, ps, …

Question:
In these file encodings, what is data and what is metadata?
metadataQA1 metadataQA2

File Encodings: Binary File

A binary file encodes data in a format that is not designed to be human-readable and is in the format used by the computer.

Binary files are often faster to process as they do not require translation from text form and may also be smaller.

Processing a binary file requires the user to understand its encoding so that the bytes can be read and interpreted properly.

There are many different text file encodings:

  • Image: jpg, png, gif, bmp, tiff, psd, …

  • Videos: mp4, mkv, avi, mov, mpg, vob, …

  • Audio: mp3, aac, wav, flac, ogg, mka, wma, …

  • Documents: pdf, doc, xls, ppt, docx, odt, …

  • Archive: zip, rar, 7z, tar, iso, …

  • Database: mdb, accde, frm, sqlite, …

  • Executable: exe, dll, so, class, … . .

Activity

Here are the instructions for this activity:

  1. Download the fileEncodings.xlsx file.

  2. Open fileEncodings.xlsx in Microsoft Excel.

  3. Save the fileEncodings.xlsx file as a “CSV” or, Comma-separated-value file. Open it in VS Code to see how it looks.

  4. Save the fileEncodings.xlsx file as a “TSV” or, Tab-separated-value file. Open it in VS Code to see how it looks.

  5. Save the fileEncodings.xlsx file as an “XML”or, Extensible Markup Language file. Open it in VS Code to see how it looks.

Note: There are other possible encodings also possible including: JSON, YAML, etc…

Look at each file using VS Code.

Try It:File Encodings
TryItFileEncoding

Opening xlsx in Excel
inExcel

Opening csv in text editor

Note: This screenshot is old and does not use VS Code, but the data will look similar.

csv

Opening tab-separated in text editor

Note: This screenshot is old and does not use VS Code, but the data will look similar.

tabdelimit

Opening xml in text editor

Note: This screenshot is old and does not use VS Code, but the data will look similar.

xml

UPC Barcodes

Universal Product Codes (UPC) encode manufacturer on left side and product on right side. Each digit uses 7 bits with different bit combinations for each side (can tell if upside down).
UPCbarcode

QR code

A QR (Quick Response) code is a 2D optical encoding developed in 1994 by Toyota with support for error correction.

QR1

Make your own codes at: www.qrstuff.com.

NATO Broadcast Alphabet

The code for broadcast communication is purposefully inefficient, to be distinctive when spoken amid noise.

Letter

Meaning

Letter

Meaning

Letter

Meaning

A

Alpha

J

Juliet

S

Sierra

B

Bravo

K

Kilo

T

Tango

C

Charlie

L

Lima

U

Uniform

D

Delta

M

Mike

V

Victor

E

Echo

N

November

W

Whiskey

F

Foxtrot

O

Oscar

X

X-ray

G

Golf

P

Papa

Y

Yankee

H

Hotel

Q

Quebec

Z

Zulu

I

India

R

Rom

Clicker Question: Memory Size

Example 1
Which is biggest?
A) 10 TB
B) 100 GB
C) 1,000,000,000,000 bytes
D) 1 PB

Clicker Question: Metadata vs. Data

Example 2
How many of the following are TRUE?

  • It is possible to have data without metadata.

  • Growth rates of data generation are decreasing.

  • It is possible to represent all decimal numbers precisely on a computer.

  • A character encoded in Unicode uses twice as much space as ASCII.
    A) 0 B) 1 C) 2 D) 3 E) 4

Conclusion

  • All data is encoded as bits on a computer.

  • Metadata provides the context to understand how to interpret the data to make it useful.

  • Memory capacity and data sizes are measured in bytes.

  • Files are sequences of bytes stored on a device.

  • A file encoding is how the bytes are organized to represent data - Text files (comma/tab separated, JSON, XML) are often processed during data analytics tasks. - Binary files are usually only processed by the program that creates them. As a data analyst, understanding the different ways of representing data is critical as it is often necessary to transform data from one format to another.