Version Control with Git and Representing Data¶

A mental model of Git¶
Activity¶
Today we will try to make a mental model of git so you can understand what it is you have been doing when you type commands into the Terminal
Click this link and add your mental model.

The GitHub Flow¶
IFrame('https://guides.github.com/introduction/flow/',width=1200,height=800)
Markdown(‘Source: GitHub Guides’)
Activity (continued)¶
Let’s now add the “After” version of your mental models
Click this link and add your mental model.

Data Representation¶
Objectives¶
Define: computer, software, memory, data, memory size/data size, cloud
Explain “Big Data” and describe data growth in the coming years.
Explain the role of metadata for interpreting data.
Define: file, file encoding, text file, binary file
Computer Terminology¶
There is a tremendous amount of terminology related to technology. Using terminology precisely and correctly demonstrates understanding of a domain and simplifies communication. We will introduce terminology as needed.
Basic Computer Terminology¶
A computer is a device that can be programmed to solve problems.
Hardware includes the physical components of computer
(eg. central processing unit, monitor, keyboard, computer data storage, graphic card, speakers).
Software programs that a computer follows to perform functions
(eg. operating system, internet browser).
Memory is a piece of technology which allows the computer to store data either temporarily (lost when computer reboots, eg. RAM) or permanently (data is preserved even if power is lost, eg. hard drive).
There are many different technologies for storing data with varying performance.
Some live inside your computer while others are portable and can be used on difference devices (e.g. USB drives).
“The Cloud”¶
“The Cloud” is not part of your computer but rather a network of distributed computers on the Internet that provides storage, applications, and services for your computer.
Examples:
Dropbox is a cloud service that allows you to store your files on machines distributed on the Internet. Automatically synchronizes any files in folder with all your machines.
iCloud is an Apple service that stores and synchronizes your data, music, apps, and other content across Apple devices.
Google Docs you can write, edit, and collaborate wherever you are. For free.
What is data?¶
Data: information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.
However, is can be argued (see this article for example) that data != information. In addition, one might refer to raw data as a collection of number/facts that don’t have meaning until it has been analyzed or has been given meaning.
How is data measured?¶
Computers represent data digitally meaning that data is represented using discrete units called as bits (Binary Digits).
The real-world is analog where the information is encoded on a continuous signal (spectrum of values, ie. infinite sounds/colours).
“Like with the artist’s abstract composition, the trick is to take all of the real-world sound, picture, number, etc. data that we want in the computer and convert it into the kind of data that can be represented in (on/off) switches.” – University of Rhode Island
These “on/off” switches are called “bits” and a “bit” can have a state of 1 (on) or 0 (off). All data is represented at its essence, as collections of bits. A byte contains 8 bits . Your internet speed is measured in either kilo/mega/giga bits per second (bps) or kilo/mega/giga Bytes per second (Bps).
Units of Data¶
Unit |
Size |
---|---|
byte (B) |
8 bits |
kilobyte (KB) |
1000 bytes |
megabyte (MB) |
10^6 bytes (or 1000 KB) |
gigabyte (GB) |
10^9 bytes (or 1000 MB) |
terabyte (TB) |
10^12 bytes (or 1000 GB) |
petabyte (PB) |
10^15 bytes (or 1000 TB) |
exabyte (EB) |
10^18 bytes (or 1000 PB) |
zettabyte (ZB) |
10^21 bytes (or 1000 EB) |
Encoding other data¶
Documents, music, and videos that we commonly use are encoded in a much more complex way, but the principles is the same: everything boils down to bits and bytes.
As we learn more about representing information, always remember that everything is stored as bits, it is by interpreting the context that we have information.
Let’s go through a few more definitions of types of data.
Metadata¶
Metadata is data that describes other data.
Examples of metadata include:
- names of files
- column names in a spreadsheet
- table and column names and types in a database
Metadata helps you understand how to interpret and manipulate the data.
Files¶
A file is a sequence of bytes on a storage device.
A file has a name.
A computer reads the file from a storage device into memory to use it.
The operating system manages how to store and retrieve the file bytes from the device.
The program using the file must know how to interpret those bytes based on its information (e.g. metadata) on what is stored in the file.
File Encoding¶
A file encoding is how the bytes represent data in a file.
A file encoding is determined from the file extension (e.g. .txt or .xlsx) which allows the operating system (OS) to know how to process the file.
The extension allows the OS to select the program to use.
The program understands how to process the file in its format.
Binary vs Text files¶
At a generic level of description, there are two kinds of computer files: text files and binary files.
The difference between binary and text files is in how these bytes are interpreted.
A text file is a file encoded in a character format such as ASCII or Unicode. These files are readable by humans.
Data analytics will often involve processing text files.
We can usually tell if a file is binary or text based on its file extension.
File Encodings: Text Files¶
There are many different text file encodings:
Web standards: html, xml, css, svg, json, … - JSON file data encoded in JSON (JavaScript Object Notation) format - XML file data encoded in XML (Extensible Markup Language) format
Tabular data: csv, tsv, … - CSV comma-separate file each line is a record, fields separated by commas - tab-separated file each line is a record, fields separated by tabs
Documents: txt, tex, markdown, asciidoc, rtf, ps, …
Question:
In these file encodings, what is data and what is metadata?
File Encodings: Binary File¶
A binary file encodes data in a format that is not designed to be human-readable and is in the format used by the computer.
Binary files are often faster to process as they do not require translation from text form and may also be smaller.
Processing a binary file requires the user to understand its encoding so that the bytes can be read and interpreted properly.
There are many different text file encodings:
Image: jpg, png, gif, bmp, tiff, psd, …
Videos: mp4, mkv, avi, mov, mpg, vob, …
Audio: mp3, aac, wav, flac, ogg, mka, wma, …
Documents: pdf, doc, xls, ppt, docx, odt, …
Archive: zip, rar, 7z, tar, iso, …
Database: mdb, accde, frm, sqlite, …
Executable: exe, dll, so, class, … . .
Activity¶
Here are the instructions for this activity:
Download the fileEncodings.xlsx file.
Open
fileEncodings.xlsx
in Microsoft Excel.Save the
fileEncodings.xlsx
file as a “CSV” or, Comma-separated-value file. Open it in VS Code to see how it looks.Save the
fileEncodings.xlsx
file as a “TSV” or, Tab-separated-value file. Open it in VS Code to see how it looks.Save the
fileEncodings.xlsx
file as an “XML”or, Extensible Markup Language file. Open it in VS Code to see how it looks.
Note: There are other possible encodings also possible including: JSON, YAML, etc…
Look at each file using VS Code.
Opening csv in text editor
Note: This screenshot is old and does not use VS Code, but the data will look similar.

Opening tab-separated in text editor
Note: This screenshot is old and does not use VS Code, but the data will look similar.

Opening xml in text editor
Note: This screenshot is old and does not use VS Code, but the data will look similar.

UPC Barcodes¶
Universal Product Codes (UPC) encode manufacturer on left side and product on right side. Each digit uses 7 bits with different bit combinations for each side (can tell if upside down).
QR code¶
A QR (Quick Response) code is a 2D optical encoding developed in 1994 by Toyota with support for error correction.

Make your own codes at: www.qrstuff.com.
NATO Broadcast Alphabet¶
The code for broadcast communication is purposefully inefficient, to be distinctive when spoken amid noise.
Letter |
Meaning |
Letter |
Meaning |
Letter |
Meaning |
||
---|---|---|---|---|---|---|---|
A |
Alpha |
J |
Juliet |
S |
Sierra |
||
B |
Bravo |
K |
Kilo |
T |
Tango |
||
C |
Charlie |
L |
Lima |
U |
Uniform |
||
D |
Delta |
M |
Mike |
V |
Victor |
||
E |
Echo |
N |
November |
W |
Whiskey |
||
F |
Foxtrot |
O |
Oscar |
X |
X-ray |
||
G |
Golf |
P |
Papa |
Y |
Yankee |
||
H |
Hotel |
Q |
Quebec |
Z |
Zulu |
||
I |
India |
R |
Rom |
Clicker Question: Memory Size¶
Example 1
Which is biggest?
A) 10 TB
B) 100 GB
C) 1,000,000,000,000 bytes
D) 1 PB
Clicker Question: Metadata vs. Data¶
Example 2
How many of the following are TRUE?
It is possible to have data without metadata.
Growth rates of data generation are decreasing.
It is possible to represent all decimal numbers precisely on a computer.
A character encoded in Unicode uses twice as much space as ASCII.
A) 0 B) 1 C) 2 D) 3 E) 4
Conclusion¶
All data is encoded as bits on a computer.
Metadata provides the context to understand how to interpret the data to make it useful.
Memory capacity and data sizes are measured in bytes.
Files are sequences of bytes stored on a device.
A file encoding is how the bytes are organized to represent data - Text files (comma/tab separated, JSON, XML) are often processed during data analytics tasks. - Binary files are usually only processed by the program that creates them. As a data analyst, understanding the different ways of representing data is critical as it is often necessary to transform data from one format to another.