Monday 25th February 2018
Digital Journalism – Big Data Journalism
Introduction to Data Science
Big data is information assets with the four Vs:
• Volume: how much?
• Velocity: growth rate?
• Variety: types of data?
• Veracity: reliability/consistency?
While all four Vs are growing, Variety is becoming the single biggest driver of big-data investments.
Data Security and Governance
Big data environments currently need a complex security architectural model. Security mechanisms: (encryption/obfuscation/loggers/monitors) must protect Data at Rest and Data in transit.
- Get data
- Clean, Prepare & Manipulate Data
- Train Model
- Test Data
Phase 2 typically represents 80% of the whole analytic process.
Know your data sources
• IoT: Device, Network and Sensor Data
• Provenance of data can be an issue… get consent!
• Use “reliable” open source data repositories
Kaggle, Data.gov.uk etc
Qualitative and Quantitative
• Combining all that data and reconciling it so that it can be used to create reports can be incredibly difficult.
• Vendors offer a variety of ETL and data integration tools designed to make the process easier.
• Many enterprises have not solved the data integration problem yet.
Extract, Transform, Load (ETL) is data pre-processing, an essential step in organizing, cleaning & unifying data for a data warehouse.
Generating Useful Insights: Skills
• Easy to learn
• Statistics based functions
• Relatively easy to learn
• Requires knowledge of programming fundamentals
Finding Data Sets
Search for central government sources
Office for National statistics – elections data
You can find data for anything that you wish to search for – American Government, European Data etc