Top Free Essential Books for the Data Scientist

The true data enthusiast has a lot to read about: big data, machine learning, data science, data mining, etc. Besides these technology domains, there are also specific implementations and languages to consider and keep up on: Hadoop, Spark, Python, and R, to name a few, not to mention the myriad tools for automating the various aspects of our professional lives which seem to pop up on a daily basis. There are a lot of topics to keep abreast of. Fortunately (unfortunately?) there is no shortage of books available on all of these subjects.

A unique top 10 list of free book recommendations is display as follows. If you’re interested in books on data, this diverse list of top picks should be right up your alley.

Data Science & Big Data

The Art of Data Science

This book describes the process of analyzing data in simple and general terms. The authors have extensive experience both managing data analysts and conducting their own data analyses, and this book is a distillation of their experience in a format that is applicable to both practitioners and managers in data science.

Big Data Now: 2015 Edition

Now in its fifth year, O’Reilly’s annual Big Data Now report recaps the trends, tools, applications, and forecasts we’ve talked about over the past year. For 2015, we’ve included a collection of blog posts, authored by leading thinkers and experts in the field, that reflect a unique set of themes we’ve identified as gaining significant attention and traction.

Apache Hadoop & Spark

Hadoop Explained

Hadoop is one of the most important technologies in a world that is built on data. Find out how it has developed and progressed to address the continuing challenge of Big Data with this insightful guide.

Mastering Apache Spark

This collections of notes (what some may rashly call a “book”) serves as the ultimate place of mine to collect all the nuts and bolts of using Apache Spark. The notes aim to help me designing and developing better products with Spark.

Statistics for Data Science

Think Stats: Exploratory Data Analysis in Python

Think Stats emphasizes simple techniques you can use to explore real data sets and answer interesting questions. The book presents a case study using data from the National Institutes of Health. Readers are encouraged to work on a project with real datasets.

Machine (Statistical) Learning & Deep Learning

An Introduction to Statistical Learning with Applications in R

This book provides an introduction to statistical learning methods. It is aimed for upper level undergraduate students, masters students and Ph.D. students in the non-mathematical sciences. The book also contains a number of R labs with detailed explanations on how to implement the various methods in real life settings, and should be a valuable resource for a practicing data scientist.

Elements of Statistical Learning

The good news is, this is pretty much the most important book you are going to read in the space. It will tie everything together for you in a way that I haven’t seen any other book attempt.

Neural Networks and Deep Learning

Neural Networks and Deep Learning is a free online book. The book will teach you about:

  • Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data.
  • Deep learning, a powerful set of techniques for learning in neural networks.

Deep Learning

The in-preparation, likely to-be definitive deep learning book of the near future, written by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. The development version is updated monthly, and will be freely available until publication.

Data Mining & SQL

Data Mining: Concepts and Techniques, Third Edition

Data Mining is a comprehensive overview of the field, and I think it is best for a graduate class in data mining, or perhaps as a reference book. The book’s focus is on technique (i.e., how to analyze data, including preparation), and it addresses all the major topics in the field including data storage and pre-processing. However, the book is really about classification methods, and the 2 chapters on cluster analysis are particularly strong and thorough.

Mining of Massive Datasets

The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with further reading references.

Learning SQL, Second Edition

If you’re writing any type of database driven code and you think that you don’t need to understand SQL, read this book. You do need to understand it, and this book teaches it very well.

Learn SQL The Hard Way

This book will teach you the 80% of SQL you probably need to use it effectively, and will mix in concepts in data modeling at the same time. If you’ve been fumbling around building web, desktop, or mobile applications because you don’t know SQL, then this book is for you. It is written for people with no prior database, programming, or SQL knowledge, but knowing at least one programming language will help.