{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data analyses  with Python & Jupyter"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Introduction\n",
    "\n",
    "You can do complex biological data manipulation and analyses using the `pandas` python package (or by switching kernels, using `R`!)\n",
    "\n",
    "We will look at pandas here, which provides `R`-like functions for data manipulation and analyses. `pandas` is built on top of NumPy. Most importantly, it offers an R-like `DataFrame` object: a multidimensional array with explicit row and column names that can contain heterogeneous types of data as well as  missing values, which would not be possible using numpy arrays.\n",
    "\n",
    "`pandas` also implements a number of powerful data operations for filtering, grouping and reshaping data similar to R or spreadsheet programs."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installing Pandas\n",
    "\n",
    "`pandas` requires NumPy. See the [Pandas documentation](http://pandas.pydata.org/).\n",
    "If you installed Anaconda, you already have Pandas installed. Otherwise, you can `sudo apt install` it.\n",
    "\n",
    "Assuming `pandas` is installed, you can import it and check the version:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'0.17.1'"
      ]
     },
     "execution_count": 1,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import pandas as pd\n",
    "pd.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Also import scipy: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "import scipy as sc"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Reminder about tabbing and help!\n",
    "\n",
    "As you read through these chapters, don't forget that Jupyter gives you the ability to quickly explore the contents of a package or methods applicable to an an object by using the tab-completion feature. Also documentation of various functions can be accessed using the ``?`` character. For example, to display all the contents of the pandas namespace, you can type\n",
    "\n",
    "```ipython\n",
    "In [1]: pd.<TAB>\n",
    "```\n",
    "\n",
    "And to display Pandas's built-in documentation, you can use this:\n",
    "\n",
    "```ipython\n",
    "In [2]: pd?\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pandas `dataframes`\n",
    "\n",
    "The dataframes is the main data object in pandas. \n",
    "\n",
    "### importing data\n",
    "Dataframes can be created from multiple sources - e.g. CSV files, excel files, and JSON."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>Species</th>\n",
       "      <th>Infraorder</th>\n",
       "      <th>Family</th>\n",
       "      <th>Distribution</th>\n",
       "      <th>Body mass male (Kg)</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Daubentonia_madagascariensis</td>\n",
       "      <td>Chiromyiformes</td>\n",
       "      <td>Daubentoniidae</td>\n",
       "      <td>Madagascar</td>\n",
       "      <td>2.700</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Allocebus_trichotis</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Cheirogaleidae</td>\n",
       "      <td>Madagascar</td>\n",
       "      <td>0.100</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Avahi_laniger</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Indridae</td>\n",
       "      <td>America</td>\n",
       "      <td>1.030</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Avahi_occidentalis</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Indridae</td>\n",
       "      <td>Madagascar</td>\n",
       "      <td>0.814</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Avahi_unicolor</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Indridae</td>\n",
       "      <td>America</td>\n",
       "      <td>0.830</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>Cheirogaleus_adipicaudatus</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Cheirogaleidae</td>\n",
       "      <td>Madagascar</td>\n",
       "      <td>0.200</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>Cheirogaleus_crossleyi</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Cheirogaleidae</td>\n",
       "      <td>Madagascar</td>\n",
       "      <td>0.400</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>Cheirogaleus_major</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Cheirogaleidae</td>\n",
       "      <td>Madagascar</td>\n",
       "      <td>0.450</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>Cheirogaleus_medius</td>\n",
       "      <td>Lemuriformes</td>\n",
       "      <td>Cheirogaleidae</td>\n",
       "      <td>Madagascar</td>\n",
       "      <td>0.217</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        Species       Infraorder          Family Distribution  \\\n",
       "0  Daubentonia_madagascariensis  Chiromyiformes   Daubentoniidae   Madagascar   \n",
       "1           Allocebus_trichotis     Lemuriformes  Cheirogaleidae   Madagascar   \n",
       "2                 Avahi_laniger     Lemuriformes        Indridae      America   \n",
       "3            Avahi_occidentalis     Lemuriformes        Indridae   Madagascar   \n",
       "4                Avahi_unicolor     Lemuriformes        Indridae      America   \n",
       "5    Cheirogaleus_adipicaudatus     Lemuriformes  Cheirogaleidae   Madagascar   \n",
       "6        Cheirogaleus_crossleyi     Lemuriformes  Cheirogaleidae   Madagascar   \n",
       "7            Cheirogaleus_major     Lemuriformes  Cheirogaleidae   Madagascar   \n",
       "8           Cheirogaleus_medius     Lemuriformes  Cheirogaleidae   Madagascar   \n",
       "\n",
       "   Body mass male (Kg)  \n",
       "0                2.700  \n",
       "1                0.100  \n",
       "2                1.030  \n",
       "3                0.814  \n",
       "4                0.830  \n",
       "5                0.200  \n",
       "6                0.400  \n",
       "7                0.450  \n",
       "8                0.217  "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MyDF = pd.read_csv('../data/testcsv.csv', sep=',')\n",
    "MyDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Creating dataframes\n",
    "\n",
    "You can also create dataframes using a python dictionary like syntax: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Var1</td>\n",
       "      <td>Grass</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Var2</td>\n",
       "      <td>Rabbit</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Var3</td>\n",
       "      <td>Fox</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Var4</td>\n",
       "      <td>Wolf</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1    col2  col3\n",
       "0  Var1   Grass     1\n",
       "1  Var2  Rabbit     2\n",
       "2  Var3     Fox   NaN\n",
       "3  Var4    Wolf     4"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "MyDF = pd.DataFrame({\n",
    "   'col1': ['Var1', 'Var2', 'Var3', 'Var4'],\n",
    "   'col2': ['Grass', 'Rabbit', 'Fox', 'Wolf'],\n",
    "   'col3': [1, 2, sc.nan, 4]\n",
    "})\n",
    "\n",
    "MyDF"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Examining your data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Var1</td>\n",
       "      <td>Grass</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Var2</td>\n",
       "      <td>Rabbit</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Var3</td>\n",
       "      <td>Fox</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Var4</td>\n",
       "      <td>Wolf</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1    col2  col3\n",
       "0  Var1   Grass     1\n",
       "1  Var2  Rabbit     2\n",
       "2  Var3     Fox   NaN\n",
       "3  Var4    Wolf     4"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Displays the top 5 rows. Accepts an optional int parameter - num. of rows to show\n",
    "MyDF.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col1</th>\n",
       "      <th>col2</th>\n",
       "      <th>col3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Var1</td>\n",
       "      <td>Grass</td>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Var2</td>\n",
       "      <td>Rabbit</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Var3</td>\n",
       "      <td>Fox</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Var4</td>\n",
       "      <td>Wolf</td>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   col1    col2  col3\n",
       "0  Var1   Grass     1\n",
       "1  Var2  Rabbit     2\n",
       "2  Var3     Fox   NaN\n",
       "3  Var4    Wolf     4"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Similar to head, but displays the last rows\n",
    "MyDF.tail()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(4, 3)"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The dimensions of the dataframe as a (rows, cols) tuple\n",
    "MyDF.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The number of columns. Equal to df.shape[0]\n",
    "len(MyDF) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "Index(['col1', 'col2', 'col3'], dtype='object')"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# An array of the column names\n",
    "MyDF.columns "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "col1     object\n",
       "col2     object\n",
       "col3    float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Columns and their types\n",
    "MyDF.dtypes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([['Var1', 'Grass', 1.0],\n",
       "       ['Var2', 'Rabbit', 2.0],\n",
       "       ['Var3', 'Fox', nan],\n",
       "       ['Var4', 'Wolf', 4.0]], dtype=object)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Converts the frame to a two-dimensional table\n",
    "MyDF.values "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>col3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>count</th>\n",
       "      <td>3.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>mean</th>\n",
       "      <td>2.333333</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.527525</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>1.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>1.500000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>2.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>3.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>4.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           col3\n",
       "count  3.000000\n",
       "mean   2.333333\n",
       "std    1.527525\n",
       "min    1.000000\n",
       "25%    1.500000\n",
       "50%    2.000000\n",
       "75%    3.000000\n",
       "max    4.000000"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Displays descriptive stats for all columns\n",
    "MyDF.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "OK, I am going to stop this brief intro to Jupyter with pandas here! I think you can already see the potential value of Jupyter for data analyses and visualization. As I mentioned above, you can also use R (e.g., using `tidyr` + `ggplot`) for this. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Readings and Resources\n",
    "\n",
    "* [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)\n",
    "* A [Jupyter + pandas quickstart tutorial](http://nikgrozev.com/2015/12/27/pandas-in-jupyter-quickstart-and-useful-snippets/)"
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autoclose": false,
   "autocomplete": false,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {},
   "toc_section_display": false,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}