{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Data analyses with Python & Jupyter" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Introduction\n", "\n", "You can do complex biological data manipulation and analyses using the `pandas` python package (or by switching kernels, using `R`!)\n", "\n", "We will look at pandas here, which provides `R`-like functions for data manipulation and analyses. `pandas` is built on top of NumPy. Most importantly, it offers an R-like `DataFrame` object: a multidimensional array with explicit row and column names that can contain heterogeneous types of data as well as missing values, which would not be possible using numpy arrays.\n", "\n", "`pandas` also implements a number of powerful data operations for filtering, grouping and reshaping data similar to R or spreadsheet programs." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installing Pandas\n", "\n", "`pandas` requires NumPy. See the [Pandas documentation](http://pandas.pydata.org/).\n", "If you installed Anaconda, you already have Pandas installed. Otherwise, you can `sudo apt install` it.\n", "\n", "Assuming `pandas` is installed, you can import it and check the version:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "'0.17.1'" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "pd.__version__" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Also import scipy: " ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "import scipy as sc" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Reminder about tabbing and help!\n", "\n", "As you read through these chapters, don't forget that Jupyter gives you the ability to quickly explore the contents of a package or methods applicable to an an object by using the tab-completion feature. Also documentation of various functions can be accessed using the ``?`` character. For example, to display all the contents of the pandas namespace, you can type\n", "\n", "```ipython\n", "In [1]: pd.\n", "```\n", "\n", "And to display Pandas's built-in documentation, you can use this:\n", "\n", "```ipython\n", "In [2]: pd?\n", "```" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Pandas `dataframes`\n", "\n", "The dataframes is the main data object in pandas. \n", "\n", "### importing data\n", "Dataframes can be created from multiple sources - e.g. CSV files, excel files, and JSON." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "scrolled": true }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
SpeciesInfraorderFamilyDistributionBody mass male (Kg)
0Daubentonia_madagascariensisChiromyiformesDaubentoniidaeMadagascar2.700
1Allocebus_trichotisLemuriformesCheirogaleidaeMadagascar0.100
2Avahi_lanigerLemuriformesIndridaeAmerica1.030
3Avahi_occidentalisLemuriformesIndridaeMadagascar0.814
4Avahi_unicolorLemuriformesIndridaeAmerica0.830
5Cheirogaleus_adipicaudatusLemuriformesCheirogaleidaeMadagascar0.200
6Cheirogaleus_crossleyiLemuriformesCheirogaleidaeMadagascar0.400
7Cheirogaleus_majorLemuriformesCheirogaleidaeMadagascar0.450
8Cheirogaleus_mediusLemuriformesCheirogaleidaeMadagascar0.217
\n", "
" ], "text/plain": [ " Species Infraorder Family Distribution \\\n", "0 Daubentonia_madagascariensis Chiromyiformes Daubentoniidae Madagascar \n", "1 Allocebus_trichotis Lemuriformes Cheirogaleidae Madagascar \n", "2 Avahi_laniger Lemuriformes Indridae America \n", "3 Avahi_occidentalis Lemuriformes Indridae Madagascar \n", "4 Avahi_unicolor Lemuriformes Indridae America \n", "5 Cheirogaleus_adipicaudatus Lemuriformes Cheirogaleidae Madagascar \n", "6 Cheirogaleus_crossleyi Lemuriformes Cheirogaleidae Madagascar \n", "7 Cheirogaleus_major Lemuriformes Cheirogaleidae Madagascar \n", "8 Cheirogaleus_medius Lemuriformes Cheirogaleidae Madagascar \n", "\n", " Body mass male (Kg) \n", "0 2.700 \n", "1 0.100 \n", "2 1.030 \n", "3 0.814 \n", "4 0.830 \n", "5 0.200 \n", "6 0.400 \n", "7 0.450 \n", "8 0.217 " ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MyDF = pd.read_csv('../data/testcsv.csv', sep=',')\n", "MyDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Creating dataframes\n", "\n", "You can also create dataframes using a python dictionary like syntax: " ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3
0Var1Grass1
1Var2Rabbit2
2Var3FoxNaN
3Var4Wolf4
\n", "
" ], "text/plain": [ " col1 col2 col3\n", "0 Var1 Grass 1\n", "1 Var2 Rabbit 2\n", "2 Var3 Fox NaN\n", "3 Var4 Wolf 4" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "MyDF = pd.DataFrame({\n", " 'col1': ['Var1', 'Var2', 'Var3', 'Var4'],\n", " 'col2': ['Grass', 'Rabbit', 'Fox', 'Wolf'],\n", " 'col3': [1, 2, sc.nan, 4]\n", "})\n", "\n", "MyDF" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Examining your data" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3
0Var1Grass1
1Var2Rabbit2
2Var3FoxNaN
3Var4Wolf4
\n", "
" ], "text/plain": [ " col1 col2 col3\n", "0 Var1 Grass 1\n", "1 Var2 Rabbit 2\n", "2 Var3 Fox NaN\n", "3 Var4 Wolf 4" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Displays the top 5 rows. Accepts an optional int parameter - num. of rows to show\n", "MyDF.head()" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col1col2col3
0Var1Grass1
1Var2Rabbit2
2Var3FoxNaN
3Var4Wolf4
\n", "
" ], "text/plain": [ " col1 col2 col3\n", "0 Var1 Grass 1\n", "1 Var2 Rabbit 2\n", "2 Var3 Fox NaN\n", "3 Var4 Wolf 4" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Similar to head, but displays the last rows\n", "MyDF.tail()" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(4, 3)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The dimensions of the dataframe as a (rows, cols) tuple\n", "MyDF.shape" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "4" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The number of columns. Equal to df.shape[0]\n", "len(MyDF) " ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "Index(['col1', 'col2', 'col3'], dtype='object')" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# An array of the column names\n", "MyDF.columns " ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "col1 object\n", "col2 object\n", "col3 float64\n", "dtype: object" ] }, "execution_count": 17, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Columns and their types\n", "MyDF.dtypes" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([['Var1', 'Grass', 1.0],\n", " ['Var2', 'Rabbit', 2.0],\n", " ['Var3', 'Fox', nan],\n", " ['Var4', 'Wolf', 4.0]], dtype=object)" ] }, "execution_count": 18, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Converts the frame to a two-dimensional table\n", "MyDF.values " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
col3
count3.000000
mean2.333333
std1.527525
min1.000000
25%1.500000
50%2.000000
75%3.000000
max4.000000
\n", "
" ], "text/plain": [ " col3\n", "count 3.000000\n", "mean 2.333333\n", "std 1.527525\n", "min 1.000000\n", "25% 1.500000\n", "50% 2.000000\n", "75% 3.000000\n", "max 4.000000" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Displays descriptive stats for all columns\n", "MyDF.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, I am going to stop this brief intro to Jupyter with pandas here! I think you can already see the potential value of Jupyter for data analyses and visualization. As I mentioned above, you can also use R (e.g., using `tidyr` + `ggplot`) for this. " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Readings and Resources\n", "\n", "* [Python Data Science Handbook](https://github.com/jakevdp/PythonDataScienceHandbook)\n", "* A [Jupyter + pandas quickstart tutorial](http://nikgrozev.com/2015/12/27/pandas-in-jupyter-quickstart-and-useful-snippets/)" ] } ], "metadata": { "anaconda-cloud": {}, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.8.5" }, "latex_envs": { "LaTeX_envs_menu_present": true, "autoclose": false, "autocomplete": false, "bibliofile": "biblio.bib", "cite_by": "apalike", "current_citInitial": 1, "eqLabelWithNumbers": true, "eqNumInitial": 1, "hotkeys": { "equation": "Ctrl-E", "itemize": "Ctrl-I" }, "labels_anchors": false, "latex_user_defs": false, "report_style_numbering": false, "user_envs_cfg": false }, "toc": { "base_numbering": 1, "nav_menu": {}, "number_sections": true, "sideBar": false, "skip_h1_title": false, "title_cell": "Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": {}, "toc_section_display": false, "toc_window_display": false } }, "nbformat": 4, "nbformat_minor": 4 }