Memento Pandas

Introduction to Pandas¶

What is Pandas? - Pandas is a powerful open-source data analysis and manipulation library for Python. - It provides data structures like DataFrame and Series, which are perfect for handling structured data. - Commonly used for data cleaning, exploration, and analysis.

Prerequisites¶

Before starting, you’ll need: - Python (3.x or higher) - Pandas installed (use pip install pandas or conda install pandas) - A basic understanding of Python and data analysis concepts

Importing Pandas¶

Importing the Pandas library:

import pandas as pd

Creating DataFrames¶

Create a DataFrame from a dictionary:

data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35]}
df = pd.DataFrame(data)
print(df)

Create a DataFrame from a CSV file:

df = pd.read_csv('file.csv')

Basic DataFrame Operations¶

View the first few rows of the DataFrame:

df.head()

View basic summary statistics of the DataFrame:

df.describe()

Get DataFrame info (e.g., number of non-null entries):

df.info()

Access a single column:

df['Name']

Access multiple columns:

df[['Name', 'Age']]

Data Selection and Filtering¶

Select rows based on condition:

df[df['Age'] > 30]

Select specific rows and columns using .loc[]:

df.loc[0:2, ['Name', 'Age']]

Select specific rows and columns using .iloc[]:

df.iloc[0:2, [0, 1]]

Modifying DataFrames¶

Add a new column:

df['Salary'] = [50000, 60000, 70000]

Remove a column:

df.drop('Salary', axis=1, inplace=True)

Rename columns:

df.rename(columns={'Name': 'Full Name'}, inplace=True)

Sort values by a column:

df.sort_values(by='Age', ascending=False)

Handling Missing Data¶

Check for missing values:

df.isnull().sum()

Fill missing values with a specific value:

df.fillna(0, inplace=True)

Drop rows with missing values:

df.dropna(inplace=True)

Grouping and Aggregating¶

Group data by a column and calculate the mean:

df.groupby('Age')['Salary'].mean()

Group by multiple columns and apply aggregation:

df.groupby(['Age', 'Salary']).agg({'Name': 'count'})

Merging DataFrames¶

Merge two DataFrames on a common column:

df1 = pd.DataFrame({'ID': [1, 2, 3], 'Name': ['Alice', 'Bob', 'Charlie']})
df2 = pd.DataFrame({'ID': [1, 2, 4], 'Age': [25, 30, 35]})
merged_df = pd.merge(df1, df2, on='ID', how='inner')

Join DataFrames on an index:

df1.set_index('ID', inplace=True)
df2.set_index('ID', inplace=True)
joined_df = df1.join(df2)

Plotting with Pandas¶

Plot a simple line chart:

df['Age'].plot(kind='line')

Plot a histogram:

df['Age'].plot(kind='hist')

Plot a scatter plot:

df.plot(kind='scatter', x='Age', y='Salary')

Exporting Data¶

Write a DataFrame to a CSV file:

df.to_csv('output.csv', index=False)

Write a DataFrame to an Excel file:

df.to_excel('output.xlsx', index=False)

Conclusion¶

Key Takeaways: - Pandas makes data manipulation and analysis easier with structures like DataFrame and Series. - You can import/export data, handle missing data, filter and aggregate data, and visualize results. - Pandas is a great tool for any data analysis project in Python.

Next Steps: - Explore Pandas documentation for advanced operations. - Learn about multi-indexing and time series data manipulation. - Try using Pandas alongside NumPy and Matplotlib for full data science workflows.

Last update : 2025-05-04T19:34:16Z