Data Driven Design Notes
Posted on February 2, 2022
Tags: softdev
1 Loading big data
1.1 Pandas
series.astype("Sparse[str]")
Sparse series for tables with lots of missing data
= 0
tot for i in pd.read_csv("test.csv",chunksize=100):
+= i["price"].sum() tot
Server is sqlalchemy, Client is pandas. We want to get Server side cursors (streaming) to lazy fetch when the client pandas needs it.
import pandas as pd
from sqlalchemy import create_engine
def process_sql_using_pandas():
= create_engine(
engine "postgresql://postgres:pass@localhost/example"
)= engine.connect().execution_options(
conn =True)
stream_results
for chunk_dataframe in pd.read_sql(
"SELECT * FROM users", conn, chunksize=1000):
print(f"Got dataframe w/{len(chunk_dataframe)} rows")
# ... do something with dataframe ...
if __name__ == '__main__':
process_sql_using_pandas()
2 Chapter 2
2.1 SQL relational drawback
- Requires a translational layer between your application code and database data
- ORMs make this easier to translate DB data to object oriented data
2.2 noSQL JSON Trees
2.3 SQL vs NoSQL
- SQL better support for joins, many-one relation, many-many relation
- NoSQL schema flexibility, better performance due to locality, tree-like structure like JSON