Lessons in SMB Data Empowerment
The journey to democratize data analytics for small and medium-sized businesses (SMBs) is one filled with both ambition and learning. DataCopilot was born from a vision to empower SMBs with the same analytical prowess as larger enterprises, utilizing the rich ecosystem of python data libraries like pandas, statsmodels and prophet, along with algorithms like Sarimax and LSTM, to bridge the gap between raw data and actionable insights.
Although the project didn’t reach its final destination, it provided invaluable lessons in data preparation, automation, and the challenges of scaling innovation for diverse business needs. In this post, I’ll reflect on what worked, what didn’t, and the enduring insights that guide my approach to future projects.
Empowering SMBs Through Data
SMBs often lack the resources to hire dedicated data science teams, yet they generate vast amounts of data that can drive growth. DataCopilot aimed to level the playing field by providing tools that:
- Simplified Data Understanding: Profile and transform data from multiple sources.
- Enabled Predictive Analytics: Lay the groundwork for automated machine learning (AutoML) pipelines.
- Democratized Insights: Make actionable intelligence accessible to all, regardless of technical expertise.
The philosophy was simple: data tools should adapt to businesses, not the other way around.
Challenges Along the Way
Building a platform to support both structured and semi-structured data had its challenges:
- Diverse Data Types: SMBs often rely on both relational databases (e.g., MySQL) and semi-structured data (e.g., JSON from APIs or IoT devices).
- Automation Complexity: Automated pipelines require clean, well-prepared data, but SMBs often lack standardized processes for data management.
- Scalability for SMB Budgets: Ensuring tools were efficient and cost-effective for smaller businesses.
Unlocking Relational Data Potential
Relational databases are the backbone of many SMB operations, storing critical information on customers, inventory, and transactions. The SQL Data Analyzer was designed to simplify schema discovery and enable advanced feature engineering for machine learning workflows.
Key Features
- Schema-Level Analysis:
- Extracts metadata about tables, columns, and relationships.
- Maps foreign key dependencies for cross-table analysis.
- Constraint Awareness:
- Identifies primary keys, foreign keys, and unique constraints to support accurate data models.
- Efficient Querying:
- Uses lightweight SQL queries (e.g.,
SHOW CREATE TABLE
) to minimize database load.
- Uses lightweight SQL queries (e.g.,
How It Works
The SQL Data Analyzer processes database schemas in several steps, with each step focused on extracting specific metadata.
Step 1: Retrieve DDL Statements
The analyzer retrieves DDL statements for each table using the SHOW CREATE TABLE
query. This provides a blueprint of the database schema, including relationships between tables.
def get_tables(cnx):
return do_raw_query(cnx, 'SHOW TABLES;')
def process_ddl(cnx, tbl_name):
ddl_query = f'SHOW CREATE TABLE {tbl_name}'
res = do_query(cnx, ddl_query)
ddl = "".join(line.strip() for line in res[0]['Create Table'].split("\n"))
return ddl
Step 2: Extract Foreign Key Relationships
Foreign key relationships are crucial for understanding how tables interact. The analyzer parses these relationships from DDL statements using regular expressions.
def extract_foreign_keys(ddl):
fk_pattern = r"FOREIGN KEY \((.+?)\) REFERENCES (\w+)\((.+?)\)"
foreign_keys = {}
for match in re.finditer(fk_pattern, ddl):
local_field, ref_table, ref_field = match.groups()
foreign_keys[local_field] = {"table": ref_table, "field": ref_field}
return foreign_keys
Step 3: Summarize Metadata
The parsed metadata is compiled into a summary, providing visibility into the database structure and relationships.
Example Output
For a database with orders
and customers
, the analyzer produces the following schema summary:
{
"orders": [
{"name": "order_id", "data_type": "int", "primary_key": true, "not_null": true},
{"name": "customer_id", "data_type": "int", "foreign_key": {"table": "customers", "field": "customer_id"}},
{"name": "order_date", "data_type": "datetime", "not_null": true}
]
}
Making Sense of Semi-Structured Data
Semi-structured data from APIs, IoT devices, and logs is increasingly common but notoriously difficult to work with. The JSON Data Analyzer was designed to handle this complexity, summarizing hierarchical data structures and providing insights that traditional tools often miss.
Key Features
- Field-Level Insights:
- Computes statistics for numeric, boolean, and string fields.
- Handles nested structures with dot-separated keys (e.g.,
user.address.city
).
- Hierarchical Support:
- Recursively parses nested dictionaries and lists.
- Efficient Processing:
- Uses probabilistic data structures like CountMinSketch for memory-efficient frequency tracking.
How It Works
The JSON Data Analyzer dynamically adapts to different field types, supporting hierarchical structures and ensuring no information is missed.
Dynamic Field Processing
Here’s how the analyzer processes various field types:
def process_value(key, value, fields):
field_type = type(value).__name__
if field_type in ['int', 'float']:
fields[key] = update_numeric_field(fields.get(key, {}), value)
elif field_type == 'bool':
fields[key] = update_boolean_field(fields.get(key, {}), value)
elif field_type == 'str':
fields[key] = update_string_field(fields.get(key, {}), value)
elif field_type == 'dict':
for sub_key, sub_value in value.items():
fields = process_value(f"{key}.{sub_key}", sub_value, fields)
return fields
Example Output
After processing a JSON file, the analyzer might output:
{
"user.age": {"type": "int", "min": 18, "max": 65, "avg": 35.4},
"user.is_active": {"type": "bool", "hist": {"True": 800, "False": 200}},
"user.address.city": {"type": "str", "unique_count": 50}
}
Forecasting
Regardless of the type of analyzer, we wanted the data in a common format before performing pre-processing and forecasting. The pandas DataFrame
served nicely for that. Time series forecasting algorithms like SARIMAX, Prophet, and LSTM rely heavily on the quality of input data. Without proper preprocessing, you may encounter erroneous results, model errors, missing interpretations, and so forth. Consider a simplified pre-processing function:
def preprocess_data(self, data, date_col, value_col):
"""
Preprocess data to ensure it is ready for forecasting.
Args:
data (pd.DataFrame): Input data.
date_col (str): Column name for datetime values.
value_col (str): Column name for values to forecast.
Returns:
pd.DataFrame: Processed data with datetime index.
"""
# Convert the date column to datetime format
data[date_col] = pd.to_datetime(data[date_col])
# Select only the relevant columns
data = data[[date_col, value_col]].dropna()
# Set the date column as the DataFrame index
data = data.set_index(date_col)
return data
It ensures timestamps are correctly formatted and aligned, and reduces the likelihood of runtime errors due to missing or misaligned data.
Once the data is pre-processed, forecasting with various models can look pretty similar. Let’s look at SARIMAX and Prophet.
from statsmodels.tsa.statespace.sarimax import SARIMAX
def forecast_sarimax(self, data, order=(1, 1, 1), seasonal_order=(0, 0, 0, 0), steps=10):
model = SARIMAX(data, order=order, seasonal_order=seasonal_order)
results = model.fit(disp=False)
forecast = results.get_forecast(steps=steps)
forecast_df = forecast.summary_frame()
return forecast_df[['mean', 'mean_ci_lower', 'mean_ci_upper']]
from prophet import Prophet
def forecast_prophet(self, data):
data = data.reset_index().rename(columns={"index": "ds", data.columns[0]: "y"})
model = Prophet()
model.fit(data)
future = model.make_future_dataframe(periods=10)
forecast = model.predict(future)
return forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']]
Exogenous Data
While forecasting models like SARIMAX, Prophet, and LSTM can offer valuable insights, they are often limited by their reliance on endogenous data—historical patterns within the target variable. For small businesses like restaurants, cafes, and Mom & Pop storefronts, external factors (exogenous data) such as weather, holidays, local events, or economic indicators can significantly influence performance. Ignoring these external factors can lead to forecasts that are disconnected from reality.
Exogenous variables enrich forecasting models by providing context beyond historical trends. For example:
- Weather: A spike in rainfall might reduce foot traffic for a coffee shop while increasing delivery orders.
- Holidays and Events: A major local festival could drive up sales for nearby restaurants but reduce sales for suburban stores as people flock to the city center.
- Economic Indicators: Changes in fuel prices or unemployment rates can alter consumer spending habits, impacting retail businesses differently.
Incorporating these variables helps models adjust predictions dynamically, ensuring forecasts align more closely with real-world fluctuations. The ability to seamlessly ingest and integrate these external data sources was one of the biggest challenges for DataCopilot. Forecasting tools must prioritize the inclusion of exogenous data to deliver more actionable and accurate predictions.
Reflections and Lessons Learned
While DataCopilot was ultimately shelved, it offered several key takeaways:
- Diverse Data Requires Diverse Tools: Supporting both SQL and JSON ensured compatibility with a wide range of business systems.
- Automation is Key: AutoML frameworks are only as good as the data preparation tools supporting them.
- Timing Matters: As tools like GitHub Copilot gain traction, the demand for user-friendly, automated systems is growing.
- Exogenous Data is Essential: Forecasting tools must incorporate external factors like weather, events, and economic trends to provide SMBs with actionable insights.