| # Comment Processing with Agentic Workflow |
|
|
| A scalable, modular system for processing comments from multiple data sources using OpenAI API, LangChain, and LangGraph. The system performs language detection, translation, and context-aware sentiment analysis using an agentic workflow architecture. |
|
|
| ## Data Sources Supported |
|
|
| - **Social Media Comments**: External platforms (Facebook, Instagram, YouTube, etc.) |
| - **Musora Internal Comments**: Comments from Musora internal applications |
| - **Extensible Architecture**: Easily add new data sources via configuration |
|
|
| ## Features |
|
|
| - **Multi-Source Support**: Process comments from multiple data sources with a single codebase |
| - **Configuration-Driven**: Add new data sources without code changes |
| - **Parent Comment Context**: Automatically includes parent comment text for reply analysis |
| - **Modular Agent Architecture**: Extensible base classes for easy addition of new agents |
| - **Language Detection**: Hybrid approach using lingua library for fast English detection, with LLM fallback for non-English languages |
| - **Translation**: High-quality translation for non-English comments using OpenAI models |
| - **Context-Aware Sentiment Analysis**: |
| - Uses content description for context |
| - Includes parent comment text when analyzing replies |
| - Multi-label intent classification |
| - **LangGraph Workflow**: Flexible graph-based orchestration of agent operations |
| - **Snowflake Integration**: Seamless data fetching and storage with source-specific tables |
| - **Parallel Processing**: Multiprocessing support for high-performance batch processing |
| - **Dynamic Batch Sizing**: Intelligent batch size calculation based on workload and available resources |
| - **Independent Batch Execution**: Each batch processes and stores results independently |
| - **Comprehensive Logging**: Detailed logging for monitoring and debugging |
| - **Scalable Configuration**: Easy-to-modify sentiment categories and intents via JSON config |
|
|
| ## Project Structure |
|
|
| ``` |
| musora-sentiment-analysis/ |
| βββ agents/ |
| β βββ __init__.py |
| β βββ base_agent.py # Base class for all agents |
| β βββ language_detection_agent.py # Language detection agent |
| β βββ translation_agent.py # Translation agent |
| β βββ sentiment_analysis_agent.py # Sentiment analysis agent (parent context support) |
| βββ workflow/ |
| β βββ __init__.py |
| β βββ comment_processor.py # LangGraph workflow orchestrator |
| βββ sql/ |
| β βββ fetch_comments.sql # Query for social media comments (with parent join) |
| β βββ fetch_musora_comments.sql # Query for Musora internal comments (with parent join) |
| β βββ create_ml_features_table.sql # Schema for social media table (with parent fields) |
| β βββ init_musora_table.sql # Initialize empty Musora table (run first!) |
| β βββ create_musora_ml_features_table.sql # Full Musora schema with views (optional) |
| βββ config_files/ |
| β βββ data_sources_config.json # Data source configuration (NEW) |
| β βββ sentiment_config.json # Configuration for agents and workflow |
| β βββ sentiment_analysis_config.json # Sentiment categories and intents |
| βββ logs/ # Processing logs (auto-created) |
| βββ LLM.py # LLM utility class |
| βββ SnowFlakeConnection.py # Snowflake connection handler |
| βββ main.py # Main execution script (multi-source support) |
| βββ requirements.txt # Python dependencies |
| βββ .env # Environment variables (not in git) |
| βββ README.md # This file |
| βββ CLAUDE.md # Detailed technical documentation |
| ``` |
|
|
| ## Setup |
|
|
| ### 1. Install Dependencies |
|
|
| ```bash |
| pip install -r requirements.txt |
| ``` |
|
|
| ### 2. Configure Environment Variables |
|
|
| Ensure your `.env` file contains the required credentials: |
|
|
| ```env |
| # Snowflake |
| SNOWFLAKE_USER=your_user |
| SNOWFLAKE_PASSWORD=your_password |
| SNOWFLAKE_ACCOUNT=your_account |
| SNOWFLAKE_ROLE=your_role |
| SNOWFLAKE_DATABASE=SOCIAL_MEDIA_DB |
| SNOWFLAKE_WAREHOUSE=your_warehouse |
| SNOWFLAKE_SCHEMA=ML_FEATURES |
| |
| # OpenAI |
| OPENAI_API_KEY=your_openai_key |
| ``` |
|
|
| ### 3. Create Snowflake Tables |
|
|
| Run the SQL scripts to create the output tables: |
|
|
| ```bash |
| # Execute the SQL files in Snowflake |
| # For social media comments (if not already exists) |
| sql/create_ml_features_table.sql |
| |
| # For Musora internal comments - INITIAL SETUP (First time only) |
| # This creates the empty table structure |
| sql/init_musora_table.sql |
| ``` |
|
|
| **Note**: Run `init_musora_table.sql` before the first Musora comments processing run. After that, you can optionally run `create_musora_ml_features_table.sql` to create the additional views if needed. |
|
|
| ## Usage |
|
|
| ### Basic Usage (Process All Data Sources) |
|
|
| Process unprocessed comments from all enabled data sources: |
|
|
| ```bash |
| python main.py |
| ``` |
|
|
| This will: |
| - Process all enabled data sources (social media and Musora comments) |
| - Fetch only comments that haven't been processed yet |
| - Process them through the workflow using parallel workers (CPU count - 2, max 5) |
| - Each batch processes and stores to Snowflake independently |
| - Append new results to the existing tables (no overwrite) |
|
|
| ### Process Specific Data Source |
|
|
| Process only social media comments: |
|
|
| ```bash |
| python main.py --data-source social_media |
| ``` |
|
|
| Process only Musora internal comments: |
|
|
| ```bash |
| python main.py --data-source musora_comments |
| ``` |
|
|
| ### Process Limited Number of Comments |
|
|
| Limit applies per data source: |
|
|
| ```bash |
| # Process 100 comments from each enabled data source |
| python main.py --limit 100 |
| |
| # Process 100 comments from only Musora source |
| python main.py --limit 100 --data-source musora_comments |
| ``` |
|
|
| ### Sequential Processing (Debug Mode) |
|
|
| For debugging purposes, use sequential processing: |
|
|
| ```bash |
| python main.py --limit 100 --sequential |
| ``` |
|
|
| This processes all comments in a single batch, making it easier to debug issues. |
|
|
| ### First Run for New Data Source |
|
|
| For the first run of Musora comments: |
|
|
| 1. **First**: Run the initialization SQL script in Snowflake: |
| ```sql |
| -- Execute in Snowflake |
| sql/init_musora_table.sql |
| ``` |
|
|
| 2. **Then**: Run the processing with overwrite flag: |
| ```bash |
| python main.py --overwrite --data-source musora_comments --limit 100 |
| ``` |
|
|
| **Why two steps?** |
| - The fetch query checks for already-processed comments by querying the output table |
| - On first run, that table doesn't exist, causing an error |
| - The init script creates the empty table structure first |
| - Then processing can run normally |
|
|
| **Warning**: Overwrite will replace all existing data in the output table. Only use for initial table creation or when reprocessing from scratch. |
|
|
| ### Custom Configuration File |
|
|
| ```bash |
| python main.py --config path/to/custom_config.json |
| ``` |
|
|
| ### Command-Line Arguments |
|
|
| - `--limit N`: Process only N comments per data source (default: 10000) |
| - `--overwrite`: Overwrite existing Snowflake table (default: append mode) |
| - `--config PATH`: Custom configuration file path |
| - `--sequential`: Use sequential processing instead of parallel (for debugging) |
| - `--data-source SOURCE`: Process only specific data source (e.g., social_media, musora_comments) |
|
|
| ### Parallel Processing |
|
|
| The system uses multiprocessing to process comments in parallel: |
|
|
| **Worker Calculation**: |
| - Number of workers: `CPU count - 2` (max 5 workers) |
| - Leaves CPU cores available for system operations |
| - Example: 8-core system β 5 workers (capped at max) |
|
|
| **Dynamic Batch Sizing**: |
| - Batch size calculated as: `total_comments / num_workers` |
| - Minimum batch size: 20 comments |
| - Maximum batch size: 1000 comments |
| - Batches β€ 20 comments are not split |
|
|
| **Independent Execution**: |
| - Each batch runs in a separate process |
| - Batches store to Snowflake immediately upon completion |
| - No waiting for all batches to complete |
| - Failed batches don't affect successful ones |
|
|
| **Performance**: |
| - Expected speedup: ~1.8-4.5x depending on number of workers |
| - Real-time progress reporting as batches complete |
| - Processing time and average per comment displayed in summary |
|
|
| ### Incremental Processing |
|
|
| The pipeline is designed for incremental processing: |
| - **Automatic deduplication**: SQL query excludes comments already in `COMMENT_SENTIMENT_FEATURES` |
| - **Append-only by default**: New results are added without overwriting existing data |
| - **Failed comment retry**: Comments with `success=False` are not stored and will be retried in future runs |
| - **Run regularly**: Safe to run daily/weekly to process new comments |
|
|
| ## Configuration |
|
|
| ### Data Sources Configuration |
|
|
| The `config_files/data_sources_config.json` file defines available data sources: |
|
|
| ```json |
| { |
| "data_sources": { |
| "social_media": { |
| "name": "Social Media Comments", |
| "enabled": true, |
| "sql_query_file": "sql/fetch_comments.sql", |
| "output_config": { |
| "table_name": "COMMENT_SENTIMENT_FEATURES", |
| "database": "SOCIAL_MEDIA_DB", |
| "schema": "ML_FEATURES" |
| } |
| }, |
| "musora_comments": { |
| "name": "Musora Internal Comments", |
| "enabled": true, |
| "sql_query_file": "sql/fetch_musora_comments.sql", |
| "output_config": { |
| "table_name": "MUSORA_COMMENT_SENTIMENT_FEATURES", |
| "database": "SOCIAL_MEDIA_DB", |
| "schema": "ML_FEATURES" |
| }, |
| "additional_fields": [ |
| "PERMALINK_URL", |
| "THUMBNAIL_URL" |
| ] |
| } |
| } |
| } |
| ``` |
|
|
| **To add a new data source**: Simply add a new entry to this config file and create the corresponding SQL query file. |
|
|
| ### Agent Configuration |
|
|
| The `config_files/sentiment_config.json` file controls agent behavior: |
|
|
| ```json |
| { |
| "agents": { |
| "language_detection": { |
| "model": "gpt-5-nano", |
| "temperature": 0.0, |
| "max_retries": 3 |
| }, |
| "translation": { |
| "model": "gpt-5-nano", |
| "temperature": 0.3, |
| "max_retries": 3 |
| }, |
| "sentiment_analysis": { |
| "model": "gpt-5-nano", |
| "temperature": 0.2, |
| "max_retries": 3 |
| } |
| }, |
| "workflow": { |
| "description": "Batch size is calculated dynamically based on number of workers (min: 20, max: 1000)", |
| "parallel_processing": { |
| "enabled": true, |
| "worker_calculation": "CPU count - 2, max 5 workers", |
| "min_batch_size": 20, |
| "max_batch_size": 1000 |
| } |
| }, |
| "snowflake": { |
| "output_table": "COMMENT_SENTIMENT_FEATURES", |
| "database": "SOCIAL_MEDIA_DB", |
| "schema": "ML_FEATURES" |
| } |
| } |
| ``` |
|
|
| **Note**: Batch size is now calculated dynamically and no longer needs to be configured manually. |
|
|
| ### Sentiment Categories Configuration |
|
|
| The `config_files/sentiment_analysis_config.json` file defines sentiment categories and intents (easily extensible): |
|
|
| ```json |
| { |
| "sentiment_polarity": { |
| "categories": [ |
| {"value": "very_positive", "label": "Very Positive", "description": "..."}, |
| {"value": "positive", "label": "Positive", "description": "..."}, |
| {"value": "neutral", "label": "Neutral", "description": "..."}, |
| {"value": "negative", "label": "Negative", "description": "..."}, |
| {"value": "very_negative", "label": "Very Negative", "description": "..."} |
| ] |
| }, |
| "intent": { |
| "categories": [ |
| {"value": "praise", "label": "Praise", "description": "..."}, |
| {"value": "question", "label": "Question", "description": "..."}, |
| {"value": "request", "label": "Request", "description": "..."}, |
| {"value": "feedback_negative", "label": "Negative Feedback", "description": "..."}, |
| {"value": "suggestion", "label": "Suggestion", "description": "..."}, |
| {"value": "humor_sarcasm", "label": "Humor/Sarcasm", "description": "..."}, |
| {"value": "off_topic", "label": "Off Topic", "description": "..."}, |
| {"value": "spam_selfpromo", "label": "Spam/Self-Promotion", "description": "..."} |
| ] |
| }, |
| "reply_policy": { |
| "requires_reply_intents": ["question", "request"], |
| "description": "Comments with these intents should be flagged for reply" |
| }, |
| "intent_settings": { |
| "multi_label": true, |
| "description": "Intent can have multiple labels as a comment can express multiple intents" |
| } |
| } |
| ``` |
|
|
| ## Adding New Agents |
|
|
| The system is designed for easy extensibility. To add a new agent: |
|
|
| ### 1. Create Agent Class |
|
|
| ```python |
| from agents.base_agent import BaseAgent |
| from typing import Dict, Any |
| |
| class MyNewAgent(BaseAgent): |
| def __init__(self, config: Dict[str, Any], api_key: str): |
| super().__init__("MyNewAgent", config) |
| # Initialize your agent-specific components |
| |
| def validate_input(self, input_data: Dict[str, Any]) -> bool: |
| # Validate input data |
| return True |
| |
| def process(self, input_data: Dict[str, Any]) -> Dict[str, Any]: |
| # Implement your agent logic |
| return {"success": True, "result": "..."} |
| ``` |
|
|
| ### 2. Update Workflow |
|
|
| Add the agent to `workflow/comment_processor.py`: |
|
|
| ```python |
| # Add to CommentState TypedDict |
| new_agent_result: str |
| |
| # Add node |
| workflow.add_node("my_new_agent", self._my_new_agent_node) |
| |
| # Add edges |
| workflow.add_edge("translation", "my_new_agent") |
| workflow.add_edge("my_new_agent", END) |
| ``` |
|
|
| ### 3. Update Configuration |
|
|
| Add agent config to `sentiment_config.json`: |
|
|
| ```json |
| { |
| "agents": { |
| "my_new_agent": { |
| "name": "MyNewAgent", |
| "model": "gpt-4o-mini", |
| "temperature": 0.5, |
| "max_retries": 3 |
| } |
| } |
| } |
| ``` |
|
|
| ## Output Schema |
|
|
| ### Social Media Comments Table |
| Stored in `SOCIAL_MEDIA_DB.ML_FEATURES.COMMENT_SENTIMENT_FEATURES` |
|
|
| ### Musora Comments Table |
| Stored in `SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES` |
|
|
| ### Common Columns (Both Tables) |
|
|
| | Column | Type | Description | |
| |--------|------|-------------| |
| | COMMENT_SK | NUMBER | Surrogate key from source | |
| | COMMENT_ID | VARCHAR | Platform comment ID | |
| | ORIGINAL_TEXT | VARCHAR | Original comment text | |
| | **PARENT_COMMENT_ID** | **VARCHAR** | **ID of parent comment if this is a reply** | |
| | **PARENT_COMMENT_TEXT** | **VARCHAR** | **Text of parent comment for context** | |
| | DETECTED_LANGUAGE | VARCHAR | Detected language name | |
| | LANGUAGE_CODE | VARCHAR | ISO 639-1 code | |
| | IS_ENGLISH | BOOLEAN | Is comment in English | |
| | TRANSLATED_TEXT | VARCHAR | English translation | |
| | TRANSLATION_PERFORMED | BOOLEAN | Was translation performed | |
| | SENTIMENT_POLARITY | VARCHAR | Sentiment (very_positive, positive, neutral, negative, very_negative) | |
| | INTENT | VARCHAR | Multi-label intents (comma-separated) | |
| | REQUIRES_REPLY | BOOLEAN | Does comment need a response | |
| | SENTIMENT_CONFIDENCE | VARCHAR | Analysis confidence (high, medium, low) | |
| | PROCESSING_SUCCESS | BOOLEAN | Processing status | |
| | PROCESSED_AT | TIMESTAMP | Processing timestamp | |
| |
| ### Musora-Specific Additional Columns |
| |
| | Column | Type | Description | |
| |--------|------|-------------| |
| | PERMALINK_URL | VARCHAR | Web URL path of the content | |
| | THUMBNAIL_URL | VARCHAR | Thumbnail URL of the content | |
| |
| ### Available Views |
| |
| **Social Media:** |
| - `VW_COMMENTS_REQUIRING_REPLY`: Comments that need responses (includes parent comment info) |
| - `VW_SENTIMENT_DISTRIBUTION`: Sentiment and intent statistics by channel (includes reply comment count) |
| - `VW_NON_ENGLISH_COMMENTS`: Filtered view of non-English comments |
|
|
| **Musora:** |
| - `VW_MUSORA_COMMENTS_REQUIRING_REPLY`: Musora comments needing responses |
| - `VW_MUSORA_SENTIMENT_DISTRIBUTION`: Musora sentiment and intent statistics |
| - `VW_MUSORA_NON_ENGLISH_COMMENTS`: Non-English Musora comments |
|
|
| ## Workflow Architecture |
|
|
| The system uses LangGraph to create a flexible, state-based workflow: |
|
|
| ``` |
| βββββββββββββββββββββββ |
| β Fetch Comments β |
| β from Snowflake β |
| β (Unprocessed Only) β |
| ββββββββββββ¬βββββββββββ |
| β |
| βΌ |
| βββββββββββββββββββββββ |
| β Language Detection β |
| β Agent β |
| ββββββββββββ¬βββββββββββ |
| β |
| βΌ |
| ββββββ΄βββββ |
| β English?β |
| ββββββ¬βββββ |
| β |
| βββββββ΄ββββββ |
| β β |
| Yes No |
| β β |
| β βΌ |
| β βββββββββββββββ |
| β β Translation β |
| β β Agent β |
| β ββββββββ¬βββββββ |
| β β |
| βββββββ¬ββββββ |
| β |
| βΌ |
| ββββββββββββββββββββ |
| β Sentiment β |
| β Analysis Agent β |
| βββββββββββ¬βββββββββ |
| β |
| βΌ |
| ββββββββββββββββ |
| βStore Results β |
| βto Snowflake β |
| β(Append Mode) β |
| ββββββββββββββββ |
| ``` |
|
|
| **Note**: The fetch step automatically excludes comments already present in `COMMENT_SENTIMENT_FEATURES`, enabling incremental processing. |
|
|
| ## Logging |
|
|
| Logs are automatically created in the `logs/` directory with timestamps: |
|
|
| ``` |
| logs/comment_processing_20251001_143022.log |
| ``` |
|
|
| ## Adding New Data Sources |
|
|
| The system is designed to make adding new data sources easy: |
|
|
| ### Steps to Add a New Source: |
|
|
| 1. **Update Configuration** (`config_files/data_sources_config.json`): |
| ```json |
| "your_new_source": { |
| "name": "Your New Source Name", |
| "enabled": true, |
| "sql_query_file": "sql/fetch_your_source.sql", |
| "output_config": { |
| "table_name": "YOUR_SOURCE_SENTIMENT_FEATURES", |
| "database": "SOCIAL_MEDIA_DB", |
| "schema": "ML_FEATURES" |
| }, |
| "additional_fields": ["FIELD1", "FIELD2"] // Optional |
| } |
| ``` |
|
|
| 2. **Create SQL Query File** (`sql/fetch_your_source.sql`): |
| - Fetch comments with consistent column names |
| - Include self-join for parent comments if available |
| - Exclude already-processed comments (LEFT JOIN with output table) |
|
|
| 3. **Create Table Initialization Script** (`sql/init_your_source_table.sql`): |
| - Creates empty table structure |
| - Base schema on `init_musora_table.sql` |
| - Add source-specific fields as needed |
| - **Run this in Snowflake FIRST before processing** |
|
|
| 4. **Create Full Schema** (optional): |
| - Base schema on `create_musora_ml_features_table.sql` |
| - Include views and indexes |
|
|
| 5. **Run First Time**: |
| ```bash |
| # Step 1: Run init script in Snowflake |
| sql/init_your_source_table.sql |
| |
| # Step 2: Process first batch |
| python main.py --overwrite --data-source your_new_source --limit 100 |
| ``` |
|
|
| **No code changes required!** |
|
|
| ## Best Practices |
|
|
| 1. **Testing**: Always test with `--limit` flag first (e.g., `--limit 100`) |
| 2. **New Data Sources**: Test new sources with `--sequential --limit 100` first |
| 3. **Debugging**: Use `--sequential` flag for easier debugging of processing issues |
| 4. **Incremental Processing**: Run regularly without `--overwrite` to process only new comments |
| 5. **Monitoring**: Check logs for processing errors and batch completion |
| 6. **Performance**: Use default parallel mode for production workloads |
| 7. **Extensibility**: Follow the base agent pattern for consistency |
| 8. **Error Handling**: All agents include robust error handling |
| 9. **Failed Comments**: Review logs for failed comments - they'll be automatically retried in future runs |
| 10. **Resource Management**: System automatically adapts to available CPU resources |
| 11. **Parent Comments**: Ensure SQL queries include parent comment joins for best accuracy |
|
|
| ## Sentiment Analysis Features |
|
|
| ### Multi-Label Intent Classification |
|
|
| The system supports **multi-label intent classification**, meaning a single comment can have multiple intents: |
|
|
| - **Example**: "This is amazing! What scale are you using?" β `["praise", "question"]` |
| - **Example**: "Love this but can you make a tutorial on it?" β `["praise", "request"]` |
|
|
| ### Context-Aware Analysis with Parent Comment Support |
|
|
| The sentiment analysis agent provides rich context understanding: |
|
|
| 1. **Content Context**: Uses the `content_description` field to understand what the comment is about |
| 2. **Parent Comment Context** (NEW): When analyzing reply comments, the system: |
| - Automatically detects when a comment is a reply |
| - Fetches the parent comment text from the database |
| - Includes parent comment in the LLM prompt |
| - Explicitly instructs the LLM that this is a reply comment |
| - Results in more accurate sentiment and intent classification |
|
|
| **Example**: |
| - Parent Comment: "Does anyone know how to play this riff?" |
| - Reply Comment: "Yes!" |
| - Without parent context: Might be classified as unclear/off-topic |
| - With parent context: Correctly classified as answering a question |
|
|
| This dramatically improves accuracy for: |
| - Short reply comments ("Yes", "Thanks!", "Agreed") |
| - Sarcastic replies (context crucial for understanding) |
| - Continuation of discussions |
| - Agreement/disagreement comments |
|
|
| ### Failure Handling & Reprocessing |
|
|
| Comments that fail sentiment analysis (missing critical fields like sentiment_polarity or intents) are: |
| - Marked as `success=False` in the workflow |
| - **NOT stored in Snowflake** |
| - **Automatically available for reprocessing** in future runs |
| |
| This ensures only successfully processed comments are stored, while failed comments remain available for retry. |
| |
| ### Incremental Processing & Deduplication |
| |
| The pipeline automatically handles incremental processing: |
| - **SQL-level deduplication**: Query excludes comments already in `COMMENT_SENTIMENT_FEATURES` using `LEFT JOIN` |
| - **Automatic retry**: Failed comments (not stored) are automatically retried on next run |
| - **Append-only mode**: Default behavior appends new records without overwriting |
| - **Production-ready**: Safe to run daily/weekly/monthly to process new comments |
| |
| ### Scalable Configuration |
| |
| To add or modify sentiment categories or intents: |
| |
| 1. Edit `config_files/sentiment_analysis_config.json` |
| 2. Add/modify categories in the `sentiment_polarity` or `intent` sections |
| 3. Update `reply_policy.requires_reply_intents` if needed |
| 4. No code changes required! |
|
|
| ## Future Extensions |
|
|
| The modular architecture supports easy addition of: |
|
|
| - Topic classification agent |
| - Entity extraction agent |
| - Engagement score prediction agent |
| - Named entity recognition agent |
|
|
| Simply create a new agent inheriting from `BaseAgent` and add it to the workflow graph. |
|
|
| ## Troubleshooting |
|
|
| ### Issue: "Object does not exist or not authorized" on First Run |
|
|
| **Error**: `Object 'SOCIAL_MEDIA_DB.ML_FEATURES.MUSORA_COMMENT_SENTIMENT_FEATURES' does not exist or not authorized` |
|
|
| **Cause**: The fetch query tries to check for already-processed comments, but the output table doesn't exist yet on first run. |
|
|
| **Solution**: |
| 1. Run the initialization script first: |
| ```sql |
| -- Execute in Snowflake |
| sql/init_musora_table.sql |
| ``` |
| 2. Then run the processing: |
| ```bash |
| python main.py --overwrite --data-source musora_comments --limit 100 |
| ``` |
|
|
| ### Issue: API Rate Limits |
|
|
| If hitting API rate limits, reduce the number of parallel workers or process fewer comments: |
| ```bash |
| # Process fewer comments at a time |
| python main.py --limit 500 |
| |
| # Or use sequential mode |
| python main.py --sequential --limit 100 |
| ``` |
|
|
| ### Issue: Memory Issues |
|
|
| Process in smaller batches using `--limit`: |
| ```bash |
| python main.py --limit 500 |
| ``` |
|
|
| ### Issue: Debugging Processing Errors |
|
|
| Use sequential mode to debug issues more easily: |
| ```bash |
| python main.py --sequential --limit 50 |
| ``` |
|
|
| This processes all comments in a single batch with clearer error messages. |
|
|
| ### Issue: Connection Timeouts |
|
|
| Check Snowflake credentials in `.env` and network connectivity. |
|
|
| ### Issue: Parallel Processing Not Working |
|
|
| If multiprocessing issues occur, use sequential mode: |
| ```bash |
| python main.py --sequential |
| ``` |
|
|
| ## Performance |
|
|
| ### Expected Speedup |
|
|
| Parallel processing provides significant performance improvements: |
|
|
| - **Sequential**: 1x (baseline) |
| - **2 workers**: ~1.8-1.9x faster |
| - **5 workers**: ~4-4.5x faster |
|
|
| Speedup isn't perfectly linear due to: |
| - Snowflake connection overhead |
| - LLM API rate limits (shared across workers) |
| - I/O operations |
|
|
| ### Monitoring Performance |
|
|
| The processing summary includes: |
| - Total processing time |
| - Average time per comment |
| - Number of workers used |
| - Batch size calculations |
| - Failed batches (if any) |
|
|
| ## License |
|
|
| Internal use only - Musora sentiment analysis project. |