Hey yoo, hi everyone, today I will introduce how to save data after Scrape into mongoDB, the details of the project I introduced in the previous lesson. Link
Introducing MongoDB
- MongoDB is an open source database type, of the NoSQL type.
- As a document-oriented database, data is stored in a place called Collection, similar to tables in database systems such as MySQL and PostgreSQL.
- Compared to RDBMS, in MongoDB collection corresponds to table, while document will correspond to row, MongoDB will use documents instead of row in RDBMS.
Write code for pipeline
We rewrite the pipeline, using pymongo (A tool that helps interact with MongoDB through Python.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import pymongo from scrapy.exceptions import DropItem class MongoDBPipeline(object): def __init__(self): connection = pymongo.MongoClient( 'localhost', 27017 ) db = connection["Mac"] self.collection = db["Item"] def process_item(self, item, spider): valid = True for data in item: if not data: valid = False raise DropItem("Missing {0}!".format(data)) if valid: self.collection.insert(dict(item)) return item |
The MongoDBPipeline class initialized with the Init constructor creates a MongoClient object with the attribute “localhost” and connects at port 27017, we name the database “Mac” and the name of the Collection and “Item”
After adding the code just now, change the name of the pipeline in the settings file to:
1 2 3 4 | ITEM_PIPELINES = { 'code9to5mac.pipelines.MongoDBPipeline': 300, } |
Rerun the project, and then check in MongoDB, we run the command.
1 2 | mongo |
later :
1 2 | show dbs |
We can see that the Mac database has been created
Okay, let’s query. The first is:
1 2 | use Mac |
And after that :
1 2 | db.Item.find().pretty() |
Data has been saved as shown below:
Conclude
This article I introduced how to import data into MongoDB from Scrapy, the next article I will introduce about Scrapy Cluster, thank you for your interest.